• Jesse Gross's avatar
    ollamarunner: Improve multimodal input handling · a7e63b82
    Jesse Gross authored
    Various vision models have different requirements for how they
    receive their inputs. For example:
     - Mllama wants images together with text and the image embeddings
       don't themselves have positions or get stored in the main KV cache
     - Llava-style models feed in embeddings similar to tokens and
       images correspond to a varying number of tokens in the cache.
    
    In addition, the strategy for providing inputs must support batching
    and multiple sequences, which are managed by the runner. At the same
    time, we want to keep data handling fully in the model so that new
    architectures are not bottlenecked by runner code which does not
    understand their particular requirements.
    
    This provides a method for models to edit the input stream so that
    it meets their needs while still being in a format that the runner
    understands. This allows the runner to avoid special processing
    for different models.
    
    In addition, this fixes a regression where non-vision models may
    try to incorrectly interpret images.
    a7e63b82
model.go 8.29 KB