1. 20 Mar, 2025 3 commits
    • Jesse Gross's avatar
      model: Pass input tensor instead of raw data to models · 0fbfcf3c
      Jesse Gross authored
      Rather than directly giving the input data to models, we can
      pass a tensor instead. In the short term, this saves some duplicated
      code.
      
      Longer term, we will want to overlap setting up the next batch with
      processing of the current one. In this case, we will only have the
      shape of tensor but it will not be loaded with data at the time of
      graph generation. By passing only a tensor to models now, we set up
      this possibility and prevent them from relying on data that they won't
      have in the future.
      
      Although the same could be done for Positions and Outputs, in some
      cases we either need the raw input data or don't use them at all.
      Therefore, for now we leave them as they are and allow models to
      convert them to tensors as needed.
      0fbfcf3c
    • Jesse Gross's avatar
      input: Rename Options to Batch · 0c220935
      Jesse Gross authored
      Options is no longer very descriptive of this struct.
      0c220935
    • Jesse Gross's avatar
      gemma2: Remove second call to Rows · b078dd15
      Jesse Gross authored
      Looks like a merge conflict that broke the model.
      b078dd15
  2. 19 Mar, 2025 1 commit
  3. 14 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Use a separate context per multimodal input · 282bfaaa
      Jesse Gross authored
      Currently there is a single context per sequence, shared all by
      all multimodal inputs. Since we build a vision encoder graph per
      image, with a large number of inputs we can eventually hit the
      maximum number of graph nodes per context.
      
      This changes to use a separate context for each image, ensuring
      that available resource limits are consistent.
      282bfaaa
    • Jesse Gross's avatar
      ml: Allow models to constrain inputs to a single batch · 9679f401
      Jesse Gross authored
      Models may require that a set of inputs all be processed as part
      of the same batch. For example, if an image has multiple patches
      with fully connected attention between them, we should not split
      the batch in the middle of an image.
      
      Fixes #9697
      9679f401
  4. 13 Mar, 2025 2 commits
  5. 12 Mar, 2025 1 commit
  6. 11 Mar, 2025 23 commits
  7. 10 Mar, 2025 1 commit
    • Jesse Gross's avatar
      model: Update encoder cache to use multimodal input processing handler · a1cda80b
      Jesse Gross authored
      The encoder cache needs to know the position of images in the input
      stream so that it knows when to delete them. Previously images didn't
      have a position, so we implied one by breaking batches before an
      image and then assuming the image was in the first position. However,
      multimodal objects are now given explicit positions in the input
      stream, so we can use that instead.
      
      Breaking batches was also a way to simulate a cross attention mask
      for mllama. However, given that it only supports a single sequence
      and a single image, this mask doesn't serve any real purpose.
      Removing the batch break does not appear to affect the quality of
      the output.
      
      Most of this is simply moving the input data structures to a new
      package to avoid import cycles.
      a1cda80b
  8. 08 Mar, 2025 1 commit
  9. 07 Mar, 2025 5 commits
    • Jesse Gross's avatar
      additional review comments · 98272fbd
      Jesse Gross authored
      98272fbd
    • Michael Yang's avatar
      ml/backend/ggml: create tensor on specific backend · 7bae7fa5
      Michael Yang authored
      some tensors should be created on specific backends to reduce number of
      copies and improve performance
      7bae7fa5
    • Michael Yang's avatar
      ml/backend/ggml: update model loading for hybrid/multi backends · bab6f34d
      Michael Yang authored
      use a similar strategy as llama.cpp for deciding where tensors should be
      allocated. this will be improved later to be aware of usable memory
      before assigning the tensor
      bab6f34d
    • Jesse Gross's avatar
      ollamarunner: Improve multimodal input handling · a7e63b82
      Jesse Gross authored
      Various vision models have different requirements for how they
      receive their inputs. For example:
       - Mllama wants images together with text and the image embeddings
         don't themselves have positions or get stored in the main KV cache
       - Llava-style models feed in embeddings similar to tokens and
         images correspond to a varying number of tokens in the cache.
      
      In addition, the strategy for providing inputs must support batching
      and multiple sequences, which are managed by the runner. At the same
      time, we want to keep data handling fully in the model so that new
      architectures are not bottlenecked by runner code which does not
      understand their particular requirements.
      
      This provides a method for models to edit the input stream so that
      it meets their needs while still being in a format that the runner
      understands. This allows the runner to avoid special processing
      for different models.
      
      In addition, this fixes a regression where non-vision models may
      try to incorrectly interpret images.
      a7e63b82
    • Jesse Gross's avatar
      model: Don't unconditionally add special tokens · b70fc4d5
      Jesse Gross authored
      We sometimes tokenize partial strings. For example, with
      multimodal inputs, we split the input string around the images
      and then tokenize each piece. In these cases, we should only add
      the special tokens on the first piece.
      b70fc4d5
  10. 04 Mar, 2025 1 commit
    • Daniel Hiltgen's avatar
      New engine: vision models and auto-fallback (#9113) · 1fdb351c
      Daniel Hiltgen authored
      * Include unified vision layers in memory prediction
      
      For newer vision models with a single gguf, include
      the projection estimates.
      
      * Adjust CLI to handle both styles of vision model metadata
      
      * Wire up new tokenizers for new engine
      
      If we're loading the new engine, utilize the new model
      text processor instead of calling into cgo wrappers for
      llama.cpp.  This also cleans up some tech debt from the
      older tokenization flow for the C++ server which was
      no longer used.
      
      This also adjusts the grammar handling logic to pass
      through to the new engine instead of utilizing the cgo
      schema to grammar call.
      
      * Lay foundation for auto selection of new engine
      1fdb351c