1. 22 May, 2025 2 commits
    • Jesse Gross's avatar
      ml: Panic rather than return error on tensor allocation failure · 1f371ea9
      Jesse Gross authored
      FromFloatSlice and FromIntSlice return an error if the shape doesn't
      match the passed data or if memory can't be allocated. Since these
      are inputs, the memory being allocated is system memory rather than VRAM.
      
      In many cases, the caller can't really handle the error and panics.
      
      Empty and Zeros directly panic if they can't allocate memory.
      
      This makes things consistent by panicing for the first two cases,
      removing a fair amount of error handling code. This is also consistent
      with how Go typically handles these situations.
      1f371ea9
    • Michael Yang's avatar
      fix: mllama quality (#10807) · adff143b
      Michael Yang authored
      * fix mllama convert
      
      - transform attn_gate and ffn_gate
      - swap attention heads for vision models
      
      * fix mllama
      
      the mlp gate which was applied in the wrong place
      adff143b
  2. 21 May, 2025 3 commits
    • Michael Yang's avatar
      feat: port qwen2 model (#10782) · c8900113
      Michael Yang authored
      c8900113
    • Michael Yang's avatar
      feat: qwen3 dense and sparse models (#10708) · e0ed984c
      Michael Yang authored
      * feat: qwen3 dense
      * feat: qwen3moe
      * fix llama4 moe
      e0ed984c
    • Michael Yang's avatar
      fix: qwen25vl assign samebatch in multimodal input (#10789) · 69b2fe92
      Michael Yang authored
      setting samebatch on the vision start token is problematic because it
      will be shared with other inputs that also use images. this will cause
      the input to be cached and the runner will not see SameBatch. SameBatch
      will also be incorrect since it may be for a different image.
      
      assigning samebatch to the input tokens resolves this by ensure it's
      assigned correctly to inputs corresponding to the image.
      
      not setting same batch correctly may cause panics during inference since
      images are no longer guaranteed to be in the same batch.
      69b2fe92
  3. 20 May, 2025 1 commit
  4. 19 May, 2025 1 commit
  5. 16 May, 2025 1 commit
  6. 15 May, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Separate text and multimodal graphs · 3c14461d
      Jesse Gross authored
      For some multimodal models (such as gemma3), we create a single
      graph that generates the image embedding and then use this in the
      text model. The embedding tensor is completely opaque to the runner.
      
      However, this doesn't work if we need to use the embedding in multiple
      batches. This can arise if the embedding is larger than the batch size.
      In these cases (as with llama4), we would like to create views that
      are more appropriately sized. However, if we do this then the original
      source tensor is used in multiple graphs, which isn't allowed. To
      avoid that problem, models with this pattern compute the embedding
      tensor on first use and recreate the individual views. There is no
      longer a single vision and text graph.
      
      This codifies the pattern of separating vision and text graphs. The
      logic of computing tensors on demand is moved to the runner, so models
      no longer have to worry about this. It also gives the runner visibility
      into the multimodal tensors, which is important for memory management.
      3c14461d
    • Michael Yang's avatar
      fix pixel values padding (#10718) · ef202789
      Michael Yang authored
      * panic if trying to pad 4d
      
      * fix pixel values padding
      ef202789
  7. 14 May, 2025 2 commits
  8. 13 May, 2025 1 commit
  9. 12 May, 2025 1 commit
  10. 26 Apr, 2025 1 commit
  11. 25 Apr, 2025 6 commits
  12. 24 Apr, 2025 1 commit
  13. 18 Apr, 2025 1 commit
  14. 03 Apr, 2025 2 commits
  15. 02 Apr, 2025 1 commit
  16. 20 Mar, 2025 3 commits
    • Jesse Gross's avatar
      model: Pass input tensor instead of raw data to models · 0fbfcf3c
      Jesse Gross authored
      Rather than directly giving the input data to models, we can
      pass a tensor instead. In the short term, this saves some duplicated
      code.
      
      Longer term, we will want to overlap setting up the next batch with
      processing of the current one. In this case, we will only have the
      shape of tensor but it will not be loaded with data at the time of
      graph generation. By passing only a tensor to models now, we set up
      this possibility and prevent them from relying on data that they won't
      have in the future.
      
      Although the same could be done for Positions and Outputs, in some
      cases we either need the raw input data or don't use them at all.
      Therefore, for now we leave them as they are and allow models to
      convert them to tensors as needed.
      0fbfcf3c
    • Jesse Gross's avatar
      input: Rename Options to Batch · 0c220935
      Jesse Gross authored
      Options is no longer very descriptive of this struct.
      0c220935
    • Jesse Gross's avatar
      gemma2: Remove second call to Rows · b078dd15
      Jesse Gross authored
      Looks like a merge conflict that broke the model.
      b078dd15
  17. 19 Mar, 2025 1 commit
  18. 14 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Use a separate context per multimodal input · 282bfaaa
      Jesse Gross authored
      Currently there is a single context per sequence, shared all by
      all multimodal inputs. Since we build a vision encoder graph per
      image, with a large number of inputs we can eventually hit the
      maximum number of graph nodes per context.
      
      This changes to use a separate context for each image, ensuring
      that available resource limits are consistent.
      282bfaaa
    • Jesse Gross's avatar
      ml: Allow models to constrain inputs to a single batch · 9679f401
      Jesse Gross authored
      Models may require that a set of inputs all be processed as part
      of the same batch. For example, if an image has multiple patches
      with fully connected attention between them, we should not split
      the batch in the middle of an image.
      
      Fixes #9697
      9679f401
  19. 13 Mar, 2025 1 commit
  20. 12 Mar, 2025 1 commit
  21. 11 Mar, 2025 6 commits