1. 22 May, 2025 2 commits
    • Jesse Gross's avatar
      ml: Panic rather than return error on tensor allocation failure · 1f371ea9
      Jesse Gross authored
      FromFloatSlice and FromIntSlice return an error if the shape doesn't
      match the passed data or if memory can't be allocated. Since these
      are inputs, the memory being allocated is system memory rather than VRAM.
      
      In many cases, the caller can't really handle the error and panics.
      
      Empty and Zeros directly panic if they can't allocate memory.
      
      This makes things consistent by panicing for the first two cases,
      removing a fair amount of error handling code. This is also consistent
      with how Go typically handles these situations.
      1f371ea9
    • Michael Yang's avatar
      fix: mllama quality (#10807) · adff143b
      Michael Yang authored
      * fix mllama convert
      
      - transform attn_gate and ffn_gate
      - swap attention heads for vision models
      
      * fix mllama
      
      the mlp gate which was applied in the wrong place
      adff143b
  2. 21 May, 2025 3 commits
    • Michael Yang's avatar
      feat: port qwen2 model (#10782) · c8900113
      Michael Yang authored
      c8900113
    • Michael Yang's avatar
      feat: qwen3 dense and sparse models (#10708) · e0ed984c
      Michael Yang authored
      * feat: qwen3 dense
      * feat: qwen3moe
      * fix llama4 moe
      e0ed984c
    • Michael Yang's avatar
      fix: qwen25vl assign samebatch in multimodal input (#10789) · 69b2fe92
      Michael Yang authored
      setting samebatch on the vision start token is problematic because it
      will be shared with other inputs that also use images. this will cause
      the input to be cached and the runner will not see SameBatch. SameBatch
      will also be incorrect since it may be for a different image.
      
      assigning samebatch to the input tokens resolves this by ensure it's
      assigned correctly to inputs corresponding to the image.
      
      not setting same batch correctly may cause panics during inference since
      images are no longer guaranteed to be in the same batch.
      69b2fe92
  3. 20 May, 2025 1 commit
  4. 19 May, 2025 2 commits
    • Michael Yang's avatar
      fix llama and mistral3 models (#10774) · ff180c34
      Michael Yang authored
      * fix llama model
      
      * fix mistral3.1 model
      
      do not set default vision layers
      ff180c34
    • Jesse Gross's avatar
      ggml: Seperate tensor load from backend creation · 94ab428e
      Jesse Gross authored
      Currently, when the backend is created, the tensors are loaded at the
      same time, which is a slow operation. This separates them to be two
      steps:
       - Create backend, including enumerating tensors and memory allocation
       - Loading tensor data
      
      This allows more flexibility in managing model loading.
      94ab428e
  5. 16 May, 2025 1 commit
  6. 15 May, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Separate text and multimodal graphs · 3c14461d
      Jesse Gross authored
      For some multimodal models (such as gemma3), we create a single
      graph that generates the image embedding and then use this in the
      text model. The embedding tensor is completely opaque to the runner.
      
      However, this doesn't work if we need to use the embedding in multiple
      batches. This can arise if the embedding is larger than the batch size.
      In these cases (as with llama4), we would like to create views that
      are more appropriately sized. However, if we do this then the original
      source tensor is used in multiple graphs, which isn't allowed. To
      avoid that problem, models with this pattern compute the embedding
      tensor on first use and recreate the individual views. There is no
      longer a single vision and text graph.
      
      This codifies the pattern of separating vision and text graphs. The
      logic of computing tensors on demand is moved to the runner, so models
      no longer have to worry about this. It also gives the runner visibility
      into the multimodal tensors, which is important for memory management.
      3c14461d
    • Michael Yang's avatar
      fix pixel values padding (#10718) · ef202789
      Michael Yang authored
      * panic if trying to pad 4d
      
      * fix pixel values padding
      ef202789
  7. 14 May, 2025 2 commits
  8. 13 May, 2025 1 commit
  9. 12 May, 2025 2 commits
  10. 26 Apr, 2025 1 commit
  11. 25 Apr, 2025 6 commits
  12. 24 Apr, 2025 1 commit
  13. 18 Apr, 2025 1 commit
  14. 08 Apr, 2025 1 commit
    • Jesse Gross's avatar
      ollamarunner: Preallocate worst case graph at startup · dbb149e6
      Jesse Gross authored
      Currently, the KV cache and graph are lazily allocated as needed.
      The cache is fully allocated on first use of the corresponding
      layer whereas the graph grows with the size of the context.
      
      This can be an issue if another application allocates more VRAM
      after we do our calculations - Ollama will crash in the middle of
      inference. If we instead allocate the maximum needed memory at
      startup of the runner, we will either succeed or fail at that point
      rather than at some surprising time in the future.
      
      Currently, this only generates a worst case batch for text, which
      means that vision models may get a partial allocation and continue
      to lazily allocate the rest.
      dbb149e6
  15. 03 Apr, 2025 2 commits
  16. 02 Apr, 2025 1 commit
  17. 21 Mar, 2025 1 commit
  18. 20 Mar, 2025 3 commits
    • Jesse Gross's avatar
      model: Pass input tensor instead of raw data to models · 0fbfcf3c
      Jesse Gross authored
      Rather than directly giving the input data to models, we can
      pass a tensor instead. In the short term, this saves some duplicated
      code.
      
      Longer term, we will want to overlap setting up the next batch with
      processing of the current one. In this case, we will only have the
      shape of tensor but it will not be loaded with data at the time of
      graph generation. By passing only a tensor to models now, we set up
      this possibility and prevent them from relying on data that they won't
      have in the future.
      
      Although the same could be done for Positions and Outputs, in some
      cases we either need the raw input data or don't use them at all.
      Therefore, for now we leave them as they are and allow models to
      convert them to tensors as needed.
      0fbfcf3c
    • Jesse Gross's avatar
      input: Rename Options to Batch · 0c220935
      Jesse Gross authored
      Options is no longer very descriptive of this struct.
      0c220935
    • Jesse Gross's avatar
      gemma2: Remove second call to Rows · b078dd15
      Jesse Gross authored
      Looks like a merge conflict that broke the model.
      b078dd15
  19. 19 Mar, 2025 1 commit
  20. 14 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Use a separate context per multimodal input · 282bfaaa
      Jesse Gross authored
      Currently there is a single context per sequence, shared all by
      all multimodal inputs. Since we build a vision encoder graph per
      image, with a large number of inputs we can eventually hit the
      maximum number of graph nodes per context.
      
      This changes to use a separate context for each image, ensuring
      that available resource limits are consistent.
      282bfaaa
    • Jesse Gross's avatar
      ml: Allow models to constrain inputs to a single batch · 9679f401
      Jesse Gross authored
      Models may require that a set of inputs all be processed as part
      of the same batch. For example, if an image has multiple patches
      with fully connected attention between them, we should not split
      the batch in the middle of an image.
      
      Fixes #9697
      9679f401
  21. 13 Mar, 2025 2 commits
  22. 12 Mar, 2025 1 commit
  23. 11 Mar, 2025 1 commit