1. 05 Aug, 2025 1 commit
    • Michael Yang's avatar
      gpt-oss (#11672) · fa7776fd
      Michael Yang authored
      
      
      * bf16
      
      * tests
      
      * gpt-oss
      
      * enable gptoss for engine
      
      * rough estimate
      
      * convert to mxfp4
      
      * handle safetensors U8
      
      * clamp glu/linear
      
      * update tokenizer
      
      * MXFP4 support
      
      This implements the Open Compute Microscaling (MX) FP4 format
      as a tensor type with backend implementations focusing
      on mulmat and mulmatid on CPU, CUDA, and Metal.
      
      * Unit tests for MXFP4 support
      
      This exercises various operations and shapes on both CPU and GPU (if detected
      on the system)
      
      * cuda graph
      
      * unit test adjustments
      
      * cuda: optimize memory access
      
      Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
      
      * mac: fix crash on old macos versions
      
      cblas_sgemm is only supported on v13.3 and up, however bf16 is
      only supported on v14+ so we were falling back to ggml-blas and
      crashing on bf16 tensors.  Checking for the function being null
      seems to be the simplest way to condittionally avoid registering the
      backend.
      
      * server: Minimum context length for gptoss
      
      This model requires a minimum context length of 8192 to function
      effectively. Users can set higher values through all normal mechanisms
      but lower values will be silently reset.
      
      * ggml: Multiply by numParallel for gptoss sliding window
      
      When computing the graph size estimate, the context size is already
      multiplied by numParallel so estimates reflect that. However, since
      sliding window models use a smaller, fixed context size, they need
      to manually take numParallel into account.
      
      * gpt-oss integration
      
      includes harmony parser and thinking levels, etc.
      
      * fix sync
      
      * fix tests
      
      * fix lint
      
      ---------
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDevon Rifkin <drifkin@drifkin.net>
      fa7776fd
  2. 29 Jul, 2025 1 commit
  3. 11 Jul, 2025 1 commit
  4. 27 Jun, 2025 1 commit
  5. 26 Jun, 2025 1 commit
  6. 11 Jun, 2025 1 commit
  7. 22 May, 2025 2 commits
    • Jesse Gross's avatar
      ml: Panic rather than return error on tensor allocation failure · 1f371ea9
      Jesse Gross authored
      FromFloatSlice and FromIntSlice return an error if the shape doesn't
      match the passed data or if memory can't be allocated. Since these
      are inputs, the memory being allocated is system memory rather than VRAM.
      
      In many cases, the caller can't really handle the error and panics.
      
      Empty and Zeros directly panic if they can't allocate memory.
      
      This makes things consistent by panicing for the first two cases,
      removing a fair amount of error handling code. This is also consistent
      with how Go typically handles these situations.
      1f371ea9
    • Michael Yang's avatar
      fix: mllama quality (#10807) · adff143b
      Michael Yang authored
      * fix mllama convert
      
      - transform attn_gate and ffn_gate
      - swap attention heads for vision models
      
      * fix mllama
      
      the mlp gate which was applied in the wrong place
      adff143b
  8. 21 May, 2025 3 commits
    • Michael Yang's avatar
      feat: port qwen2 model (#10782) · c8900113
      Michael Yang authored
      c8900113
    • Michael Yang's avatar
      feat: qwen3 dense and sparse models (#10708) · e0ed984c
      Michael Yang authored
      * feat: qwen3 dense
      * feat: qwen3moe
      * fix llama4 moe
      e0ed984c
    • Michael Yang's avatar
      fix: qwen25vl assign samebatch in multimodal input (#10789) · 69b2fe92
      Michael Yang authored
      setting samebatch on the vision start token is problematic because it
      will be shared with other inputs that also use images. this will cause
      the input to be cached and the runner will not see SameBatch. SameBatch
      will also be incorrect since it may be for a different image.
      
      assigning samebatch to the input tokens resolves this by ensure it's
      assigned correctly to inputs corresponding to the image.
      
      not setting same batch correctly may cause panics during inference since
      images are no longer guaranteed to be in the same batch.
      69b2fe92
  9. 20 May, 2025 1 commit
  10. 19 May, 2025 1 commit
  11. 16 May, 2025 1 commit
  12. 15 May, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Separate text and multimodal graphs · 3c14461d
      Jesse Gross authored
      For some multimodal models (such as gemma3), we create a single
      graph that generates the image embedding and then use this in the
      text model. The embedding tensor is completely opaque to the runner.
      
      However, this doesn't work if we need to use the embedding in multiple
      batches. This can arise if the embedding is larger than the batch size.
      In these cases (as with llama4), we would like to create views that
      are more appropriately sized. However, if we do this then the original
      source tensor is used in multiple graphs, which isn't allowed. To
      avoid that problem, models with this pattern compute the embedding
      tensor on first use and recreate the individual views. There is no
      longer a single vision and text graph.
      
      This codifies the pattern of separating vision and text graphs. The
      logic of computing tensors on demand is moved to the runner, so models
      no longer have to worry about this. It also gives the runner visibility
      into the multimodal tensors, which is important for memory management.
      3c14461d
    • Michael Yang's avatar
      fix pixel values padding (#10718) · ef202789
      Michael Yang authored
      * panic if trying to pad 4d
      
      * fix pixel values padding
      ef202789
  13. 14 May, 2025 2 commits
  14. 13 May, 2025 1 commit
  15. 12 May, 2025 1 commit
  16. 26 Apr, 2025 1 commit
  17. 25 Apr, 2025 6 commits
  18. 24 Apr, 2025 1 commit
  19. 18 Apr, 2025 1 commit
  20. 03 Apr, 2025 2 commits
  21. 02 Apr, 2025 1 commit
  22. 20 Mar, 2025 3 commits
    • Jesse Gross's avatar
      model: Pass input tensor instead of raw data to models · 0fbfcf3c
      Jesse Gross authored
      Rather than directly giving the input data to models, we can
      pass a tensor instead. In the short term, this saves some duplicated
      code.
      
      Longer term, we will want to overlap setting up the next batch with
      processing of the current one. In this case, we will only have the
      shape of tensor but it will not be loaded with data at the time of
      graph generation. By passing only a tensor to models now, we set up
      this possibility and prevent them from relying on data that they won't
      have in the future.
      
      Although the same could be done for Positions and Outputs, in some
      cases we either need the raw input data or don't use them at all.
      Therefore, for now we leave them as they are and allow models to
      convert them to tensors as needed.
      0fbfcf3c
    • Jesse Gross's avatar
      input: Rename Options to Batch · 0c220935
      Jesse Gross authored
      Options is no longer very descriptive of this struct.
      0c220935
    • Jesse Gross's avatar
      gemma2: Remove second call to Rows · b078dd15
      Jesse Gross authored
      Looks like a merge conflict that broke the model.
      b078dd15
  23. 19 Mar, 2025 1 commit
  24. 14 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Use a separate context per multimodal input · 282bfaaa
      Jesse Gross authored
      Currently there is a single context per sequence, shared all by
      all multimodal inputs. Since we build a vision encoder graph per
      image, with a large number of inputs we can eventually hit the
      maximum number of graph nodes per context.
      
      This changes to use a separate context for each image, ensuring
      that available resource limits are consistent.
      282bfaaa
    • Jesse Gross's avatar
      ml: Allow models to constrain inputs to a single batch · 9679f401
      Jesse Gross authored
      Models may require that a set of inputs all be processed as part
      of the same batch. For example, if an image has multiple patches
      with fully connected attention between them, we should not split
      the batch in the middle of an image.
      
      Fixes #9697
      9679f401
  25. 13 Mar, 2025 1 commit
  26. 12 Mar, 2025 1 commit