1. 21 Jan, 2026 1 commit
  2. 05 Aug, 2025 2 commits
    • Jesse Gross's avatar
      ggml: Prevent kv cache quanitization on gpt-oss · 8253ad4d
      Jesse Gross authored
      KV cache quantization has a dependency on the flash attention kernel.
      We currently cannot use flash attention with gpt-oss as it requires
      additional operations.
      
      The model definition does not call flash attention, so it works
      regardless of the setting but the cache will pick up the
      quantization type. This updates the flash attention setting earlier
      in the loading flow so that all downstream settings are also set correctly.
      
      Fixes: #11671
      8253ad4d
    • Michael Yang's avatar
      gpt-oss (#11672) · fa7776fd
      Michael Yang authored
      
      
      * bf16
      
      * tests
      
      * gpt-oss
      
      * enable gptoss for engine
      
      * rough estimate
      
      * convert to mxfp4
      
      * handle safetensors U8
      
      * clamp glu/linear
      
      * update tokenizer
      
      * MXFP4 support
      
      This implements the Open Compute Microscaling (MX) FP4 format
      as a tensor type with backend implementations focusing
      on mulmat and mulmatid on CPU, CUDA, and Metal.
      
      * Unit tests for MXFP4 support
      
      This exercises various operations and shapes on both CPU and GPU (if detected
      on the system)
      
      * cuda graph
      
      * unit test adjustments
      
      * cuda: optimize memory access
      
      Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
      
      * mac: fix crash on old macos versions
      
      cblas_sgemm is only supported on v13.3 and up, however bf16 is
      only supported on v14+ so we were falling back to ggml-blas and
      crashing on bf16 tensors.  Checking for the function being null
      seems to be the simplest way to condittionally avoid registering the
      backend.
      
      * server: Minimum context length for gptoss
      
      This model requires a minimum context length of 8192 to function
      effectively. Users can set higher values through all normal mechanisms
      but lower values will be silently reset.
      
      * ggml: Multiply by numParallel for gptoss sliding window
      
      When computing the graph size estimate, the context size is already
      multiplied by numParallel so estimates reflect that. However, since
      sliding window models use a smaller, fixed context size, they need
      to manually take numParallel into account.
      
      * gpt-oss integration
      
      includes harmony parser and thinking levels, etc.
      
      * fix sync
      
      * fix tests
      
      * fix lint
      
      ---------
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDevon Rifkin <drifkin@drifkin.net>
      fa7776fd
  3. 04 Aug, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Log contents of cache when unable to find a slot · 0d38b665
      Jesse Gross authored
      There is a bug when using sliding window attention where we run
      out of KV cache slots. This is likely due to not correctly removing
      all of the entries as they slide out of range. This adds additional
      logging when this occurs to track down the source.
      
      Bug #10127
      0d38b665
  4. 31 Jul, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Enable SWA to retain additional entries · 4183bb05
      Jesse Gross authored
      Models that use sliding window attention can only resume a sequence
      from the cache if it falls within the saved windows. This works well
      if the next message picks up where the old one left off. However, it
      generally prevents a partial prefix match unless the entire conversation
      falls within the sliding window.
      
      This can be a problem with reasoning models where the traces are
      supposed to be removed from future messages, forcing the entire
      history to be re-evaluated.
      
      This change allows models to specify that a larger amount of the
      history be retained in memory, to allow more partial resumption.
      It still respects the window that the model was trained on for
      token generation.
      4183bb05
  5. 30 Jul, 2025 3 commits
  6. 29 Jul, 2025 3 commits
  7. 28 Jul, 2025 1 commit
  8. 27 Jul, 2025 1 commit
  9. 25 Jul, 2025 2 commits
    • Jesse Gross's avatar
      kvcache: Group shift operations into batches · 764be748
      Jesse Gross authored
      Currently, when we need to do a shift on the cache, it is one
      RoPE operation on the entire size of the cache (per layer). In
      some cases, this can create a compute graph that is larger than
      the forward pass since the forward pass is working in batches.
      Since we don't consider shifting in our memory estimates, it's
      possible for this to cause a crash if we run out of memory.
      
      By limiting the size of the RoPE calls to batch size chunks, we
      ensure that the shift will never exceed the size of the forward
      pass, since the forward pass will also contain a RoPE of the same
      size. This does not have a sigificant impact on performance since
      RoPE is a math operation that is mostly proportional to the size
      of its inputs.
      
      In theory defrag could have the same issue since it also creates a
      compute graph outside of the forward pass, however, since it is
      only copies, it does not require any working space.
      764be748
    • Ruyut's avatar
      b72e5adb
  10. 24 Jul, 2025 2 commits
  11. 23 Jul, 2025 2 commits
  12. 22 Jul, 2025 2 commits
  13. 20 Jul, 2025 2 commits
  14. 19 Jul, 2025 1 commit
  15. 17 Jul, 2025 5 commits
  16. 16 Jul, 2025 3 commits
  17. 11 Jul, 2025 4 commits
  18. 09 Jul, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Report ordinal IDs for AMD GPUs on Windows · 35fda7b4
      Jesse Gross authored
      We don't get valid UUIDs for AMD GPUs on Windows, so the best option
      is to use the ordinal IDs. This brings us in line with what we currently
      do on the Ollama server - the only exception is AMD GPUs on Linux, which
      falls back to using ordinal IDs. The GGML implementation has no fallback
      but it doesn't appear to occur for any of the GPUs that we support.
      
      It's also possible that there are collisions between ordinal IDs for
      different libraries - however the only places where we use them are
      AMD on Windows and Metal on Mac, which can never occur on the same
      system.
      35fda7b4
  19. 08 Jul, 2025 3 commits
    • Daniel Hiltgen's avatar
      doc: add MacOS docs (#11334) · 66fb8575
      Daniel Hiltgen authored
      also removes stale model dir instructions for windows
      66fb8575
    • Daniel Hiltgen's avatar
      Reduce default parallelism to 1 (#11330) · 20c3266e
      Daniel Hiltgen authored
      The current scheduler algorithm of picking the paralellism based on available
      VRAM complicates the upcoming dynamic layer memory allocation algorithm.  This
      changes the default to 1, with the intent going forward that parallelism is
      explicit and will no longer be dynamically determined.  Removal of the dynamic
      logic will come in a follow up.
      20c3266e
    • Daniel Hiltgen's avatar
      API/CLI context enhancements (#11331) · 34088dbc
      Daniel Hiltgen authored
      * API: expose context size of loaded models
      
      * CLI: add context UX
      
      This adds a column in the ps output to show the models context size.
      34088dbc