1. 13 Aug, 2025 2 commits
    • Daniel Hiltgen's avatar
      cuda: leverage JIT for smaller footprint (#11635) · dc5a6454
      Daniel Hiltgen authored
      Prior to this change our official binaries contained both JIT PTX code and
      the cubin binary code for our chosen compute capabilities. This change
      switches to only compile the PTX code and rely on JIT at runtime for
      generating the cubin specific to the users GPU.  The cubins are cached
      on the users system, so they should only see a small lag on the very
      first model load for a given Ollama release.  This also adds the first
      generation of Blackwell GPUs so they aren't reliant on the Hopper PTX.
      
      This change reduces the ggml-cuda.dll from 1.2G to 460M
      dc5a6454
    • youzichuan's avatar
      bb71654e
  2. 12 Aug, 2025 2 commits
  3. 11 Aug, 2025 4 commits
  4. 10 Aug, 2025 1 commit
  5. 08 Aug, 2025 3 commits
    • Jesse Gross's avatar
      ggml: No-alloc mode · 79f6376f
      Jesse Gross authored
      Callers can set a backend buffer type to be no-alloc, meaning that
      it does not allocate memory for tensors or operations. This can
      be used for calculating memory requirements. Tensors and graphs
      must be recreated with no-alloc set to false before loading data.
      
      Defaults to false for newly created backend buffer types.
      79f6376f
    • Jesse Gross's avatar
      ggml: Support closing backends · 756c78cf
      Jesse Gross authored
      In order to iteratively find the best memory allocation, we need to
      be able to free backend memory so we can try again.
      756c78cf
    • Jesse Gross's avatar
      ggml: Use GGML's typedef'ed pointer types · d7f4f788
      Jesse Gross authored
      For many backend data structures, GGML defines a typedef of a pointer
      type and returns these from functions. In most cases, CGo understands
      that these are interchangable but some parts of Go (such as generics)
      think they are two different types. We should prefer the form that
      GGML uses.
      d7f4f788
  6. 07 Aug, 2025 6 commits
  7. 06 Aug, 2025 7 commits
  8. 05 Aug, 2025 6 commits
    • Devon Rifkin's avatar
      tools: support anyOf types · 30f8a68c
      Devon Rifkin authored
      afaik gpt-oss is the first model that meaningfully transforms tool
      function definitions in its template. We found that relatively common
      definitions that include `anyOf` were not working because the template
      was assuming that types were always defined via a `type` field.
      
      anyOf allows for fully recursive types, so I exposed a
      `toTypeScriptType()` function to handle this recursive logic in go and
      keep the templates cleaner. The gpt-oss templates will need to be
      updated to use this.
      
      We should keep building out our function definition support to more
      fully support the parts of json schema that make sense for this use
      case, but in the meantime this will unblock some users (e.g., zed's
      ollama integration w/ gpt-oss). Probably the most urgent is proper array
      support
      30f8a68c
    • Daniel Hiltgen's avatar
      win: static link msvc libs (#11612) · e378e334
      Daniel Hiltgen authored
      This should help reduce the runtime dependencies on windows.
      e378e334
    • Michael Yang's avatar
      gptoss: fix memory calc (#11700) · fcec04bf
      Michael Yang authored
      fcec04bf
    • Jeffrey Morgan's avatar
      docs: add docs for Ollama Turbo (#11687) · ee92ca3e
      Jeffrey Morgan authored
      ee92ca3e
    • Jesse Gross's avatar
      ggml: Prevent kv cache quanitization on gpt-oss · 8253ad4d
      Jesse Gross authored
      KV cache quantization has a dependency on the flash attention kernel.
      We currently cannot use flash attention with gpt-oss as it requires
      additional operations.
      
      The model definition does not call flash attention, so it works
      regardless of the setting but the cache will pick up the
      quantization type. This updates the flash attention setting earlier
      in the loading flow so that all downstream settings are also set correctly.
      
      Fixes: #11671
      8253ad4d
    • Michael Yang's avatar
      gpt-oss (#11672) · fa7776fd
      Michael Yang authored
      
      
      * bf16
      
      * tests
      
      * gpt-oss
      
      * enable gptoss for engine
      
      * rough estimate
      
      * convert to mxfp4
      
      * handle safetensors U8
      
      * clamp glu/linear
      
      * update tokenizer
      
      * MXFP4 support
      
      This implements the Open Compute Microscaling (MX) FP4 format
      as a tensor type with backend implementations focusing
      on mulmat and mulmatid on CPU, CUDA, and Metal.
      
      * Unit tests for MXFP4 support
      
      This exercises various operations and shapes on both CPU and GPU (if detected
      on the system)
      
      * cuda graph
      
      * unit test adjustments
      
      * cuda: optimize memory access
      
      Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
      
      * mac: fix crash on old macos versions
      
      cblas_sgemm is only supported on v13.3 and up, however bf16 is
      only supported on v14+ so we were falling back to ggml-blas and
      crashing on bf16 tensors.  Checking for the function being null
      seems to be the simplest way to condittionally avoid registering the
      backend.
      
      * server: Minimum context length for gptoss
      
      This model requires a minimum context length of 8192 to function
      effectively. Users can set higher values through all normal mechanisms
      but lower values will be silently reset.
      
      * ggml: Multiply by numParallel for gptoss sliding window
      
      When computing the graph size estimate, the context size is already
      multiplied by numParallel so estimates reflect that. However, since
      sliding window models use a smaller, fixed context size, they need
      to manually take numParallel into account.
      
      * gpt-oss integration
      
      includes harmony parser and thinking levels, etc.
      
      * fix sync
      
      * fix tests
      
      * fix lint
      
      ---------
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDevon Rifkin <drifkin@drifkin.net>
      fa7776fd
  9. 04 Aug, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Log contents of cache when unable to find a slot · 0d38b665
      Jesse Gross authored
      There is a bug when using sliding window attention where we run
      out of KV cache slots. This is likely due to not correctly removing
      all of the entries as they slide out of range. This adds additional
      logging when this occurs to track down the source.
      
      Bug #10127
      0d38b665
  10. 31 Jul, 2025 1 commit
    • Jesse Gross's avatar
      kvcache: Enable SWA to retain additional entries · 4183bb05
      Jesse Gross authored
      Models that use sliding window attention can only resume a sequence
      from the cache if it falls within the saved windows. This works well
      if the next message picks up where the old one left off. However, it
      generally prevents a partial prefix match unless the entire conversation
      falls within the sliding window.
      
      This can be a problem with reasoning models where the traces are
      supposed to be removed from future messages, forcing the entire
      history to be re-evaluated.
      
      This change allows models to specify that a larger amount of the
      history be retained in memory, to allow more partial resumption.
      It still respects the window that the model was trained on for
      token generation.
      4183bb05
  11. 30 Jul, 2025 3 commits
  12. 29 Jul, 2025 3 commits
  13. 28 Jul, 2025 1 commit