1. 05 Aug, 2025 1 commit
    • Michael Yang's avatar
      gpt-oss (#11672) · fa7776fd
      Michael Yang authored
      
      
      * bf16
      
      * tests
      
      * gpt-oss
      
      * enable gptoss for engine
      
      * rough estimate
      
      * convert to mxfp4
      
      * handle safetensors U8
      
      * clamp glu/linear
      
      * update tokenizer
      
      * MXFP4 support
      
      This implements the Open Compute Microscaling (MX) FP4 format
      as a tensor type with backend implementations focusing
      on mulmat and mulmatid on CPU, CUDA, and Metal.
      
      * Unit tests for MXFP4 support
      
      This exercises various operations and shapes on both CPU and GPU (if detected
      on the system)
      
      * cuda graph
      
      * unit test adjustments
      
      * cuda: optimize memory access
      
      Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
      
      * mac: fix crash on old macos versions
      
      cblas_sgemm is only supported on v13.3 and up, however bf16 is
      only supported on v14+ so we were falling back to ggml-blas and
      crashing on bf16 tensors.  Checking for the function being null
      seems to be the simplest way to condittionally avoid registering the
      backend.
      
      * server: Minimum context length for gptoss
      
      This model requires a minimum context length of 8192 to function
      effectively. Users can set higher values through all normal mechanisms
      but lower values will be silently reset.
      
      * ggml: Multiply by numParallel for gptoss sliding window
      
      When computing the graph size estimate, the context size is already
      multiplied by numParallel so estimates reflect that. However, since
      sliding window models use a smaller, fixed context size, they need
      to manually take numParallel into account.
      
      * gpt-oss integration
      
      includes harmony parser and thinking levels, etc.
      
      * fix sync
      
      * fix tests
      
      * fix lint
      
      ---------
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDevon Rifkin <drifkin@drifkin.net>
      fa7776fd
  2. 23 Jul, 2025 1 commit
  3. 22 Jul, 2025 1 commit
  4. 08 Jul, 2025 2 commits
    • Daniel Hiltgen's avatar
      Reduce default parallelism to 1 (#11330) · 20c3266e
      Daniel Hiltgen authored
      The current scheduler algorithm of picking the paralellism based on available
      VRAM complicates the upcoming dynamic layer memory allocation algorithm.  This
      changes the default to 1, with the intent going forward that parallelism is
      explicit and will no longer be dynamically determined.  Removal of the dynamic
      logic will come in a follow up.
      20c3266e
    • Daniel Hiltgen's avatar
      API/CLI context enhancements (#11331) · 34088dbc
      Daniel Hiltgen authored
      * API: expose context size of loaded models
      
      * CLI: add context UX
      
      This adds a column in the ps output to show the models context size.
      34088dbc
  5. 27 Jun, 2025 1 commit
  6. 20 Jun, 2025 1 commit
  7. 18 Jun, 2025 2 commits
  8. 12 Jun, 2025 2 commits
  9. 07 Jun, 2025 1 commit
  10. 06 Jun, 2025 1 commit
  11. 05 Jun, 2025 1 commit
  12. 04 Jun, 2025 1 commit
  13. 29 May, 2025 1 commit
    • Devon Rifkin's avatar
      add thinking support to the api and cli (#10584) · 5f57b0ef
      Devon Rifkin authored
      - Both `/api/generate` and `/api/chat` now accept a `"think"`
        option that allows specifying whether thinking mode should be on or
        not
      - Templates get passed this new option so, e.g., qwen3's template can
        put `/think` or `/no_think` in the system prompt depending on the
        value of the setting
      - Models' thinking support is inferred by inspecting model templates.
        The prefix and suffix the parser uses to identify thinking support is
        also automatically inferred from templates
      - Thinking control & parsing is opt-in via the API to prevent breaking
        existing API consumers. If the `"think"` option is not specified, the
        behavior is unchanged from previous versions of ollama
      - Add parsing for thinking blocks in both streaming/non-streaming mode
        in both `/generate` and `/chat`
      - Update the CLI to make use of these changes. Users can pass `--think`
        or `--think=false` to control thinking, or during an interactive
        session they can use the commands `/se...
      5f57b0ef
  14. 27 May, 2025 1 commit
  15. 24 May, 2025 1 commit
  16. 23 May, 2025 1 commit
  17. 22 May, 2025 2 commits
    • Daniel Hiltgen's avatar
      sched: fix runner leak during reloading unload (#10819) · d950ff12
      Daniel Hiltgen authored
      When the same model is being reloaded rapidly with client connections
      being canceled before the model finishes loading, the queued unload
      event could cause a leak of runners by deleting a different runner from
      the loaded list.
      d950ff12
    • Bruce MacDonald's avatar
      server: improve tensor quantization fallback logic (#10806) · fbe6ae28
      Bruce MacDonald authored
      Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.
      fbe6ae28
  18. 21 May, 2025 1 commit
  19. 19 May, 2025 2 commits
  20. 14 May, 2025 2 commits
  21. 13 May, 2025 1 commit
  22. 12 May, 2025 3 commits
  23. 08 May, 2025 2 commits
  24. 07 May, 2025 2 commits
    • Daniel Hiltgen's avatar
      sched: fix race leading to orphaned runners (#10599) · 5e380c3b
      Daniel Hiltgen authored
      If a model is loading, and the request context is canceled during the load
      by a client closing the connection, and another request is inbound for the
      same model with a different configuration (context size, etc.) thus requiring
      a reload, two unload events can be in flight.  The first shuts down the
      original model load, but the second one caused the loss of the new
      reloading runner reference, thus triggering the leak.
      
      The primary fix is detecting the duplicate unload and ignoring the second
      instance.  The load routine is also hardened to ensure we detect
      clobbering an already present runner and unload it with a warning.
      5e380c3b
    • Jeffrey Morgan's avatar
      392de840
  25. 06 May, 2025 3 commits
  26. 05 May, 2025 1 commit
  27. 03 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      sched: logging improvements (#10550) · 76ea735a
      Daniel Hiltgen authored
      This enhances our logging in the scheduler.  The initial "waiting for server" log
      no longer claims an initial error state (now "not responding" which better reflects
      the actual state).  Runners now have slog wiring to report more details about the
      runner, including PID.
      76ea735a
  28. 01 May, 2025 1 commit