1. 20 Jun, 2025 1 commit
  2. 18 Jun, 2025 2 commits
  3. 12 Jun, 2025 2 commits
  4. 07 Jun, 2025 1 commit
  5. 06 Jun, 2025 1 commit
  6. 05 Jun, 2025 1 commit
  7. 04 Jun, 2025 1 commit
  8. 29 May, 2025 1 commit
    • Devon Rifkin's avatar
      add thinking support to the api and cli (#10584) · 5f57b0ef
      Devon Rifkin authored
      - Both `/api/generate` and `/api/chat` now accept a `"think"`
        option that allows specifying whether thinking mode should be on or
        not
      - Templates get passed this new option so, e.g., qwen3's template can
        put `/think` or `/no_think` in the system prompt depending on the
        value of the setting
      - Models' thinking support is inferred by inspecting model templates.
        The prefix and suffix the parser uses to identify thinking support is
        also automatically inferred from templates
      - Thinking control & parsing is opt-in via the API to prevent breaking
        existing API consumers. If the `"think"` option is not specified, the
        behavior is unchanged from previous versions of ollama
      - Add parsing for thinking blocks in both streaming/non-streaming mode
        in both `/generate` and `/chat`
      - Update the CLI to make use of these changes. Users can pass `--think`
        or `--think=false` to control thinking, or during an interactive
        session they can use the commands `/se...
      5f57b0ef
  9. 27 May, 2025 1 commit
  10. 24 May, 2025 1 commit
  11. 23 May, 2025 1 commit
  12. 22 May, 2025 2 commits
    • Daniel Hiltgen's avatar
      sched: fix runner leak during reloading unload (#10819) · d950ff12
      Daniel Hiltgen authored
      When the same model is being reloaded rapidly with client connections
      being canceled before the model finishes loading, the queued unload
      event could cause a leak of runners by deleting a different runner from
      the loaded list.
      d950ff12
    • Bruce MacDonald's avatar
      server: improve tensor quantization fallback logic (#10806) · fbe6ae28
      Bruce MacDonald authored
      Fall back to alternative quantization types when a tensor's dimensions aren't divisible by the block size required for the original desired quantization type. If retried quantization types fail, the system ultimately falls back to F16 (half-precision floating point) which has a block size of 1 and can handle any tensor dimension.
      fbe6ae28
  13. 21 May, 2025 1 commit
  14. 19 May, 2025 2 commits
  15. 14 May, 2025 2 commits
  16. 13 May, 2025 1 commit
  17. 12 May, 2025 3 commits
  18. 08 May, 2025 2 commits
  19. 07 May, 2025 2 commits
    • Daniel Hiltgen's avatar
      sched: fix race leading to orphaned runners (#10599) · 5e380c3b
      Daniel Hiltgen authored
      If a model is loading, and the request context is canceled during the load
      by a client closing the connection, and another request is inbound for the
      same model with a different configuration (context size, etc.) thus requiring
      a reload, two unload events can be in flight.  The first shuts down the
      original model load, but the second one caused the loss of the new
      reloading runner reference, thus triggering the leak.
      
      The primary fix is detecting the duplicate unload and ignoring the second
      instance.  The load routine is also hardened to ensure we detect
      clobbering an already present runner and unload it with a warning.
      5e380c3b
    • Jeffrey Morgan's avatar
      392de840
  20. 06 May, 2025 3 commits
  21. 05 May, 2025 1 commit
  22. 03 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      sched: logging improvements (#10550) · 76ea735a
      Daniel Hiltgen authored
      This enhances our logging in the scheduler.  The initial "waiting for server" log
      no longer claims an initial error state (now "not responding" which better reflects
      the actual state).  Runners now have slog wiring to report more details about the
      runner, including PID.
      76ea735a
  23. 01 May, 2025 1 commit
  24. 30 Apr, 2025 2 commits
    • Devon Rifkin's avatar
      strip out thinking tags in message history for qwen3 & r1 (#10490) · ad3c7c9b
      Devon Rifkin authored
      * strip out thinking tags in message history for qwen3 & r1
      
      This is in advance of "proper" support where we'll make reasoning
      configurable and we'll parse out thinking/reasoning tags and provide
      them to the caller. These models expect there to be no thinking tags in
      the message history, so this should improve quality
      
      * parse model names instead of hacky prefix check
      ad3c7c9b
    • Daniel Hiltgen's avatar
      Fix "Stopping..." scheduler hang (#10487) · 415c8fcc
      Daniel Hiltgen authored
      * Adjust initial scheduler refCount
      
      Ensure we only set the refCount on success
      
      * sched: fix lock order inversion deadlock
      
      Under certain race conditions, there was a scenario where the scheduler would
      get into a deadlock while trying to update free space information while a model
      was trying to unload.
      415c8fcc
  25. 29 Apr, 2025 1 commit
    • Devon Rifkin's avatar
      lower default num parallel to 2 · fe5b9bb2
      Devon Rifkin authored
      this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
      fe5b9bb2
  26. 28 Apr, 2025 1 commit
  27. 25 Apr, 2025 2 commits
    • Michael Yang's avatar
      explicitly decode maxarraysize 1024 · 340448d2
      Michael Yang authored
      340448d2
    • Michael Yang's avatar
      fix superfluous call to WriteHeader · 214a7678
      Michael Yang authored
      the first call to http.ResponseWriter.Write implicitly calls WriteHeader
      with http.StatusOK if it hasn't already been called. once WriteHeader
      has been called, subsequent calls has no effect. Write is called when
      JSON encoding progressUpdateJSON{}. calls to
      http.ResponseWriter.WriteHeader after the first encode is useless and
      produces a warning:
      
      http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)
      214a7678