1. 06 Jan, 2026 1 commit
  2. 26 Sep, 2025 1 commit
    • Patrick Devine's avatar
      bugfix: restore the current runOptions if loading fails in the CLI (#12402) · b04e46da
      Patrick Devine authored
      There are two bugs when using `/load <model>` for a model that doesn't exist, namely:
        1. it will not restore the current model settings if the current model is a thinking model; and
        2. it will crash is the current model is a non-thinking model
      
      This bug fix saves the current runOptions and then restores them if the model load
      doesn't happen. It also fixes the crash happening for non-thinking models.
      b04e46da
  3. 05 Aug, 2025 1 commit
    • Michael Yang's avatar
      gpt-oss (#11672) · fa7776fd
      Michael Yang authored
      
      
      * bf16
      
      * tests
      
      * gpt-oss
      
      * enable gptoss for engine
      
      * rough estimate
      
      * convert to mxfp4
      
      * handle safetensors U8
      
      * clamp glu/linear
      
      * update tokenizer
      
      * MXFP4 support
      
      This implements the Open Compute Microscaling (MX) FP4 format
      as a tensor type with backend implementations focusing
      on mulmat and mulmatid on CPU, CUDA, and Metal.
      
      * Unit tests for MXFP4 support
      
      This exercises various operations and shapes on both CPU and GPU (if detected
      on the system)
      
      * cuda graph
      
      * unit test adjustments
      
      * cuda: optimize memory access
      
      Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
      
      * mac: fix crash on old macos versions
      
      cblas_sgemm is only supported on v13.3 and up, however bf16 is
      only supported on v14+ so we were falling back to ggml-blas and
      crashing on bf16 tensors.  Checking for the function being null
      seems to be the simplest way to condittionally avoid registering the
      backend.
      
      * server: Minimum context length for gptoss
      
      This model requires a minimum context length of 8192 to function
      effectively. Users can set higher values through all normal mechanisms
      but lower values will be silently reset.
      
      * ggml: Multiply by numParallel for gptoss sliding window
      
      When computing the graph size estimate, the context size is already
      multiplied by numParallel so estimates reflect that. However, since
      sliding window models use a smaller, fixed context size, they need
      to manually take numParallel into account.
      
      * gpt-oss integration
      
      includes harmony parser and thinking levels, etc.
      
      * fix sync
      
      * fix tests
      
      * fix lint
      
      ---------
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDevon Rifkin <drifkin@drifkin.net>
      fa7776fd
  4. 22 Jul, 2025 1 commit
  5. 29 May, 2025 1 commit
    • Devon Rifkin's avatar
      add thinking support to the api and cli (#10584) · 5f57b0ef
      Devon Rifkin authored
      - Both `/api/generate` and `/api/chat` now accept a `"think"`
        option that allows specifying whether thinking mode should be on or
        not
      - Templates get passed this new option so, e.g., qwen3's template can
        put `/think` or `/no_think` in the system prompt depending on the
        value of the setting
      - Models' thinking support is inferred by inspecting model templates.
        The prefix and suffix the parser uses to identify thinking support is
        also automatically inferred from templates
      - Thinking control & parsing is opt-in via the API to prevent breaking
        existing API consumers. If the `"think"` option is not specified, the
        behavior is unchanged from previous versions of ollama
      - Add parsing for thinking blocks in both streaming/non-streaming mode
        in both `/generate` and `/chat`
      - Update the CLI to make use of these changes. Users can pass `--think`
        or `--think=false` to control thinking, or during an interactive
        session they can use the commands `/set think` or `/set nothink`
      - A `--hidethinking` option has also been added to the CLI. This makes
        it easy to use thinking in scripting scenarios like
        `ollama run qwen3 --think --hidethinking "my question here"` where you
        just want to see the answer but still want the benefits of thinking
        models
      5f57b0ef
  6. 13 May, 2025 1 commit
  7. 10 May, 2025 1 commit
  8. 20 Apr, 2025 1 commit
  9. 15 Mar, 2025 1 commit
    • Patrick Devine's avatar
      fix: correctly save in interactive mode (#9788) · 2c8b4846
      Patrick Devine authored
      This fixes the case where a FROM line in previous modelfile points to a
      file which may/may not be present in a different ollama instance. We
      shouldn't be relying on the filename though and instead just check if
      the FROM line was instead a valid model name and point to that instead.
      2c8b4846
  10. 13 Mar, 2025 1 commit
  11. 12 Mar, 2025 1 commit
  12. 01 Jan, 2025 1 commit
  13. 22 Dec, 2024 1 commit
  14. 26 Nov, 2024 1 commit
  15. 21 Nov, 2024 1 commit
  16. 18 Oct, 2024 1 commit
  17. 10 Oct, 2024 1 commit
    • Jesse Gross's avatar
      cli: Send all images in conversation history · 7fe39025
      Jesse Gross authored
      Currently the CLI only sends images from the most recent image-
      containing message. This prevents doing things like sending
      one message with an image and then a follow message with a
      second image and asking for comparision based on additional
      information not present in any text that was output.
      
      It's possible that some models have a problem with this but the
      CLI is not the right place to do this since any adjustments are
      model-specific and should affect all clients.
      
      Both llava:34b and minicpm-v do reasonable things with multiple
      images in the history.
      7fe39025
  18. 11 Sep, 2024 2 commits
  19. 02 Aug, 2024 1 commit
  20. 27 Jul, 2024 1 commit
  21. 26 Jul, 2024 2 commits
  22. 22 Jul, 2024 1 commit
  23. 14 Jul, 2024 1 commit
  24. 28 Jun, 2024 1 commit
  25. 25 Jun, 2024 1 commit
    • Blake Mizerany's avatar
      cmd: defer stating model info until necessary (#5248) · 2aa91a93
      Blake Mizerany authored
      This commit changes the 'ollama run' command to defer fetching model
      information until it really needs it. That is, when in interactive mode.
      
      It also removes one such case where the model information is fetch in
      duplicate, just before calling generateInteractive and then again, first
      thing, in generateInteractive.
      
      This positively impacts the performance of the command:
      
          ; time ./before run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.168 total
          ; time ./before run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.220 total
          ; time ./before run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.217 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./after run llama3 'hi'  0.02s user 0.01s system 4% cpu 0.652 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./after run llama3 'hi'  0.01s user 0.01s system 5% cpu 0.498 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?
      
          ./after run llama3 'hi'  0.01s user 0.01s system 3% cpu 0.479 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total
      2aa91a93
  26. 04 Jun, 2024 1 commit
  27. 24 May, 2024 1 commit
  28. 21 May, 2024 1 commit
  29. 18 May, 2024 1 commit
  30. 14 May, 2024 3 commits
  31. 07 May, 2024 1 commit
  32. 01 May, 2024 1 commit
  33. 23 Apr, 2024 1 commit
  34. 22 Apr, 2024 1 commit
  35. 29 Mar, 2024 1 commit
  36. 26 Mar, 2024 1 commit