1. 25 Sep, 2025 1 commit
  2. 23 Sep, 2025 1 commit
    • Patrick Devine's avatar
      auth: fix problems with the ollama keypairs (#12373) · 64883e3c
      Patrick Devine authored
      * auth: fix problems with the ollama keypairs
      
      This change adds several fixes including:
        - reading in the pubkey files correctly
        - fixing the push unit test to create a keypair file in a temp directory
        - not return 500 errors for normal status error
      64883e3c
  3. 17 Sep, 2025 1 commit
  4. 11 Sep, 2025 1 commit
  5. 15 Aug, 2025 1 commit
  6. 05 Aug, 2025 1 commit
    • Michael Yang's avatar
      gpt-oss (#11672) · fa7776fd
      Michael Yang authored
      
      
      * bf16
      
      * tests
      
      * gpt-oss
      
      * enable gptoss for engine
      
      * rough estimate
      
      * convert to mxfp4
      
      * handle safetensors U8
      
      * clamp glu/linear
      
      * update tokenizer
      
      * MXFP4 support
      
      This implements the Open Compute Microscaling (MX) FP4 format
      as a tensor type with backend implementations focusing
      on mulmat and mulmatid on CPU, CUDA, and Metal.
      
      * Unit tests for MXFP4 support
      
      This exercises various operations and shapes on both CPU and GPU (if detected
      on the system)
      
      * cuda graph
      
      * unit test adjustments
      
      * cuda: optimize memory access
      
      Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4
      
      * mac: fix crash on old macos versions
      
      cblas_sgemm is only supported on v13.3 and up, however bf16 is
      only supported on v14+ so we were falling back to ggml-blas and
      crashing on bf16 tensors.  Checking for the function being null
      seems to be the simplest way to condittionally avoid registering the
      backend.
      
      * server: Minimum context length for gptoss
      
      This model requires a minimum context length of 8192 to function
      effectively. Users can set higher values through all normal mechanisms
      but lower values will be silently reset.
      
      * ggml: Multiply by numParallel for gptoss sliding window
      
      When computing the graph size estimate, the context size is already
      multiplied by numParallel so estimates reflect that. However, since
      sliding window models use a smaller, fixed context size, they need
      to manually take numParallel into account.
      
      * gpt-oss integration
      
      includes harmony parser and thinking levels, etc.
      
      * fix sync
      
      * fix tests
      
      * fix lint
      
      ---------
      Co-authored-by: default avatarDaniel Hiltgen <daniel@ollama.com>
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      Co-authored-by: default avatarDevon Rifkin <drifkin@drifkin.net>
      fa7776fd
  7. 24 Jul, 2025 1 commit
  8. 22 Jul, 2025 1 commit
  9. 17 Jul, 2025 1 commit
  10. 16 Jul, 2025 1 commit
  11. 08 Jul, 2025 1 commit
  12. 09 Jun, 2025 1 commit
  13. 08 Jun, 2025 1 commit
  14. 06 Jun, 2025 2 commits
  15. 29 May, 2025 1 commit
    • Devon Rifkin's avatar
      add thinking support to the api and cli (#10584) · 5f57b0ef
      Devon Rifkin authored
      - Both `/api/generate` and `/api/chat` now accept a `"think"`
        option that allows specifying whether thinking mode should be on or
        not
      - Templates get passed this new option so, e.g., qwen3's template can
        put `/think` or `/no_think` in the system prompt depending on the
        value of the setting
      - Models' thinking support is inferred by inspecting model templates.
        The prefix and suffix the parser uses to identify thinking support is
        also automatically inferred from templates
      - Thinking control & parsing is opt-in via the API to prevent breaking
        existing API consumers. If the `"think"` option is not specified, the
        behavior is unchanged from previous versions of ollama
      - Add parsing for thinking blocks in both streaming/non-streaming mode
        in both `/generate` and `/chat`
      - Update the CLI to make use of these changes. Users can pass `--think`
        or `--think=false` to control thinking, or during an interactive
        session they can use the commands `/se...
      5f57b0ef
  16. 21 May, 2025 1 commit
  17. 15 May, 2025 2 commits
  18. 13 May, 2025 1 commit
  19. 10 May, 2025 1 commit
  20. 08 May, 2025 1 commit
  21. 06 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      Move quantization to new backend (#10363) · 42481045
      Daniel Hiltgen authored
      * Move quantization logic to GGML via new backend
      
      This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
      
      * Remove "add model quantizations"
      
      This is no longer needed now that quantization is implemented in Go+GGML code directly.
      42481045
  22. 05 May, 2025 1 commit
  23. 28 Apr, 2025 1 commit
  24. 22 Apr, 2025 1 commit
    • Devon Rifkin's avatar
      increase default context length to 4096 (#10364) · 424f6486
      Devon Rifkin authored
      * increase default context length to 4096
      
      We lower the default numParallel from 4 to 2 and use these "savings" to
      double the default context length from 2048 to 4096.
      
      We're memory neutral in cases when we previously would've used
      numParallel == 4, but we add the following mitigation to handle some
      cases where we would have previously fallen back to 1x2048 due to low
      VRAM: we decide between 2048 and 4096 using a runtime check, choosing
      2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
      purposefully don't check the available VRAM because we don't want the
      context window size to change unexpectedly based on the available VRAM.
      
      We plan on making the default even larger, but this is a relatively
      low-risk change we can make to quickly double it.
      
      * fix tests
      
      add an explicit context length so they don't get truncated. The code
      that converts -1 from being a signal for doing a runtime check isn't
      running as part of these tests.
      
      * tweak small gpu message
      
      * clarify context length default
      
      also make it actually show up in `ollama serve --help`
      424f6486
  25. 20 Apr, 2025 1 commit
  26. 16 Apr, 2025 1 commit
    • Blake Mizerany's avatar
      cmd: add retry/backoff (#10069) · 1e7f62cb
      Blake Mizerany authored
      This commit adds retry/backoff to the registry client for pull requests.
      
      Also, revert progress indication to match original client's until we can
      "get it right."
      
      Also, make WithTrace wrap existing traces instead of clobbering them.
      This allows clients to compose traces.
      1e7f62cb
  27. 14 Apr, 2025 1 commit
  28. 08 Apr, 2025 1 commit
  29. 02 Apr, 2025 1 commit
  30. 01 Apr, 2025 1 commit
  31. 21 Mar, 2025 1 commit
  32. 15 Mar, 2025 1 commit
    • Patrick Devine's avatar
      fix: correctly save in interactive mode (#9788) · 2c8b4846
      Patrick Devine authored
      This fixes the case where a FROM line in previous modelfile points to a
      file which may/may not be present in a different ollama instance. We
      shouldn't be relying on the filename though and instead just check if
      the FROM line was instead a valid model name and point to that instead.
      2c8b4846
  33. 13 Mar, 2025 1 commit
  34. 12 Mar, 2025 1 commit
  35. 04 Mar, 2025 2 commits
    • Michael Yang's avatar
      ml/backend/ggml: consolidate system info logging · 05a01fde
      Michael Yang authored
      - output backend system info when initializing the backend. this ensures
        this information is always present without needing to be called
        explicitly
      - convert to structured logging
      - enumerate devices rather than backends since devices are ordered
      - track device indices grouped by device name
      05a01fde
    • Daniel Hiltgen's avatar
      New engine: vision models and auto-fallback (#9113) · 1fdb351c
      Daniel Hiltgen authored
      * Include unified vision layers in memory prediction
      
      For newer vision models with a single gguf, include
      the projection estimates.
      
      * Adjust CLI to handle both styles of vision model metadata
      
      * Wire up new tokenizers for new engine
      
      If we're loading the new engine, utilize the new model
      text processor instead of calling into cgo wrappers for
      llama.cpp.  This also cleans up some tech debt from the
      older tokenization flow for the C++ server which was
      no longer used.
      
      This also adjusts the grammar handling logic to pass
      through to the new engine instead of utilizing the cgo
      schema to grammar call.
      
      * Lay foundation for auto selection of new engine
      1fdb351c
  36. 03 Mar, 2025 1 commit
  37. 19 Feb, 2025 1 commit