1. 09 Jan, 2026 1 commit
    • Daniel Hiltgen's avatar
      Add experimental MLX backend and engine with imagegen support (#13648) · 33ee7168
      Daniel Hiltgen authored
      
      
      * WIP - MLX backend with gemma3
      
      * MLX: add cmake and go tag build toggles
      
      To build the new MLX backend code:
        cmake --preset MLX
        cmake --build --preset MLX --parallel
        cmake --install build --component MLX
        go build -tags mlx .
      
      Note: the main.go entrypoint for the MLX engine will change in a follow up commit.
      
      * add experimental image generation runtime
      
      * add experimental image generation runtime
      
      * MLX: wire up cuda build for linux
      
      * MLX: get dependencies correct and dedup
      
      This is still too large for a unified github artifact, but is now "correct" for the mlx_cuda_v13
      directory.
      
      * fix relative link bug in dedup
      
      * Add darwin build and readme
      
      * add go build tag for mlx dependent code and wire up build_darwin.sh
      
      * lint cleanup
      
      * macos: build mlx for x86
      
      This will be CPU only.
      
      * cuda build instructions and fix drift from mlx bump
      
      * stale comment
      
      * Delete agent helper doc
      
      * Clean up readme.md
      
      * Revise README for tokenizer clarity and details
      
      Updated README to clarify tokenizer functionality and removed correctness section.
      
      ---------
      Co-authored-by: default avatarjmorganca <jmorganca@gmail.com>
      33ee7168
  2. 06 Jan, 2026 1 commit
    • Devon Rifkin's avatar
      preserve tool definition and call JSON ordering (#13525) · e51dead6
      Devon Rifkin authored
      * preserve tool definition and call JSON ordering
      
      This is another iteration of
      <https://github.com/ollama/ollama/pull/12518>, but this time we've
      simplified things by relaxing the competing requirements of being
      compatible AND order-preserving with templates (vs. renderers). We
      maintain backwards compatibility at the cost of not guaranteeing order
      for templates. We plan on moving more and more models to renderers,
      which have been updated to use these new data types, and additionally
      we could add an opt-in way of templates getting an order-preserved list
      (e.g., via sibling template vars)
      
      * orderedmap_test: remove testify
      e51dead6
  3. 03 Jan, 2026 1 commit
  4. 18 Dec, 2025 3 commits
  5. 16 Dec, 2025 1 commit
    • Bruce MacDonald's avatar
      types: ConfigV2 and RootFS (#13504) · 45c47393
      Bruce MacDonald authored
      Refactored the ConfigV2 and RootFS types from server/images.go to a new types/model/config.go file under the model package. Updated all references to use model.ConfigV2 and model.RootFS. This allows for use in other projects without worrying about compiling the c code in the llama package.
      45c47393
  6. 11 Dec, 2025 2 commits
  7. 08 Dec, 2025 1 commit
  8. 05 Dec, 2025 1 commit
  9. 20 Nov, 2025 1 commit
  10. 18 Nov, 2025 1 commit
  11. 16 Nov, 2025 2 commits
  12. 13 Nov, 2025 1 commit
  13. 11 Nov, 2025 2 commits
    • Jesse Gross's avatar
      llm: Use Ollama engine memory layouts for both old and new engines · f560bd07
      Jesse Gross authored
      Currently for both the old and new engines, there is code to
      calculate how much memory is required for a model and lay out
      the layers onto GPUs. This reuses the new engine's lay out code
      for the old engine as well, bringing them closer together. The
      old engine continues to use its current method of estimating
      required memory.
      
      This reduces maintainence effort and improves consistency, as new
      features only need to be implemented in one place. The newer code
      is also more accurate, especially with multiple GPUs.
      f560bd07
    • Baptiste Jamin's avatar
      server: add logprobs and top_logprobs support to Ollama's API (#12899) · 59241c5b
      Baptiste Jamin authored
      
      
      Adds logprobs support to Ollama's API including support for Ollama's
      OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
      in the API, Ollama will return the log probabilities for each token generated.
      'top_logprobs', an integer value can also be specified up to the value 20.
      When specified, the API will also provide the number of most likely tokens to
      return at each token position
      Co-authored-by: default avatarBaptiste Jamin <baptiste@crisp.chat>
      59241c5b
  14. 06 Nov, 2025 2 commits
  15. 05 Nov, 2025 2 commits
  16. 04 Nov, 2025 1 commit
  17. 29 Oct, 2025 1 commit
  18. 28 Oct, 2025 1 commit
  19. 27 Oct, 2025 2 commits
    • Devon Rifkin's avatar
      create: inherit FROM model's renderer/parser · 1bdd8169
      Devon Rifkin authored
      On main, the `RENDERER` and `PARSER` fields from the `Modelfile` don't
      get propagated to a new model created with a `req.From` parameter. This
      is easily triggered via `ollama run qwen3-coder`, then running some save
      command like `/save qwen3-coder-custom`.
      
      Added a regression test for this, and then open the config for the
      "from" model in order to use its renderer/parser as a default for the
      new model. This will fix the CLI and also API-based creates.
      
      Fixes: https://github.com/ollama/ollama/issues/12792
      1bdd8169
    • nicole pardal's avatar
      server: Consolidate embedding truncation in runner (#12730) · 5d347f6d
      nicole pardal authored
      Currently, checking the length of prompts for embeddings to ensure
      they fit in the context window (and possible truncation) occurs in
      two places - the Ollama server and runner. This can lead to
      inconsistencies in both the checks and reported number of tokens
      processed. Since we have to do this processing in the runner, this
      consolidates all of the logic there.
      5d347f6d
  20. 25 Oct, 2025 1 commit
  21. 23 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      DRY out the runner lifecycle code (#12540) · 3258a89b
      Daniel Hiltgen authored
      * DRY out the runner lifecycle code
      
      Now that discovery uses the runners as well, this unifies the runner spawning code
      into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo
      
      * win: make incremental builds better
      
      Place build artifacts in discrete directories so incremental builds don't have to start fresh
      
      * Adjust sort order to consider iGPUs
      
      * handle cpu inference oom scenarios
      
      * review comments
      3258a89b
  22. 22 Oct, 2025 1 commit
  23. 20 Oct, 2025 1 commit
  24. 17 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      test: harden scheduler tests (#12662) · 68e04c7f
      Daniel Hiltgen authored
      * test: harden scheduler tests
      
      This removes reschedDelay which was stale code, and adds
      a new configurable timeout for the waitForVRAMRecovery so
      tests can now set the timeout to be very short to avoid the
      scheduler getting stuck and hitting a test timeout.
      
      * test: tune tests for partial loads
      
      Give stress tests more time when the model is split between CPU/GPU
      68e04c7f
  25. 16 Oct, 2025 1 commit
  26. 14 Oct, 2025 1 commit
  27. 13 Oct, 2025 1 commit
    • Grace's avatar
      Qwen3VL Cloud Parser and Renderer (#12526) · 05982a95
      Grace authored
      
      
      * working (other than tool call is the incorrect order) for tool calls and tools
      
      * Tests work, other than image tags (tests do not go through server) and tools (not in the correct order, but contents are the same)
      
      * testing for qwen3vl parser - toolparser is working
      
      * made changes to JSON tool parser, wraps the TollCallFunction with a TollCall object
      
      * Working parser for thinking models - assumes state of thinking, emits unambiguous content in thinking, does not call tool call in thinking
      
      * changed the parser to start with collecting content
      
      * thinking prefill
      
      * add hasThinkingSupport parameter to parser
      
      * qwen3-vl -> qwen3-vl-instruct for renderer/parser
      
      * Add hasThinkingSupport=false to QwenVLParser
      
      ---------
      Co-authored-by: default avatarDevon Rifkin <drifkin@drifkin.net>
      05982a95
  28. 11 Oct, 2025 2 commits
  29. 10 Oct, 2025 2 commits
  30. 09 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      logs: quiet down context canceled on completion and scheduler noise (#12553) · 15e3611d
      Daniel Hiltgen authored
      * logs: quiet down context canceled on completion
      
      If the client closes the connection before Completion finishes, we were
      logging at error level implying the runner crashed which was misleading.
      
      time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled"
      
      * quiet down scheduler log error on expected case
      
      Since we don't hold the lock while performing memory load calculations, other
      runners can unload in parallel, so finding no runner to unload is a valid scenario
      which we shouldn't log at error level.
      15e3611d