1. 15 Sep, 2025 2 commits
    • Devon Rifkin's avatar
      address comments · 472feec2
      Devon Rifkin authored
      472feec2
    • Devon Rifkin's avatar
      add qwen3-coder tool support · 47991940
      Devon Rifkin authored
      The format qwen3-coder uses is relatively unique, both in rendering and
      in parsing. To implement parsing, I wrote a custom parser in similar
      style to harmony. For the rendering, I found that the logic would be
      much more difficult to follow in a template, so I introduced the concept
      of a built-in renderer that uses go code, rather than a template to
      generate prompts.
      
      I set us up for future built-in parsers and renderers by making it so
      they can be specified in a Modelfile like so:
      
      ```
      RENDERER "qwen3-coder"
      PARSER "qwen3-coder"
      ```
      
      These need to be provided explicitly because the architecture alone is
      not enough to understand what format the model expects to receive, and
      what format we expect it to output (e.g., qwen3-coder is `qwen3moe`,
      which includes other qwen3-family models as well)
      
      I haven't converted harmony to be one of these "built-ins" yet, since
      some of it is in flux with the changes @ParthSareen has been making to
      move harmony to the runner. It is likely that many other built-ins will
      need to move to the runner as well, but I'm able to slightly defer that
      decision since qwen3-coder doesn't have thinking (and therefore doesn't
      need to be in the runner to make structured outputs work). I expect to
      unify harmony with this approach very soon.
      
      Whether a particular model supports tools or thinking was previously
      inferred from templates, but without a template we now also use the
      parser itself to declare what it supports. If we have future models that
      re-use the same parsing format, but have different capabilities, we'll
      want to parameterize them and give them different names to be specified
      as a `PARSER`.
      
      Misc changes:
      
      - I worked on the renderer by diffing outputs from the reference
        implementation and ours. To make it easier to do this, I extended
        <https://github.com/ollama/ollama/pull/11875> to also support
        returning the prompt via the openai compat layer
      47991940
  2. 12 Sep, 2025 5 commits
  3. 11 Sep, 2025 6 commits
  4. 10 Sep, 2025 5 commits
  5. 09 Sep, 2025 4 commits
  6. 08 Sep, 2025 4 commits
  7. 05 Sep, 2025 1 commit
  8. 04 Sep, 2025 2 commits
  9. 02 Sep, 2025 3 commits
  10. 31 Aug, 2025 2 commits
  11. 29 Aug, 2025 2 commits
    • Daniel Hiltgen's avatar
      perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd
      Daniel Hiltgen authored
      * perf: build graph for next batch in parallel to keep GPU busy
      
      This refactors the main run loop of the ollama runner to perform the main GPU
      intensive tasks (Compute+Floats) in a go routine so we can prepare the next
      batch in parallel to reduce the amount of time the GPU stalls waiting for the
      next batch of work.
      
      * tests: tune integration tests for ollama engine
      
      This tunes the integration tests to focus more on models supported
      by the new engine.
      517807cd
    • Daniel Hiltgen's avatar
      Always filter devices (#12108) · ead4a9a1
      Daniel Hiltgen authored
      * Always filter devices
      
      Avoid crashing on unsupported AMD iGPUs
      
      * Remove cuda device filtering
      
      This interferes with mixed setups
      ead4a9a1
  12. 28 Aug, 2025 1 commit
  13. 27 Aug, 2025 2 commits
    • Jesse Gross's avatar
      ggml: Avoid allocating CUDA primary context on unused GPUs · 9d97e6a9
      Jesse Gross authored
      The recent memory management changes caused all GPUs to be visible
      to the runner, regardless of whether they are ultimately used. This
      caused CUDA devices to allocate a primary context (~300 MB VRAM) on
      each GPU, for each model. This is unnecessary, so we can both avoid
      touching GPUs that we exclude in the early stage of allocation and
      freeing the memory for any that we touch but don't use.
      
      The issue will continue to exist for the old engine, since it touches
      all devices during initialization.
      9d97e6a9
    • Michael Yang's avatar
      fix keep alive (#12041) · 10815324
      Michael Yang authored
      10815324
  14. 26 Aug, 2025 1 commit