1. 30 Oct, 2025 11 commits
    • Daniel Hiltgen's avatar
      win: avoid ID mixups on refresh (#12869) · db973c8f
      Daniel Hiltgen authored
      On Windows AMD IDs are numeric, and can reorder based on the filter environment.
      By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
      and ignore unsupported iGPUs.  Without this, on some systems iGPU VRAM was incorrectly
      being used to populate the dGPU.
      db973c8f
    • Jesse Gross's avatar
      ggml: Enable op_offload to improve partial offload performance · afaf7ce8
      Jesse Gross authored
      When a model is partially offloaded to system RAM, we can either
      do the calculations on the CPU or we can temporarily transfer the
      data to the GPU to do the calculations there. Small batches tend
      to be better on the CPU, large batches on the GPU.
      
      The llamarunner used the GPU in most cases and the ollamarunner
      used the CPU. Although the ollamarunner saw an improvement in
      token generation performance, there was a large performance hit
      in prompt processing (3-10x).
      
      There is an existing heuristic to dynamically switch between these
      two modes but in practice it doesn't have enough information to
      accurately make that decision. This adds authoritative data to make
      the check work to get the best of both worlds.
      
      Fixes #12037
      afaf7ce8
    • Jesse Gross's avatar
      ollamarunner: Worst case batch for token generation · 26465fb8
      Jesse Gross authored
      We currently allocate the worst case batch for max sized
      batches, which corresponds to prompt processing. However,
      there are some cases where the generated graph is different
      for small and large batches. To ensure that we don't need
      to allocate memory later after layout has taken place, we
      should run the worst case batch both ways and take the larger
      amount of memory.
      
      This does not noticeably affect loading speed as the most expensive
      part of this logic is from image processing and that does not
      occur during token generation.
      26465fb8
    • Daniel Hiltgen's avatar
      win: use copy for subprocess logs (#12864) · 88236bc0
      Daniel Hiltgen authored
      windows gets confused when we try to hand the stderr file descriptor to the subprocess children.  This ensures the log output
      always shows up.
      88236bc0
    • Patrick Devine's avatar
      76eb7d0f
    • Michael Yang's avatar
      interleaved mrope (#12807) · f67a6df1
      Michael Yang authored
      * ml(ggml): mrope
      * interleave mrope
      f67a6df1
    • Michael Yang's avatar
      75e75d9a
    • Michael Yang's avatar
      fix(cmd): unload model before removal (#12832) · ed78e127
      Michael Yang authored
      this change fixes two bugs with `ollama rm`:
      
      1. before a model is removed, it will first be stopped. this only
         happens for the first argument and skipped for all other models
      2. models are unloaded indiscriminately. this errors for cloud models
         and should be omitted
      ed78e127
    • Michael Yang's avatar
      fix: qwen2.5vl, qwen3vl composite image (#12841) · d432ade7
      Michael Yang authored
      this change fixes images with an alpha channel by overlaying the image
      onto a white background
      d432ade7
    • Michael Yang's avatar
      tests: add tests and docs for commonly used ops (#12844) · 06b3422d
      Michael Yang authored
      * mulmat
      * permute
      06b3422d
    • Athiban Sharon's avatar
      Update README.md (#12822) · cbe1cf06
      Athiban Sharon authored
      Fixed broken docs links
      cbe1cf06
  2. 29 Oct, 2025 8 commits
  3. 28 Oct, 2025 12 commits
  4. 27 Oct, 2025 2 commits
    • Devon Rifkin's avatar
      create: inherit FROM model's renderer/parser · 1bdd8169
      Devon Rifkin authored
      On main, the `RENDERER` and `PARSER` fields from the `Modelfile` don't
      get propagated to a new model created with a `req.From` parameter. This
      is easily triggered via `ollama run qwen3-coder`, then running some save
      command like `/save qwen3-coder-custom`.
      
      Added a regression test for this, and then open the config for the
      "from" model in order to use its renderer/parser as a default for the
      new model. This will fix the CLI and also API-based creates.
      
      Fixes: https://github.com/ollama/ollama/issues/12792
      1bdd8169
    • nicole pardal's avatar
      server: Consolidate embedding truncation in runner (#12730) · 5d347f6d
      nicole pardal authored
      Currently, checking the length of prompts for embeddings to ensure
      they fit in the context window (and possible truncation) occurs in
      two places - the Ollama server and runner. This can lead to
      inconsistencies in both the checks and reported number of tokens
      processed. Since we have to do this processing in the runner, this
      consolidates all of the logic there.
      5d347f6d
  5. 25 Oct, 2025 1 commit
  6. 23 Oct, 2025 4 commits
    • Jesse Gross's avatar
      llm: Change memory allocation backoff from exponential to incremental · ad6f6a1d
      Jesse Gross authored
      If we create a memory layout that should fit based on report free VRAM
      but allocation still fails, we start applying a backoff. This reduces
      free VRAM by an exponential percentage (1%, 2%, 4%...). However, the
      points chosen tend to be too dense at the beginning and too sparse at
      the end. Therefore, this switches to an incremental backoff (10%, 20%,
      30%...).
      ad6f6a1d
    • Vinh Nguyen's avatar
    • Daniel Hiltgen's avatar
      DRY out the runner lifecycle code (#12540) · 3258a89b
      Daniel Hiltgen authored
      * DRY out the runner lifecycle code
      
      Now that discovery uses the runners as well, this unifies the runner spawning code
      into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo
      
      * win: make incremental builds better
      
      Place build artifacts in discrete directories so incremental builds don't have to start fresh
      
      * Adjust sort order to consider iGPUs
      
      * handle cpu inference oom scenarios
      
      * review comments
      3258a89b
    • Jesse Gross's avatar
      kvcache: Remove special case for reservation mask · 1c093e97
      Jesse Gross authored
      We currently short circuit generation of the cache mask and just
      generate an empty tensor of the correct size. However, in some
      cases, this can also skip a cast operation. This can result in the
      worst case graph being not fully worst case.
      
      We don't actually need the fast path for mask generation, so it's
      better to just use the normal code path.
      1c093e97
  7. 22 Oct, 2025 2 commits
    • Jesse Gross's avatar
      llamarunner: Record the time for all batches during prompt processing · a8d9c264
      Jesse Gross authored
      Currently, we only record the time for the last batch when processing
      the prompt. This results in unrealistically high numbers for the
      old llama runner.
      
      Before:
      total duration:       31.273112939s
      load duration:        4.97054657s
      prompt eval count:    32768 token(s)
      prompt eval duration: 235.137439ms
      prompt eval rate:     139356.80 tokens/s
      eval count:           1873 token(s)
      eval duration:        18.173182374s
      eval rate:            103.06 tokens/s
      
      After:
      total duration:       30.024798033s
      load duration:        4.758588663s
      prompt eval count:    32768 token(s)
      prompt eval duration: 7.779621548s
      prompt eval rate:     4212.03 tokens/s
      eval count:           1769 token(s)
      eval duration:        17.148014223s
      eval rate:            103.16 tokens/s
      a8d9c264
    • frob's avatar