1. 02 Oct, 2025 1 commit
  2. 01 Oct, 2025 3 commits
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb
    • Devon Rifkin's avatar
      Merge pull request #12461 from ollama/drifkin/qwen3-coder-tweaks · 6b50f2b9
      Devon Rifkin authored
      qwen3-coder: fix tool definition type rendering
      6b50f2b9
    • Michael Yang's avatar
      fix keep alive · 35ac4eb1
      Michael Yang authored
      this reference to keep alive was missed in #12041 so chat has a
      diffferent behaviour than generate
      35ac4eb1
  3. 30 Sep, 2025 5 commits
    • Jesse Gross's avatar
      ggml: Preallocate CUDA pool memory · 3d0b1734
      Jesse Gross authored
      The GGML CUDA backend allocates additional memory for intermediate
      results during calculation. This memory isn't currently allocated
      during worst case graph reservation and therefore not included in
      scheduling. This means that as these buffers potentially grow
      with context length, we could crash.
      
      This extends the memory allocation system down layer from the GGML
      graph to the CUDA layer, preallocating the worst case memory there
      as well.
      
      Fixes #11753
      3d0b1734
    • Jesse Gross's avatar
      ggml: Backport scale kernel fixes · efaee8c2
      Jesse Gross authored
      The GGML scale kernel uses signed 32-bit ints to represent
      the number of elements in the tensor. For large images,
      mistral-small3.2 overflows this, triggering CUDA errors due
      to negative arguments.
      
      Currently, this can happen when the user passes a large image
      to mistral-small3.2. However, with upcoming changes to reserve
      CUDA memory, it happens every time mistral-small is loaded as
      we reserve using a worst case batch.
      
      This patch is part of an upstream GGML commit and should be removed
      after GGML is updated past 0a1b398 "ggml: add ops for WAN video model
      (cuda && cpu) (#15669)".
      
      Fixes #10388
      efaee8c2
    • Jesse Gross's avatar
      ggml: Remove allocation status reporting · 734b57da
      Jesse Gross authored
      For each memory allocation we report the size of the (attempted)
      allocation and whether it succeeded or failed. The latter status
      reporting proved to be not that useful in practice as systems
      such as Windows can automatically overflow from VRAM into RAM,
      resultings in successful allocations even when there isn't
      enough memory where we wanted.
      
      As a result, this information is only used for debug logging,
      which isn't worthwhile enough for the amount of code. It
      also isn't fully accurate, as multiple allocations may result
      in partial failures.
      734b57da
    • Devon Rifkin's avatar
      83021fcf
    • Michael Yang's avatar
      0469861d
  4. 26 Sep, 2025 2 commits
  5. 25 Sep, 2025 4 commits
  6. 24 Sep, 2025 5 commits
    • Grace's avatar
      Grace/deepseek v3 migration (#12385) · fbd82ba5
      Grace authored
      
      
      * init deepseek model file
      
      * temp removal of flash attention implementation
      
      * shapes and proper, can make a pass
      
      * query, key, value have good cosine similarity, but the max diff is a bit high
      
      * Attention block is working! ** with eager for now, have not added the mask line
      
      * Attention block is working! ** with eager for now, have not added the mask line
      
      * working MoE at around 0.95 cosine sim
      
      * added cosine similarity function
      
      * Starting end to end structure
      
      * Trying (and failing) to get rope to work, going to test full thing on tater
      
      * running on tater36... just not the right outputs
      
      * we have the right values for rope... but its still not working?
      
      * chnage Extrapolation Factor to 1
      
      * removed adding residuals twice, removed normalization from shared expert, refactored Norms (Attention, MLP) to be outside the (Attention, MLP) blocks and in the Transformer block instead, add cache setLayer
      
      * Temporary modelfiles for cpu
      
      * change kpass intermediate step to kv, two layer outputs [0,1] look fine
      
      * this calls for 16 chicken nuggets
      
      * whoops
      
      * cleaning up code
      
      * delete stuff we dont need
      
      * getting rid of debug statements for llama cpp
      
      * working with long contexts
      
      * fix long context view error
      
      * reverting some changes I made for files that are not apart of pr
      
      * Added proper tokenizer for deeepseek3
      
      * clean up model and go test
      
      * remove Modelfile
      
      * not passing the tests
      
      * whoops
      
      * how to pass the ci tests
      
      * resolving some of the comments
      
      * rename
      
      * linted and renamed deepseek3 -> deepseek2
      
      * remove name go
      
      * addressed changes - main change was adopting qwen3 naming scheme
      
      * I cannot with linters
      
      * clean up logs
      
      * clean up logs
      
      ---------
      Co-authored-by: default avatarGrace Guo <graceguo@Graces-MBP.localdomain>
      Co-authored-by: default avatarGrace Guo <graceguo@Graces-MacBook-Pro.local>
      Co-authored-by: default avatargraceguo <graceguo@tater36.localdomain>
      fbd82ba5
    • Michael Yang's avatar
      prefer ollama engine for qwen3moe (#12374) · 2e742544
      Michael Yang authored
      2e742544
    • Devon Rifkin's avatar
      Merge pull request #12393 from ollama/drifkin/fix-built-ins · bbb195a6
      Devon Rifkin authored
      harmony: don't sanitize built-ins
      bbb195a6
    • Devon Rifkin's avatar
      harmony: don't sanitize built-ins · fd88cd7c
      Devon Rifkin authored
      In #11910 we started sanitizing function names, but we accidentally were
      modifying built-ins like `browser.open` to `browser_open`. This was
      removing the special prompt rendering for built-ins, but this wasn't
      immediately apparent since the models seem to be reasonably good at
      remembering the built-ins even when presented with these slightly
      renamed version. This fix prevents built-ins from ever being renamed.
      fd88cd7c
    • Michael Yang's avatar
      fix: leaf alt name (#12390) · e1979c57
      Michael Yang authored
      a leaf node with an alternative name gets all its alternatives names
      added into the same branch rather than creating branches themselves
      e1979c57
  7. 23 Sep, 2025 3 commits
  8. 22 Sep, 2025 4 commits
  9. 20 Sep, 2025 2 commits
    • Devon Rifkin's avatar
      Merge pull request #12358 from ollama/drifkin/qwen3-coder-ampersands · 3677842f
      Devon Rifkin authored
      parsers: fix `&`s in qwen3coder parameter values
      3677842f
    • Devon Rifkin's avatar
      parsers: fix `&`s in qwen3coder parameter values · 242df70a
      Devon Rifkin authored
      In <https://github.com/ollama/ollama/issues/12357> we that the model
      will output tool calls such as
      
      ```
      <function=shell>
      <parameter=command>
      pwd && ls -la
      </parameter>
      </function>
      ```
      
      We parse this using the approach of transforming into valid xml and then
      using an xml parser. While we do transform the function and parameter
      names, we weren't escaping the parameter values (which in this example
      are invalid since `pwd && ls -la` contains unescaped ampersands).
      
      This has been fixed by first transforming the tags in the same way, and
      then walking the transformed string and escaping the text in between the
      tags. This also fixes a case where `<` in the middle of a parameter
      value would cause an xml parse failure.
      
      Fixes: #12357
      242df70a
  10. 19 Sep, 2025 1 commit
  11. 18 Sep, 2025 8 commits
  12. 17 Sep, 2025 2 commits