Commits · 05982a95cb9e053fadf309e60ec9ff2bc58ba32e · OpenDAS / ollama

13 Oct, 2025 5 commits

Qwen3VL Cloud Parser and Renderer (#12526) · 05982a95

Grace authored Oct 13, 2025



* working (other than tool call is the incorrect order) for tool calls and tools

* Tests work, other than image tags (tests do not go through server) and tools (not in the correct order, but contents are the same)

* testing for qwen3vl parser - toolparser is working

* made changes to JSON tool parser, wraps the TollCallFunction with a TollCall object

* Working parser for thinking models - assumes state of thinking, emits unambiguous content in thinking, does not call tool call in thinking

* changed the parser to start with collecting content

* thinking prefill

* add hasThinkingSupport parameter to parser

* qwen3-vl -> qwen3-vl-instruct for renderer/parser

* Add hasThinkingSupport=false to QwenVLParser

---------
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>

05982a95

Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal... · 4987f13d

Gabe Goodhart authored Oct 13, 2025


Llama cpp bump (df1b612): granite docling / mamba2 optimizations / multimodal encoding fixes (#12552)

* feat: Bump llama.cpp to df1b612

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(mtmd): Correctly encode text chunks during mtmd tokenization

There can be text chunks that appear interspersed with the image embeddings
that contain template delimiter tokens for some models. These need to be
correctly translated to text tokens.

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* tests: Use MtmdChunk in image_test

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* style: Fix unnecessary conversion linting

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(ggml): Revert changes to ggml_hip.cpp

These changes were done largely by our code assistant and are likely wrong

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Revert changes in mem_nvml.cpp

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update sync point to 1deee0

This brings in several more optimization commits and model support for
EmbeddingGemma

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Update patches for 1deee0

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: sync for bump to 1deee0

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Bad patch updates with errant `+`

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Bump llama.cpp/ggml to 7049736

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: format-patches after latest bump

Branch: LlamaCPPBump-GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

4987f13d

runner: fix shifting on llama runner (#12604) · e638f2ac
Jeffrey Morgan authored Oct 13, 2025

e638f2ac
Revert "use llama runner for qwen3 (#12556)" · 18087f2e
Michael Yang authored Oct 13, 2025
```
This reverts commit 3d32249c.
```
18087f2e
fix(qwen3): deepseek distill · 6c833d5f
Michael Yang authored Oct 13, 2025
```
deepseek's qwen3 distill uses a different rope scheme so support both
```
6c833d5f

11 Oct, 2025 5 commits
- Reapply "add truncate and shift parameters" (#12582) · 6544e147
  Jeffrey Morgan authored Oct 11, 2025
  
  6544e147
- Merge pull request #12581 from ollama/drifkin/renderer-api-generate · 5db8a818
  Devon Rifkin authored Oct 11, 2025
```
routes: fix built-in renderers for `api/generate`
```
  5db8a818
- routes: fix built-in renderers for `api/generate` · 6db8da99
  Devon Rifkin authored Oct 11, 2025
```
Made it so when api/generate builds up a message array and generates the
prompt it now goes through the same function as `api/chat` for
consistency. This is where we hook the optional built-in renderers to
bypass templates, which was missing for `api/generate` before this
change.

Closes: #12578
```
  6db8da99
- discover: fix typo (#12565) · 0c68ec8d
  frob authored Oct 11, 2025
  
  0c68ec8d
- doc: remove AMD EOL GPUs (#12567) · 70d9e363
  Daniel Hiltgen authored Oct 10, 2025
  
  70d9e363
10 Oct, 2025 10 commits
- ollamarunner: fix deadlock · 1a2feb2a
  Michael Yang authored Oct 10, 2025
```
hardErrCh will deadlock since forwardBatch is blocked on
computeStartedCh which never gets sent. since the response to
hardErrCh is to panic, just panic instead
```
  1a2feb2a
- implement nvml for linux (#12517) · aab21904
  Daniel Hiltgen authored Oct 10, 2025
```
* implement nvml for linux

* Improve scheduler logging when VRAM doesn't recover
```
  aab21904
- comment split · 629db9dc
  Michael Yang authored Oct 09, 2025
  
  629db9dc
- fix test · e0cd5116
  Michael Yang authored Oct 07, 2025
  
  e0cd5116
- fix lint · 20733207
  Michael Yang authored Oct 07, 2025
  
  20733207
- convert: slice gate_up weight · 93085127
  Michael Yang authored Oct 06, 2025
  
  93085127
- convert: split gate_up bias · c00fa9cc
  Michael Yang authored Oct 06, 2025
  
  c00fa9cc
- refactor: using testing.B.Loop · df411c4b
  yajianggroup authored Sep 23, 2025
```
Signed-off-by: yajianggroup <yajianggroup@outlook.com>
```
  df411c4b
- use llama runner for qwen3 (#12556) · 3d32249c
  Jeffrey Morgan authored Oct 09, 2025
  
  3d32249c
- thinking: allow `"think": false` for non-thinking models (#12555) · d681cd7c
  Patrick Devine authored Oct 09, 2025
  
  d681cd7c
09 Oct, 2025 9 commits

refactor: use builtin max and min · 47298fce
shengxinjing authored Sep 28, 2025

47298fce
refactor: use builtin max and min · 4a48937e
shengxinjing authored Sep 25, 2025

4a48937e
ollamarunner: measure only active time · 967a82f5
Michael Yang authored Sep 29, 2025

967a82f5

llamarunner: update metrics · bbbc73d6

Michael Yang authored Sep 26, 2025

this change updates how metrics are collected. until now, performance
metrics, specifically initial input processing and subsequent generation
durations, were collected by taking the timestamp when creating a new
sequence, the first token generation, and completing generation. the
processing duration is taken as first token generation sub sequence
creation while generation is taken as completing generation sub first
token generation.

while this approach is an accurate end-to-end metric of processing and
generation, it's not comparable to other tools which only measure the
active, i.e. decode, duration.

this change updates the metrics to only capture decode duration so it
can be more directly compared to other tools

bbbc73d6

logs: quiet down context canceled on completion and scheduler noise (#12553) · 15e3611d

Daniel Hiltgen authored Oct 09, 2025

* logs: quiet down context canceled on completion

If the client closes the connection before Completion finishes, we were
logging at error level implying the runner crashed which was misleading.

time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled"

* quiet down scheduler log error on expected case

Since we don't hold the lock while performing memory load calculations, other
runners can unload in parallel, so finding no runner to unload is a valid scenario
which we shouldn't log at error level.

15e3611d

routes: structured outputs for gpt-oss (#12460) · 77060d46
Parth Sareen authored Oct 08, 2025

77060d46
openai: change the reasonin_effort field to also take none · 1b91d4dd
Patrick Devine authored Oct 08, 2025

1b91d4dd
Revert "add truncate and shift parameters (#12519)" (#12545) · 7d965258
Jeffrey Morgan authored Oct 08, 2025
```
This reverts commit 6a62b894.
```
7d965258
add truncate and shift parameters (#12519) · 6a62b894
Jeffrey Morgan authored Oct 08, 2025

6a62b894

08 Oct, 2025 4 commits

thinking: turn on thinking mode for all reasoning models (#12533) · 90d429f5
Patrick Devine authored Oct 08, 2025

90d429f5

kvcache: Clean up sliding window state with independent batches · 1fc35f12

Jesse Gross authored Oct 06, 2025

Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that
are out of the cache's window each time we start a new forward pass.

The cache storage needs to handle the window size for each sequence
plus the batch size, since the batch needs to attend to the full
window size. This means that we have greater than a window size
stored while processing the batch.

When the next batch comes, we are currently only looking at the
sequences in the incoming batch to slide the window forward.
However, we also need to clean up the other sequences that might
be occupying space in the batch processing buffer to ensure each
sequence is only using its window size of storage. Failure to do
this can result in "no kv cache slot found" errors.

Fixes: #10127

1fc35f12

discover: Disable flash attention for Jetson Xavier (CC 7.2) · aa45f7ce

Jesse Gross authored Oct 07, 2025

GGML picks the wrong kernel and these systems fail with:
Sep 28 22:25:39 xavier ollama[48999]: //ml/backend/ggml/ggml/src/ggml-cuda/fattn-wmma-f16.cu:437:
ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 720. ggml-cuda.cu
was compiled for: __CUDA_ARCH_LIST__

Fixes #12442

aa45f7ce

Integration test tuning (#12492) · 4e5d862e
Daniel Hiltgen authored Oct 08, 2025
```
Remove some flaky scenarios, and switch to chat for better reliability
```
4e5d862e

07 Oct, 2025 2 commits

docs: improve accuracy of LLM library docs (#12530) · 303be930
Daniel Hiltgen authored Oct 07, 2025

303be930

Bring back escape valve for llm libraries and fix Jetpack6 crash (#12529) · bd15eba4

Daniel Hiltgen authored Oct 07, 2025

* Bring back escape valve for llm libraries

If the new discovery logic picks the wrong library, this gives users the
ability to force a specific one using the same pattern as before. This
can also potentially speed up bootstrap discovery if one of the libraries
takes a long time to load and ultimately bind to no devices.  For example
unsupported AMD iGPUS can sometimes take a while to discover and rule out.

* Bypass extra discovery on jetpack systems

On at least Jetpack6, cuda_v12 appears to expose the iGPU, but crashes later on in
cublasInit so if we detect a Jetpack, short-circuit and use that variant.

bd15eba4

06 Oct, 2025 3 commits

Merge pull request #12509 from ollama/drifkin/oai-compat-refactor · bc712786
Devon Rifkin authored Oct 06, 2025
```
openai: refactor to split compat layer and middleware
```
bc712786
win: fix build script (#12513) · 91823193
Daniel Hiltgen authored Oct 06, 2025

91823193

discovery: prevent dup OLLAMA_LIBRARY_PATH (#12514) · 04c18498

Daniel Hiltgen authored Oct 06, 2025

This variable isn't currently documented or intended as something the user can
override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling
this in the subprocess environment which will cause problems with the new
bootstrap discovery logic.

04c18498

05 Oct, 2025 1 commit

openai: refactor to split compat layer and middleware · 2c2f4dea

Devon Rifkin authored Oct 05, 2025

This makes the core openai compat layer independent of the middleware
that adapts it to our particular gin routes

2c2f4dea

04 Oct, 2025 1 commit
- CI: fix win arm build (#12502) · 292767af
  Daniel Hiltgen authored Oct 04, 2025
```
Resolve subtle erroraction stickiness difference between x86 and arm builder setup
```
  292767af