Commits · c00fa9cc2be428daeb3dd7cb67a86ff3a9d85cbd · OpenDAS / ollama

10 Oct, 2025 4 commits
- convert: split gate_up bias · c00fa9cc
  Michael Yang authored Oct 06, 2025
  
  c00fa9cc
- refactor: using testing.B.Loop · df411c4b
  yajianggroup authored Sep 23, 2025
```
Signed-off-by: yajianggroup <yajianggroup@outlook.com>
```
  df411c4b
- use llama runner for qwen3 (#12556) · 3d32249c
  Jeffrey Morgan authored Oct 09, 2025
  
  3d32249c
- thinking: allow `"think": false` for non-thinking models (#12555) · d681cd7c
  Patrick Devine authored Oct 09, 2025
  
  d681cd7c
09 Oct, 2025 9 commits

refactor: use builtin max and min · 47298fce
shengxinjing authored Sep 28, 2025

47298fce
refactor: use builtin max and min · 4a48937e
shengxinjing authored Sep 25, 2025

4a48937e
ollamarunner: measure only active time · 967a82f5
Michael Yang authored Sep 29, 2025

967a82f5

Michael Yang authored Sep 26, 2025

this change updates how metrics are collected. until now, performance
metrics, specifically initial input processing and subsequent generation
durations, were collected by taking the timestamp when creating a new
sequence, the first token generation, and completing generation. the
processing duration is taken as first token generation sub sequence
creation while generation is taken as completing generation sub first
token generation.

while this approach is an accurate end-to-end metric of processing and
generation, it's not comparable to other tools which only measure the
active, i.e. decode, duration.

this change updates the metrics to only capture decode duration so it
can be more directly compared to other tools

bbbc73d6

logs: quiet down context canceled on completion and scheduler noise (#12553) · 15e3611d

Daniel Hiltgen authored Oct 09, 2025

* logs: quiet down context canceled on completion

If the client closes the connection before Completion finishes, we were
logging at error level implying the runner crashed which was misleading.

time=2025-10-08T22:59:20.566-07:00 level=ERROR source=server.go:1490 msg="post predict" error="Post \"http://127.0.0.1:57736/completion\": context canceled"

* quiet down scheduler log error on expected case

Since we don't hold the lock while performing memory load calculations, other
runners can unload in parallel, so finding no runner to unload is a valid scenario
which we shouldn't log at error level.

15e3611d

routes: structured outputs for gpt-oss (#12460) · 77060d46
Parth Sareen authored Oct 08, 2025

77060d46
openai: change the reasonin_effort field to also take none · 1b91d4dd
Patrick Devine authored Oct 08, 2025

1b91d4dd
Revert "add truncate and shift parameters (#12519)" (#12545) · 7d965258
Jeffrey Morgan authored Oct 08, 2025
```
This reverts commit 6a62b894.
```
7d965258
add truncate and shift parameters (#12519) · 6a62b894
Jeffrey Morgan authored Oct 08, 2025

6a62b894

08 Oct, 2025 4 commits

thinking: turn on thinking mode for all reasoning models (#12533) · 90d429f5
Patrick Devine authored Oct 08, 2025

90d429f5

kvcache: Clean up sliding window state with independent batches · 1fc35f12

Jesse Gross authored Oct 06, 2025

Sliding windows models (e.g. gpt-oss, gemma3) remove tokens that
are out of the cache's window each time we start a new forward pass.

The cache storage needs to handle the window size for each sequence
plus the batch size, since the batch needs to attend to the full
window size. This means that we have greater than a window size
stored while processing the batch.

When the next batch comes, we are currently only looking at the
sequences in the incoming batch to slide the window forward.
However, we also need to clean up the other sequences that might
be occupying space in the batch processing buffer to ensure each
sequence is only using its window size of storage. Failure to do
this can result in "no kv cache slot found" errors.

Fixes: #10127

1fc35f12

discover: Disable flash attention for Jetson Xavier (CC 7.2) · aa45f7ce

Jesse Gross authored Oct 07, 2025

GGML picks the wrong kernel and these systems fail with:
Sep 28 22:25:39 xavier ollama[48999]: //ml/backend/ggml/ggml/src/ggml-cuda/fattn-wmma-f16.cu:437:
ERROR: CUDA kernel flash_attn_ext_f16 has no device code compatible with CUDA arch 720. ggml-cuda.cu
was compiled for: __CUDA_ARCH_LIST__

Fixes #12442

aa45f7ce

Integration test tuning (#12492) · 4e5d862e
Daniel Hiltgen authored Oct 08, 2025
```
Remove some flaky scenarios, and switch to chat for better reliability
```
4e5d862e

07 Oct, 2025 2 commits

docs: improve accuracy of LLM library docs (#12530) · 303be930
Daniel Hiltgen authored Oct 07, 2025

303be930

Bring back escape valve for llm libraries and fix Jetpack6 crash (#12529) · bd15eba4

Daniel Hiltgen authored Oct 07, 2025

* Bring back escape valve for llm libraries

If the new discovery logic picks the wrong library, this gives users the
ability to force a specific one using the same pattern as before. This
can also potentially speed up bootstrap discovery if one of the libraries
takes a long time to load and ultimately bind to no devices.  For example
unsupported AMD iGPUS can sometimes take a while to discover and rule out.

* Bypass extra discovery on jetpack systems

On at least Jetpack6, cuda_v12 appears to expose the iGPU, but crashes later on in
cublasInit so if we detect a Jetpack, short-circuit and use that variant.

bd15eba4

06 Oct, 2025 3 commits

Merge pull request #12509 from ollama/drifkin/oai-compat-refactor · bc712786
Devon Rifkin authored Oct 06, 2025
```
openai: refactor to split compat layer and middleware
```
bc712786
win: fix build script (#12513) · 91823193
Daniel Hiltgen authored Oct 06, 2025

91823193

discovery: prevent dup OLLAMA_LIBRARY_PATH (#12514) · 04c18498

Daniel Hiltgen authored Oct 06, 2025

This variable isn't currently documented or intended as something the user can
override, but if the user happens to set OLLAMA_LIBRARY_PATH we were doubling
this in the subprocess environment which will cause problems with the new
bootstrap discovery logic.

04c18498

05 Oct, 2025 1 commit

openai: refactor to split compat layer and middleware · 2c2f4dea

Devon Rifkin authored Oct 05, 2025

This makes the core openai compat layer independent of the middleware
that adapts it to our particular gin routes

2c2f4dea

04 Oct, 2025 2 commits
- CI: fix win arm build (#12502) · 292767af
  Daniel Hiltgen authored Oct 04, 2025
```
Resolve subtle erroraction stickiness difference between x86 and arm builder setup
```
  292767af
- CI: replace clang compiler for windows (#12495) · ae5e0f08
  Daniel Hiltgen authored Oct 04, 2025
  
  ae5e0f08
03 Oct, 2025 10 commits
- llm: Support KV cache quantization with gpt-oss · 19e6796e
  Jesse Gross authored Oct 03, 2025
```
With the new version of GGML in #12245, KV cache quantization
no longer causes a fallback to CPU.
```
  19e6796e
- Fixed Deepseek2 adding nil tensor error · 33801c15
  Grace authored Oct 03, 2025
  
  33801c15
- Workaround broken NVIDIA iGPU free VRAM data (#12490) · e4340667
  Daniel Hiltgen authored Oct 03, 2025
```
The CUDA APIs for reporting free VRAM are useless on NVIDIA iGPU
systems as they only return the kernels actual free memory and ignore
buff/cache allocations which on a typical system will quickly fill up
most of the free system memory.  As a result, we incorrectly think
there's very little available for GPU allocations which is wrong.
```
  e4340667
- test: add template error test (#12489) · 2fa1e92a
  Patrick Devine authored Oct 03, 2025
  
  2fa1e92a
- ci: place rocm windows in correct runner dir (#12487) · 07e36761
  Daniel Hiltgen authored Oct 03, 2025
  
  07e36761
- CI: temporarily disable clang install (#12486) · c29fb007
  Daniel Hiltgen authored Oct 02, 2025
```
This will likely yield builds that have problems with unicode characters
but at least we can start testing the release while we try to find an
alternate clang compiler for windows, or mingw ships a fixed version.
```
  c29fb007
- ci: fix windows build (#12485) · 730ed6e9
  Daniel Hiltgen authored Oct 02, 2025
  
  730ed6e9
- ci: fix windows build (#12484) · dc066016
  Daniel Hiltgen authored Oct 02, 2025
  
  dc066016
- templates: fix crash in improperly defined templates (#12483) · 1ed2881e
  Patrick Devine authored Oct 02, 2025
  
  1ed2881e
- llm: Enable flash attention by default for qwen3 and qwen3moe · 0bda7289
  Jesse Gross authored Oct 02, 2025
  
  0bda7289
02 Oct, 2025 4 commits

AMD: block running on unsupported gfx900/gfx906 (#12481) · 55ca8272
Daniel Hiltgen authored Oct 02, 2025

55ca8272

Update GGML to b6646 (#12245) · c68f367e

Daniel Hiltgen authored Oct 02, 2025

Notable EOLs with this change:
- MacOS v12 and v13 are no longer supported (v14+ required)
- AMD gfx900 and gfx906 are no longer supported

c68f367e

llm: Allow overriding flash attention setting · fdb10946

Jesse Gross authored Oct 01, 2025

As we automatically enable flash attention for more models, there
are likely some cases where we get it wrong. This allows setting
OLLAMA_FLASH_ATTENTION=0 to disable it, even for models that usually
have flash attention.

fdb10946

fix panic on bootstrapDevices (#12475) · 05a43e07
Daniel Hiltgen authored Oct 01, 2025
```
Wrong index variable was used.
```
05a43e07

01 Oct, 2025 1 commit

Use runners for GPU discovery (#12090) · bc8909fb

Daniel Hiltgen authored Oct 01, 2025

This revamps how we discover GPUs in the system by leveraging the Ollama
runner. This should eliminate inconsistency between our GPU discovery and the
runners capabilities at runtime, particularly for cases where we try to filter
out unsupported GPUs. Now the runner does that implicitly based on the actual
device list. In some cases free VRAM reporting can be unreliable which can
leaad to scheduling mistakes, so this also includes a patch to leverage more
reliable VRAM reporting libraries if available.

Automatic workarounds have been removed as only one GPU leveraged this, which
is now documented. This GPU will soon fall off the support matrix with the next
ROCm bump.

Additional cleanup of the scheduler and discovery packages can be done in the
future once we have switched on the new memory management code, and removed
support for the llama runner.

bc8909fb