Commits · ec9eb28f4c3481d58c6da38ee488cb8cd5379256 · OpenDAS / ollama

28 Oct, 2025 1 commit
- gemma3: make embedding non-causal (#12297) · ec9eb28f
  Michael Yang authored Oct 27, 2025
  
  ec9eb28f
27 Oct, 2025 1 commit

server: Consolidate embedding truncation in runner (#12730) · 5d347f6d

nicole pardal authored Oct 27, 2025

Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.

5d347f6d

25 Oct, 2025 1 commit
- cloud: set the proxy content-type to the same as local models (#12759) · b97eb2b8
  Patrick Devine authored Oct 25, 2025
  
  b97eb2b8
23 Oct, 2025 4 commits

llm: Change memory allocation backoff from exponential to incremental · ad6f6a1d

Jesse Gross authored Oct 23, 2025

If we create a memory layout that should fit based on report free VRAM
but allocation still fails, we start applying a backoff. This reduces
free VRAM by an exponential percentage (1%, 2%, 4%...). However, the
points chosen tend to be too dense at the beginning and too sparse at
the end. Therefore, this switches to an incremental backoff (10%, 20%,
30%...).

ad6f6a1d

readme: add VT Code project to terminal community integrations (#12749) · 6723a40b
Vinh Nguyen authored Oct 24, 2025

6723a40b

DRY out the runner lifecycle code (#12540) · 3258a89b

Daniel Hiltgen authored Oct 23, 2025

* DRY out the runner lifecycle code

Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo

* win: make incremental builds better

Place build artifacts in discrete directories so incremental builds don't have to start fresh

* Adjust sort order to consider iGPUs

* handle cpu inference oom scenarios

* review comments

3258a89b

kvcache: Remove special case for reservation mask · 1c093e97

Jesse Gross authored Oct 22, 2025

We currently short circuit generation of the cache mask and just
generate an empty tensor of the correct size. However, in some
cases, this can also skip a cast operation. This can result in the
worst case graph being not fully worst case.

We don't actually need the fast path for mask generation, so it's
better to just use the normal code path.

1c093e97

22 Oct, 2025 4 commits

llamarunner: Record the time for all batches during prompt processing · a8d9c264

Jesse Gross authored Oct 16, 2025

Currently, we only record the time for the last batch when processing
the prompt. This results in unrealistically high numbers for the
old llama runner.

Before:
total duration:       31.273112939s
load duration:        4.97054657s
prompt eval count:    32768 token(s)
prompt eval duration: 235.137439ms
prompt eval rate:     139356.80 tokens/s
eval count:           1873 token(s)
eval duration:        18.173182374s
eval rate:            103.06 tokens/s

After:
total duration:       30.024798033s
load duration:        4.758588663s
prompt eval count:    32768 token(s)
prompt eval duration: 7.779621548s
prompt eval rate:     4212.03 tokens/s
eval count:           1769 token(s)
eval duration:        17.148014223s
eval rate:            103.16 tokens/s

a8d9c264

tools: parse tool calls that don't conform to ("name": name, "arguments": args} (#12738) · 0334e67f
frob authored Oct 22, 2025

0334e67f
embeddings: base64 encoding fix (#12715) · e0ead1ad
nicole pardal authored Oct 22, 2025

e0ead1ad
cloud: don't error sending empty messages (#12724) · d515aed6
Patrick Devine authored Oct 21, 2025

d515aed6

20 Oct, 2025 5 commits
- runner: always truncate embeddings requests (#12714) · 5fe7ba1b
  Jeffrey Morgan authored Oct 20, 2025
  
  5fe7ba1b
- fs(ggml): fill in arch prefix if necessary (#12646) · d2b63c19
  Michael Yang authored Oct 20, 2025
  
  d2b63c19
- model/parsers: remove warning for missing <think> tag for qwen3-vl (#12713) · 94f110b3
  Jeffrey Morgan authored Oct 20, 2025
  
  94f110b3
- cuda: get driver version after props (#12707) · 5d22953b
  Daniel Hiltgen authored Oct 20, 2025
```
Users on Windows without GPUs are reporting errors relating to
cudaDriverGetVersion with the device set to -1.  This ensures we only grab the
driver once we're enumerating actual devices.
```
  5d22953b
- rocm: give it more time to bootstrap (#12681) · d245dffe
  Daniel Hiltgen authored Oct 20, 2025
```
Some users are hitting timeouts.  We'd like to make this faster, but for now make sure we don't timeout too aggressively.
```
  d245dffe
18 Oct, 2025 2 commits

contiguous input per layer (#12686) · bc1a818f
Daniel Hiltgen authored Oct 17, 2025
```
Co-authored-by: Michael Yang <git@mxy.ng>
```
bc1a818f

win: more verbose load failures (#12683) · ba2253dc

Daniel Hiltgen authored Oct 17, 2025

When loading the dynamic libraries, if something goes wrong report some
details.  Unfortunately this wont explain which dependencies are missing,
but this breadcrumb in the logs should help us diagnose GPU discovery
failures.

ba2253dc

17 Oct, 2025 1 commit

test: harden scheduler tests (#12662) · 68e04c7f

Daniel Hiltgen authored Oct 17, 2025

* test: harden scheduler tests

This removes reschedDelay which was stale code, and adds
a new configurable timeout for the waitForVRAMRecovery so
tests can now set the timeout to be very short to avoid the
scheduler getting stuck and hitting a test timeout.

* test: tune tests for partial loads

Give stress tests more time when the model is split between CPU/GPU

68e04c7f

16 Oct, 2025 11 commits
- cuda: tidy up CC settings (#12668) · 27067993
  Daniel Hiltgen authored Oct 16, 2025
```
8.7 is Jetpack only, so no need on x86 builds
10.3 covers [G]B300
```
  27067993
- renderers: add global flag for setting [img] tags (#12669) · 65fb3ff4
  Jeffrey Morgan authored Oct 16, 2025
```
Adds a temporary global flag to renderers that causes renderers to always
render images as [img]. In a follow up change, we will consider making this
the default, and this flag could eventually be removed
```
  65fb3ff4
- Grace/qwen3 thinking (#12647) · e2a0b244
  Grace authored Oct 16, 2025
```
* changing initial status to take into consideration prefill

* Add seperate strings for content and thinking builder

* thinking tests

* remove white space from string before closing think tag
```
  e2a0b244
- cuda: bring back CC 5.2 (#12666) · 1813ff85
  Daniel Hiltgen authored Oct 16, 2025
```
Forward compat on the newer driver doesn't seem to be working.
This should get 5.2 working on newer drivers again.
```
  1813ff85
- test: add a few missing embedding models (#12661) · b531777a
  Daniel Hiltgen authored Oct 16, 2025
  
  b531777a
- Revert "Workaround broken NVIDIA iGPU free VRAM data (#12490)" (#12642) · fe3ec8db
  Daniel Hiltgen authored Oct 16, 2025
```
The workaround has been moved into the underlying C++ code.

This reverts commit e4340667.
```
  fe3ec8db
- vulkan: Get FilterID from Backend for Vulkan (#12655) · c7441342
  Thomas Stocker authored Oct 16, 2025
```
* vulkan: Get FilterID from Backend for Vulkan

* Fixing patch
```
  c7441342
- readme: add achatbot-go to community integrations (#12629) · 4be41d2d
  weedge authored Oct 16, 2025
  
  4be41d2d
- fs/ggml: fix function name in comment (#12630) · de670570
  zhetaicheleba authored Oct 16, 2025
  
  de670570
- Merge pull request #12651 from ollama/drifkin/oai-conversion · 201d9371
  Devon Rifkin authored Oct 15, 2025
```
openai: make tool call conversion fns public
```
  201d9371
- openai: make tool call conversion fns public · 160cecc8
  Devon Rifkin authored Oct 15, 2025
  
  160cecc8
15 Oct, 2025 7 commits

CI: Set up temporary opt-out Vulkan support (#12614) · 8b6e5bae

Daniel Hiltgen authored Oct 15, 2025

Initially Vulkan support in Ollama will require building from source. Once it is
more thoroughly tested and we have fixed any critical bugs, then we can
bundle Vulkan into the official binary releases.

8b6e5bae

perf: backport cuda iGPU sched spin (#12641) · 75d17fc6
Daniel Hiltgen authored Oct 15, 2025

75d17fc6

ml/backend/ggml: NVML fallback for unified memory GPUs (#12619) · 8fafc8af

Santosh Bhavani authored Oct 15, 2025

* Simplify NVML fallback for unified memory GPUs

Remove device-specific checks and environment variable dependency for
NVML_ERROR_NOT_SUPPORTED fallback. When NVML doesn't support memory
queries, unconditionally use /proc/meminfo instead of checking device
names or OLLAMA_UNIFIED_MEMORY environment variable.

This provides better memory reporting by using MemAvailable which
accounts for reclaimable memory, avoiding the underreporting issue
described in NVIDIA support article a_id/5728.

Tested on NVIDIA GB10 unified memory iGPU with consistent and accurate
memory reporting across multiple model load/unload cycles.

* Add NVML fallback patch for unified memory GPUs

8fafc8af

llm: Enable flash attention by default for gemma3 · c3c85aa0
Jesse Gross authored Oct 15, 2025

c3c85aa0
envconfig: default to port 443 when connecting to ollama.com (#12617) · 0d713051
Jeffrey Morgan authored Oct 14, 2025

0d713051
types: send index for tool calls (#12625) · c4c5a4a0
Parth Sareen authored Oct 14, 2025

c4c5a4a0

llm: Perform eviction when num_gpu is set with new estimates · 3dcfd5f6

Jesse Gross authored Oct 14, 2025

Currently, if you set num_gpu then this forces the model to
load with that number of layers in the current configuration.
This is done regardless of any other information, which means
that no eviction is performed even if another model is loaded.

This behavior is different from the old estimates (and still
happens for models that runs on the llama engine). In those
cases, models would be evicted if needed to load at the requested
number of layers. That behavior is more useful and less surprising,
so this changes the new estimates to match.

Fixes #12580

3dcfd5f6

14 Oct, 2025 3 commits
- Merge pull request #12621 from ollama/drifkin/any-of · 53a969d5
  Devon Rifkin authored Oct 14, 2025
```
qwen3-coder: support anyOf when parsing tool calls
```
  53a969d5
- qwen3-coder: support anyOf when parsing tool calls · 08fbb60b
  Devon Rifkin authored Oct 14, 2025
  
  08fbb60b
- logs: fix bogus "0 MiB free" log line (#12590) · 850da848
  Daniel Hiltgen authored Oct 14, 2025
```
On the llama runner, after the recent GGML bump a new log line reports
incorrect 0 MiB free after our patch to remove memory from the props.  This
adjusts the llama.cpp code to fetch the actual free memory of the active device.
```
  850da848