Commits · 3bee3af6ed6acaeece2f4ffe9a5093c23d26f989 · OpenDAS / ollama

31 Oct, 2025 3 commits
- cpu: always ensure LibOllamaPath included (#12890) · 3bee3af6
  Daniel Hiltgen authored Oct 31, 2025
```
In CPU only setups the LibOllamaPath was omitted causing
us not to load the ggml-cpu-XXX libraries during inference.
```
  3bee3af6
- logs: catch rocm errors (#12888) · 83537993
  Daniel Hiltgen authored Oct 31, 2025
```
This will help bubble up more crash errors
```
  83537993
- embeddings: removed redundant TestAPIEmbeddings test (#12863) · 7dd4862a
  nicole pardal authored Oct 30, 2025
```
This PR removes a redundant test from TestAPIEmbeddings
Contents of this test already exists in embed_test.go and model_arch_test.go
```
  7dd4862a
30 Oct, 2025 11 commits

win: avoid ID mixups on refresh (#12869) · db973c8f

Daniel Hiltgen authored Oct 30, 2025

On Windows AMD IDs are numeric, and can reorder based on the filter environment.
By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
and ignore unsupported iGPUs. Without this, on some systems iGPU VRAM was incorrectly
being used to populate the dGPU.

db973c8f

ggml: Enable op_offload to improve partial offload performance · afaf7ce8

Jesse Gross authored Oct 27, 2025

When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.

Fixes #12037

afaf7ce8

ollamarunner: Worst case batch for token generation · 26465fb8

Jesse Gross authored Oct 27, 2025

We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.

This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.

26465fb8

win: use copy for subprocess logs (#12864) · 88236bc0

Daniel Hiltgen authored Oct 30, 2025

windows gets confused when we try to hand the stderr file descriptor to the subprocess children.  This ensures the log output
always shows up.

88236bc0

testing: test more models with tool calling (#12867) · 76eb7d0f
Patrick Devine authored Oct 30, 2025

76eb7d0f
interleaved mrope (#12807) · f67a6df1
Michael Yang authored Oct 30, 2025
```
* ml(ggml): mrope
* interleave mrope
```
f67a6df1
qwen3vl: enable flash attention by default (#12862) · 75e75d9a
Michael Yang authored Oct 30, 2025

75e75d9a

fix(cmd): unload model before removal (#12832) · ed78e127

Michael Yang authored Oct 30, 2025

this change fixes two bugs with `ollama rm`:

1. before a model is removed, it will first be stopped. this only
   happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
   and should be omitted

ed78e127

fix: qwen2.5vl, qwen3vl composite image (#12841) · d432ade7
Michael Yang authored Oct 30, 2025
```
this change fixes images with an alpha channel by overlaying the image
onto a white background
```
d432ade7
tests: add tests and docs for commonly used ops (#12844) · 06b3422d
Michael Yang authored Oct 30, 2025
```
* mulmat
* permute
```
06b3422d
Update README.md (#12822) · cbe1cf06
Athiban Sharon authored Oct 30, 2025
```
Fixed broken docs links
```
cbe1cf06

29 Oct, 2025 8 commits
- Removing whitespace between Thinking and Content in Qwen3VL (#12838) · 0a2d9208
  Grace authored Oct 29, 2025
```
Eats extra whitespace at the end/beginning of content
```
  0a2d9208
- int: harden server lifecycle (#12835) · c8864710
  Daniel Hiltgen authored Oct 29, 2025
```
this should reduce zombies during integration runs
```
  c8864710
- tests: fix embeddinggemma integration test (#12830) · 05aff4a4
  Patrick Devine authored Oct 29, 2025
  
  05aff4a4
- fix: conv2d bias (#12834) · 0d140bd1
  Michael Yang authored Oct 29, 2025
  
  0d140bd1
- docs: temporarily restore api.md and cleanup docs paths (#12818) · 93e45f0f
  Jeffrey Morgan authored Oct 28, 2025
  
  93e45f0f
- docs: fix root api documentation page (#12813) · a3421608
  Jeffrey Morgan authored Oct 28, 2025
  
  a3421608
- docs: add new cloud model + fix openai redirect (#12812) · f6c29409
  Jeffrey Morgan authored Oct 28, 2025
  
  f6c29409
- feat(model): add qwen3vl (#12665) · 7d25b9e1
  Michael Yang authored Oct 28, 2025
  
  7d25b9e1
28 Oct, 2025 12 commits
- embed: add distance correlation test for library embed models (#12796) · 36d64fb5
  Patrick Devine authored Oct 28, 2025
  
  36d64fb5
- docs: update readme and links (#12809) · d828517e
  Parth Sareen authored Oct 28, 2025
  
  d828517e
- Fix vulkan PCI ID and ID handling (#12775) · 14977a93
  Daniel Hiltgen authored Oct 28, 2025
```
* Fix vulkan PCI ID and ID handling

Intel GPUs may not report PCI IDs which was leading to incorrect overlap
detection.  Switch to using the existing PCI IDs, however AMD GPUs claim not to
report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
switches Vulkan to use UUID based IDs. The GPU discovery patches have been
squashed into a single patch to simplify future rebases.

* review comments
```
  14977a93
- Revert "server: Consolidate embedding truncation in runner (#12730)" (#12810) · 29f63f37
  Patrick Devine authored Oct 28, 2025
```
This reverts commit 5d347f6d.
```
  29f63f37
- docs: add docs for docs.ollama.com (#12805) · 3d99d977
  Parth Sareen authored Oct 28, 2025
  
  3d99d977
- docs: rename to mdx to setup docs site (#12804) · 6d02a43a
  Parth Sareen authored Oct 28, 2025
  
  6d02a43a
- Revert "docs: add reference to docs.ollama.com (#12800)" (#12803) · 5483497d
  Parth Sareen authored Oct 28, 2025
```
This reverts commit 934dd9e1.
```
  5483497d
- docs: add reference to docs.ollama.com (#12800) · 934dd9e1
  Parth Sareen authored Oct 28, 2025
  
  934dd9e1
- s/From*Slice/From*s/ (#12255) · 1188f408
  Michael Yang authored Oct 28, 2025
  
  1188f408
- embedding tests: added check against exact base64 string (#12790) · 15c7d30d
  nicole pardal authored Oct 28, 2025
  
  15c7d30d
- Merge pull request #12793 from ollama/drifkin/12792_renderer-parser-from · 98623171
  Devon Rifkin authored Oct 28, 2025
```
create: inherit FROM model's renderer/parser
```
  98623171
- gemma3: make embedding non-causal (#12297) · ec9eb28f
  Michael Yang authored Oct 27, 2025
  
  ec9eb28f
27 Oct, 2025 2 commits

create: inherit FROM model's renderer/parser · 1bdd8169

Devon Rifkin authored Oct 27, 2025

On main, the `RENDERER` and `PARSER` fields from the `Modelfile` don't
get propagated to a new model created with a `req.From` parameter. This
is easily triggered via `ollama run qwen3-coder`, then running some save
command like `/save qwen3-coder-custom`.

Added a regression test for this, and then open the config for the
"from" model in order to use its renderer/parser as a default for the
new model. This will fix the CLI and also API-based creates.

Fixes: https://github.com/ollama/ollama/issues/12792

1bdd8169

server: Consolidate embedding truncation in runner (#12730) · 5d347f6d

nicole pardal authored Oct 27, 2025

Currently, checking the length of prompts for embeddings to ensure
they fit in the context window (and possible truncation) occurs in
two places - the Ollama server and runner. This can lead to
inconsistencies in both the checks and reported number of tokens
processed. Since we have to do this processing in the runner, this
consolidates all of the logic there.

5d347f6d

25 Oct, 2025 1 commit
- cloud: set the proxy content-type to the same as local models (#12759) · b97eb2b8
  Patrick Devine authored Oct 25, 2025
  
  b97eb2b8
23 Oct, 2025 3 commits

llm: Change memory allocation backoff from exponential to incremental · ad6f6a1d

Jesse Gross authored Oct 23, 2025

If we create a memory layout that should fit based on report free VRAM
but allocation still fails, we start applying a backoff. This reduces
free VRAM by an exponential percentage (1%, 2%, 4%...). However, the
points chosen tend to be too dense at the beginning and too sparse at
the end. Therefore, this switches to an incremental backoff (10%, 20%,
30%...).

ad6f6a1d

readme: add VT Code project to terminal community integrations (#12749) · 6723a40b
Vinh Nguyen authored Oct 24, 2025

6723a40b

DRY out the runner lifecycle code (#12540) · 3258a89b

Daniel Hiltgen authored Oct 23, 2025

* DRY out the runner lifecycle code

Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo

* win: make incremental builds better

Place build artifacts in discrete directories so incremental builds don't have to start fresh

* Adjust sort order to consider iGPUs

* handle cpu inference oom scenarios

* review comments

3258a89b