Commits · 544b6739dde2a6b156b1673c72d94949c1940be7 · OpenDAS / ollama

06 Nov, 2025 5 commits
- ggml update to b6840 (#12791) · 544b6739
  Daniel Hiltgen authored Nov 06, 2025
  
  544b6739
- readme: remove 404 link (#11351) · c4ba257c
  7394112478 authored Nov 06, 2025
  
  c4ba257c
- readme: add hle-eval-ollama to list of terminal community integrations (#11371) · 342e58ce
  mags0ft authored Nov 06, 2025
  
  342e58ce
- readme: add lollms and lollms WebUI to community integrations (#11981) · 47b2585c
  Saifeddine ALOUI authored Nov 06, 2025
  
  47b2585c
- app: fix macOS file picker to support Uniform Type Identifiers (#12965) · 4111db01
  Vincent Koc authored Nov 05, 2025
  
  4111db01
05 Nov, 2025 9 commits

ci: re-enable signing (#12974) · 80d34260
Daniel Hiltgen authored Nov 05, 2025

80d34260

embeddings: added embedding command for cl (#12795) · 1ca608bc

nicole pardal authored Nov 05, 2025

Co-authored-by: A-Akhil <akhilrahul70@gmail.com>

This PR introduces a new ollama embed command that allows users to generate embeddings directly from the command line.

Added ollama embed MODEL [TEXT...] command for generating text embeddings
Supports both direct text arguments and stdin piping for scripted workflows

Outputs embeddings as JSON arrays (one per line)

1ca608bc

mac: fix stale VRAM data (#12972) · 6aa72830

Daniel Hiltgen authored Nov 05, 2025

The scheduler updates free VRAM based on current loaded models. This was
mutating the persisted list of GPUs, and when coupled with the non-refreshing
logic for Metal that lead to stale low VRAM reporting after unload. The fix is
to make sure the GPU discovery always returns a copy so the schedulers GPU list
is in fact ephemeral and doesn't leak any temporary adjustments back into the
persistent list.

6aa72830

bugfix: show connection string for interactive cli usage (#12930) · f89fc1ca
Patrick Devine authored Nov 05, 2025

f89fc1ca

win: revert CPU discovery logic to 0.12.3 (#12969) · 97e05d2a

Daniel Hiltgen authored Nov 05, 2025

The behavior change in 0.12.4 is the most likely the root cause of hangs some
users are seeing.  This reverts to the 0.12.3 code, with some added trace
logging.

97e05d2a

readme: Add handy-ollama to community integrations (#8601) · 8bbc7395
Youdon authored Nov 06, 2025

8bbc7395
log: trace logging for scheduler (#12961) · 408c2f99
Daniel Hiltgen authored Nov 05, 2025

408c2f99

Add Tool Call ID (#12956) · 809b9c68

Grace authored Nov 04, 2025



* routes/types: add tool call id

---------
Co-authored-by: ParthSareen <parth.sareen@ollama.com>

809b9c68

log: instrument CPU discovery timing (#12960) · ba8c0358
Daniel Hiltgen authored Nov 04, 2025

ba8c0358

04 Nov, 2025 5 commits

discovery: only retry AMD GPUs (#12894) · 27f1fde4

Daniel Hiltgen authored Nov 04, 2025

* discovery: only retry AMD GPUs

CUDA and Vulkan don't crash on unsupported devices, so retry isn't necessary.
This also refactors the code to shift the Library specific logic into the ml
package.

* review comments

27f1fde4

vulkan: Add memory detection for Intel GPU using DXGI+PDH (#12664) · 220e133f

virajwad authored Nov 04, 2025

* PDH free memory skeleton

* Add PDH printing

* Add LUID support for Vulkan

* wire luid from ggml-vulkan to mem-dxgi-pdh file

* Fix to ggml-impl

* Continue skeleton

* Implemented ggml_dxgi_pdh_get_device_memory

* fix comments

* Fix - change value GB to bytes

* add ifdefs to only support windows and not linux

* modify error codes

* Finished ggml_dxgi_pdh_init() function

* completed ggml_dxgi_pdh_release()

* Formatting changes, add static to functions

* fix build errors

* fix go build error

* fix luid - now should match between dxgi and vulkan

* Fix the free memory reporting (was using copy by value, change to reference)

* keep only dxgi1_2.h

* Modifications based on PR feedback

* fix merge conflicts (2) and fix desc1.description printout

* move dxgi + pdh api calls to before the vendor specific library calls

* change from 3 samples to 1 sample for PDH

* modify when old_mode is set

* add fix for building MacOS

* fix release and returns for other vendors

* add patch file

220e133f

app: add code for macOS and Windows apps under 'app' (#12933) · d3b4b997

Daniel Hiltgen authored Nov 04, 2025



* app: add code for macOS and Windows apps under 'app'

* app: add readme

* app: windows and linux only for now

* ci: fix ui CI validation

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>

d3b4b997

vulkan: enable flash attention (#12937) · a4770107

Daniel Hiltgen authored Nov 04, 2025

Also adjusts the vulkan windows build pattern to match recent changes in other backends
so incremental builds are faster.

a4770107

ggml: Increase maximum graph size · ef549d51

Jesse Gross authored Oct 30, 2025

The initial implementation of qwen3-vl:235b exceeded the maximum graph
size based on the number of tensors. Although this was later fixed
through the use of the mrope operation, we are close to the limit in
some cases. This updates to track the current llama.cpp usage of GGML.

ef549d51

03 Nov, 2025 3 commits
- readme: add Hillnote to community integrations (#12929) · d2158ca6
  Rajath Bail authored Nov 04, 2025
  
  d2158ca6
- chore(gptoss): cleanup dead code (#12932) · ce3eb0a3
  Michael Yang authored Nov 03, 2025
  
  ce3eb0a3
- readme: add Strands Agents to community integrations (#11740) · 60829f7e
  Ryan Coleman authored Nov 02, 2025
  
  60829f7e
02 Nov, 2025 1 commit
- readme: add Ollama Bash Lib to community integrations (#12235) · 9a50fd58
  Attogram Project authored Nov 03, 2025
  
  9a50fd58
31 Oct, 2025 4 commits

ggml: Avoid cudaMemsetAsync during memory fitting · 392a2702

Jesse Gross authored Oct 31, 2025

We pass invalid pointers when we check the size of the required
compute graph before fitting. Some CUDA APIs validate these pointers
but we can just skip them during this phase. cudaMemsetAsync is one
of these that we weren't skipping but never took the code path that
used it. Now that we have enabled op_offload, we can hit it in
memory pressured situations.

392a2702

cpu: always ensure LibOllamaPath included (#12890) · 3bee3af6

Daniel Hiltgen authored Oct 31, 2025

In CPU only setups the LibOllamaPath was omitted causing
us not to load the ggml-cpu-XXX libraries during inference.

3bee3af6

logs: catch rocm errors (#12888) · 83537993
Daniel Hiltgen authored Oct 31, 2025
```
This will help bubble up more crash errors
```
83537993

embeddings: removed redundant TestAPIEmbeddings test (#12863) · 7dd4862a

nicole pardal authored Oct 30, 2025

This PR removes a redundant test from TestAPIEmbeddings
Contents of this test already exists in embed_test.go and model_arch_test.go

7dd4862a

30 Oct, 2025 11 commits

win: avoid ID mixups on refresh (#12869) · db973c8f

Daniel Hiltgen authored Oct 30, 2025

On Windows AMD IDs are numeric, and can reorder based on the filter environment.
By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
and ignore unsupported iGPUs. Without this, on some systems iGPU VRAM was incorrectly
being used to populate the dGPU.

db973c8f

ggml: Enable op_offload to improve partial offload performance · afaf7ce8

Jesse Gross authored Oct 27, 2025

When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.

Fixes #12037

afaf7ce8

ollamarunner: Worst case batch for token generation · 26465fb8

Jesse Gross authored Oct 27, 2025

We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.

This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.

26465fb8

win: use copy for subprocess logs (#12864) · 88236bc0

Daniel Hiltgen authored Oct 30, 2025

windows gets confused when we try to hand the stderr file descriptor to the subprocess children.  This ensures the log output
always shows up.

88236bc0

testing: test more models with tool calling (#12867) · 76eb7d0f
Patrick Devine authored Oct 30, 2025

76eb7d0f
interleaved mrope (#12807) · f67a6df1
Michael Yang authored Oct 30, 2025
```
* ml(ggml): mrope
* interleave mrope
```
f67a6df1
qwen3vl: enable flash attention by default (#12862) · 75e75d9a
Michael Yang authored Oct 30, 2025

75e75d9a

fix(cmd): unload model before removal (#12832) · ed78e127

Michael Yang authored Oct 30, 2025

this change fixes two bugs with `ollama rm`:

1. before a model is removed, it will first be stopped. this only
   happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
   and should be omitted

ed78e127

fix: qwen2.5vl, qwen3vl composite image (#12841) · d432ade7
Michael Yang authored Oct 30, 2025
```
this change fixes images with an alpha channel by overlaying the image
onto a white background
```
d432ade7
tests: add tests and docs for commonly used ops (#12844) · 06b3422d
Michael Yang authored Oct 30, 2025
```
* mulmat
* permute
```
06b3422d
Update README.md (#12822) · cbe1cf06
Athiban Sharon authored Oct 30, 2025
```
Fixed broken docs links
```
cbe1cf06

29 Oct, 2025 2 commits
- Removing whitespace between Thinking and Content in Qwen3VL (#12838) · 0a2d9208
  Grace authored Oct 29, 2025
```
Eats extra whitespace at the end/beginning of content
```
  0a2d9208
- int: harden server lifecycle (#12835) · c8864710
  Daniel Hiltgen authored Oct 29, 2025
```
this should reduce zombies during integration runs
```
  c8864710