Commits · 220e133fca8fe128dbf8fecef96c8484f991e39c · OpenDAS / ollama

04 Nov, 2025 4 commits

vulkan: Add memory detection for Intel GPU using DXGI+PDH (#12664) · 220e133f

virajwad authored Nov 04, 2025

* PDH free memory skeleton

* Add PDH printing

* Add LUID support for Vulkan

* wire luid from ggml-vulkan to mem-dxgi-pdh file

* Fix to ggml-impl

* Continue skeleton

* Implemented ggml_dxgi_pdh_get_device_memory

* fix comments

* Fix - change value GB to bytes

* add ifdefs to only support windows and not linux

* modify error codes

* Finished ggml_dxgi_pdh_init() function

* completed ggml_dxgi_pdh_release()

* Formatting changes, add static to functions

* fix build errors

* fix go build error

* fix luid - now should match between dxgi and vulkan

* Fix the free memory reporting (was using copy by value, change to reference)

* keep only dxgi1_2.h

* Modifications based on PR feedback

* fix merge conflicts (2) and fix desc1.description printout

* move dxgi + pdh api calls to before the vendor specific library calls

* change from 3 samples to 1 sample for PDH

* modify when old_mode is set

* add fix for building MacOS

* fix release and returns for other vendors

* add patch file

220e133f

app: add code for macOS and Windows apps under 'app' (#12933) · d3b4b997

Daniel Hiltgen authored Nov 04, 2025



* app: add code for macOS and Windows apps under 'app'

* app: add readme

* app: windows and linux only for now

* ci: fix ui CI validation

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>

d3b4b997

vulkan: enable flash attention (#12937) · a4770107

Daniel Hiltgen authored Nov 04, 2025

Also adjusts the vulkan windows build pattern to match recent changes in other backends
so incremental builds are faster.

a4770107

ggml: Increase maximum graph size · ef549d51

Jesse Gross authored Oct 30, 2025

The initial implementation of qwen3-vl:235b exceeded the maximum graph
size based on the number of tensors. Although this was later fixed
through the use of the mrope operation, we are close to the limit in
some cases. This updates to track the current llama.cpp usage of GGML.

ef549d51

03 Nov, 2025 3 commits
- readme: add Hillnote to community integrations (#12929) · d2158ca6
  Rajath Bail authored Nov 04, 2025
  
  d2158ca6
- chore(gptoss): cleanup dead code (#12932) · ce3eb0a3
  Michael Yang authored Nov 03, 2025
  
  ce3eb0a3
- readme: add Strands Agents to community integrations (#11740) · 60829f7e
  Ryan Coleman authored Nov 02, 2025
  
  60829f7e
02 Nov, 2025 1 commit
- readme: add Ollama Bash Lib to community integrations (#12235) · 9a50fd58
  Attogram Project authored Nov 03, 2025
  
  9a50fd58
31 Oct, 2025 4 commits

ggml: Avoid cudaMemsetAsync during memory fitting · 392a2702

Jesse Gross authored Oct 31, 2025

We pass invalid pointers when we check the size of the required
compute graph before fitting. Some CUDA APIs validate these pointers
but we can just skip them during this phase. cudaMemsetAsync is one
of these that we weren't skipping but never took the code path that
used it. Now that we have enabled op_offload, we can hit it in
memory pressured situations.

392a2702

cpu: always ensure LibOllamaPath included (#12890) · 3bee3af6

Daniel Hiltgen authored Oct 31, 2025

In CPU only setups the LibOllamaPath was omitted causing
us not to load the ggml-cpu-XXX libraries during inference.

3bee3af6

logs: catch rocm errors (#12888) · 83537993
Daniel Hiltgen authored Oct 31, 2025
```
This will help bubble up more crash errors
```
83537993

embeddings: removed redundant TestAPIEmbeddings test (#12863) · 7dd4862a

nicole pardal authored Oct 30, 2025

This PR removes a redundant test from TestAPIEmbeddings
Contents of this test already exists in embed_test.go and model_arch_test.go

7dd4862a

30 Oct, 2025 11 commits

win: avoid ID mixups on refresh (#12869) · db973c8f

Daniel Hiltgen authored Oct 30, 2025

On Windows AMD IDs are numeric, and can reorder based on the filter environment.
By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
and ignore unsupported iGPUs. Without this, on some systems iGPU VRAM was incorrectly
being used to populate the dGPU.

db973c8f

ggml: Enable op_offload to improve partial offload performance · afaf7ce8

Jesse Gross authored Oct 27, 2025

When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.

Fixes #12037

afaf7ce8

ollamarunner: Worst case batch for token generation · 26465fb8

Jesse Gross authored Oct 27, 2025

We currently allocate the worst case batch for max sized
batches, which corresponds to prompt processing. However,
there are some cases where the generated graph is different
for small and large batches. To ensure that we don't need
to allocate memory later after layout has taken place, we
should run the worst case batch both ways and take the larger
amount of memory.

This does not noticeably affect loading speed as the most expensive
part of this logic is from image processing and that does not
occur during token generation.

26465fb8

win: use copy for subprocess logs (#12864) · 88236bc0

Daniel Hiltgen authored Oct 30, 2025

windows gets confused when we try to hand the stderr file descriptor to the subprocess children.  This ensures the log output
always shows up.

88236bc0

testing: test more models with tool calling (#12867) · 76eb7d0f
Patrick Devine authored Oct 30, 2025

76eb7d0f
interleaved mrope (#12807) · f67a6df1
Michael Yang authored Oct 30, 2025
```
* ml(ggml): mrope
* interleave mrope
```
f67a6df1
qwen3vl: enable flash attention by default (#12862) · 75e75d9a
Michael Yang authored Oct 30, 2025

75e75d9a

fix(cmd): unload model before removal (#12832) · ed78e127

Michael Yang authored Oct 30, 2025

this change fixes two bugs with `ollama rm`:

1. before a model is removed, it will first be stopped. this only
   happens for the first argument and skipped for all other models
2. models are unloaded indiscriminately. this errors for cloud models
   and should be omitted

ed78e127

fix: qwen2.5vl, qwen3vl composite image (#12841) · d432ade7
Michael Yang authored Oct 30, 2025
```
this change fixes images with an alpha channel by overlaying the image
onto a white background
```
d432ade7
tests: add tests and docs for commonly used ops (#12844) · 06b3422d
Michael Yang authored Oct 30, 2025
```
* mulmat
* permute
```
06b3422d
Update README.md (#12822) · cbe1cf06
Athiban Sharon authored Oct 30, 2025
```
Fixed broken docs links
```
cbe1cf06

29 Oct, 2025 8 commits
- Removing whitespace between Thinking and Content in Qwen3VL (#12838) · 0a2d9208
  Grace authored Oct 29, 2025
```
Eats extra whitespace at the end/beginning of content
```
  0a2d9208
- int: harden server lifecycle (#12835) · c8864710
  Daniel Hiltgen authored Oct 29, 2025
```
this should reduce zombies during integration runs
```
  c8864710
- tests: fix embeddinggemma integration test (#12830) · 05aff4a4
  Patrick Devine authored Oct 29, 2025
  
  05aff4a4
- fix: conv2d bias (#12834) · 0d140bd1
  Michael Yang authored Oct 29, 2025
  
  0d140bd1
- docs: temporarily restore api.md and cleanup docs paths (#12818) · 93e45f0f
  Jeffrey Morgan authored Oct 28, 2025
  
  93e45f0f
- docs: fix root api documentation page (#12813) · a3421608
  Jeffrey Morgan authored Oct 28, 2025
  
  a3421608
- docs: add new cloud model + fix openai redirect (#12812) · f6c29409
  Jeffrey Morgan authored Oct 28, 2025
  
  f6c29409
- feat(model): add qwen3vl (#12665) · 7d25b9e1
  Michael Yang authored Oct 28, 2025
  
  7d25b9e1
28 Oct, 2025 9 commits
- embed: add distance correlation test for library embed models (#12796) · 36d64fb5
  Patrick Devine authored Oct 28, 2025
  
  36d64fb5
- docs: update readme and links (#12809) · d828517e
  Parth Sareen authored Oct 28, 2025
  
  d828517e
- Fix vulkan PCI ID and ID handling (#12775) · 14977a93
  Daniel Hiltgen authored Oct 28, 2025
```
* Fix vulkan PCI ID and ID handling

Intel GPUs may not report PCI IDs which was leading to incorrect overlap
detection.  Switch to using the existing PCI IDs, however AMD GPUs claim not to
report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
switches Vulkan to use UUID based IDs. The GPU discovery patches have been
squashed into a single patch to simplify future rebases.

* review comments
```
  14977a93
- Revert "server: Consolidate embedding truncation in runner (#12730)" (#12810) · 29f63f37
  Patrick Devine authored Oct 28, 2025
```
This reverts commit 5d347f6d.
```
  29f63f37
- docs: add docs for docs.ollama.com (#12805) · 3d99d977
  Parth Sareen authored Oct 28, 2025
  
  3d99d977
- docs: rename to mdx to setup docs site (#12804) · 6d02a43a
  Parth Sareen authored Oct 28, 2025
  
  6d02a43a
- Revert "docs: add reference to docs.ollama.com (#12800)" (#12803) · 5483497d
  Parth Sareen authored Oct 28, 2025
```
This reverts commit 934dd9e1.
```
  5483497d
- docs: add reference to docs.ollama.com (#12800) · 934dd9e1
  Parth Sareen authored Oct 28, 2025
  
  934dd9e1
- s/From*Slice/From*s/ (#12255) · 1188f408
  Michael Yang authored Oct 28, 2025
  
  1188f408