Commits · 603ceefaa67feee627e01cae1df1e0642e1c868f · OpenDAS / ollama

08 Dec, 2025 1 commit

Michael Yang authored Nov 18, 2025

change to a flatter directory structure and group the options with the
function

update models to call rope in one place

603ceefa

06 Dec, 2025 1 commit

ggml: handle all streams (#13350) · c146a138

Daniel Hiltgen authored Dec 05, 2025

Follow up from #12992

Free all streams, and keep the alloc logic aligned across streams.

c146a138

04 Dec, 2025 3 commits

ggml: Enable flash attention for vision encoders · 1108d8b3

Jesse Gross authored Dec 02, 2025

Although the vision component of multimodal models typically already
call the optimized nn.Attention, it is converted into non-fused
operations. That is because the backend-specific fused kernels may
have requirements, such as padding, and they is performed by the
cache, which vision encoders don't use.

This implements a fallback path in the backend, softening the
requirements into optimizations. In turn, this allows flash attention
to be used for vision encoders, saving a significant amount of VRAM
and improving performance.

1108d8b3

ggml: Always set cache padding to 256 · 7837a5bc

Jesse Gross authored Dec 04, 2025

We currently use cache padding of 32 when not using flash attention
and 256 with flash attention, which is based on the historic alignment
requirements of these kernels. The restrictions have since been
loosened but there are still performance benefits, such as better
CUDA graph reuse.

Since the requirement is no longer kernel-specific, set the padding
uniformly to 256, as llama.cpp has.

7837a5bc

ggml update to b7108 (#12992) · 0cf7794b

Daniel Hiltgen authored Dec 03, 2025

* Revert "vulkan: temporary cary of vulkan fixes (#12971)"

This reverts commit 3a9e8e9f.

* ggml update to b7087

* fix argsort on metal

* update to b7108

* fix bakllava regression

This model lacks the metadata for the projector type.

* update to b7209

* fix TopK perf

* only build arm code on arm

0cf7794b

03 Dec, 2025 1 commit

CUDA: filter devices on secondary discovery (#13317) · 3f308367

Daniel Hiltgen authored Dec 03, 2025

We now do a deeper probe of CUDA devices to verify the library version has
the correct compute capability coverage for the device. Due to ROCm also
interpreting the CUDA env var to filter AMD devices, we try to avoid setting
it which leads to problems in mixed vendor systems. However without setting
it for this deeper probe, each CUDA library subprocess discovers all CUDA GPUs
and on systems with lots of GPUs, this can lead to hitting timeouts. The fix is
to turn on the CUDA visibility env var just for this deeper probe use-case.

3f308367

02 Dec, 2025 1 commit
- CUDA: verify CC is supported by target library (#13298) · f8f10718
  Daniel Hiltgen authored Dec 02, 2025
  
  f8f10718
19 Nov, 2025 5 commits

kvcache: Use SetRows to store cache data · 53985b3c

Jesse Gross authored Aug 18, 2025

We currently copy data into the KV cache in contiguous buffers using
ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation
so that contiguous buffers are no longer required. The direct primary
benefit of this is that we no longer need to perform defragmentation.

However, GGML recently removed an optimization for ggml_cpy() and
we picked it up in 544b6739 "ggml update to b6840 (#12791)". This
caused a roughly 40% drop in token generation performance on CUDA
due to CUDA graphs no longer being used. By switching to
ggml_set_rows(), the original optimization is no longer necessary
and CUDA performance is restored.

Fixes #13112

53985b3c

ggml: Automatically make tensors contiguous on reshape · b6e02cbb

Jesse Gross authored Nov 18, 2025

GGML requires tensors to be contiguous for reshape and if
this is not the case, it will assert fail. Contiguous is an
expensive operation, so it's best to do it lazily when it is
actually required rather than ahead of time when it may not
be needed.

b6e02cbb

win: exit instead of abort (#13138) · 485da9fd

Daniel Hiltgen authored Nov 18, 2025

Calling abort on windows triggers the C++ runtime to attempt a debugger
attach, which causes the crashed runners to hang instead of exit, leading
to a timeout instead of a fast failure during discovery.

485da9fd

cuda: skip large batches · 0796d79d
Michael Yang authored Nov 18, 2025
```
cuda panics on batches larger than 1024 so skip those and fallback to
cpu
```
0796d79d
deepseekocr · 92981ae3
Michael Yang authored Oct 31, 2025

92981ae3

18 Nov, 2025 2 commits
- migrate to golangci-lint v2 (#13109) · 718961de
  Michael Yang authored Nov 18, 2025
```
* migrate to golangci-lint v2
* copyloopvar
```
  718961de
- Add deepseek v3.1 (#13063) · 584e2d64
  Grace authored Nov 17, 2025
```
* Add mla for flash attention
* Revert to using chunks
```
  584e2d64
17 Nov, 2025 1 commit

bring back sysfs based VRAM information for AMD (#12871) · 2f36d769

Daniel Hiltgen authored Nov 17, 2025

* build: optimize dockerfile context for iterating

This moves the copy of the source into the layer AFTER
doing software installs so we don't have to go through
the RPM install for cuda, etc. every time you touch a
source file.

* amd: implement linux sysfs based VRAM lookup

This adds a C++ implementation of sysfs DRM VRAM discovery
for more accurate free VRAM data on linux for AMD GPUs.

2f36d769

13 Nov, 2025 2 commits
- chore: update models to use slice/chunk/chunksections (#12934) · 333203d8
  Michael Yang authored Nov 13, 2025
```
* use slice/chunks

* bert

* llama4

* gemma3n

* gptoss

* mistral3

* qwen3vl

* qwen25vl

* deepseek2

* remove unused ops
```
  333203d8
- ml: add slice operation (#12870) · b48083f3
  Michael Yang authored Nov 13, 2025
```
* slice

* chunk, chunksections
```
  b48083f3
12 Nov, 2025 1 commit
- vulkan: temporary cary of vulkan fixes (#12971) · 3a9e8e9f
  Daniel Hiltgen authored Nov 12, 2025
```
This should be reverted once we update ggml past b6897
```
  3a9e8e9f
11 Nov, 2025 2 commits

llm: Prefer dedicated GPUs over iGPUs when allocating memory · 8bf38552

Jesse Gross authored Nov 04, 2025

We currently assign model layers to GPUs according to free VRAM,
which assumes that GPU performance is roughly equal. This does not
work well for mixed dGPU and iGPU systems because iGPUs typically
use system memory which is large but their performance is slow.
This instead assigns layers to dGPUs first and then iGPUs.

In the future, this could be generalized to have a more fine grained
notion of GPU performance but dGPU vs. iGPU performance is the most
extreme.

8bf38552

llamarunner: Respect device ordering for offloaded layers · 4372d0bf

Jesse Gross authored Nov 10, 2025

We used to control the way that llama.cpp saw devices using
CUDA_VISIBLE_DEVICES or similar. This would ensure that the layers
offloaded to a device were actually the ones intended. This is
particularly important because we might reorder devices based on
free memory or performance.

When we started explicitly scheduling layers, this logic went
away but the llamarunner didn't have any way to set the correct
order of devices. This meant that the correct number of layers
would be assigned to a device but not necessarily the layers
that were expected. This change sets up the devices correctly
based on the offload information.

4372d0bf

06 Nov, 2025 2 commits
- Remove unnecessary MacOs 13 and lower Patches (#12656) · d4e0da08
  Thomas Stocker authored Nov 07, 2025
```
* Remove unnecessary macos 13 Patch

* Remove unnecessary MacOs Version Guard patch

* rename patchesw

* remove again macos13 patch

* rename files
```
  d4e0da08
- ggml update to b6840 (#12791) · 544b6739
  Daniel Hiltgen authored Nov 06, 2025
  
  544b6739
04 Nov, 2025 4 commits

discovery: only retry AMD GPUs (#12894) · 27f1fde4

Daniel Hiltgen authored Nov 04, 2025

* discovery: only retry AMD GPUs

CUDA and Vulkan don't crash on unsupported devices, so retry isn't necessary.
This also refactors the code to shift the Library specific logic into the ml
package.

* review comments

27f1fde4

vulkan: Add memory detection for Intel GPU using DXGI+PDH (#12664) · 220e133f

virajwad authored Nov 04, 2025

* PDH free memory skeleton

* Add PDH printing

* Add LUID support for Vulkan

* wire luid from ggml-vulkan to mem-dxgi-pdh file

* Fix to ggml-impl

* Continue skeleton

* Implemented ggml_dxgi_pdh_get_device_memory

* fix comments

* Fix - change value GB to bytes

* add ifdefs to only support windows and not linux

* modify error codes

* Finished ggml_dxgi_pdh_init() function

* completed ggml_dxgi_pdh_release()

* Formatting changes, add static to functions

* fix build errors

* fix go build error

* fix luid - now should match between dxgi and vulkan

* Fix the free memory reporting (was using copy by value, change to reference)

* keep only dxgi1_2.h

* Modifications based on PR feedback

* fix merge conflicts (2) and fix desc1.description printout

* move dxgi + pdh api calls to before the vendor specific library calls

* change from 3 samples to 1 sample for PDH

* modify when old_mode is set

* add fix for building MacOS

* fix release and returns for other vendors

* add patch file

220e133f

vulkan: enable flash attention (#12937) · a4770107

Daniel Hiltgen authored Nov 04, 2025

Also adjusts the vulkan windows build pattern to match recent changes in other backends
so incremental builds are faster.

a4770107

ggml: Increase maximum graph size · ef549d51

Jesse Gross authored Oct 30, 2025

The initial implementation of qwen3-vl:235b exceeded the maximum graph
size based on the number of tensors. Although this was later fixed
through the use of the mrope operation, we are close to the limit in
some cases. This updates to track the current llama.cpp usage of GGML.

ef549d51

31 Oct, 2025 2 commits

ggml: Avoid cudaMemsetAsync during memory fitting · 392a2702

Jesse Gross authored Oct 31, 2025

We pass invalid pointers when we check the size of the required
compute graph before fitting. Some CUDA APIs validate these pointers
but we can just skip them during this phase. cudaMemsetAsync is one
of these that we weren't skipping but never took the code path that
used it. Now that we have enabled op_offload, we can hit it in
memory pressured situations.

392a2702

cpu: always ensure LibOllamaPath included (#12890) · 3bee3af6

Daniel Hiltgen authored Oct 31, 2025

In CPU only setups the LibOllamaPath was omitted causing
us not to load the ggml-cpu-XXX libraries during inference.

3bee3af6

30 Oct, 2025 3 commits

ggml: Enable op_offload to improve partial offload performance · afaf7ce8

Jesse Gross authored Oct 27, 2025

When a model is partially offloaded to system RAM, we can either
do the calculations on the CPU or we can temporarily transfer the
data to the GPU to do the calculations there. Small batches tend
to be better on the CPU, large batches on the GPU.

The llamarunner used the GPU in most cases and the ollamarunner
used the CPU. Although the ollamarunner saw an improvement in
token generation performance, there was a large performance hit
in prompt processing (3-10x).

There is an existing heuristic to dynamically switch between these
two modes but in practice it doesn't have enough information to
accurately make that decision. This adds authoritative data to make
the check work to get the best of both worlds.

Fixes #12037

afaf7ce8

interleaved mrope (#12807) · f67a6df1
Michael Yang authored Oct 30, 2025
```
* ml(ggml): mrope
* interleave mrope
```
f67a6df1
tests: add tests and docs for commonly used ops (#12844) · 06b3422d
Michael Yang authored Oct 30, 2025
```
* mulmat
* permute
```
06b3422d

29 Oct, 2025 2 commits
- fix: conv2d bias (#12834) · 0d140bd1
  Michael Yang authored Oct 29, 2025
  
  0d140bd1
- feat(model): add qwen3vl (#12665) · 7d25b9e1
  Michael Yang authored Oct 28, 2025
  
  7d25b9e1
28 Oct, 2025 2 commits

Fix vulkan PCI ID and ID handling (#12775) · 14977a93

Daniel Hiltgen authored Oct 28, 2025

* Fix vulkan PCI ID and ID handling

Intel GPUs may not report PCI IDs which was leading to incorrect overlap
detection.  Switch to using the existing PCI IDs, however AMD GPUs claim not to
report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
switches Vulkan to use UUID based IDs. The GPU discovery patches have been
squashed into a single patch to simplify future rebases.

* review comments

14977a93

s/From*Slice/From*s/ (#12255) · 1188f408
Michael Yang authored Oct 28, 2025

1188f408

23 Oct, 2025 1 commit

DRY out the runner lifecycle code (#12540) · 3258a89b

Daniel Hiltgen authored Oct 23, 2025

* DRY out the runner lifecycle code

Now that discovery uses the runners as well, this unifies the runner spawning code
into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo

* win: make incremental builds better

Place build artifacts in discrete directories so incremental builds don't have to start fresh

* Adjust sort order to consider iGPUs

* handle cpu inference oom scenarios

* review comments

3258a89b

20 Oct, 2025 1 commit

cuda: get driver version after props (#12707) · 5d22953b

Daniel Hiltgen authored Oct 20, 2025

Users on Windows without GPUs are reporting errors relating to
cudaDriverGetVersion with the device set to -1. This ensures we only grab the
driver once we're enumerating actual devices.

5d22953b

18 Oct, 2025 1 commit

win: more verbose load failures (#12683) · ba2253dc

Daniel Hiltgen authored Oct 17, 2025

When loading the dynamic libraries, if something goes wrong report some
details.  Unfortunately this wont explain which dependencies are missing,
but this breadcrumb in the logs should help us diagnose GPU discovery
failures.

ba2253dc

16 Oct, 2025 1 commit
- vulkan: Get FilterID from Backend for Vulkan (#12655) · c7441342
  Thomas Stocker authored Oct 16, 2025
```
* vulkan: Get FilterID from Backend for Vulkan

* Fixing patch
```
  c7441342
15 Oct, 2025 1 commit
- perf: backport cuda iGPU sched spin (#12641) · 75d17fc6
  Daniel Hiltgen authored Oct 15, 2025
  
  75d17fc6