- 08 Dec, 2025 1 commit
-
-
Michael Yang authored
change to a flatter directory structure and group the options with the function update models to call rope in one place
-
- 06 Dec, 2025 1 commit
-
-
Daniel Hiltgen authored
Follow up from #12992 Free all streams, and keep the alloc logic aligned across streams.
-
- 04 Dec, 2025 3 commits
-
-
Jesse Gross authored
Although the vision component of multimodal models typically already call the optimized nn.Attention, it is converted into non-fused operations. That is because the backend-specific fused kernels may have requirements, such as padding, and they is performed by the cache, which vision encoders don't use. This implements a fallback path in the backend, softening the requirements into optimizations. In turn, this allows flash attention to be used for vision encoders, saving a significant amount of VRAM and improving performance.
-
Jesse Gross authored
We currently use cache padding of 32 when not using flash attention and 256 with flash attention, which is based on the historic alignment requirements of these kernels. The restrictions have since been loosened but there are still performance benefits, such as better CUDA graph reuse. Since the requirement is no longer kernel-specific, set the padding uniformly to 256, as llama.cpp has.
-
Daniel Hiltgen authored
* Revert "vulkan: temporary cary of vulkan fixes (#12971)" This reverts commit 3a9e8e9f. * ggml update to b7087 * fix argsort on metal * update to b7108 * fix bakllava regression This model lacks the metadata for the projector type. * update to b7209 * fix TopK perf * only build arm code on arm
-
- 03 Dec, 2025 1 commit
-
-
Daniel Hiltgen authored
We now do a deeper probe of CUDA devices to verify the library version has the correct compute capability coverage for the device. Due to ROCm also interpreting the CUDA env var to filter AMD devices, we try to avoid setting it which leads to problems in mixed vendor systems. However without setting it for this deeper probe, each CUDA library subprocess discovers all CUDA GPUs and on systems with lots of GPUs, this can lead to hitting timeouts. The fix is to turn on the CUDA visibility env var just for this deeper probe use-case.
-
- 02 Dec, 2025 1 commit
-
-
Daniel Hiltgen authored
-
- 19 Nov, 2025 5 commits
-
-
Jesse Gross authored
We currently copy data into the KV cache in contiguous buffers using ggml_cpy(). ggml_set_rows() was introduced to allow scatter operation so that contiguous buffers are no longer required. The direct primary benefit of this is that we no longer need to perform defragmentation. However, GGML recently removed an optimization for ggml_cpy() and we picked it up in 544b6739 "ggml update to b6840 (#12791)". This caused a roughly 40% drop in token generation performance on CUDA due to CUDA graphs no longer being used. By switching to ggml_set_rows(), the original optimization is no longer necessary and CUDA performance is restored. Fixes #13112
-
Jesse Gross authored
GGML requires tensors to be contiguous for reshape and if this is not the case, it will assert fail. Contiguous is an expensive operation, so it's best to do it lazily when it is actually required rather than ahead of time when it may not be needed.
-
Daniel Hiltgen authored
Calling abort on windows triggers the C++ runtime to attempt a debugger attach, which causes the crashed runners to hang instead of exit, leading to a timeout instead of a fast failure during discovery.
-
Michael Yang authored
cuda panics on batches larger than 1024 so skip those and fallback to cpu
-
Michael Yang authored
-
- 18 Nov, 2025 2 commits
-
-
Michael Yang authored
* migrate to golangci-lint v2 * copyloopvar
-
Grace authored
* Add mla for flash attention * Revert to using chunks
-
- 17 Nov, 2025 1 commit
-
-
Daniel Hiltgen authored
* build: optimize dockerfile context for iterating This moves the copy of the source into the layer AFTER doing software installs so we don't have to go through the RPM install for cuda, etc. every time you touch a source file. * amd: implement linux sysfs based VRAM lookup This adds a C++ implementation of sysfs DRM VRAM discovery for more accurate free VRAM data on linux for AMD GPUs.
-
- 13 Nov, 2025 2 commits
-
-
Michael Yang authored
* use slice/chunks * bert * llama4 * gemma3n * gptoss * mistral3 * qwen3vl * qwen25vl * deepseek2 * remove unused ops
-
Michael Yang authored
* slice * chunk, chunksections
-
- 12 Nov, 2025 1 commit
-
-
Daniel Hiltgen authored
This should be reverted once we update ggml past b6897
-
- 11 Nov, 2025 2 commits
-
-
Jesse Gross authored
We currently assign model layers to GPUs according to free VRAM, which assumes that GPU performance is roughly equal. This does not work well for mixed dGPU and iGPU systems because iGPUs typically use system memory which is large but their performance is slow. This instead assigns layers to dGPUs first and then iGPUs. In the future, this could be generalized to have a more fine grained notion of GPU performance but dGPU vs. iGPU performance is the most extreme.
-
Jesse Gross authored
We used to control the way that llama.cpp saw devices using CUDA_VISIBLE_DEVICES or similar. This would ensure that the layers offloaded to a device were actually the ones intended. This is particularly important because we might reorder devices based on free memory or performance. When we started explicitly scheduling layers, this logic went away but the llamarunner didn't have any way to set the correct order of devices. This meant that the correct number of layers would be assigned to a device but not necessarily the layers that were expected. This change sets up the devices correctly based on the offload information.
-
- 06 Nov, 2025 2 commits
-
-
Thomas Stocker authored
* Remove unnecessary macos 13 Patch * Remove unnecessary MacOs Version Guard patch * rename patchesw * remove again macos13 patch * rename files
-
Daniel Hiltgen authored
-
- 04 Nov, 2025 4 commits
-
-
Daniel Hiltgen authored
* discovery: only retry AMD GPUs CUDA and Vulkan don't crash on unsupported devices, so retry isn't necessary. This also refactors the code to shift the Library specific logic into the ml package. * review comments
-
virajwad authored
* PDH free memory skeleton * Add PDH printing * Add LUID support for Vulkan * wire luid from ggml-vulkan to mem-dxgi-pdh file * Fix to ggml-impl * Continue skeleton * Implemented ggml_dxgi_pdh_get_device_memory * fix comments * Fix - change value GB to bytes * add ifdefs to only support windows and not linux * modify error codes * Finished ggml_dxgi_pdh_init() function * completed ggml_dxgi_pdh_release() * Formatting changes, add static to functions * fix build errors * fix go build error * fix luid - now should match between dxgi and vulkan * Fix the free memory reporting (was using copy by value, change to reference) * keep only dxgi1_2.h * Modifications based on PR feedback * fix merge conflicts (2) and fix desc1.description printout * move dxgi + pdh api calls to before the vendor specific library calls * change from 3 samples to 1 sample for PDH * modify when old_mode is set * add fix for building MacOS * fix release and returns for other vendors * add patch file
-
Daniel Hiltgen authored
Also adjusts the vulkan windows build pattern to match recent changes in other backends so incremental builds are faster.
-
Jesse Gross authored
The initial implementation of qwen3-vl:235b exceeded the maximum graph size based on the number of tensors. Although this was later fixed through the use of the mrope operation, we are close to the limit in some cases. This updates to track the current llama.cpp usage of GGML.
-
- 31 Oct, 2025 2 commits
-
-
Jesse Gross authored
We pass invalid pointers when we check the size of the required compute graph before fitting. Some CUDA APIs validate these pointers but we can just skip them during this phase. cudaMemsetAsync is one of these that we weren't skipping but never took the code path that used it. Now that we have enabled op_offload, we can hit it in memory pressured situations.
-
Daniel Hiltgen authored
In CPU only setups the LibOllamaPath was omitted causing us not to load the ggml-cpu-XXX libraries during inference.
-
- 30 Oct, 2025 3 commits
-
-
Jesse Gross authored
When a model is partially offloaded to system RAM, we can either do the calculations on the CPU or we can temporarily transfer the data to the GPU to do the calculations there. Small batches tend to be better on the CPU, large batches on the GPU. The llamarunner used the GPU in most cases and the ollamarunner used the CPU. Although the ollamarunner saw an improvement in token generation performance, there was a large performance hit in prompt processing (3-10x). There is an existing heuristic to dynamically switch between these two modes but in practice it doesn't have enough information to accurately make that decision. This adds authoritative data to make the check work to get the best of both worlds. Fixes #12037
-
Michael Yang authored
* ml(ggml): mrope * interleave mrope
-
Michael Yang authored
* mulmat * permute
-
- 29 Oct, 2025 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 28 Oct, 2025 2 commits
-
-
Daniel Hiltgen authored
* Fix vulkan PCI ID and ID handling Intel GPUs may not report PCI IDs which was leading to incorrect overlap detection. Switch to using the existing PCI IDs, however AMD GPUs claim not to report PCI IDs, but actually do, so try anyway, as this is required for ADLX to find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also switches Vulkan to use UUID based IDs. The GPU discovery patches have been squashed into a single patch to simplify future rebases. * review comments
-
Michael Yang authored
-
- 23 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
* DRY out the runner lifecycle code Now that discovery uses the runners as well, this unifies the runner spawning code into a single place. This also unifies GPU discovery types with the newer ml.DeviceInfo * win: make incremental builds better Place build artifacts in discrete directories so incremental builds don't have to start fresh * Adjust sort order to consider iGPUs * handle cpu inference oom scenarios * review comments
-
- 20 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
Users on Windows without GPUs are reporting errors relating to cudaDriverGetVersion with the device set to -1. This ensures we only grab the driver once we're enumerating actual devices.
-
- 18 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
When loading the dynamic libraries, if something goes wrong report some details. Unfortunately this wont explain which dependencies are missing, but this breadcrumb in the logs should help us diagnose GPU discovery failures.
-
- 16 Oct, 2025 1 commit
-
-
Thomas Stocker authored
* vulkan: Get FilterID from Backend for Vulkan * Fixing patch
-
- 15 Oct, 2025 1 commit
-
-
Daniel Hiltgen authored
-