llama/patches/0022-ggml-No-alloc-mode.patch · 3d0b1734c006798960a56acb0ea23ea57e0dd1d9 · OpenDAS / ollama

ggml: Preallocate CUDA pool memory · 3d0b1734

Jesse Gross authored Sep 09, 2025

The GGML CUDA backend allocates additional memory for intermediate
results during calculation. This memory isn't currently allocated
during worst case graph reservation and therefore not included in
scheduling. This means that as these buffers potentially grow
with context length, we could crash.

This extends the memory allocation system down layer from the GGML
graph to the CUDA layer, preallocating the worst case memory there
as well.

Fixes #11753

3d0b1734

0022-ggml-No-alloc-mode.patch 25.6 KB

Replace 0022-ggml-No-alloc-mode.patch