ml/backend/ggml/ggml.go · 34c3b68fc8a14eb5a93f6bdd175fa94e2e8fa12b · OpenDAS / ollama

ggml: Don't allocate CPU buffers as CUDA Host buffers · 34c3b68f

Jesse Gross authored Apr 09, 2025

Allocating (and in particular, freeing) memory from CUDA host buffers
is expensive and can cause a significant performance hit if we do
it for every token. Using normal system memory avoids this issue
and also gives the OS more flexibility to manage it.

There is no performance impact from this patch directly (either
positive or negative) but it makes a difference once we start
freeing memory correctly.

34c3b68f

ggml.go 27.6 KB

Replace ggml.go