ggml: Don't allocate CPU buffers as CUDA Host buffers

Allocating (and in particular, freeing) memory from CUDA host buffers is expensive and can cause a significant performance hit if we do it for every token. Using normal system memory avoids this issue and also gives the OS more flexibility to manage it. There is no performance impact from this patch directly (either positive or negative) but it makes a difference once we start freeing memory correctly.

ggml: Don't allocate CPU buffers as CUDA Host buffers
Allocating (and in particular, freeing) memory from CUDA host buffers is expensive and can cause a significant performance hit if we do it for every token. Using normal system memory avoids this issue and also gives the OS more flexibility to manage it. There is no performance impact from this patch directly (either positive or negative) but it makes a difference once we start freeing memory correctly.
34c3b68f · Jesse Gross · Jesse Gross · f33ccd5d · 34c3b68f
Commit 34c3b68f authored Apr 09, 2025 by Jesse Gross Committed by Jesse Gross Apr 11, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 0 additions and 6 deletions

ml/backend/ggml/ggml.go ml/backend/ggml/ggml.go +0 -6

No files found.
--- a/ml/backend/ggml/ggml.go
+++ b/ml/backend/ggml/ggml.go
@@ -384,12 +384,6 @@ func New(ctx context.Context, r *os.File, params ml.BackendParams) (ml.Backend,
 	for _, d := range append(gpus, append(accels, cpus...)...) {
 		b := C.ggml_backend_dev_init(d, nil)
 		bt := C.ggml_backend_get_default_buffer_type(b)
-		if d := C.ggml_backend_get_device(b); C.ggml_backend_dev_type(d) == C.GGML_BACKEND_DEVICE_TYPE_CPU && len(gpus) > 0 {
-			// use the first gpu host buffer type for gpu if possible
-			if hbt := C.ggml_backend_dev_host_buffer_type(gpus[0]); hbt != nil {
-				bt = hbt
-			}
-		}

 		deviceBufferTypes[d] = bt