• Jesse Gross's avatar
    ggml: Don't allocate CPU buffers as CUDA Host buffers · 34c3b68f
    Jesse Gross authored
    Allocating (and in particular, freeing) memory from CUDA host buffers
    is expensive and can cause a significant performance hit if we do
    it for every token. Using normal system memory avoids this issue
    and also gives the OS more flexibility to manage it.
    
    There is no performance impact from this patch directly (either
    positive or negative) but it makes a difference once we start
    freeing memory correctly.
    34c3b68f
ggml.go 27.6 KB