• Jesse Gross's avatar
    ggml: Avoid allocating CUDA primary context on unused GPUs · 9d97e6a9
    Jesse Gross authored
    The recent memory management changes caused all GPUs to be visible
    to the runner, regardless of whether they are ultimately used. This
    caused CUDA devices to allocate a primary context (~300 MB VRAM) on
    each GPU, for each model. This is unnecessary, so we can both avoid
    touching GPUs that we exclude in the early stage of allocation and
    freeing the memory for any that we touch but don't use.
    
    The issue will continue to exist for the old engine, since it touches
    all devices during initialization.
    9d97e6a9
ggml.go 40.2 KB