• Jesse Gross's avatar
    llm: Don't always evict models in CPU-only mode · 073fa31d
    Jesse Gross authored
    With old memory estimates, it's currently impossible to load more
    than one model at a time when no GPUs are available. This is because
    the check for whether we need to evict a model looks to see if all
    layers of the new model can be loaded onto GPUs, which is never true
    if there are no GPUs. Before the memory management changes, there
    was a special code path for CPU-only systems.
    
    This problem does not exist with new memory estimates.
    
    Fixes #11974
    073fa31d
memory.go 14.9 KB