• Daniel Hiltgen's avatar
    Enable concurrency by default · 17b7186c
    Daniel Hiltgen authored
    This adjusts our default settings to enable multiple models and parallel
    requests to a single model.  Users can still override these by the same
    env var settings as before.  Parallel has a direct impact on
    num_ctx, which in turn can have a significant impact on small VRAM GPUs
    so this change also refines the algorithm so that when parallel is not
    explicitly set by the user, we try to find a reasonable default that fits
    the model on their GPU(s).  As before, multiple models will only load
    concurrently if they fully fit in VRAM.
    17b7186c
server.go 29.2 KB