1. 21 Jun, 2024 2 commits
    • Daniel Hiltgen's avatar
      Disable concurrency for AMD + Windows · 9929751c
      Daniel Hiltgen authored
      Until ROCm v6.2 ships, we wont be able to get accurate free memory
      reporting on windows, which makes automatic concurrency too risky.
      Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
      All other platforms and GPUs have accurate VRAM reporting wired
      up now, so we can turn on concurrency by default.
      9929751c
    • Daniel Hiltgen's avatar
      Enable concurrency by default · 17b7186c
      Daniel Hiltgen authored
      This adjusts our default settings to enable multiple models and parallel
      requests to a single model.  Users can still override these by the same
      env var settings as before.  Parallel has a direct impact on
      num_ctx, which in turn can have a significant impact on small VRAM GPUs
      so this change also refines the algorithm so that when parallel is not
      explicitly set by the user, we try to find a reasonable default that fits
      the model on their GPU(s).  As before, multiple models will only load
      concurrently if they fully fit in VRAM.
      17b7186c
  2. 20 Jun, 2024 9 commits
  3. 19 Jun, 2024 15 commits
  4. 18 Jun, 2024 7 commits
  5. 17 Jun, 2024 7 commits