1. 15 Oct, 2025 1 commit
    • Jesse Gross's avatar
      llm: Perform eviction when num_gpu is set with new estimates · 3dcfd5f6
      Jesse Gross authored
      Currently, if you set num_gpu then this forces the model to
      load with that number of layers in the current configuration.
      This is done regardless of any other information, which means
      that no eviction is performed even if another model is loaded.
      
      This behavior is different from the old estimates (and still
      happens for models that runs on the llama engine). In those
      cases, models would be evicted if needed to load at the requested
      number of layers. That behavior is more useful and less surprising,
      so this changes the new estimates to match.
      
      Fixes #12580
      3dcfd5f6
  2. 01 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb
  3. 30 Sep, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Remove allocation status reporting · 734b57da
      Jesse Gross authored
      For each memory allocation we report the size of the (attempted)
      allocation and whether it succeeded or failed. The latter status
      reporting proved to be not that useful in practice as systems
      such as Windows can automatically overflow from VRAM into RAM,
      resultings in successful allocations even when there isn't
      enough memory where we wanted.
      
      As a result, this information is only used for debug logging,
      which isn't worthwhile enough for the amount of code. It
      also isn't fully accurate, as multiple allocations may result
      in partial failures.
      734b57da
  4. 14 Aug, 2025 1 commit
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9
  5. 08 May, 2025 1 commit
  6. 17 Dec, 2024 2 commits
    • Blake Mizerany's avatar
      llm: do not error on "null" format (#8139) · 2ddc32d5
      Blake Mizerany authored
      This fixes another regression in the previous commit that fixed other
      known bugs.
      2ddc32d5
    • Blake Mizerany's avatar
      llm: do not silently fail for supplied, but invalid formats (#8130) · 87f0a49f
      Blake Mizerany authored
      Changes in #8002 introduced fixes for bugs with mangling JSON Schemas.
      It also fixed a bug where the server would silently fail when clients
      requested invalid formats. It also, unfortunately, introduced a bug
      where the server would reject requests with an empty format, which
      should be allowed.
      
      The change in #8127 updated the code to allow the empty format, but also
      reintroduced the regression where the server would silently fail when
      the format was set, but invalid.
      
      This commit fixes both regressions. The server does not reject the empty
      format, but it does reject invalid formats. It also adds tests to help
      us catch regressions in the future.
      
      Also, the updated code provides a more detailed error message when a
      client sends a non-empty, but invalid format, echoing the invalid format
      in the response.
      
      This commits also takes the opportunity to remove superfluous linter
      checks.
      87f0a49f