1. 29 Aug, 2025 1 commit
  2. 28 Aug, 2025 1 commit
  3. 27 Aug, 2025 2 commits
    • Jesse Gross's avatar
      ggml: Avoid allocating CUDA primary context on unused GPUs · 9d97e6a9
      Jesse Gross authored
      The recent memory management changes caused all GPUs to be visible
      to the runner, regardless of whether they are ultimately used. This
      caused CUDA devices to allocate a primary context (~300 MB VRAM) on
      each GPU, for each model. This is unnecessary, so we can both avoid
      touching GPUs that we exclude in the early stage of allocation and
      freeing the memory for any that we touch but don't use.
      
      The issue will continue to exist for the old engine, since it touches
      all devices during initialization.
      9d97e6a9
    • Michael Yang's avatar
      fix keep alive (#12041) · 10815324
      Michael Yang authored
      10815324
  4. 26 Aug, 2025 3 commits
  5. 25 Aug, 2025 1 commit
  6. 22 Aug, 2025 6 commits
  7. 21 Aug, 2025 1 commit
  8. 20 Aug, 2025 6 commits
  9. 19 Aug, 2025 2 commits
    • Jesse Gross's avatar
      kvcache: Use Cast instead of Copy for flash attention masks · 05ccb17c
      Jesse Gross authored
      Flash attention kernels require the mask of the KV cache be a F16
      rather than an F32. We can use the GGML operation ggml_cast to do
      this rather than doing it ourselves, which allows reuse of a
      preallocated buffer in the graph rather than allocating a new one
      for each batch. This improves token generation performance with
      flash attention by 10-30% (with gpt-oss). This also makes performance
      with flash attention better than without it, as expected.
      05ccb17c
    • Michael Yang's avatar
      disable output_all (#11959) · f804e8a4
      Michael Yang authored
      f804e8a4
  10. 18 Aug, 2025 8 commits
  11. 15 Aug, 2025 7 commits
  12. 14 Aug, 2025 2 commits
    • Daniel Hiltgen's avatar
      6eaf194b
    • Jesse Gross's avatar
      llm: New memory management · d5a0d8d9
      Jesse Gross authored
      This changes the memory allocation strategy from upfront estimation to
      tracking actual allocations done by the engine and reacting to that. The
      goal is avoid issues caused by both under-estimation (crashing) and
      over-estimation (low performance due to under-utilized GPUs).
      
      It is currently opt-in and can be enabled for models running on the
      Ollama engine by setting OLLAMA_NEW_ESTIMATES=1. Behavior in other
      cases is unchanged and will continue to use the existing estimates.
      d5a0d8d9