1. 03 Dec, 2025 1 commit
    • Daniel Hiltgen's avatar
      CUDA: filter devices on secondary discovery (#13317) · 3f308367
      Daniel Hiltgen authored
      We now do a deeper probe of CUDA devices to verify the library version has
      the correct compute capability coverage for the device.  Due to ROCm also
      interpreting the CUDA env var to filter AMD devices, we try to avoid setting
      it which leads to problems in mixed vendor systems.  However without setting
      it for this deeper probe, each CUDA library subprocess discovers all CUDA GPUs
      and on systems with lots of GPUs, this can lead to hitting timeouts.  The fix is
      to turn on the CUDA visibility env var just for this deeper probe use-case.
      3f308367
  2. 02 Dec, 2025 1 commit
  3. 11 Nov, 2025 2 commits
    • Jesse Gross's avatar
      llm: Prefer dedicated GPUs over iGPUs when allocating memory · 8bf38552
      Jesse Gross authored
      We currently assign model layers to GPUs according to free VRAM,
      which assumes that GPU performance is roughly equal. This does not
      work well for mixed dGPU and iGPU systems because iGPUs typically
      use system memory which is large but their performance is slow.
      This instead assigns layers to dGPUs first and then iGPUs.
      
      In the future, this could be generalized to have a more fine grained
      notion of GPU performance but dGPU vs. iGPU performance is the most
      extreme.
      8bf38552
    • Jesse Gross's avatar
      llamarunner: Respect device ordering for offloaded layers · 4372d0bf
      Jesse Gross authored
      We used to control the way that llama.cpp saw devices using
      CUDA_VISIBLE_DEVICES or similar. This would ensure that the layers
      offloaded to a device were actually the ones intended. This is
      particularly important because we might reorder devices based on
      free memory or performance.
      
      When we started explicitly scheduling layers, this logic went
      away but the llamarunner didn't have any way to set the correct
      order of devices. This meant that the correct number of layers
      would be assigned to a device but not necessarily the layers
      that were expected. This change sets up the devices correctly
      based on the offload information.
      4372d0bf
  4. 04 Nov, 2025 2 commits
    • Daniel Hiltgen's avatar
      discovery: only retry AMD GPUs (#12894) · 27f1fde4
      Daniel Hiltgen authored
      * discovery: only retry AMD GPUs
      
      CUDA and Vulkan don't crash on unsupported devices, so retry isn't necessary.
      This also refactors the code to shift the Library specific logic into the ml
      package.
      
      * review comments
      27f1fde4
    • Daniel Hiltgen's avatar
      vulkan: enable flash attention (#12937) · a4770107
      Daniel Hiltgen authored
      Also adjusts the vulkan windows build pattern to match recent changes in other backends
      so incremental builds are faster.
      a4770107
  5. 31 Oct, 2025 1 commit
  6. 28 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Fix vulkan PCI ID and ID handling (#12775) · 14977a93
      Daniel Hiltgen authored
      * Fix vulkan PCI ID and ID handling
      
      Intel GPUs may not report PCI IDs which was leading to incorrect overlap
      detection.  Switch to using the existing PCI IDs, however AMD GPUs claim not to
      report PCI IDs, but actually do, so try anyway, as this is required for ADLX to
      find the GPUs on Windows. Numeric IDs lead to scheduling problems, so this also
      switches Vulkan to use UUID based IDs. The GPU discovery patches have been
      squashed into a single patch to simplify future rebases.
      
      * review comments
      14977a93
  7. 23 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      DRY out the runner lifecycle code (#12540) · 3258a89b
      Daniel Hiltgen authored
      * DRY out the runner lifecycle code
      
      Now that discovery uses the runners as well, this unifies the runner spawning code
      into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo
      
      * win: make incremental builds better
      
      Place build artifacts in discrete directories so incremental builds don't have to start fresh
      
      * Adjust sort order to consider iGPUs
      
      * handle cpu inference oom scenarios
      
      * review comments
      3258a89b
  8. 01 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb