1. 06 Nov, 2025 5 commits
  2. 05 Nov, 2025 9 commits
  3. 04 Nov, 2025 5 commits
    • Daniel Hiltgen's avatar
      discovery: only retry AMD GPUs (#12894) · 27f1fde4
      Daniel Hiltgen authored
      * discovery: only retry AMD GPUs
      
      CUDA and Vulkan don't crash on unsupported devices, so retry isn't necessary.
      This also refactors the code to shift the Library specific logic into the ml
      package.
      
      * review comments
      27f1fde4
    • virajwad's avatar
      vulkan: Add memory detection for Intel GPU using DXGI+PDH (#12664) · 220e133f
      virajwad authored
      * PDH free memory skeleton
      
      * Add PDH printing
      
      * Add LUID support for Vulkan
      
      * wire luid from ggml-vulkan to mem-dxgi-pdh file
      
      * Fix to ggml-impl
      
      * Continue skeleton
      
      * Implemented ggml_dxgi_pdh_get_device_memory
      
      * fix comments
      
      * Fix - change value GB to bytes
      
      * add ifdefs to only support windows and not linux
      
      * modify error codes
      
      * Finished ggml_dxgi_pdh_init() function
      
      * completed ggml_dxgi_pdh_release()
      
      * Formatting changes, add static to functions
      
      * fix build errors
      
      * fix go build error
      
      * fix luid - now should match between dxgi and vulkan
      
      * Fix the free memory reporting (was using copy by value, change to reference)
      
      * keep only dxgi1_2.h
      
      * Modifications based on PR feedback
      
      * fix merge conflicts (2) and fix desc1.description printout
      
      * move dxgi + pdh api calls to before the vendor specific library calls
      
      * change from 3 samples to 1 sample for PDH
      
      * modify when old_mode is set
      
      * add fix for building MacOS
      
      * fix release and returns for other vendors
      
      * add patch file
      220e133f
    • Daniel Hiltgen's avatar
      app: add code for macOS and Windows apps under 'app' (#12933) · d3b4b997
      Daniel Hiltgen authored
      
      
      * app: add code for macOS and Windows apps under 'app'
      
      * app: add readme
      
      * app: windows and linux only for now
      
      * ci: fix ui CI validation
      
      ---------
      Co-authored-by: default avatarjmorganca <jmorganca@gmail.com>
      d3b4b997
    • Daniel Hiltgen's avatar
      vulkan: enable flash attention (#12937) · a4770107
      Daniel Hiltgen authored
      Also adjusts the vulkan windows build pattern to match recent changes in other backends
      so incremental builds are faster.
      a4770107
    • Jesse Gross's avatar
      ggml: Increase maximum graph size · ef549d51
      Jesse Gross authored
      The initial implementation of qwen3-vl:235b exceeded the maximum graph
      size based on the number of tensors. Although this was later fixed
      through the use of the mrope operation, we are close to the limit in
      some cases. This updates to track the current llama.cpp usage of GGML.
      ef549d51
  4. 03 Nov, 2025 3 commits
  5. 02 Nov, 2025 1 commit
  6. 31 Oct, 2025 4 commits
  7. 30 Oct, 2025 11 commits
    • Daniel Hiltgen's avatar
      win: avoid ID mixups on refresh (#12869) · db973c8f
      Daniel Hiltgen authored
      On Windows AMD IDs are numeric, and can reorder based on the filter environment.
      By passing in the filter env on a full discovery refresh, we'll only look at the actual devices
      and ignore unsupported iGPUs.  Without this, on some systems iGPU VRAM was incorrectly
      being used to populate the dGPU.
      db973c8f
    • Jesse Gross's avatar
      ggml: Enable op_offload to improve partial offload performance · afaf7ce8
      Jesse Gross authored
      When a model is partially offloaded to system RAM, we can either
      do the calculations on the CPU or we can temporarily transfer the
      data to the GPU to do the calculations there. Small batches tend
      to be better on the CPU, large batches on the GPU.
      
      The llamarunner used the GPU in most cases and the ollamarunner
      used the CPU. Although the ollamarunner saw an improvement in
      token generation performance, there was a large performance hit
      in prompt processing (3-10x).
      
      There is an existing heuristic to dynamically switch between these
      two modes but in practice it doesn't have enough information to
      accurately make that decision. This adds authoritative data to make
      the check work to get the best of both worlds.
      
      Fixes #12037
      afaf7ce8
    • Jesse Gross's avatar
      ollamarunner: Worst case batch for token generation · 26465fb8
      Jesse Gross authored
      We currently allocate the worst case batch for max sized
      batches, which corresponds to prompt processing. However,
      there are some cases where the generated graph is different
      for small and large batches. To ensure that we don't need
      to allocate memory later after layout has taken place, we
      should run the worst case batch both ways and take the larger
      amount of memory.
      
      This does not noticeably affect loading speed as the most expensive
      part of this logic is from image processing and that does not
      occur during token generation.
      26465fb8
    • Daniel Hiltgen's avatar
      win: use copy for subprocess logs (#12864) · 88236bc0
      Daniel Hiltgen authored
      windows gets confused when we try to hand the stderr file descriptor to the subprocess children.  This ensures the log output
      always shows up.
      88236bc0
    • Patrick Devine's avatar
      76eb7d0f
    • Michael Yang's avatar
      interleaved mrope (#12807) · f67a6df1
      Michael Yang authored
      * ml(ggml): mrope
      * interleave mrope
      f67a6df1
    • Michael Yang's avatar
      75e75d9a
    • Michael Yang's avatar
      fix(cmd): unload model before removal (#12832) · ed78e127
      Michael Yang authored
      this change fixes two bugs with `ollama rm`:
      
      1. before a model is removed, it will first be stopped. this only
         happens for the first argument and skipped for all other models
      2. models are unloaded indiscriminately. this errors for cloud models
         and should be omitted
      ed78e127
    • Michael Yang's avatar
      fix: qwen2.5vl, qwen3vl composite image (#12841) · d432ade7
      Michael Yang authored
      this change fixes images with an alpha channel by overlaying the image
      onto a white background
      d432ade7
    • Michael Yang's avatar
      tests: add tests and docs for commonly used ops (#12844) · 06b3422d
      Michael Yang authored
      * mulmat
      * permute
      06b3422d
    • Athiban Sharon's avatar
      Update README.md (#12822) · cbe1cf06
      Athiban Sharon authored
      Fixed broken docs links
      cbe1cf06
  8. 29 Oct, 2025 2 commits