1. 15 Dec, 2025 2 commits
  2. 13 Dec, 2025 2 commits
  3. 12 Dec, 2025 10 commits
  4. 11 Dec, 2025 5 commits
  5. 10 Dec, 2025 6 commits
  6. 09 Dec, 2025 5 commits
  7. 08 Dec, 2025 5 commits
  8. 06 Dec, 2025 1 commit
  9. 05 Dec, 2025 1 commit
  10. 04 Dec, 2025 3 commits
    • Jesse Gross's avatar
      9191dfaf
    • Jesse Gross's avatar
      ggml: Enable flash attention for vision encoders · 1108d8b3
      Jesse Gross authored
      Although the vision component of multimodal models typically already
      call the optimized nn.Attention, it is converted into non-fused
      operations. That is because the backend-specific fused kernels may
      have requirements, such as padding, and they is performed by the
      cache, which vision encoders don't use.
      
      This implements a fallback path in the backend, softening the
      requirements into optimizations. In turn, this allows flash attention
      to be used for vision encoders, saving a significant amount of VRAM
      and improving performance.
      1108d8b3
    • Jesse Gross's avatar
      ggml: Always set cache padding to 256 · 7837a5bc
      Jesse Gross authored
      We currently use cache padding of 32 when not using flash attention
      and 256 with flash attention, which is based on the historic alignment
      requirements of these kernels. The restrictions have since been
      loosened but there are still performance benefits, such as better
      CUDA graph reuse.
      
      Since the requirement is no longer kernel-specific, set the padding
      uniformly to 256, as llama.cpp has.
      7837a5bc