1. 15 Dec, 2025 3 commits
  2. 13 Dec, 2025 2 commits
  3. 12 Dec, 2025 10 commits
  4. 11 Dec, 2025 5 commits
  5. 10 Dec, 2025 6 commits
  6. 09 Dec, 2025 5 commits
  7. 08 Dec, 2025 5 commits
  8. 06 Dec, 2025 1 commit
  9. 05 Dec, 2025 1 commit
  10. 04 Dec, 2025 2 commits
    • Jesse Gross's avatar
      9191dfaf
    • Jesse Gross's avatar
      ggml: Enable flash attention for vision encoders · 1108d8b3
      Jesse Gross authored
      Although the vision component of multimodal models typically already
      call the optimized nn.Attention, it is converted into non-fused
      operations. That is because the backend-specific fused kernels may
      have requirements, such as padding, and they is performed by the
      cache, which vision encoders don't use.
      
      This implements a fallback path in the backend, softening the
      requirements into optimizations. In turn, this allows flash attention
      to be used for vision encoders, saving a significant amount of VRAM
      and improving performance.
      1108d8b3