• Jesse Gross's avatar
    ggml: Enable flash attention for vision encoders · 1108d8b3
    Jesse Gross authored
    Although the vision component of multimodal models typically already
    call the optimized nn.Attention, it is converted into non-fused
    operations. That is because the backend-specific fused kernels may
    have requirements, such as padding, and they is performed by the
    cache, which vision encoders don't use.
    
    This implements a fallback path in the backend, softening the
    requirements into optimizations. In turn, this allows flash attention
    to be used for vision encoders, saving a significant amount of VRAM
    and improving performance.
    1108d8b3
backend.go 10.7 KB