• Jesse Gross's avatar
    attention: Remove unnecessary contiguous operations · 854a9195
    Jesse Gross authored
    Prior to performing attention, we need to permute query, key
    and value. Currently we call Contiguous after each of these
    permutations, which is correct but expensive. Avoiding the
    3 calls to Contiguous increases performance by over 20%.
    
    The permutations of query and key do not violate the continuity
    rules for mulmat and the Contiguous call can be simply removed.
    
    Value requires a different permutation and does require Contiguous.
    However, we can use the copy into the cache as a way to perform this
    without further overhead.
    
    To support this and avoid unexpected tensor shapes that are seen by
    models, we need tighter integration between attention, cache
    and backend. Future optimization will also likely need this structure
     - for example, flash attention has special padding requirements in
    the cache and other backends may have their own needs.
    
    This further contains the operations that go into attention so that
    these and other optimizations can be handled transparently. Models
    that have special requirements for attention can still implement
    their own version of it.
    854a9195
encoder.go 3.01 KB