• Nicolas Patry's avatar
    Saving some VRAM. (#2790) · b57f3703
    Nicolas Patry authored
    * Saving some VRAM.
    
    - 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
      left, so 400MB saved.
    
    - Effect not as visible on attention=flashinfer and n_shard=1. I suspect
      it's linked to the torch allocator.
    
    * Adding assertion.
    b57f3703
flash_causal_lm.py 95 KB