server/text_generation_server/models/flash_causal_lm.py · 8c3669b287a1c651cb07049e67f1ce5967828167 · OpenDAS / text-generation-inference

Nicolas Patry authored Dec 03, 2024

* Saving some VRAM.

- 8B on 4xL4 attention=flashdecoding . Before 4.28GB left, After 4.32GB
  left, so 400MB saved.

- Effect not as visible on attention=flashinfer and n_shard=1. I suspect
  it's linked to the torch allocator.

* Adding assertion.

b57f3703

flash_causal_lm.py 95 KB

Replace flash_causal_lm.py