Speedup graph for A100, d128

2d5b2483 · Dan Fu · 5d07483b · 2d5b2483 · 2d5b2483
Commit 2d5b2483 authored Jun 14, 2022 by Dan Fu
Hide whitespace changes
Inline Side-by-side

Showing with 8 additions and 0 deletions

README.md README.md +8 -0

assets/flashattn_speedup_a100_d128.jpg assets/flashattn_speedup_a100_d128.jpg +0 -0

No files found.
--- a/README.md
+++ b/README.md
@@ -71,6 +71,14 @@ Memory savings are proportional to sequence length -- since standard attention h
 We see 10X memory savings at sequence length 2K, and 20X at 4K.
 As a result, FlashAttention can scale to much longer sequence lengths.
+#### Head Dimension 128
+![FlashAttention speedup, head dimension 128](assets/flashattn_speedup_a100_d128.jpg)
+We show speedup with head dimension 128.
+Here we show batch size 16 with 12 heads.
+Speedup is less than with the smaller head sizes, but speedup is still significant -- especially with a causal mask.
 ### RTX 3090
 For the RTX 3090, we use batch size 12 with 12 attention heads.

--- a/assets/flashattn_speedup_a100_d128.jpg
+++ b/assets/flashattn_speedup_a100_d128.jpg