hopper/benchmark_flash_attention_fp8.py · 5018ac6ac531aabdb05c8af1ba3d98a2235bdbde · gaoqiong / flash-attention

"llm/llama.cpp/models/ggml-vocab-refact.gguf.out" did not exist on "b0135f4b9b176eab9155b660d04c9ca2a1ec2341"

Fp8 kernel with "in-kernel" transpose of V in producer (#1100) · 5018ac6a

jayhshah authored Jul 30, 2024

* base version

* restructure pipelines, add special fp8 epilogue

* add variants

* add fp8 causal and modify dynamic tile scheduler

* better causal schedule

* maintain two schedules for non causal and causal

* removing macros

* fix regression

* clean up unneeded methods and variants

* fix mistake with NumProducerThreads

* base version

* restructure pipelines, add special fp8 epilogue

* add variants

* add fp8 causal and modify dynamic tile scheduler

* better causal schedule

* maintain two schedules for non causal and causal

* removing macros

* fix regression

* clean up unneeded methods and variants

* fix mistake with NumProducerThreads

* use seqlen traits

* add fp8 .cu files and benchmark script

* fix merge issue

* fix merge issue

* fix merge issue

* remove duplicate code

* fix regression with varseqlen

* move varseqlen init in constexpr

* fix test script

* more constexpr on varseqlen and add max offset

* add back test cases

5018ac6a

benchmark_flash_attention_fp8.py 12.3 KB

Replace benchmark_flash_attention_fp8.py