[BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split (#575)
* [CI] Add flash_decoding example to CI * Add output of ref latency * format example_gqa_decode.py * [BugFix] Fix precision issue in GQA decode when block_N exceeds seqlen/num_split * format example_gqa_decode.py
Showing
Please register or sign in to comment