# Attention Sink We compare with an optimized version of the official Triton implementation at [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py). ## Algorithm ### Forward The only change from vanilla FlashAttention is that `sinks` should be taken into consideration in the softmax, which requires an extra rescaling at the epilogue stage. ### Backward Based on detailed mathematical derivation, interestingly, the backward computation process of `dQ`, `dK`, `dv` is almost identical to that in vanilla FlashAttention, except for that the specific meanings of `lse` differ. We only need to compute `dsinks` additionally, which is given by: $$ dsink_h=-\sum_{b}\sum_{q}P_{b, h, q}Delta_{b, h, q} $$ where $P_{b, h, q}$ is the proportion of $sink_h$ in the softmax in the $b$-th block, $h$-th head and $q$-th query(row). ## Benchmark of forward process ### Benchmark Environment - **Hardware**: NVIDIA H800 - **CUDA version**: 12.9 - **Triton Version**: 3.4.0 ### Results - dtype=float16 - batch_size=1, heads=64, kv_heads=8 (the setting of GPT-OSS-120B) - Full attention is adopted. | SEQ_LEN | headdim | Triton TFLOPs | TileLang TFLOPs | Speedup | |---------|---------|---------------|----------------------|---------| | 2048 | 64 | 231.55 | **277.07** | 1.20x | | 2048 | 128 | 313.55 | **393.98** | 1.26x | | | | | | | | 4096 | 64 | 272.17 | **337.30** | 1.24x | | 4096 | 128 | 356.35 | **461.54** | 1.30x | | | | | | | | 8192 | 64 | 289.93 | **353.81** | 1.22x | | 8192 | 128 | 392.18 | **482.50** | 1.23x | | | | | | | | 16384 | 64 | 299.52 | **377.44** | 1.26x | | 16384 | 128 | 404.64 | **519.02** | 1.28x | > The backward performance will be further optimized via fine-grained manual pipelining of FA3 in the tilelang kernel.