README.md

# Attention Sink

We compare with an optimized version of the official Triton implementation [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).


## Algorithm
### Forward
The only change from vanilla FlashAttention is that `sinks` should be taken into consideration in the softmax, which requires an extra rescaling at the epilogue stage.

### Backward
Based on detailed mathematical derivation, interestingly, the backward computation process of `dQ`, `dK`, `dv` is almost identical to that in vanilla FlashAttention, except for that the specific meanings of `lse` differ. We only need to compute `dsinks` additionally, which is given by:

$$
dsink_h=-\sum_{b}\sum_{q}P_{b, h, q}Delta_{b, h, q}
$$

where $P_{b, h, q}$ is the proportion of $sink_h$ in the softmax in the $b$-th block, $h$-th head and $q$-th query(row).

## Benchmark of forward process

### Benchmark Environment
- **Hardware**: NVIDIA H800
- **CUDA version**: 12.9
- **Triton Version**: 3.4.0

### Results

- dtype=bfloat16
- batch_size=1, heads=64, kv_heads=8 (the setting of GPT-OSS-120B)
- Full attention is adopted.

| SEQ_LEN | headdim | Triton TFLOPs | TileLang TFLOPs      | Speedup |
|---------|---------|---------------|----------------------|---------|
| 2048    |   64    | 232.98        | **281.89**           | 1.21x   |
| 2048    |  128    | 321.55        | **417.98**           | 1.30x   |
|         |         |               |                      |         |
| 4096    |   64    | 280.70        | **349.47**           | 1.25x   |
| 4096    |  128    | 369.61        | **497.13**           | 1.35x   |
|         |         |               |                      |         |
| 8192    |   64    | 299.04        | **385.56**           | 1.29x   |
| 8192    |  128    | 399.39        | **507.93**           | 1.27x   |
|         |         |               |                      |         |
| 16384   |   64    | 309.46        | **400.62**           | 1.29x   |
| 16384   |  128    | 418.99        | **549.11**           | 1.31x   |

> The backward performance will be further optimized in the future.