# Performance Benchmark ## Hardware & Environment - **Hardware**: NVIDIA H100 PCIe - **CUDA version**: 12.8.1 - **PyTorch Version**: 2.7.1+cu128 - **Triton Version**: 3.3.1 ## Performance Results BATCH_SIZE=1, HEAD=1, DIM=64 | SEQ_LEN | VS_LIST | Triton Time | TileLang Time | Speedup | |---------|--------------|-------------|---------------|---------| | 8192 | [1000, 200] | 0.168 ms | 0.105 ms | 1.60x | | 8192 | [1000, 600] | 0.207 ms | 0.119 ms | 1.74x | | 8192 | [800, 600] | 0.207 ms | 0.122 ms | 1.70x | | | | | | | | 16384 | [1000, 200] | 0.261 ms | 0.167 ms | 1.56x | | 16384 | [1000, 600] | 0.419 ms | 0.258 ms | 1.62x | | 16384 | [800, 600] | 0.422 ms | 0.255 ms | 1.65x | | | | | | | | 32768 | [1000, 200] | 0.374 ms | 0.248 ms | 1.51x | | 32768 | [1000, 600] | 0.823 ms | 0.554 ms | 1.49x | | 32768 | [800, 600] | 0.826 ms | 0.558 ms | 1.48x | | | | | | | | 65536 | [1000, 200] | 0.637 ms | 0.524 ms | 1.22x | | 65536 | [1000, 600] | 1.758 ms | 1.501 ms | 1.17x | | 65536 | [800, 600] | 1.783 ms | 1.489 ms | 1.20x |