Commits · 0997c33310ed5e496c79a2b3c659cacc0a2aeba2 · OpenDAS / tilelang

12 Apr, 2025 1 commit

[Docs] Add AMD Flash MLA Documentation to Tutorials Section (#376) · 0997c333

Lei Wang authored Apr 12, 2025

* [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang

- Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism.
- Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use.
- Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization.
- Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming.

* doc update

* [Add] Enhance AMD FlashMLA implementation and documentation

- Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation.
- Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation.
- Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks.

* lint fix

* [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies

- Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators.
- Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling.
- Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks.
- Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.

0997c333

10 Apr, 2025 1 commit

[MLA][AMD] Add amd mla benchmarking (#367) · d3536d9e

Lei Wang authored Apr 10, 2025



* [Add] Introduce benchmark scripts for MLA decoding with AMD support

- Added three new benchmark scripts: `benchmark_mla_decode_amd_tilelang.py`, `benchmark_mla_decode_amd_torch.py`, and `benchmark_mla_decode_amd_triton.py` to evaluate the performance of the MLA decoding mechanism across different frameworks.
- Each script includes implementations for attention calculation, performance profiling, and output validation against reference implementations.
- Enhanced command-line argument parsing for customizable input parameters, including batch size, number of heads, and dimensions.
- Integrated performance comparison functionality to facilitate benchmarking between different implementations.

* lint fix

* lint fix

---------
Co-authored-by: Zhiwen Mo <zhiwen.mo25@ic.ac.uk>

d3536d9e