- 12 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism. - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use. - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization. - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming. * doc update * [Add] Enhance AMD FlashMLA implementation and documentation - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation. - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation. - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks. * lint fix * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators. - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling. - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks. - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
-
- 10 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Add] Introduce benchmark scripts for MLA decoding with AMD support - Added three new benchmark scripts: `benchmark_mla_decode_amd_tilelang.py`, `benchmark_mla_decode_amd_torch.py`, and `benchmark_mla_decode_amd_triton.py` to evaluate the performance of the MLA decoding mechanism across different frameworks. - Each script includes implementations for attention calculation, performance profiling, and output validation against reference implementations. - Enhanced command-line argument parsing for customizable input parameters, including batch size, number of heads, and dimensions. - Integrated performance comparison functionality to facilitate benchmarking between different implementations. * lint fix * lint fix --------- Co-authored-by:Zhiwen Mo <zhiwen.mo25@ic.ac.uk>
-