1. 12 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Docs] Add AMD Flash MLA Documentation to Tutorials Section (#376) · 0997c333
      Lei Wang authored
      * [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang
      
      - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism.
      - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use.
      - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization.
      - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming.
      
      * doc update
      
      * [Add] Enhance AMD FlashMLA implementation and documentation
      
      - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation.
      - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation.
      - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks.
      
      * lint fix
      
      * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies
      
      - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators.
      - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling.
      - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks.
      - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
      0997c333
  2. 10 Apr, 2025 1 commit
    • Lei Wang's avatar
      [MLA][AMD] Add amd mla benchmarking (#367) · d3536d9e
      Lei Wang authored
      
      
      * [Add] Introduce benchmark scripts for MLA decoding with AMD support
      
      - Added three new benchmark scripts: `benchmark_mla_decode_amd_tilelang.py`, `benchmark_mla_decode_amd_torch.py`, and `benchmark_mla_decode_amd_triton.py` to evaluate the performance of the MLA decoding mechanism across different frameworks.
      - Each script includes implementations for attention calculation, performance profiling, and output validation against reference implementations.
      - Enhanced command-line argument parsing for customizable input parameters, including batch size, number of heads, and dimensions.
      - Integrated performance comparison functionality to facilitate benchmarking between different implementations.
      
      * lint fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      d3536d9e