"app/vscode:/vscode.git/clone" did not exist on "952abe029b550b91d3913f34600399b901391bb7"
  • Lei Wang's avatar
    [Docs] Add AMD Flash MLA Documentation to Tutorials Section (#376) · 0997c333
    Lei Wang authored
    * [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang
    
    - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism.
    - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use.
    - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization.
    - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming.
    
    * doc update
    
    * [Add] Enhance AMD FlashMLA implementation and documentation
    
    - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation.
    - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation.
    - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks.
    
    * lint fix
    
    * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies
    
    - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators.
    - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling.
    - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks.
    - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
    0997c333