[Doc] Update README.md for deepseek_mla on AMD (#389)

* Update README.md for deepseek_mla: Refine performance comparison details and add acknowledgment section. Adjusted performance metrics for TileLang, highlighting its efficiency over Triton and assembly kernels. Included gratitude to the AMD ROCm team for their contributions. * Update README.md for deepseek_mla: Clarify performance metrics for TileLang, specifying the range of performance parity with hand-optimized assembly kernels. This adjustment enhances the accuracy of the comparative analysis against Triton implementations.

[Doc] Update README.md for deepseek_mla on AMD (#389)
* Update README.md for deepseek_mla: Refine performance comparison details and add acknowledgment section. Adjusted performance metrics for TileLang, highlighting its efficiency over Triton and assembly kernels. Included gratitude to the AMD ROCm team for their contributions. * Update README.md for deepseek_mla: Clarify performance metrics for TileLang, specifying the range of performance parity with hand-optimized assembly kernels. This adjustment enhances the accuracy of the comparative analysis against Triton implementations.
e9d4ceda · Lei Wang · LeiWang1999 · a636debb · e9d4ceda · e9d4ceda
Commit e9d4ceda authored Apr 14, 2025 by Lei Wang Committed by LeiWang1999 Apr 14, 2025
Hide whitespace changes
Inline Side-by-side

Showing with 5 additions and 1 deletion

README.md README.md +1 -0

examples/deepseek_mla/amd/README.md examples/deepseek_mla/amd/README.md +4 -1

No files found.
--- a/README.md
+++ b/README.md
@@ -11,6 +11,7 @@ Tile Language (**tile-lang**) is a concise domain-specific language designed to
 <img src=./images/MatmulExample.png />
 ## Latest News
+- 14/04/2025 🚀: Added high-performance FlashMLA implementation for AMD MI300X, achieving performance parity with hand-optimized assembly kernels of Aiter! See [example_mla_amd](./examples/deepseek_mla/amd/README.md) for details.
 - 03/03/2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see [example_mla_decode.py](./examples/deepseek_mla/example_mla_decode.py))! We also provide [documentation](./examples/deepseek_mla/README.md) explaining how TileLang achieves this.
 - 02/15/2025 ✨: Added WebGPU Codegen support, see [Pull Request #86](https://github.com/tile-ai/tilelang/pull/86)!
 - 02/12/2025 ✨: Excited to announce the release of [v0.1.0](https://github.com/tile-ai/tilelang/releases/tag/v0.1.0)!

--- a/examples/deepseek_mla/amd/README.md
+++ b/examples/deepseek_mla/amd/README.md
@@ -36,7 +36,7 @@ We conducted comparative performance analysis across multiple frameworks using f
  <figcaption style="text-align: center;">Figure 1: Computational throughput comparison across frameworks (Batch sizes 64 and 128)</figcaption>
 </figure>
-Notably, TileLang achieves performance parity with hand-optimized assembly kernels (aiter-asm) in most test cases, while significantly outperforming both Triton (1.98×) and PyTorch (3.76×) implementations. This performance is achieved through a concise 80-line Python implementation, demonstrating TileLang's efficiency and programmability advantages.
+Notably, TileLang achieves performance parity with hand-optimized assembly kernels (aiter-asm) (from 0.73x to 1.21x) in most test cases, while significantly outperforming Triton (up to 6.5x faster)implementations. This performance is achieved through a concise 70-line Python implementation!
 ## Future Optimization Opportunities
@@ -46,3 +46,6 @@ Notably, TileLang achieves performance parity with hand-optimized assembly kerne
   - Reduce shared memory pressure
   - Improve compute-to-memory access ratios
   - Enhance parallelism through dimension-wise task distribution
+## Acknowledgement
+We would like to express our sincere gratitude to the AMD ROCm and Composable Kernel team for their outstanding contributions. We have learned a great deal from the ROCm software stack.
\ No newline at end of file