Commit e9d4ceda authored by Lei Wang's avatar Lei Wang Committed by LeiWang1999
Browse files

[Doc] Update README.md for deepseek_mla on AMD (#389)

* Update README.md for deepseek_mla: Refine performance comparison details and add acknowledgment section. Adjusted performance metrics for TileLang, highlighting its efficiency over Triton and assembly kernels. Included gratitude to the AMD ROCm team for their contributions.

* Update README.md for deepseek_mla: Clarify performance metrics for TileLang, specifying the range of performance parity with hand-optimized assembly kernels. This adjustment enhances the accuracy of the comparative analysis against Triton implementations.
parent a636debb
...@@ -11,6 +11,7 @@ Tile Language (**tile-lang**) is a concise domain-specific language designed to ...@@ -11,6 +11,7 @@ Tile Language (**tile-lang**) is a concise domain-specific language designed to
<img src=./images/MatmulExample.png /> <img src=./images/MatmulExample.png />
## Latest News ## Latest News
- 14/04/2025 🚀: Added high-performance FlashMLA implementation for AMD MI300X, achieving performance parity with hand-optimized assembly kernels of Aiter! See [example_mla_amd](./examples/deepseek_mla/amd/README.md) for details.
- 03/03/2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see [example_mla_decode.py](./examples/deepseek_mla/example_mla_decode.py))! We also provide [documentation](./examples/deepseek_mla/README.md) explaining how TileLang achieves this. - 03/03/2025 🚀: Added high-performance MLA Decoding support using only 80 lines of Python code, achieving performance on par with FlashMLA on H100 (see [example_mla_decode.py](./examples/deepseek_mla/example_mla_decode.py))! We also provide [documentation](./examples/deepseek_mla/README.md) explaining how TileLang achieves this.
- 02/15/2025 ✨: Added WebGPU Codegen support, see [Pull Request #86](https://github.com/tile-ai/tilelang/pull/86)! - 02/15/2025 ✨: Added WebGPU Codegen support, see [Pull Request #86](https://github.com/tile-ai/tilelang/pull/86)!
- 02/12/2025 ✨: Excited to announce the release of [v0.1.0](https://github.com/tile-ai/tilelang/releases/tag/v0.1.0)! - 02/12/2025 ✨: Excited to announce the release of [v0.1.0](https://github.com/tile-ai/tilelang/releases/tag/v0.1.0)!
......
...@@ -36,7 +36,7 @@ We conducted comparative performance analysis across multiple frameworks using f ...@@ -36,7 +36,7 @@ We conducted comparative performance analysis across multiple frameworks using f
<figcaption style="text-align: center;">Figure 1: Computational throughput comparison across frameworks (Batch sizes 64 and 128)</figcaption> <figcaption style="text-align: center;">Figure 1: Computational throughput comparison across frameworks (Batch sizes 64 and 128)</figcaption>
</figure> </figure>
Notably, TileLang achieves performance parity with hand-optimized assembly kernels (aiter-asm) in most test cases, while significantly outperforming both Triton (1.98×) and PyTorch (3.76×) implementations. This performance is achieved through a concise 80-line Python implementation, demonstrating TileLang's efficiency and programmability advantages. Notably, TileLang achieves performance parity with hand-optimized assembly kernels (aiter-asm) (from 0.73x to 1.21x) in most test cases, while significantly outperforming Triton (up to 6.5x faster)implementations. This performance is achieved through a concise 70-line Python implementation!
## Future Optimization Opportunities ## Future Optimization Opportunities
...@@ -46,3 +46,6 @@ Notably, TileLang achieves performance parity with hand-optimized assembly kerne ...@@ -46,3 +46,6 @@ Notably, TileLang achieves performance parity with hand-optimized assembly kerne
- Reduce shared memory pressure - Reduce shared memory pressure
- Improve compute-to-memory access ratios - Improve compute-to-memory access ratios
- Enhance parallelism through dimension-wise task distribution - Enhance parallelism through dimension-wise task distribution
## Acknowledgement
We would like to express our sincere gratitude to the AMD ROCm and Composable Kernel team for their outstanding contributions. We have learned a great deal from the ROCm software stack.
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment