Commits · 42f3c5789db65b5ff1eadea0fe4ce3805483a8e8 · OpenDAS / FlashMLA

29 Sep, 2025 6 commits
- Rename deep dive blog · 42f3c578
  Shengyu Liu authored Sep 29, 2025
  
  42f3c578
- Add Deep-Dive Blog for the New Sparse Decoding Kernel on Hopper (#100) · 472477e8
  Shengyu Liu authored Sep 29, 2025
  
  472477e8
- Add Sparse Decoding Kernel and Sparse Prefill Kernel for Blackwell · fd249aac
  Simon Mo authored Sep 29, 2025
```
Signed-off-by: simon-mo <simon.mo@hey.com>
```
  fd249aac
- Merge pull request #98 from deepseek-ai/open-source-h · 17944550
  Shengyu Liu authored Sep 29, 2025
```
Add Sparse Attention Kernels on Hopper
```
  17944550
- Merge remote-tracking branch 'github/main' into open-source-h · 3969f20b
  Shengyu Liu authored Sep 29, 2025
  
  3969f20b
- Fill in link to DSv3.2 paper · 7232d69d
  Shengyu Liu authored Sep 29, 2025
  
  7232d69d
24 Sep, 2025 2 commits
- Add a comment · 87709cf4
  Shengyu Liu authored Sep 24, 2025
  
  87709cf4
- Reorganize files and add sparse prefill/decoding kernels on hopper · c28eca99
  Shengyu Liu authored Sep 24, 2025
  
  c28eca99
22 Sep, 2025 1 commit
- Refine handling for q/v sequence length equals zero. (#92) · ebf30641
  zhang authored Sep 22, 2025
  
  ebf30641
27 Aug, 2025 1 commit

Zeyu WANG authored Aug 27, 2025

* fix calc space bug

* use python code to allocate the buffer for backward kernel

261330bb

25 Aug, 2025 2 commits
- Remove cudaMalloc and cudaFree in backward (#87) · eb758335
  Li Xiang authored Aug 25, 2025
```
* get rid of cudaMalloc and cudaFree

* minor fix

---------
Co-authored-by: Jiashi Li <js.li@high-flyer.cn>
```
  eb758335
- Remove tma padding for fwd inputs (#85) · 2d291b0c
  zhang authored Aug 25, 2025
  
  2d291b0c
14 Aug, 2025 2 commits
- Fix accuracy issue in sum_OdO kernel · c7590278
  Jiashi Li authored Aug 14, 2025
  
  c7590278
- Drop support for CUDA <12.8 · ef5b1a69
  Jiashi Li authored Aug 14, 2025
  
  ef5b1a69
01 Aug, 2025 1 commit

Add more GPU architctures support (#76) · 41b611f7

Zeyu WANG authored Aug 01, 2025



* Add more GPU architctures support

* Merge fmha and mla runner

* add varlen & non varlen support, and add incontiguous tensor support

* update readme

* add varlen api

---------
Co-authored-by: dianzhangc <dianzhangc@nvidia.com>

41b611f7

29 Apr, 2025 2 commits
- update .gitignore · 9edee0c0
  ljss authored Apr 29, 2025
  
  9edee0c0
- update to cutlass 3.9 · 9c5dfab6
  ljss authored Apr 29, 2025
  
  9c5dfab6
28 Apr, 2025 1 commit
- Fix synchronization issues · 01a27728
  ljss authored Apr 28, 2025
  
  01a27728
23 Apr, 2025 2 commits

Fix LaTeX render error (#74) · 70b94685
Shengyu Liu authored Apr 23, 2025

70b94685

Minor fix to the docs to correct FlashAttention-3's paper link and typos (#73) · 6cff5a73

ℍ𝕠𝕝𝕝𝕠𝕨 𝕄𝕒𝕟 authored Apr 23, 2025

Thank you for open source FlashMLA! Just read the write up and very amazing
work! Found some very minor mistakes regarding to typos, and the link
to the FlashAttention-3 paper is wrong as that is the original FlashAttention
paper, so I just send the PR here. Thanks again!
Signed-off-by: Hollow Man <hollowman@opensuse.org>

6cff5a73

22 Apr, 2025 2 commits

Update README.md (#72) · a9444cd6
Shengyu Liu authored Apr 22, 2025

a9444cd6

Performance Update (2025.04.22) (#71) · c2067be3

Shengyu Liu authored Apr 22, 2025

* Fix benchmark script

* Performance optimization for compute-bound cases

* Add new testcase (s_k = 16384)

* Update README.md

* Update comment

* Update README.md

* Add the deep-dive blog

* Add background color for MLA Kernel Sched.drawio.svg

* Use relative path for the schedule image

* Move flash_mla.h to kernels/params.h

c2067be3

01 Mar, 2025 2 commits
- add missing copyright · b31bfe72
  ljss authored Mar 01, 2025
  
  b31bfe72
- add community support for [AMD] · 3e123bc9
  Jiashi Li authored Mar 01, 2025
  
  3e123bc9
27 Feb, 2025 3 commits
- reformat Community Support section · 1aef31d1
  hpp authored Feb 27, 2025
  
  1aef31d1
- add Community Support of [Hygon DCU] [Intellifusion] [Iluvatar Corex] · 77d9d8d2
  hpp authored Feb 27, 2025
  
  77d9d8d2
- add Community Support of [Hygon DCU] [Intellifusion] [Iluvatar Corex] · 4430e398
  hpp authored Feb 27, 2025
  
  4430e398
26 Feb, 2025 4 commits
- fix readme · 480405ad
  Jiashi Li authored Feb 26, 2025
  
  480405ad
- Fix readme · 966eedc2
  Jiashi Li authored Feb 26, 2025
  
  966eedc2
- Merge pull request #45 from yangsijia-serena/main · 01d6d400
  Jiashi Li authored Feb 26, 2025
```
fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them
```
  01d6d400
- add Community Support of [MetaX] and [Moore Threads] · 6492cabb
  hpp authored Feb 26, 2025
  
  6492cabb
25 Feb, 2025 4 commits
- fix(benchmark): store 'compare' and 'one' perf results in csv files and visualize them · b6798030
  yangsijia.614 authored Feb 25, 2025
  
  b6798030
- cuda12.8 recommendation · 4edea86f
  ljss authored Feb 26, 2025
  
  4edea86f
- Merge pull request #32 from sijiac/fp16-support · b549289f
  Jiashi Li authored Feb 25, 2025
```
Support FP16 dtype in FlashMLA kenrel
```
  b549289f
- Style fix · e1e9fa98
  ljss authored Feb 25, 2025
  
  e1e9fa98
24 Feb, 2025 5 commits
- add flag to disable FP16 compile · a3b74b85
  Sijia Chen authored Feb 24, 2025
  
  a3b74b85
- Merge pull request #35 from KnowingNothing/main · 18e32770
  Jiashi Li authored Feb 25, 2025
```
feat: add benchmark for flash_infer vs flash_mla
```
  18e32770
- Merge pull request #37 from chunyang-wen/Update-doc-string · 7d69520a
  Jiashi Li authored Feb 25, 2025
```
Update docstring
```
  7d69520a
- add gitignore for png and csv files in benchmark · 922f63bd
  zhengsize authored Feb 24, 2025
  
  922f63bd
- Update docstring · c4c5912b
  chunyang.wen authored Feb 25, 2025
  
  c4c5912b