Commits · 6df7e0a02edcee851744168079377a039f6d728d · gaoqiong / flash-attention

03 Jul, 2024 1 commit
- Fix the varlen deterministic test (#1023) · 6df7e0a0
  muoshuosha authored Jul 04, 2024
```
Co-authored-by: moshuosha <moshuosha@qq.com>
```
  6df7e0a0
01 Jul, 2024 5 commits

Fix typos of comments about shape. (#837) · 9486635c
66RING authored Jul 01, 2024

9486635c

Fix KeyError handling for non-existing key in state_dict.pop() (#898) · 0d810cfb

JDKWangGuan authored Jun 30, 2024

Update handling for KeyError in state_dict.pop() for non-existing keys.
Changed state_dict.pop(f"h.{d}.attn.bias") to state_dict.pop(f"h.{d}.attn.bias", None) to prevent KeyError exceptions.


The following code can re-produce the issue
```
from transformers import AutoTokenizer, GPT2Model, GPT2Config
from flash_attn.models.gpt import GPTLMHeadModel, GPTModel

# >>> transformers.__version__
# '4.38.2'

model_path = 'gpt2'
output_model_path = 'gpt2_model'
config = GPT2Config.from_pretrained(model_path, output_hidden_states=True)
model = GPT2Model.from_pretrained(model_path, from_tf=False, config=config)
'''
model fine-tuning here
'''
# dump the fine-tuned model
model.save_pretrained(output_model_path)

# load the fine-tuned model
config = GPT2Config.from_pretrained(output_model_path, output_hidden_states=True)
model = GPTModel.from_pretrained(output_model_path, config=config, strict=True)  # failed due to KeyError: 'h.0.attn.bias'
model = GPTLMHeadModel.from_pretrained(output_model_path, config=config, strict=True)  # failed due to KeyError: 'h.0.attn.bias'

```

0d810cfb

fix typo (#974) · 6a2a16e9
cao lei authored Jun 30, 2024

6a2a16e9

Fixing argument checking when using `seqlenq_ngroups_swapped`. (#976) · 5bf20196

Nicolas Patry authored Jul 01, 2024

When user send `out` as a parameter of the function
`seqlenq_ngroups_swapped` with parameters that trigger,
the CHECK_SHAPE is incorrect (since q shape is modified.)

5bf20196

remove swizzle part of `sV.data()` to get a completely non-swizzle `sVtNoSwizzle` (#984) · ab59ec35
Liang authored Jul 01, 2024
```
Co-authored-by: zl <zl@deepseek.com>
```
ab59ec35

27 Jun, 2024 1 commit

Support unpadded LSE layout (#970) · f816dee6

Grigory Sizov authored Jun 27, 2024



* Support unpadded LSE layout.
Co-authored-by: Xinfeng Xie <xfxie.ceca@gmail.com>
Co-authored-by: Jianyu Huang <hjyahead@gmail.com>

* Cleanup

* Fix unpadded LSE on split-kv path

* Fix formatting and comments

* Fix inline vs forceinline

---------
Co-authored-by: Xinfeng Xie <xfxie.ceca@gmail.com>
Co-authored-by: Jianyu Huang <hjyahead@gmail.com>

f816dee6

26 May, 2024 7 commits
- Update citation · 320fb594
  Tri Dao authored May 26, 2024
  
  320fb594
- Limit to MAX_JOBS=1 with CUDA 12.2 · e2e4333c
  Tri Dao authored May 26, 2024
  
  e2e4333c
- Bump to 2.5.9 · ce735035
  Tri Dao authored May 26, 2024
  
  ce735035
- Update to Cutlass 3.5 · d732be1e
  Tri Dao authored May 26, 2024
  
  d732be1e
- [CI] Compile for pytorch 2.4.0.dev20240407 (for nvcr 24.05) · af627063
  Tri Dao authored May 26, 2024
  
  af627063
- Update for python3.12 (#870) · 40e66723
  Wongboo authored May 27, 2024
  
  40e66723
- add exception to Timeout Error (#963) · beb8b8ba
  Corey James Levinson authored May 26, 2024
```
When timeout connecting, you get URLError: <urlopen error timed out>, In that case, build it from source.
```
  beb8b8ba
23 May, 2024 1 commit
- remove an unused import (#960) · 22339db1
  lancerts authored May 23, 2024
  
  22339db1
06 May, 2024 1 commit
- Move packaging and ninja from install_requires to setup_requires (#937) · 9c0e9ee8
  Wei Ji authored May 07, 2024
```
Set `packaging` and `ninja` as build time dependencies rather than runtime dependencies.
```
  9c0e9ee8
26 Apr, 2024 3 commits
- Bump to v2.5.8 · 9a11f440
  Tri Dao authored Apr 26, 2024
  
  9a11f440
- [CI] Compile for pytorch 2.2.2 and 2.3.0 · 35060e74
  Tri Dao authored Apr 26, 2024
  
  35060e74
- [CrossEntropy] Change ignored_index -> ignore_index · ec6d2214
  Tri Dao authored Apr 26, 2024
  
  ec6d2214
08 Apr, 2024 4 commits
- Bump to v2.5.7 · 85881f54
  Tri Dao authored Apr 07, 2024
  
  85881f54
- [CI] Compile with torch 2.3.0.dev20240207 · 2aea958f
  Tri Dao authored Apr 07, 2024
  
  2aea958f
- Use Cute's local_tile to get gQ, gK, gV · 656daef4
  Tri Dao authored Apr 07, 2024
  
  656daef4
- Transpose out when swapping seqlen_q and num_groups · 9eb3d099
  Tri Dao authored Apr 07, 2024
  
  9eb3d099
05 Apr, 2024 1 commit

Fix spurious re-compilations of `rotary_kernel` (#911) · f692b98d

Ivan Komarov authored Apr 05, 2024

All integer parameters are specialized by default, so the two parameters
removed in this commit could lead to kernel re-compilation, even if
they were completely unused.

f692b98d

28 Mar, 2024 2 commits
- Add the option for the macro and note (#893) · 23e8fa5a
  Driss Guessous authored Mar 27, 2024
  
  23e8fa5a
- Minor fix in compute_attn_1rowblock_splitkv (#900) · 3e9414f1
  ljss authored Mar 28, 2024
  
  3e9414f1
19 Mar, 2024 1 commit
- [LayerNorm] Update layer_norm_linear · 36587c01
  Tri Dao authored Mar 18, 2024
  
  36587c01
15 Mar, 2024 3 commits
- fix: cast the alibi slopes to torch.float32 (#846) · 6bbc5323
  Markus Krimmel authored Mar 15, 2024
  
  6bbc5323
- Add in, macrosf for defining __grid_constant__ (#852) · 4a73e903
  Driss Guessous authored Mar 15, 2024
  
  4a73e903
- Enable paged attention in varlen forward (#831) · 2a15840f
  Grigory Sizov authored Mar 15, 2024
```
* Enable paged attention in varlen forward

* Format + fix padding
```
  2a15840f
14 Mar, 2024 2 commits
- Support ARM builds (#757) · 26c9e827
  Arvind Sundararajan authored Mar 13, 2024
  
  26c9e827
- Make nvcc threads configurable via environment variable (#885) · 50896ec5
  Chirag Jain authored Mar 14, 2024
  
  50896ec5
02 Mar, 2024 2 commits
- Bump to v2.5.6 · 6c9e60de
  Tri Dao authored Mar 01, 2024
  
  6c9e60de
- [CI] Change torch 2.3.0.dev20240126 to 20240105 for nvcr 24.02 · 6e2fa307
  Tri Dao authored Mar 01, 2024
  
  6e2fa307
21 Feb, 2024 4 commits
- Bump to v2.5.5 · 87a12776
  Tri Dao authored Feb 21, 2024
  
  87a12776
- Enable headdim 256 backward on consumer GPUs (Ampere, Ada) · 2406f288
  Tri Dao authored Feb 21, 2024
  
  2406f288
- Bump to v2.5.4 · 43950dda
  Tri Dao authored Feb 20, 2024
  
  43950dda
- Update Cutlass to v3.4.1 · 4d6b794b
  Tri Dao authored Feb 20, 2024
  
  4d6b794b
20 Feb, 2024 1 commit
- Don't need to reduce row_sum during online softmax · b32efb1a
  Tri Dao authored Feb 20, 2024
  
  b32efb1a
18 Feb, 2024 1 commit

Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads... · f45bbb4c

Qubitium authored Feb 18, 2024

Optimize compile to 1: avoid oom 2: minimize swap usage 3: avoid threads starvation when letting ninja decide how many workers to spawn or manual MAX_JOBS "guesses". Logic is to take the min value of MAX_JOBS auto-calculated by two metrics: 1: cpu cores 2: free memory. This should allow flash-attn to compile close to the most efficient manner under any consumer/server env. (#832)

f45bbb4c