- FlashAttention: fast and memory-efficient exact attention. This makes
1. FlashAttention: fast and memory-efficient exact attention. This makes
attention much faster and saves a lot of activation memory. As a result we don't need
to use any activation checkpointing.
```sh
pip install flash-attn
```
- Fused matmul + bias (forward and backward), and fused matmul + bias + gelu
2. Fused matmul + bias (forward and backward), and fused matmul + bias + gelu
(forward and backward), adapted from Apex's
[FusedDense](https://github.com/NVIDIA/apex/tree/master/apex/fused_dense). We
make it work for bfloat16. For best performance, you should use CUDA >= 11.8. CuBLAS versions before
...
...
@@ -47,16 +71,16 @@ this doesn't have the best matmul + bias + gelu performance for bfloat16.
```sh
cd ../csrc/fused_dense_lib && pip install .
```
- Optimized cross-entropy loss, adapted from Apex's
3. Optimized cross-entropy loss, adapted from Apex's
[Xentropy](https://github.com/NVIDIA/apex/tree/master/apex/contrib/xentropy). We make it work for bfloat16 and support in-place backward to save memory.
```sh
cd ../csrc/xentropy && pip install .
```
- Fused rotary embedding:
4. Fused rotary embedding:
```sh
cd ../csrc/rotary && pip install .
```
- Fused dropout + residual + LayerNorm, adapted from Apex's
5. Fused dropout + residual + LayerNorm, adapted from Apex's
[FastLayerNorm](https://github.com/NVIDIA/apex/tree/master/apex/contrib/layer_norm). We add dropout and residual, and make it work for both pre-norm and post-norm architecture.
This only supports a limited set of dimensions, see `csrc/layer_norm/ln_fwd_cuda_kernel.cu`.
```sh
...
...
@@ -65,8 +89,9 @@ cd ../csrc/layer_norm && pip install .
## Training
Feel free to use the model in your training setup. We also provide here training
scripts to train GPT2 on Openwebtext and GPT3 on The Pile as examples.
We also provide here training scripts to train GPT2 on Openwebtext and GPT3 on
The Pile as examples. Feel free to use the model in your own training setup as
well.
We use [Hydra](https://hydra.cc/) for configuration,
[Pytorch-Lightning](https://github.com/Lightning-AI/lightning) for training, and
...
...
@@ -75,12 +100,20 @@ We use [Hydra](https://hydra.cc/) for configuration,
We use the template from `https://github.com/ashleve/lightning-hydra-template`.
Please read the instructions there to understand the repo structure.