# Liger Kernel: Efficient Triton Kernels for LLM Training ## Key Features - **Ease of use:** Simply patch your Hugging Face model with one line of code, or compose your own model using our Liger Kernel modules. - **Time and memory efficient:** In the same spirit as Flash-Attn, but for layers like **RMSNorm**, **RoPE**, **SwiGLU**, and **CrossEntropy**! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with **kernel fusion**, **in-place replacement**, and **chunking** techniques. - **Exact:** Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy. - **Lightweight:** Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches! - **Multi-GPU supported:** Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.). - **Trainer Framework Integration**: [Axolotl](https://github.com/axolotl-ai-cloud/axolotl), [LLaMa-Factory](https://github.com/hiyouga/LLaMA-Factory), [SFTTrainer](https://github.com/huggingface/trl/releases/tag/v0.10.1), [Hugging Face Trainer](https://github.com/huggingface/transformers/pull/32860), [SWIFT](https://github.com/modelscope/ms-swift), [oumi](https://github.com/oumi-ai/oumi/tree/main) ## Installation ### Dependencies - `torch >= 2.1.2` - `triton >= 2.3.0` ### Optional Dependencies - `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers. > **Note:** > Our kernels inherit the full spectrum of hardware compatibility offered by [Triton](https://github.com/triton-lang/triton). To install from source: ```bash git clone https://github.com/linkedin/Liger-Kernel.git cd Liger-Kernel # Install Default Dependencies # Setup.py will detect whether you are using AMD or NVIDIA pip install -e . # Setup Development Dependencies pip install -e ".[dev]" # NOTE -> For AMD users only pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.3/ ``` ## Getting Started There are a couple of ways to apply Liger kernels, depending on the level of customization required. ### 1. Use AutoLigerKernelForCausalLM Using the `AutoLigerKernelForCausalLM` is the simplest approach, as you don't have to import a model-specific patching API. If the model type is supported, the modeling code will be automatically patched using the default settings. ```python from liger_kernel.transformers import AutoLigerKernelForCausalLM # This AutoModel wrapper class automatically monkey-patches the # model with the optimized Liger kernels if the model is supported. model = AutoLigerKernelForCausalLM.from_pretrained("path/to/some/model") ``` ### 2. Apply Model-Specific Patching APIs Using the [patching APIs](#patching), you can swap Hugging Face models with optimized Liger Kernels. ```python import transformers from liger_kernel.transformers import apply_liger_kernel_to_llama # 1a. Adding this line automatically monkey-patches the model with the optimized Liger kernels apply_liger_kernel_to_llama() # 1b. You could alternatively specify exactly which kernels are applied apply_liger_kernel_to_llama( rope=True, swiglu=True, cross_entropy=True, fused_linear_cross_entropy=False, rms_norm=False ) # 2. Instantiate patched model model = transformers.AutoModelForCausalLM("path/to/llama/model") ``` ### 3. Compose Your Own Model You can take individual [kernels](https://github.com/linkedin/Liger-Kernel?tab=readme-ov-file#model-kernels) to compose your models. ```python from liger_kernel.transformers import LigerFusedLinearCrossEntropyLoss import torch.nn as nn import torch model = nn.Linear(128, 256).cuda() # fuses linear + cross entropy layers together and performs chunk-by-chunk computation to reduce memory loss_fn = LigerFusedLinearCrossEntropyLoss() input = torch.randn(4, 128, requires_grad=True, device="cuda") target = torch.randint(256, (4, ), device="cuda") loss = loss_fn(model.weight, input, target) loss.backward() ```