first commit

9b0e3a30 · cmx · fe5cd1fc · 9b0e3a30 · 9b0e3a30 · 9b0e3a30
Commit 9b0e3a30 authored Mar 25, 2026 by cmx
20 changed files
--- a/docs/images/post-training.png
+++ b/docs/images/post-training.png
--- a/docs/index.md
+++ b/docs/index.md
+<a name="readme-top"></a>
+# Liger Kernel: Efficient Triton Kernels for LLM Training
+<table style="width: 100%; text-align: center; border-collapse: collapse;">
+    <tr>
+        <th style="padding: 10px;" colspan="2">Stable</th>
+        <th style="padding: 10px;" colspan="2">Nightly</th>
+        <th style="padding: 10px;">Discord</th>
+        <th style="padding: 10px;">Build</th>
+    </tr>
+    <tr>
+        <td style="padding: 10px;">
+            <a href="https://pepy.tech/project/liger-kernel">
+                <img src="https://static.pepy.tech/badge/liger-kernel" alt="Downloads (Stable)">
+            </a>
+        </td>
+        <td style="padding: 10px;">
+            <a href="https://pypi.org/project/liger-kernel">
+                <img alt="PyPI - Version" src="https://img.shields.io/pypi/v/liger-kernel?color=green">
+            </a>
+        </td>
+        <td style="padding: 10px;">
+            <a href="https://pepy.tech/project/liger-kernel-nightly">
+                <img src="https://static.pepy.tech/badge/liger-kernel-nightly" alt="Downloads (Nightly)">
+            </a>
+        </td>
+        <td style="padding: 10px;">
+            <a href="https://pypi.org/project/liger-kernel-nightly">
+                <img alt="PyPI - Version" src="https://img.shields.io/pypi/v/liger-kernel-nightly?color=green">
+            </a>
+        </td>
+        <td style="padding: 10px;">
+            <a href="https://discord.gg/gpumode">
+                <img src="https://dcbadge.vercel.app/api/server/gpumode?style=flat" alt="Join Our Discord">
+            </a>
+        </td>
+        <td style="padding: 10px;">
+            <div style="display: block;">
+                <a href="https://github.com/linkedin/Liger-Kernel/actions/workflows/nvi-ci.yml">
+                    <img src="https://github.com/linkedin/Liger-Kernel/actions/workflows/nvi-ci.yml/badge.svg?event=schedule" alt="Build">
+                </a>
+            </div>
+            <div style="display: block;">
+                <a href="https://github.com/linkedin/Liger-Kernel/actions/workflows/amd-ci.yml">
+                    <img src="https://github.com/linkedin/Liger-Kernel/actions/workflows/amd-ci.yml/badge.svg?event=schedule" alt="Build">
+                </a>
+            </div>
+        </td>
+    </tr>
+</table>
+<img src="https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/logo-banner.png">
+**Liger Kernel** is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU **training throughput by 20%** and reduces **memory usage by 60%**. We have implemented **Hugging Face Compatible** `RMSNorm`, `RoPE`, `SwiGLU`, `CrossEntropy`, `FusedLinearCrossEntropy`, and more to come. The kernel works out of the box with [Flash Attention](https://github.com/Dao-AILab/flash-attention), [PyTorch FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed). We welcome contributions from the community to gather the best kernels for LLM training.
+We've also added optimized Post-Training kernels that deliver **up to 80% memory savings** for alignment and distillation tasks. We support losses like DPO, CPO, ORPO, SimPO, JSD, and many more. Check out [how we optimize the memory](https://x.com/hsu_byron/status/1866577403918917655).
+## Supercharge Your Model with Liger Kernel
+With one line of code, Liger Kernel can increase throughput by more than 20% and reduce memory usage by 60%, thereby enabling longer context lengths, larger batch sizes, and massive vocabularies.
+| Speed Up                 | Memory Reduction        |
+|--------------------------|-------------------------|
+| ![Speed up](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-tps.png) | ![Memory](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-memory.png) |
+> **Note:**
+> - Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Data Type = `bf16`, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s.
+> - Hugging Face models start to OOM at a 4K context length, whereas Hugging Face + Liger Kernel scales up to 16K.
+## Optimize Post Training with Liger Kernel
+<p align="center">
+    <img src="https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/post-training.png" width="50%" alt="Post Training">
+</p>
+We provide optimized post training kernels like DPO, ORPO, SimPO, and more which can reduce memory usage by up to 80%. You can easily use them as python modules.
+```python
+from liger_kernel.chunked_loss import LigerFusedLinearDPOLoss
+orpo_loss = LigerFusedLinearORPOLoss()
+y = orpo_loss(lm_head.weight, x, target)
+```
+#### Key Features
+- **Ease of use:** Simply patch your Hugging Face model with one line of code, or compose your own model using our Liger Kernel modules.
+- **Time and memory efficient:** In the same spirit as Flash-Attn, but for layers like **RMSNorm**, **RoPE**, **SwiGLU**, and **CrossEntropy**! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with **kernel fusion**, **in-place replacement**, and **chunking** techniques.
+- **Exact:** Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy.
+- **Lightweight:** Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches!
+- **Multi-GPU supported:** Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.).
+- **Trainer Framework Integration**: [Axolotl](https://github.com/axolotl-ai-cloud/axolotl), [LLaMa-Factory](https://github.com/hiyouga/LLaMA-Factory), [SFTTrainer](https://github.com/huggingface/trl/releases/tag/v0.10.1), [Hugging Face Trainer](https://github.com/huggingface/transformers/pull/32860), [SWIFT](https://github.com/modelscope/ms-swift)
+### Installation
+To install the stable version:
+```bash
+$ pip install liger-kernel
+```
+To install the nightly version:
+```bash
+$ pip install liger-kernel-nightly
+```
+To install from source:
+```bash
+git clone https://github.com/linkedin/Liger-Kernel.git
+cd Liger-Kernel
+# Install Default Dependencies
+# Setup.py will detect whether you are using AMD or NVIDIA
+pip install -e .
+# Setup Development Dependencies
+pip install -e ".[dev]"
+```
+!!! Note " Dependencies " 
+    #### CUDA
+    - `torch >= 2.1.2`
+    - `triton >= 2.3.0`
+    #### ROCm
+    - `torch >= 2.5.0` Install according to the instruction in Pytorch official webpage.
+    - `triton >= 3.0.0` Install from pypi. (e.g. `pip install triton==3.0.0`)
+!!!Tip "Optional Dependencies "
+    - `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
+!!! Note
+     Our kernels inherit the full spectrum of hardware compatibility offered by [Triton](https://github.com/triton-lang/triton).
+#### Sponsorship and Collaboration
+- [AMD](https://www.amd.com/en.html): Providing AMD GPUs for our AMD CI.
+- [Intel](https://www.intel.com/): Providing Intel GPUs for our Intel CI.
+- [Modal](https://modal.com/): Free 3000 credits from GPU MODE IRL for our NVIDIA CI.
+- [EmbeddedLLM](https://embeddedllm.com/): Making Liger Kernel run fast and stable on AMD. 
+- [HuggingFace](https://huggingface.co/): Integrating Liger Kernel into Hugging Face Transformers and TRL.
+- [Lightning AI](https://lightning.ai/): Integrating Liger Kernel into Lightning Thunder.
+- [Axolotl](https://axolotl.ai/): Integrating Liger Kernel into Axolotl.
+- [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory): Integrating Liger Kernel into Llama-Factory.
+!!! Note " Contact " 
+    - For issues, create a Github ticket in this repository .
+    - For open discussion, join [our discord channel](https://discord.gg/gpumode) .
+    - For formal collaboration, send an email to byhsu@linkedin.com .
+### Cite this work
+Bib Latex entry:
+```bib
+@inproceedings{
+hsu2025ligerkernel,
+title={Liger-Kernel: Efficient Triton Kernels for {LLM} Training},
+author={Pin-Lun Hsu and Yun Dai and Vignesh Kothapalli and Qingquan Song and Shao Tang and Siyu Zhu and Steven Shimizu and Shivam Sahni and Haowen Ning and Yanning Chen and Zhipeng Wang},
+booktitle={Championing Open-source DEvelopment in ML Workshop @ ICML25},
+year={2025},
+url={https://openreview.net/forum?id=36SjAIT42G}
+}
+```
+### Star History
+[![Star History Chart](https://api.star-history.com/svg?repos=linkedin/Liger-Kernel&type=Date)](https://star-history.com/#linkedin/Liger-Kernel&Date)
+<p align="right" style="font-size: 14px; color: #555; margin-top: 20px;">
+    <a href="#readme-top" style="text-decoration: none; color: #007bff; font-weight: bold;">
+        ↑ Back to Top ↑
+    </a>
+</p>
--- a/docs/license.md
+++ b/docs/license.md
+This project is licensed under the [BSD 2-CLAUSE](https://github.com/linkedin/Liger-Kernel/blob/main/LICENSE) License (see `LICENSE` for details).
+It also includes components from projects licensed under:
+- Apache License 2.0 (see `LICENSE-APACHE-2.0` for details).
+- MIT License (see `LICENSE-MIT-AutoAWQ` for details).
+- MIT License (see `LICENSE-MIT-Efficient Cross Entropy` for details).
+- MIT License (see `LICENSE-MIT-llmc` for details).
+- MIT License (see `LICENSE-MIT-triton` for details).
\ No newline at end of file
--- a/examples/alignment/accelerate_config.yaml
+++ b/examples/alignment/accelerate_config.yaml
+compute_environment: LOCAL_MACHINE
+debug: false
+distributed_type: FSDP
+downcast_bf16: 'no'
+enable_cpu_affinity: false
+fsdp_config:
+  fsdp_activation_checkpointing: false
+  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
+  fsdp_backward_prefetch: BACKWARD_PRE
+  fsdp_cpu_ram_efficient_loading: true
+  fsdp_forward_prefetch: false
+  fsdp_offload_params: false
+  fsdp_sharding_strategy: FULL_SHARD
+  fsdp_state_dict_type: SHARDED_STATE_DICT
+  fsdp_sync_module_states: true
+  fsdp_use_orig_params: true
+machine_rank: 0
+main_training_function: main
+num_machines: 1
+num_processes: 8
+rdzv_backend: static
+same_network: true
+tpu_env: []
+tpu_use_cluster: false
+tpu_use_sudo: false
+use_cpu: false
--- a/examples/alignment/run_orpo.py
+++ b/examples/alignment/run_orpo.py
+import torch
+from datasets import load_dataset
+from transformers import AutoModelForCausalLM
+from transformers import AutoTokenizer
+from trl import ORPOConfig  # noqa: F401
+from liger_kernel.transformers.trainer import LigerORPOTrainer  # noqa: F401
+model = AutoModelForCausalLM.from_pretrained(
+    "meta-llama/Llama-3.2-1B-Instruct",
+    dtype=torch.bfloat16,
+)
+tokenizer = AutoTokenizer.from_pretrained(
+    "meta-llama/Llama-3.2-1B-Instruct",
+    max_length=512,
+    padding="max_length",
+)
+tokenizer.pad_token = tokenizer.eos_token
+train_dataset = load_dataset("trl-lib/tldr-preference", split="train")
+training_args = ORPOConfig(
+    output_dir="Llama3.2_1B_Instruct",
+    beta=0.1,
+    max_length=128,
+    per_device_train_batch_size=32,
+    max_steps=100,
+    save_strategy="no",
+)
+trainer = LigerORPOTrainer(model=model, args=training_args, tokenizer=tokenizer, train_dataset=train_dataset)
+trainer.train()
--- a/examples/huggingface/README.md
+++ b/examples/huggingface/README.md
+# Liger-Kernel Example with HuggingFace Trainer
+## How to Run
+### Locally on a GPU machine
+You can run the example locally on a GPU machine. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs.
+```bash
+pip install -r requirements.txt
+sh run_{MODEL}.sh
+```
+### Remotely on Modal
+If you do not have access to a GPU machine, you can run the example on Modal. Modal is a serverless platform that allows you to run your code on a remote GPU machine. You can sign up for a free account at [Modal](https://www.modal.com/).
+```bash
+pip install modal
+modal setup  # authenticate with Modal
+modal run launch_on_modal.py --script "run_qwen2_vl.sh"
+```
+**Notes**
+1. This example uses an optional `use_liger` flag. If true, it does a 1 line monkey patch to apply liger kernel.
+2. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings:
+    * Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B
+    * Run `huggingface-cli login` and enter your HuggingFace token
+3. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable `CPUOffload` in FSDP.
+## Benchmark Result
+### LLaMA
+Benchmark conditions: LLaMA 3-8B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
+Throughput improves by around 20%, while GPU memory usage drops by 40%. This allows you to train the model on smaller GPUs, use larger batch sizes, or handle longer sequence lengths without incurring additional costs.
+![Throughput](img/llama_tps.png)
+![GPU Memory Allocated](img/llama_mem_alloc.png)
+### QWEN
+Benchmark conditions: Qwen2-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
+Throughput improves by around 10%, while GPU memory usage drops by 50%.
+![Throughput](img/qwen_tps.png)
+![GPU Memory Allocated](img/qwen_mem_alloc.png)
+### GEMMA 7B
+Benchmark conditions: Gemma-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
+Throughput improves by around 24%, while GPU memory usage drops by 33%.
+![Throughput](img/gemma_7b_mem.png)
+![GPU Memory Allocated](img/gemma_7b_tp.png)
--- a/examples/huggingface/callback.py
+++ b/examples/huggingface/callback.py
+import time
+from dataclasses import dataclass
+import torch
+import transformers
+from transformers import TrainerControl
+from transformers import TrainerState
+from transformers import TrainingArguments
+from liger_kernel.utils import infer_device
+# https://simple.wikipedia.org/wiki/Byte
+# For memory, we use binary system
+M_BIN_UNIT = 2**20
+# For metrics (tflops), we use decimal system
+T_DEC_UNIT = 10**12
+def round_to_n_decimal(x, n):
+    return round(x, n)
+@dataclass
+class Precision:
+    """
+    Precision is a dataclass to store the number of decimal points for each metric.
+    """
+    n_decimal_time: int
+    n_decimal_memory: int
+    n_decimal_TPS: int
+@dataclass
+class State:
+    """
+    State is a dataclass to store the internal state of the efficiency callback.
+    """
+    n_warmup_steps: int = 0
+    total_peak_memory_allocated: float = float("-inf")
+    total_peak_memory_reserved: float = float("-inf")
+    step_start_time: float = 0.0
+    elapsed_time: float = 0.0
+    elapsed_step: int = 0
+    step_start_tokens_seen: int = 0
+    elapsed_tokens_seen: int = 0
+    global_start_step: int = 0
+@dataclass
+class Time:
+    """
+    Time is a dataclass to store the time-related metrics.
+    """
+    step: int = 0
+    step_time_sec: float = 0.0
+    avg_step_time_sec: float = 0.0
+    time_to_completion_sec: float = 0.0
+    estimated_total_time_sec: float = 0.0
+@dataclass
+class Memory:
+    """
+    Memory is a dataclass to store the memory-related metrics.
+    """
+    step_peak_memory_allocated_MB: float = 0.0
+    step_peak_memory_reserved_MB: float = 0.0
+    total_peak_memory_allocated_MB: float = 0.0
+    total_peak_memory_reserved_MB: float = 0.0
+@dataclass
+class TPS:
+    """
+    TPS is a dataclass to store the tokens per second metrics.
+    """
+    step_tokens_per_second: float = 0.0
+    avg_tokens_per_second: float = 0.0
+class EfficiencyCallback(transformers.TrainerCallback):
+    """
+    EfficiencyCallback is a callback to track the efficiency of the training process.
+    The tracked stats include: step time, memory, and throughput.
+    It requires including `--include_num_input_tokens_seen` and `logging_steps=1` in the training arguments.
+    Args:
+        n_warmup_steps: number of warmup steps
+            The stats in the first n_warmup_steps will not be added into the aggregated stats
+            This is because the first few steps might take longer due to jit compliation and other initialization overheads
+        n_decimal_time: number of decimal points for time
+        n_decimal_memory: number of decimal points for memory
+        n_decimal_TPS: number of decimal points for TPS
+    """
+    def __init__(self, n_warmup_steps=2, n_decimal_time=2, n_decimal_memory=2, n_decimal_TPS=2):
+        self.state = State(
+            n_warmup_steps,
+        )
+        self.precision = Precision(n_decimal_time, n_decimal_memory, n_decimal_TPS)
+        self.time = Time()
+        self.memory = Memory()
+        self.tps = TPS()
+        self.device = infer_device()
+    def on_init_end(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ):
+        """
+        Event called at the end of the initialization of the [`Trainer`].
+        """
+        if not args.include_num_input_tokens_seen:
+            raise Exception(
+                'Please pass training argument "--include_num_input_tokens_seen" to track tokens per second'
+            )
+        if args.logging_steps != 1:
+            raise Exception("Please set logging_steps=1 to track the efficiency metrics accurately")
+    def on_train_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ):
+        # if loaded from checkpoints, global_start_step is not 1 but state.global_step
+        self.state.global_start_step = state.global_step
+    def on_log(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        logs: dict[str, float],
+        **kwargs,
+    ):
+        if state.global_step < (self.state.global_start_step + self.state.n_warmup_steps):
+            return
+        else:
+            # spread self.time, self.memory, self.tps to logs
+            logs.update(self.time.__dict__)
+            logs.update(self.memory.__dict__)
+            logs.update(self.tps.__dict__)
+    def on_step_begin(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ):
+        """
+        Event called at the beginning of a training step. If using gradient accumulation, one training step might take
+        several inputs.
+        """
+        # memory
+        getattr(torch, self.device).reset_peak_memory_stats()
+        # time
+        self.state.step_start_time = time.perf_counter()
+    def on_step_end(
+        self,
+        args: TrainingArguments,
+        state: TrainerState,
+        control: TrainerControl,
+        **kwargs,
+    ):
+        if state.global_step < (self.state.global_start_step + self.state.n_warmup_steps):
+            # The end the current step_start_tokens_seen is the start of next iteration
+            # tokens
+            self.state.step_start_tokens_seen = state.num_input_tokens_seen
+            return
+        # time
+        current_time = time.perf_counter()
+        step_time = current_time - self.state.step_start_time
+        self.state.elapsed_time += step_time
+        # step
+        global_step = state.global_step
+        self.state.elapsed_step += 1
+        avg_step_time = self.state.elapsed_time / self.state.elapsed_step
+        self.time.step = global_step
+        self.time.step_time_sec = round_to_n_decimal(step_time, self.precision.n_decimal_time)
+        self.time.avg_step_time_sec = round_to_n_decimal(avg_step_time, self.precision.n_decimal_time)
+        self.time.time_to_completion_sec = round_to_n_decimal(
+            avg_step_time * (state.max_steps - global_step),
+            self.precision.n_decimal_time,
+        )
+        self.time.estimated_total_time_sec = round_to_n_decimal(
+            avg_step_time * state.max_steps, self.precision.n_decimal_time
+        )
+        # memory
+        step_peak_memory_allocated = getattr(torch, self.device).memory.max_memory_allocated()
+        step_peak_memory_reserved = getattr(torch, self.device).memory.max_memory_reserved()
+        self.memory.step_peak_memory_allocated_MB = round_to_n_decimal(
+            step_peak_memory_allocated / M_BIN_UNIT, self.precision.n_decimal_memory
+        )
+        self.state.total_peak_memory_allocated = max(self.state.total_peak_memory_allocated, step_peak_memory_allocated)
+        self.memory.total_peak_memory_allocated_MB = round_to_n_decimal(
+            self.state.total_peak_memory_allocated / M_BIN_UNIT,
+            self.precision.n_decimal_memory,
+        )
+        self.memory.step_peak_memory_reserved_MB = round_to_n_decimal(
+            step_peak_memory_reserved / M_BIN_UNIT, self.precision.n_decimal_memory
+        )
+        self.state.total_peak_memory_reserved = max(self.state.total_peak_memory_reserved, step_peak_memory_reserved)
+        self.memory.total_peak_memory_reserved_MB = round_to_n_decimal(
+            self.state.total_peak_memory_reserved / M_BIN_UNIT,
+            self.precision.n_decimal_memory,
+        )
+        # tokens
+        step_tokens_seen = state.num_input_tokens_seen - self.state.step_start_tokens_seen
+        self.state.elapsed_tokens_seen += step_tokens_seen
+        self.tps.step_tokens_per_second = round_to_n_decimal(
+            step_tokens_seen / step_time,
+            self.precision.n_decimal_TPS,
+        )
+        self.tps.avg_tokens_per_second = round_to_n_decimal(
+            self.state.elapsed_tokens_seen / self.state.elapsed_time,
+            self.precision.n_decimal_TPS,
+        )
+        # The end the current step_start_tokens_seen is the start of next iteration
+        # tokens
+        self.state.step_start_tokens_seen = state.num_input_tokens_seen
--- a/examples/huggingface/config/fsdp_config.json
+++ b/examples/huggingface/config/fsdp_config.json
+{
+    "backward_prefetch": "backward_pre",
+    "forward_prefetch": "true",
+    "activation_checkpointing": true
+}
\ No newline at end of file
--- a/examples/huggingface/img/gemma_7b_mem.png
+++ b/examples/huggingface/img/gemma_7b_mem.png
--- a/examples/huggingface/img/gemma_7b_tp.png
+++ b/examples/huggingface/img/gemma_7b_tp.png
--- a/examples/huggingface/img/llama_mem_alloc.png
+++ b/examples/huggingface/img/llama_mem_alloc.png
--- a/examples/huggingface/img/llama_tps.png
+++ b/examples/huggingface/img/llama_tps.png
--- a/examples/huggingface/img/qwen_mem_alloc.png
+++ b/examples/huggingface/img/qwen_mem_alloc.png
--- a/examples/huggingface/img/qwen_tps.png
+++ b/examples/huggingface/img/qwen_tps.png
--- a/examples/huggingface/launch_on_modal.py
+++ b/examples/huggingface/launch_on_modal.py
+"""
+launch_on_modal.py
+This tool is designed to launch scripts using Modal.
+It sets up the necessary environment, including GPU resources and python dependencies,
+and executes the specified training script remotely.
+### Setup and Usage
+```bash
+pip install modal
+modal setup  # authenticate with Modal
+export HF_TOKEN="your_huggingface_token"  # if using a gated model such as llama3
+modal run launch_on_modal.py --script "run_qwen2_vl.sh"
+```
+### Caveats
+This tool is intended as an easy on-ramp to using Liger-Kernel for fine-tuning LLMs and
+VLMs - it is a reproducible way to run benchmarks and example scripts. However, it is not
+the best way to develop a model on Modal, as it re-downloads the model and dataset each
+time it is run. For iterative development, consider using `modal.Volume` to cache the
+model and dataset between runs.
+"""
+import os
+import modal
+from modal import gpu
+TWO_HOURS = 2 * 60 * 60
+SIXTEEN_GB = 16 * 1024
+app = modal.App("liger-example")
+image = modal.Image.debian_slim().pip_install_from_requirements("requirements.txt").copy_local_dir(".", "/root")
+if "HF_TOKEN" not in os.environ:
+    print("HF_TOKEN not found in environment variables, using an empty token.")
+hf_token_secret = modal.Secret.from_dict({"HF_TOKEN": os.environ.get("HF_TOKEN", "")})
+@app.function(
+    gpu=gpu.A100(count=4, size="80GB"),
+    image=image,
+    timeout=TWO_HOURS,
+    memory=SIXTEEN_GB,
+    secrets=[hf_token_secret],
+)
+def launch_script(script: str):
+    import subprocess
+    script_path = f"/root/{script}"
+    os.chmod(script_path, 0o755)  # make script executable
+    print(f"Running script: {script_path}")
+    subprocess.run([script_path], check=True, cwd="/root", env=os.environ.copy())
+@app.local_entrypoint()
+def main(script: str):
+    """
+    Launch a script remotely on modal.
+    ```bash
+    export HF_TOKEN="your_huggingface_token"  # if using a gated model such as llama3
+    modal run --detach launch_on_modal.py --script "run_qwen2_vl.sh"
+    ```
+    """
+    launch_script.remote(script=script)
--- a/examples/huggingface/requirements.txt
+++ b/examples/huggingface/requirements.txt
+transformers==4.45.2
+trl
+liger-kernel
+triton
+torch
+torchvision
\ No newline at end of file
--- a/examples/huggingface/run_benchmarks.sh
+++ b/examples/huggingface/run_benchmarks.sh
+#!/bin/bash
+## Benchmarking Script
+## Runs the training script with different configurations and logs the results
+MODEL_TYPE="mistral"
+MODEL_PATH="mistralai/Mistral-7B-v0.1"
+USE_LIGER_VALUES=("True" "False")
+BATCH_SIZE_VALUES=(64 128 192)
+NUM_REP=5
+MAX_STEPS=20
+DATASET_PATH="tatsu-lab/alpaca"
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+mkdir -p "${SCRIPT_DIR}/results"
+for USE_LIGER in "${USE_LIGER_VALUES[@]}"; do
+    for BATCH_SIZE in "${BATCH_SIZE_VALUES[@]}"; do
+        echo "Running with use_liger=$USE_LIGER and batch_size=$BATCH_SIZE"
+        for ((i=1; i<=NUM_REP; i++)); do
+            LOG_FILE="${SCRIPT_DIR}/results/${MODEL_TYPE}_use_liger_${USE_LIGER}_batch_size_${BATCH_SIZE}_rep_${i}.log"
+            torchrun --nnodes=1 --nproc-per-node=4 training.py \
+                --bf16 \
+                --num_train_epochs 1 \
+                --max_steps $MAX_STEPS \
+                --model_name $MODEL_PATH \
+                --dataset $DATASET_PATH \
+                --per_device_train_batch_size $BATCH_SIZE \
+                --per_device_eval_batch_size 16 \
+                --eval_strategy "no" \
+                --save_strategy "no" \
+                --learning_rate 6e-6 \
+                --weight_decay 0.05 \
+                --warmup_ratio 0.1 \
+                --lr_scheduler_type "cosine" \
+                --logging_steps 1 \
+                --include_num_input_tokens_seen \
+                --report_to none \
+                --fsdp "full_shard auto_wrap" \
+                --fsdp_config config/fsdp_config.json \
+                --seed 42 \
+                --use_liger $USE_LIGER \
+                --output_dir model_output_dir \
+                > $LOG_FILE
+            sleep 5
+        done
+    done
+done
\ No newline at end of file
--- a/examples/huggingface/run_gemma.sh
+++ b/examples/huggingface/run_gemma.sh
+#!/bin/bash
+torchrun --nnodes=1 --nproc-per-node=4 training.py \
+    --model_name "google/gemma-7b-it" \
+    --bf16 \
+    --max_steps 20 \
+    --per_device_train_batch_size 24 \
+    --per_device_eval_batch_size 1 \
+    --eval_strategy "no" \
+    --save_strategy "no" \
+    --learning_rate 6e-6 \
+    --weight_decay 0.05 \
+    --warmup_ratio 0.1 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --include_num_input_tokens_seen \
+    --report_to none \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_config config/fsdp_config.json \
+    --seed 42 \
+    --use_liger True \
+    --output_dir alpaca_finetuning
--- a/examples/huggingface/run_llama.sh
+++ b/examples/huggingface/run_llama.sh
+#!/bin/bash
+torchrun --nnodes=1 --nproc-per-node=4 training.py \
+    --bf16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 64 \
+    --per_device_eval_batch_size 64 \
+    --eval_strategy "no" \
+    --save_strategy "no" \
+    --learning_rate 6e-6 \
+    --weight_decay 0.05 \
+    --warmup_ratio 0.1 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --include_num_input_tokens_seen \
+    --report_to none \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_config config/fsdp_config.json \
+    --seed 42 \
+    --use_liger True \
+    --output_dir alpaca_finetuning
--- a/examples/huggingface/run_qwen.sh
+++ b/examples/huggingface/run_qwen.sh
+#!/bin/bash
+torchrun --nnodes=1 --nproc-per-node=4 training.py \
+    --model_name "Qwen/Qwen2-7B" \
+    --bf16 \
+    --num_train_epochs 1 \
+    --per_device_train_batch_size 48 \
+    --per_device_eval_batch_size 64 \
+    --eval_strategy "no" \
+    --save_strategy "no" \
+    --learning_rate 6e-6 \
+    --weight_decay 0.05 \
+    --warmup_ratio 0.1 \
+    --lr_scheduler_type "cosine" \
+    --logging_steps 1 \
+    --include_num_input_tokens_seen \
+    --report_to none \
+    --fsdp "full_shard auto_wrap" \
+    --fsdp_config config/fsdp_config.json \
+    --seed 42 \
+    --use_liger True \
+    --output_dir alpaca_finetuning