README.md

# Liger-Kernel Example with HuggingFace Trainer

## How to Run

### Locally on a GPU machine
You can run the example locally on a GPU machine. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs.

```bash
pip install -r requirements.txt
sh run_{MODEL}.sh
```

### Remotely on Modal
If you do not have access to a GPU machine, you can run the example on Modal. Modal is a serverless platform that allows you to run your code on a remote GPU machine. You can sign up for a free account at [Modal](https://www.modal.com/).

```bash
pip install modal
modal setup  # authenticate with Modal
modal run launch_on_modal.py --script "run_qwen2_vl.sh"
```

**Notes**
1. This example uses an optional `use_liger` flag. If true, it does a 1 line monkey patch to apply liger kernel.
2. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings:
    * Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B
    * Run `huggingface-cli login` and enter your HuggingFace token
3. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable `CPUOffload` in FSDP.


## Benchmark Result

### LLaMA
Benchmark conditions: LLaMA 3-8B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.

Throughput improves by around 20%, while GPU memory usage drops by 40%. This allows you to train the model on smaller GPUs, use larger batch sizes, or handle longer sequence lengths without incurring additional costs.

![Throughput](img/llama_tps.png)
![GPU Memory Allocated](img/llama_mem_alloc.png)

### QWEN
Benchmark conditions: Qwen2-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.

Throughput improves by around 10%, while GPU memory usage drops by 50%.

![Throughput](img/qwen_tps.png)
![GPU Memory Allocated](img/qwen_mem_alloc.png)


### GEMMA 7B
Benchmark conditions: Gemma-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.

Throughput improves by around 24%, while GPU memory usage drops by 33%.

![Throughput](img/gemma_7b_mem.png)
![GPU Memory Allocated](img/gemma_7b_tp.png)