Commit 9b0e3a30 authored by cmx's avatar cmx
Browse files

first commit

parent fe5cd1fc
Pipeline #3450 failed with stages
in 0 seconds
<a name="readme-top"></a>
# Liger Kernel: Efficient Triton Kernels for LLM Training
<table style="width: 100%; text-align: center; border-collapse: collapse;">
<tr>
<th style="padding: 10px;" colspan="2">Stable</th>
<th style="padding: 10px;" colspan="2">Nightly</th>
<th style="padding: 10px;">Discord</th>
<th style="padding: 10px;">Build</th>
</tr>
<tr>
<td style="padding: 10px;">
<a href="https://pepy.tech/project/liger-kernel">
<img src="https://static.pepy.tech/badge/liger-kernel" alt="Downloads (Stable)">
</a>
</td>
<td style="padding: 10px;">
<a href="https://pypi.org/project/liger-kernel">
<img alt="PyPI - Version" src="https://img.shields.io/pypi/v/liger-kernel?color=green">
</a>
</td>
<td style="padding: 10px;">
<a href="https://pepy.tech/project/liger-kernel-nightly">
<img src="https://static.pepy.tech/badge/liger-kernel-nightly" alt="Downloads (Nightly)">
</a>
</td>
<td style="padding: 10px;">
<a href="https://pypi.org/project/liger-kernel-nightly">
<img alt="PyPI - Version" src="https://img.shields.io/pypi/v/liger-kernel-nightly?color=green">
</a>
</td>
<td style="padding: 10px;">
<a href="https://discord.gg/gpumode">
<img src="https://dcbadge.vercel.app/api/server/gpumode?style=flat" alt="Join Our Discord">
</a>
</td>
<td style="padding: 10px;">
<div style="display: block;">
<a href="https://github.com/linkedin/Liger-Kernel/actions/workflows/nvi-ci.yml">
<img src="https://github.com/linkedin/Liger-Kernel/actions/workflows/nvi-ci.yml/badge.svg?event=schedule" alt="Build">
</a>
</div>
<div style="display: block;">
<a href="https://github.com/linkedin/Liger-Kernel/actions/workflows/amd-ci.yml">
<img src="https://github.com/linkedin/Liger-Kernel/actions/workflows/amd-ci.yml/badge.svg?event=schedule" alt="Build">
</a>
</div>
</td>
</tr>
</table>
<img src="https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/logo-banner.png">
**Liger Kernel** is a collection of Triton kernels designed specifically for LLM training. It can effectively increase multi-GPU **training throughput by 20%** and reduces **memory usage by 60%**. We have implemented **Hugging Face Compatible** `RMSNorm`, `RoPE`, `SwiGLU`, `CrossEntropy`, `FusedLinearCrossEntropy`, and more to come. The kernel works out of the box with [Flash Attention](https://github.com/Dao-AILab/flash-attention), [PyTorch FSDP](https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html), and [Microsoft DeepSpeed](https://github.com/microsoft/DeepSpeed). We welcome contributions from the community to gather the best kernels for LLM training.
We've also added optimized Post-Training kernels that deliver **up to 80% memory savings** for alignment and distillation tasks. We support losses like DPO, CPO, ORPO, SimPO, JSD, and many more. Check out [how we optimize the memory](https://x.com/hsu_byron/status/1866577403918917655).
## Supercharge Your Model with Liger Kernel
With one line of code, Liger Kernel can increase throughput by more than 20% and reduce memory usage by 60%, thereby enabling longer context lengths, larger batch sizes, and massive vocabularies.
| Speed Up | Memory Reduction |
|--------------------------|-------------------------|
| ![Speed up](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-tps.png) | ![Memory](https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/e2e-memory.png) |
> **Note:**
> - Benchmark conditions: LLaMA 3-8B, Batch Size = 8, Data Type = `bf16`, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 8 A100s.
> - Hugging Face models start to OOM at a 4K context length, whereas Hugging Face + Liger Kernel scales up to 16K.
## Optimize Post Training with Liger Kernel
<p align="center">
<img src="https://raw.githubusercontent.com/linkedin/Liger-Kernel/main/docs/images/post-training.png" width="50%" alt="Post Training">
</p>
We provide optimized post training kernels like DPO, ORPO, SimPO, and more which can reduce memory usage by up to 80%. You can easily use them as python modules.
```python
from liger_kernel.chunked_loss import LigerFusedLinearDPOLoss
orpo_loss = LigerFusedLinearORPOLoss()
y = orpo_loss(lm_head.weight, x, target)
```
#### Key Features
- **Ease of use:** Simply patch your Hugging Face model with one line of code, or compose your own model using our Liger Kernel modules.
- **Time and memory efficient:** In the same spirit as Flash-Attn, but for layers like **RMSNorm**, **RoPE**, **SwiGLU**, and **CrossEntropy**! Increases multi-GPU training throughput by 20% and reduces memory usage by 60% with **kernel fusion**, **in-place replacement**, and **chunking** techniques.
- **Exact:** Computation is exact—no approximations! Both forward and backward passes are implemented with rigorous unit tests and undergo convergence testing against training runs without Liger Kernel to ensure accuracy.
- **Lightweight:** Liger Kernel has minimal dependencies, requiring only Torch and Triton—no extra libraries needed! Say goodbye to dependency headaches!
- **Multi-GPU supported:** Compatible with multi-GPU setups (PyTorch FSDP, DeepSpeed, DDP, etc.).
- **Trainer Framework Integration**: [Axolotl](https://github.com/axolotl-ai-cloud/axolotl), [LLaMa-Factory](https://github.com/hiyouga/LLaMA-Factory), [SFTTrainer](https://github.com/huggingface/trl/releases/tag/v0.10.1), [Hugging Face Trainer](https://github.com/huggingface/transformers/pull/32860), [SWIFT](https://github.com/modelscope/ms-swift)
### Installation
To install the stable version:
```bash
$ pip install liger-kernel
```
To install the nightly version:
```bash
$ pip install liger-kernel-nightly
```
To install from source:
```bash
git clone https://github.com/linkedin/Liger-Kernel.git
cd Liger-Kernel
# Install Default Dependencies
# Setup.py will detect whether you are using AMD or NVIDIA
pip install -e .
# Setup Development Dependencies
pip install -e ".[dev]"
```
!!! Note " Dependencies "
#### CUDA
- `torch >= 2.1.2`
- `triton >= 2.3.0`
#### ROCm
- `torch >= 2.5.0` Install according to the instruction in Pytorch official webpage.
- `triton >= 3.0.0` Install from pypi. (e.g. `pip install triton==3.0.0`)
!!!Tip "Optional Dependencies "
- `transformers >= 4.x`: Required if you plan to use the transformers models patching APIs. The specific model you are working will dictate the minimum version of transformers.
!!! Note
Our kernels inherit the full spectrum of hardware compatibility offered by [Triton](https://github.com/triton-lang/triton).
#### Sponsorship and Collaboration
- [AMD](https://www.amd.com/en.html): Providing AMD GPUs for our AMD CI.
- [Intel](https://www.intel.com/): Providing Intel GPUs for our Intel CI.
- [Modal](https://modal.com/): Free 3000 credits from GPU MODE IRL for our NVIDIA CI.
- [EmbeddedLLM](https://embeddedllm.com/): Making Liger Kernel run fast and stable on AMD.
- [HuggingFace](https://huggingface.co/): Integrating Liger Kernel into Hugging Face Transformers and TRL.
- [Lightning AI](https://lightning.ai/): Integrating Liger Kernel into Lightning Thunder.
- [Axolotl](https://axolotl.ai/): Integrating Liger Kernel into Axolotl.
- [Llama-Factory](https://github.com/hiyouga/LLaMA-Factory): Integrating Liger Kernel into Llama-Factory.
!!! Note " Contact "
- For issues, create a Github ticket in this repository .
- For open discussion, join [our discord channel](https://discord.gg/gpumode) .
- For formal collaboration, send an email to byhsu@linkedin.com .
### Cite this work
Bib Latex entry:
```bib
@inproceedings{
hsu2025ligerkernel,
title={Liger-Kernel: Efficient Triton Kernels for {LLM} Training},
author={Pin-Lun Hsu and Yun Dai and Vignesh Kothapalli and Qingquan Song and Shao Tang and Siyu Zhu and Steven Shimizu and Shivam Sahni and Haowen Ning and Yanning Chen and Zhipeng Wang},
booktitle={Championing Open-source DEvelopment in ML Workshop @ ICML25},
year={2025},
url={https://openreview.net/forum?id=36SjAIT42G}
}
```
### Star History
[![Star History Chart](https://api.star-history.com/svg?repos=linkedin/Liger-Kernel&type=Date)](https://star-history.com/#linkedin/Liger-Kernel&Date)
<p align="right" style="font-size: 14px; color: #555; margin-top: 20px;">
<a href="#readme-top" style="text-decoration: none; color: #007bff; font-weight: bold;">
↑ Back to Top ↑
</a>
</p>
This project is licensed under the [BSD 2-CLAUSE](https://github.com/linkedin/Liger-Kernel/blob/main/LICENSE) License (see `LICENSE` for details).
It also includes components from projects licensed under:
- Apache License 2.0 (see `LICENSE-APACHE-2.0` for details).
- MIT License (see `LICENSE-MIT-AutoAWQ` for details).
- MIT License (see `LICENSE-MIT-Efficient Cross Entropy` for details).
- MIT License (see `LICENSE-MIT-llmc` for details).
- MIT License (see `LICENSE-MIT-triton` for details).
\ No newline at end of file
compute_environment: LOCAL_MACHINE
debug: false
distributed_type: FSDP
downcast_bf16: 'no'
enable_cpu_affinity: false
fsdp_config:
fsdp_activation_checkpointing: false
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: true
machine_rank: 0
main_training_function: main
num_machines: 1
num_processes: 8
rdzv_backend: static
same_network: true
tpu_env: []
tpu_use_cluster: false
tpu_use_sudo: false
use_cpu: false
import torch
from datasets import load_dataset
from transformers import AutoModelForCausalLM
from transformers import AutoTokenizer
from trl import ORPOConfig # noqa: F401
from liger_kernel.transformers.trainer import LigerORPOTrainer # noqa: F401
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
max_length=512,
padding="max_length",
)
tokenizer.pad_token = tokenizer.eos_token
train_dataset = load_dataset("trl-lib/tldr-preference", split="train")
training_args = ORPOConfig(
output_dir="Llama3.2_1B_Instruct",
beta=0.1,
max_length=128,
per_device_train_batch_size=32,
max_steps=100,
save_strategy="no",
)
trainer = LigerORPOTrainer(model=model, args=training_args, tokenizer=tokenizer, train_dataset=train_dataset)
trainer.train()
# Liger-Kernel Example with HuggingFace Trainer
## How to Run
### Locally on a GPU machine
You can run the example locally on a GPU machine. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs.
```bash
pip install -r requirements.txt
sh run_{MODEL}.sh
```
### Remotely on Modal
If you do not have access to a GPU machine, you can run the example on Modal. Modal is a serverless platform that allows you to run your code on a remote GPU machine. You can sign up for a free account at [Modal](https://www.modal.com/).
```bash
pip install modal
modal setup # authenticate with Modal
modal run launch_on_modal.py --script "run_qwen2_vl.sh"
```
**Notes**
1. This example uses an optional `use_liger` flag. If true, it does a 1 line monkey patch to apply liger kernel.
2. The example uses Llama3 model that requires community license agreement and HuggingFace Hub login. If you want to use Llama3 in this example, please make sure you have done the followings:
* Agree on the community license agreement https://huggingface.co/meta-llama/Meta-Llama-3-8B
* Run `huggingface-cli login` and enter your HuggingFace token
3. The default hyperparameters and configurations work on single node with 4xA100 80GB GPUs. For running on device with less GPU RAM, please consider reducing the per-GPU batch size and/or enable `CPUOffload` in FSDP.
## Benchmark Result
### LLaMA
Benchmark conditions: LLaMA 3-8B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
Throughput improves by around 20%, while GPU memory usage drops by 40%. This allows you to train the model on smaller GPUs, use larger batch sizes, or handle longer sequence lengths without incurring additional costs.
![Throughput](img/llama_tps.png)
![GPU Memory Allocated](img/llama_mem_alloc.png)
### QWEN
Benchmark conditions: Qwen2-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
Throughput improves by around 10%, while GPU memory usage drops by 50%.
![Throughput](img/qwen_tps.png)
![GPU Memory Allocated](img/qwen_mem_alloc.png)
### GEMMA 7B
Benchmark conditions: Gemma-7B, Alpaca Dataset, Max seq len = 512, Data Type = bf16, Optimizer = AdamW, Gradient Checkpointing = True, Distributed Strategy = FSDP1 on 4 A100s.
Throughput improves by around 24%, while GPU memory usage drops by 33%.
![Throughput](img/gemma_7b_mem.png)
![GPU Memory Allocated](img/gemma_7b_tp.png)
import time
from dataclasses import dataclass
import torch
import transformers
from transformers import TrainerControl
from transformers import TrainerState
from transformers import TrainingArguments
from liger_kernel.utils import infer_device
# https://simple.wikipedia.org/wiki/Byte
# For memory, we use binary system
M_BIN_UNIT = 2**20
# For metrics (tflops), we use decimal system
T_DEC_UNIT = 10**12
def round_to_n_decimal(x, n):
return round(x, n)
@dataclass
class Precision:
"""
Precision is a dataclass to store the number of decimal points for each metric.
"""
n_decimal_time: int
n_decimal_memory: int
n_decimal_TPS: int
@dataclass
class State:
"""
State is a dataclass to store the internal state of the efficiency callback.
"""
n_warmup_steps: int = 0
total_peak_memory_allocated: float = float("-inf")
total_peak_memory_reserved: float = float("-inf")
step_start_time: float = 0.0
elapsed_time: float = 0.0
elapsed_step: int = 0
step_start_tokens_seen: int = 0
elapsed_tokens_seen: int = 0
global_start_step: int = 0
@dataclass
class Time:
"""
Time is a dataclass to store the time-related metrics.
"""
step: int = 0
step_time_sec: float = 0.0
avg_step_time_sec: float = 0.0
time_to_completion_sec: float = 0.0
estimated_total_time_sec: float = 0.0
@dataclass
class Memory:
"""
Memory is a dataclass to store the memory-related metrics.
"""
step_peak_memory_allocated_MB: float = 0.0
step_peak_memory_reserved_MB: float = 0.0
total_peak_memory_allocated_MB: float = 0.0
total_peak_memory_reserved_MB: float = 0.0
@dataclass
class TPS:
"""
TPS is a dataclass to store the tokens per second metrics.
"""
step_tokens_per_second: float = 0.0
avg_tokens_per_second: float = 0.0
class EfficiencyCallback(transformers.TrainerCallback):
"""
EfficiencyCallback is a callback to track the efficiency of the training process.
The tracked stats include: step time, memory, and throughput.
It requires including `--include_num_input_tokens_seen` and `logging_steps=1` in the training arguments.
Args:
n_warmup_steps: number of warmup steps
The stats in the first n_warmup_steps will not be added into the aggregated stats
This is because the first few steps might take longer due to jit compliation and other initialization overheads
n_decimal_time: number of decimal points for time
n_decimal_memory: number of decimal points for memory
n_decimal_TPS: number of decimal points for TPS
"""
def __init__(self, n_warmup_steps=2, n_decimal_time=2, n_decimal_memory=2, n_decimal_TPS=2):
self.state = State(
n_warmup_steps,
)
self.precision = Precision(n_decimal_time, n_decimal_memory, n_decimal_TPS)
self.time = Time()
self.memory = Memory()
self.tps = TPS()
self.device = infer_device()
def on_init_end(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
**kwargs,
):
"""
Event called at the end of the initialization of the [`Trainer`].
"""
if not args.include_num_input_tokens_seen:
raise Exception(
'Please pass training argument "--include_num_input_tokens_seen" to track tokens per second'
)
if args.logging_steps != 1:
raise Exception("Please set logging_steps=1 to track the efficiency metrics accurately")
def on_train_begin(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
**kwargs,
):
# if loaded from checkpoints, global_start_step is not 1 but state.global_step
self.state.global_start_step = state.global_step
def on_log(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
logs: dict[str, float],
**kwargs,
):
if state.global_step < (self.state.global_start_step + self.state.n_warmup_steps):
return
else:
# spread self.time, self.memory, self.tps to logs
logs.update(self.time.__dict__)
logs.update(self.memory.__dict__)
logs.update(self.tps.__dict__)
def on_step_begin(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
**kwargs,
):
"""
Event called at the beginning of a training step. If using gradient accumulation, one training step might take
several inputs.
"""
# memory
getattr(torch, self.device).reset_peak_memory_stats()
# time
self.state.step_start_time = time.perf_counter()
def on_step_end(
self,
args: TrainingArguments,
state: TrainerState,
control: TrainerControl,
**kwargs,
):
if state.global_step < (self.state.global_start_step + self.state.n_warmup_steps):
# The end the current step_start_tokens_seen is the start of next iteration
# tokens
self.state.step_start_tokens_seen = state.num_input_tokens_seen
return
# time
current_time = time.perf_counter()
step_time = current_time - self.state.step_start_time
self.state.elapsed_time += step_time
# step
global_step = state.global_step
self.state.elapsed_step += 1
avg_step_time = self.state.elapsed_time / self.state.elapsed_step
self.time.step = global_step
self.time.step_time_sec = round_to_n_decimal(step_time, self.precision.n_decimal_time)
self.time.avg_step_time_sec = round_to_n_decimal(avg_step_time, self.precision.n_decimal_time)
self.time.time_to_completion_sec = round_to_n_decimal(
avg_step_time * (state.max_steps - global_step),
self.precision.n_decimal_time,
)
self.time.estimated_total_time_sec = round_to_n_decimal(
avg_step_time * state.max_steps, self.precision.n_decimal_time
)
# memory
step_peak_memory_allocated = getattr(torch, self.device).memory.max_memory_allocated()
step_peak_memory_reserved = getattr(torch, self.device).memory.max_memory_reserved()
self.memory.step_peak_memory_allocated_MB = round_to_n_decimal(
step_peak_memory_allocated / M_BIN_UNIT, self.precision.n_decimal_memory
)
self.state.total_peak_memory_allocated = max(self.state.total_peak_memory_allocated, step_peak_memory_allocated)
self.memory.total_peak_memory_allocated_MB = round_to_n_decimal(
self.state.total_peak_memory_allocated / M_BIN_UNIT,
self.precision.n_decimal_memory,
)
self.memory.step_peak_memory_reserved_MB = round_to_n_decimal(
step_peak_memory_reserved / M_BIN_UNIT, self.precision.n_decimal_memory
)
self.state.total_peak_memory_reserved = max(self.state.total_peak_memory_reserved, step_peak_memory_reserved)
self.memory.total_peak_memory_reserved_MB = round_to_n_decimal(
self.state.total_peak_memory_reserved / M_BIN_UNIT,
self.precision.n_decimal_memory,
)
# tokens
step_tokens_seen = state.num_input_tokens_seen - self.state.step_start_tokens_seen
self.state.elapsed_tokens_seen += step_tokens_seen
self.tps.step_tokens_per_second = round_to_n_decimal(
step_tokens_seen / step_time,
self.precision.n_decimal_TPS,
)
self.tps.avg_tokens_per_second = round_to_n_decimal(
self.state.elapsed_tokens_seen / self.state.elapsed_time,
self.precision.n_decimal_TPS,
)
# The end the current step_start_tokens_seen is the start of next iteration
# tokens
self.state.step_start_tokens_seen = state.num_input_tokens_seen
{
"backward_prefetch": "backward_pre",
"forward_prefetch": "true",
"activation_checkpointing": true
}
\ No newline at end of file
"""
launch_on_modal.py
This tool is designed to launch scripts using Modal.
It sets up the necessary environment, including GPU resources and python dependencies,
and executes the specified training script remotely.
### Setup and Usage
```bash
pip install modal
modal setup # authenticate with Modal
export HF_TOKEN="your_huggingface_token" # if using a gated model such as llama3
modal run launch_on_modal.py --script "run_qwen2_vl.sh"
```
### Caveats
This tool is intended as an easy on-ramp to using Liger-Kernel for fine-tuning LLMs and
VLMs - it is a reproducible way to run benchmarks and example scripts. However, it is not
the best way to develop a model on Modal, as it re-downloads the model and dataset each
time it is run. For iterative development, consider using `modal.Volume` to cache the
model and dataset between runs.
"""
import os
import modal
from modal import gpu
TWO_HOURS = 2 * 60 * 60
SIXTEEN_GB = 16 * 1024
app = modal.App("liger-example")
image = modal.Image.debian_slim().pip_install_from_requirements("requirements.txt").copy_local_dir(".", "/root")
if "HF_TOKEN" not in os.environ:
print("HF_TOKEN not found in environment variables, using an empty token.")
hf_token_secret = modal.Secret.from_dict({"HF_TOKEN": os.environ.get("HF_TOKEN", "")})
@app.function(
gpu=gpu.A100(count=4, size="80GB"),
image=image,
timeout=TWO_HOURS,
memory=SIXTEEN_GB,
secrets=[hf_token_secret],
)
def launch_script(script: str):
import subprocess
script_path = f"/root/{script}"
os.chmod(script_path, 0o755) # make script executable
print(f"Running script: {script_path}")
subprocess.run([script_path], check=True, cwd="/root", env=os.environ.copy())
@app.local_entrypoint()
def main(script: str):
"""
Launch a script remotely on modal.
```bash
export HF_TOKEN="your_huggingface_token" # if using a gated model such as llama3
modal run --detach launch_on_modal.py --script "run_qwen2_vl.sh"
```
"""
launch_script.remote(script=script)
transformers==4.45.2
trl
liger-kernel
triton
torch
torchvision
\ No newline at end of file
#!/bin/bash
## Benchmarking Script
## Runs the training script with different configurations and logs the results
MODEL_TYPE="mistral"
MODEL_PATH="mistralai/Mistral-7B-v0.1"
USE_LIGER_VALUES=("True" "False")
BATCH_SIZE_VALUES=(64 128 192)
NUM_REP=5
MAX_STEPS=20
DATASET_PATH="tatsu-lab/alpaca"
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
mkdir -p "${SCRIPT_DIR}/results"
for USE_LIGER in "${USE_LIGER_VALUES[@]}"; do
for BATCH_SIZE in "${BATCH_SIZE_VALUES[@]}"; do
echo "Running with use_liger=$USE_LIGER and batch_size=$BATCH_SIZE"
for ((i=1; i<=NUM_REP; i++)); do
LOG_FILE="${SCRIPT_DIR}/results/${MODEL_TYPE}_use_liger_${USE_LIGER}_batch_size_${BATCH_SIZE}_rep_${i}.log"
torchrun --nnodes=1 --nproc-per-node=4 training.py \
--bf16 \
--num_train_epochs 1 \
--max_steps $MAX_STEPS \
--model_name $MODEL_PATH \
--dataset $DATASET_PATH \
--per_device_train_batch_size $BATCH_SIZE \
--per_device_eval_batch_size 16 \
--eval_strategy "no" \
--save_strategy "no" \
--learning_rate 6e-6 \
--weight_decay 0.05 \
--warmup_ratio 0.1 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--include_num_input_tokens_seen \
--report_to none \
--fsdp "full_shard auto_wrap" \
--fsdp_config config/fsdp_config.json \
--seed 42 \
--use_liger $USE_LIGER \
--output_dir model_output_dir \
> $LOG_FILE
sleep 5
done
done
done
\ No newline at end of file
#!/bin/bash
torchrun --nnodes=1 --nproc-per-node=4 training.py \
--model_name "google/gemma-7b-it" \
--bf16 \
--max_steps 20 \
--per_device_train_batch_size 24 \
--per_device_eval_batch_size 1 \
--eval_strategy "no" \
--save_strategy "no" \
--learning_rate 6e-6 \
--weight_decay 0.05 \
--warmup_ratio 0.1 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--include_num_input_tokens_seen \
--report_to none \
--fsdp "full_shard auto_wrap" \
--fsdp_config config/fsdp_config.json \
--seed 42 \
--use_liger True \
--output_dir alpaca_finetuning
#!/bin/bash
torchrun --nnodes=1 --nproc-per-node=4 training.py \
--bf16 \
--num_train_epochs 1 \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 64 \
--eval_strategy "no" \
--save_strategy "no" \
--learning_rate 6e-6 \
--weight_decay 0.05 \
--warmup_ratio 0.1 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--include_num_input_tokens_seen \
--report_to none \
--fsdp "full_shard auto_wrap" \
--fsdp_config config/fsdp_config.json \
--seed 42 \
--use_liger True \
--output_dir alpaca_finetuning
#!/bin/bash
torchrun --nnodes=1 --nproc-per-node=4 training.py \
--model_name "Qwen/Qwen2-7B" \
--bf16 \
--num_train_epochs 1 \
--per_device_train_batch_size 48 \
--per_device_eval_batch_size 64 \
--eval_strategy "no" \
--save_strategy "no" \
--learning_rate 6e-6 \
--weight_decay 0.05 \
--warmup_ratio 0.1 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--include_num_input_tokens_seen \
--report_to none \
--fsdp "full_shard auto_wrap" \
--fsdp_config config/fsdp_config.json \
--seed 42 \
--use_liger True \
--output_dir alpaca_finetuning
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment