# Llama Models
## Table of contents
- [1. Overview](#1-overview)
- [2. Prerequisites](#2-prerequisites)
- [3. Training Setup](#3-training-setup)
- [4. Configuration](#4-configuration)
- [5. Test Datasets](#5-test-datasets)
- [6. FP8 Debugging](#6-fp8-debugging)
## 1. Overview
Train Llama models using FP8 precision with Megatron-Core.
## 2. Prerequisites
```bash
# Clone repository
export HOST_MEGATRON_LM_DIR="/path/to/your/host/megatron-lm"
git clone https://github.com/NVIDIA/Megatron-LM.git "$HOST_MEGATRON_LM_DIR"
cd "$HOST_MEGATRON_LM_DIR"
git checkout "core_r0.12.0"
# Set paths
export HOST_CHECKPOINT_PATH="./checkpoints/llama3_8b_fp8"
export HOST_TENSORBOARD_LOGS_PATH="./tensorboard_logs/llama3_8b_fp8"
# Optional: For real data
# export HOST_TOKENIZER_MODEL_PATH="/path/to/host/tokenizer.model"
# export HOST_DATA_PREFIX="/path/to/host/mydata_prefix"
```
## 3. Training Setup
### Using Mock Data
```bash
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.03-py3"
docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
-v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/llama/train_llama3_8b_h100_fp8.sh \
/workspace/checkpoints \
/workspace/tensorboard_logs \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_mock_$(date +'%y-%m-%d_%H-%M-%S').log"
```
### Using Custom Data and Tokenizer
```bash
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.03-py3"
docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
-v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
-v "${HOST_TOKENIZER_MODEL_PATH}:/workspace/tokenizer_model" \
-v "$(dirname "${HOST_DATA_PREFIX}"):/workspace/data_dir" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/llama/train_llama3_8b_h100_fp8.sh \
/workspace/checkpoints \
/workspace/tensorboard_logs \
/workspace/tokenizer_model \
"/workspace/data_dir/$(basename "${HOST_DATA_PREFIX}")" \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_custom_$(date +'%y-%m-%d_%H-%M-%S').log"
```
## 4. Configuration
Default parallelism strategy:
- Tensor Parallel: 1
- Pipeline Parallel: 1
- Context Parallel: 2
Llama-3-8B architecture:
- 32 layers
- Hidden size: 4096
- FFN hidden size: 14336
- Attention heads: 32
- Query groups: 8
- Sequence length: 8192
- RMSNorm normalization with SwiGLU and RoPE
Key training parameters:
- Micro-batch size: 1
- Global batch size: 128
- Learning rate: 1.5e-4
- Min learning rate: 1.0e-5
- Weight decay: 0.1
- FP8 format: hybrid
You can modify these parameters directly in the `train_llama3_8b_h100_fp8.sh` script.
This configuration follows those defined in NeMo Framework's performance scripts, which can be found at [https://github.com/NVIDIA/NeMo/tree/main/scripts/performance](https://github.com/NVIDIA/NeMo/tree/main/scripts/performance).
### FP8 Performance
| Model | #-GPUs | GBS | MBS | Seq Length | TP | PP | CP | VP | EP | GA | Tokens/sec/GPU | TFLOP/sec/GPU |
|-------|--------|-----|-----|------------|----|----|----|----|----|----|----------------|---------------|
| LLAMA3-8B | 8 | 128 | 1 | 8192 | 1 | 1 | 2 | 1 | 1 | 32 | 13812 | 800 |
| LLAMA3-70B | 64 | 128 | 1 | 8192 | 4 | 8 | 1 | 5 | 1 | 64 | 1621 | 780 |
| LLAMA3-405B | 1024 | 512 | 1 | 8192 | 8 | 8 | 2 | 8 | 1 | 64 | 315 | 834 |
Legend:
- GBS: Global Batch Size
- MBS: Micro Batch Size
- TP: Tensor Parallel size
- PP: Pipeline Parallel size
- CP: Context Parallel size
- VP: Virtual Pipeline stages
- EP: Expert Parallel size
- GA: Gradient Accumulation steps
As NeMo uses Megatron-Core, for the latest performance benchmarks, please refer to the official [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance_summary.html).
## 5. Test Datasets
Recommended datasets:
1. **WikiText-103**: https://huggingface.co/datasets/Salesforce/wikitext
Preprocess datasets:
```bash
python "${HOST_MEGATRON_LM_DIR}/tools/preprocess_data.py" \
--input your_dataset.json \
--output-prefix test_dataset \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--append-eod
```
## 6. FP8 Training Considerations
- **Hardware**: Requires NVIDIA Hopper, Ada, or Blackwell GPUs for FP8 support
- **Troubleshooting**: If you encounter NaN values or instability with FP8 training, please refer to [Transformer Engine](https://github.com/NVIDIA/TransformerEngine).