Commit f356f546 authored by maming's avatar maming
Browse files

Initial commit

parents
Pipeline #3339 canceled with stages
# Llama Models
## Table of contents
- [1. Overview](#1-overview)
- [2. Prerequisites](#2-prerequisites)
- [3. Training Setup](#3-training-setup)
- [4. Configuration](#4-configuration)
- [5. Test Datasets](#5-test-datasets)
- [6. FP8 Debugging](#6-fp8-debugging)
## 1. Overview
<a id="overview" name="overview"></a>
Train Llama models using FP8 precision with Megatron-Core.
## 2. Prerequisites
<a id="prerequisites" name="prerequisites"></a>
```bash
# Clone repository
export HOST_MEGATRON_LM_DIR="/path/to/your/host/megatron-lm"
git clone https://github.com/NVIDIA/Megatron-LM.git "$HOST_MEGATRON_LM_DIR"
cd "$HOST_MEGATRON_LM_DIR"
git checkout "core_r0.12.0"
# Set paths
export HOST_CHECKPOINT_PATH="./checkpoints/llama3_8b_fp8"
export HOST_TENSORBOARD_LOGS_PATH="./tensorboard_logs/llama3_8b_fp8"
# Optional: For real data
# export HOST_TOKENIZER_MODEL_PATH="/path/to/host/tokenizer.model"
# export HOST_DATA_PREFIX="/path/to/host/mydata_prefix"
```
## 3. Training Setup
<a id="training-setup" name="training-setup"></a>
### Using Mock Data
```bash
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.03-py3"
docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
-v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/llama/train_llama3_8b_h100_fp8.sh \
/workspace/checkpoints \
/workspace/tensorboard_logs \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_mock_$(date +'%y-%m-%d_%H-%M-%S').log"
```
### Using Custom Data and Tokenizer
```bash
PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.03-py3"
docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
-v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
-v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
-v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
-v "${HOST_TOKENIZER_MODEL_PATH}:/workspace/tokenizer_model" \
-v "$(dirname "${HOST_DATA_PREFIX}"):/workspace/data_dir" \
--workdir /workspace/megatron-lm \
$PYTORCH_IMAGE \
bash examples/llama/train_llama3_8b_h100_fp8.sh \
/workspace/checkpoints \
/workspace/tensorboard_logs \
/workspace/tokenizer_model \
"/workspace/data_dir/$(basename "${HOST_DATA_PREFIX}")" \
2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_custom_$(date +'%y-%m-%d_%H-%M-%S').log"
```
## 4. Configuration
<a id="configuration" name="configuration"></a>
Default parallelism strategy:
- Tensor Parallel: 1
- Pipeline Parallel: 1
- Context Parallel: 2
Llama-3-8B architecture:
- 32 layers
- Hidden size: 4096
- FFN hidden size: 14336
- Attention heads: 32
- Query groups: 8
- Sequence length: 8192
- RMSNorm normalization with SwiGLU and RoPE
Key training parameters:
- Micro-batch size: 1
- Global batch size: 128
- Learning rate: 1.5e-4
- Min learning rate: 1.0e-5
- Weight decay: 0.1
- FP8 format: hybrid
You can modify these parameters directly in the `train_llama3_8b_h100_fp8.sh` script.
This configuration follows those defined in NeMo Framework's performance scripts, which can be found at [https://github.com/NVIDIA/NeMo/tree/main/scripts/performance](https://github.com/NVIDIA/NeMo/tree/main/scripts/performance).
### FP8 Performance
| Model | #-GPUs | GBS | MBS | Seq Length | TP | PP | CP | VP | EP | GA | Tokens/sec/GPU | TFLOP/sec/GPU |
|-------|--------|-----|-----|------------|----|----|----|----|----|----|----------------|---------------|
| LLAMA3-8B | 8 | 128 | 1 | 8192 | 1 | 1 | 2 | 1 | 1 | 32 | 13812 | 800 |
| LLAMA3-70B | 64 | 128 | 1 | 8192 | 4 | 8 | 1 | 5 | 1 | 64 | 1621 | 780 |
| LLAMA3-405B | 1024 | 512 | 1 | 8192 | 8 | 8 | 2 | 8 | 1 | 64 | 315 | 834 |
Legend:
- GBS: Global Batch Size
- MBS: Micro Batch Size
- TP: Tensor Parallel size
- PP: Pipeline Parallel size
- CP: Context Parallel size
- VP: Virtual Pipeline stages
- EP: Expert Parallel size
- GA: Gradient Accumulation steps
As NeMo uses Megatron-Core, for the latest performance benchmarks, please refer to the official [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance_summary.html).
## 5. Test Datasets
<a id="test-datasets" name="test-datasets"></a>
Recommended datasets:
1. **WikiText-103**: https://huggingface.co/datasets/Salesforce/wikitext
Preprocess datasets:
```bash
python "${HOST_MEGATRON_LM_DIR}/tools/preprocess_data.py" \
--input your_dataset.json \
--output-prefix test_dataset \
--tokenizer-type HuggingFaceTokenizer \
--tokenizer-model /path/to/tokenizer.model \
--append-eod
```
## 6. FP8 Training Considerations
<a id="fp8-training-considerations" name="fp8-training-considerations"></a>
- **Hardware**: Requires NVIDIA Hopper, Ada, or Blackwell GPUs for FP8 support
- **Troubleshooting**: If you encounter NaN values or instability with FP8 training, please refer to [Transformer Engine](https://github.com/NVIDIA/TransformerEngine).
#!/bin/bash
# Environment variables for performance tuning
export CUDA_DEVICE_MAX_CONNECTIONS=${CUDA_DEVICE_MAX_CONNECTIONS:-1}
#export LOG_LEVEL=${LOG_LEVEL:-INFO}
#export NCCL_IB_TIMEOUT=${NCCL_IB_TIMEOUT:-19}
#export NVTE_FWD_LAYERNORM_SM_MARGIN=${NVTE_FWD_LAYERNORM_SM_MARGIN:-16}
#export NVTE_BWD_LAYERNORM_SM_MARGIN=${NVTE_BWD_LAYERNORM_SM_MARGIN:-16}
#export NCCL_P2P_NET_CHUNKSIZE=${NCCL_P2P_NET_CHUNKSIZE:-2097152}
#export NCCL_AVOID_RECORD_STREAMS=${NCCL_AVOID_RECORD_STREAMS:-1}
CHECKPOINT_PATH=${1:-"checkpoints/llama3_8b_fp8"}
TENSORBOARD_LOGS_PATH=${2:-"tensorboard_logs/llama3_8b_fp8"}
TOKENIZER_ARG=${3:-"MOCK"} # Path to tokenizer model, or "MOCK"
DATA_ARG=${4:-"MOCK"} # Data prefix, or "MOCK"
# Create directories if they don't exist
mkdir -p "$(dirname "$CHECKPOINT_PATH")"
mkdir -p "$(dirname "$TENSORBOARD_LOGS_PATH")"
# Distributed training setup
GPUS_PER_NODE=8
NUM_NODES=1
MASTER_ADDR=${MASTER_ADDR:-localhost}
MASTER_PORT=${MASTER_PORT:-6000}
NODE_RANK=${NODE_RANK:-0}
WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
# Path to the pretrain_gpt.py script, assuming this script is run from the root of the Megatron-LM repository
PRETRAIN_SCRIPT_PATH="pretrain_gpt.py"
# Fixed model and training parameters
TP_SIZE=1
CP_SIZE=1
PP_SIZE=1
MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=128
NUM_LAYERS=32
DTYPE="fp8"
SEQ_LENGTH=8192
MAX_POSITION_EMBEDDINGS=8192
# Data cache path (useful for both mock and real data)
DATA_CACHE_PATH="${PWD}/benchmark_cache_llama3_8b_fp8"
mkdir -p "$DATA_CACHE_PATH"
DISTRIBUTED_ARGS=(
--nproc_per_node $GPUS_PER_NODE
--nnodes $NUM_NODES
--node_rank $NODE_RANK
--master_addr $MASTER_ADDR
--master_port $MASTER_PORT
)
MODEL_ARGS=(
--use-mcore-models
--num-layers $NUM_LAYERS
--hidden-size 4096
--ffn-hidden-size 14336
--num-attention-heads 32
--group-query-attention
--num-query-groups 8
--kv-channels 128
--seq-length $SEQ_LENGTH
--max-position-embeddings $MAX_POSITION_EMBEDDINGS
--position-embedding-type rope
--rotary-base 1000000
--rotary-percent 1.0
--attention-dropout 0.0
--hidden-dropout 0.0
--swiglu
--init-method-std 0.0134
--attention-backend fused
--apply-layernorm-1p
--untie-embeddings-and-output-weights
--disable-bias-linear
)
TRAINING_ARGS=(
--micro-batch-size $MICRO_BATCH_SIZE
--global-batch-size $GLOBAL_BATCH_SIZE
--train-samples 1953125000
--lr-decay-samples 1949218748
--lr-warmup-samples 3906252
--lr 0.00015
--min-lr 0.00001
--decoupled-lr 5.0e-4 # Specific to decoupled AdamW, ensure optimizer is compatible
--decoupled-min-lr 4.5e-5 # Specific to decoupled AdamW
--lr-decay-style cosine
--clip-grad 1.0
--weight-decay 0.1
--adam-beta1 0.9
--adam-beta2 0.95
--bf16
--grad-reduce-in-bf16
--cross-entropy-loss-fusion
--calculate-per-token-loss
--manual-gc
--empty-unused-memory-level 1
--exit-duration-in-mins 235
)
# Conditional arguments based on DTYPE (FP8)
DTYPE_ARGS=()
if [[ "$DTYPE" == "fp8" ]]; then
DTYPE_ARGS+=(
"--fp8-format hybrid"
"--fp8-amax-history-len 1024"
"--fp8-amax-compute-algo max"
"--fp8-param-gather"
)
fi
# Model parallelism arguments
MODEL_PARALLEL_ARGS=(
--tensor-model-parallel-size $TP_SIZE
--context-parallel-size $CP_SIZE
# --pipeline-model-parallel-size $PP_SIZE # Not explicitly set in llama script options, assume 1 if not multi-node PP
--sequence-parallel # Always enable sequence parallelism with TP_SIZE=2
)
# Distributed Data Parallel (DDP) arguments
# From original script's ddp_args
DDP_ARGS=(
--use-distributed-optimizer
--overlap-grad-reduce
--overlap-param-gather
)
TRAINING_ARGS+=("${DDP_ARGS[@]}")
# Data arguments (conditional for mock vs real data)
DATA_ARGS_LIST=()
if [[ "$TOKENIZER_ARG" == "MOCK" ]] || [[ "$DATA_ARG" == "MOCK" ]] || [[ -z "$TOKENIZER_ARG" ]]; then
DATA_ARGS_LIST+=(
"--mock-data"
"--tokenizer-type NullTokenizer"
"--vocab-size 128256"
"--data-cache-path ${DATA_CACHE_PATH}"
"--tiktoken-pattern v2"
"--split '99,1,0'"
"--no-create-attention-mask-in-dataloader"
"--no-mmap-bin-files"
"--num-workers 1"
)
else
# Settings for real data
DATA_ARGS_LIST+=(
"--data-path $DATA_ARG"
"--tokenizer-type HuggingFaceTokenizer"
"--tokenizer-model $TOKENIZER_ARG"
"--data-cache-path ${DATA_CACHE_PATH}"
"--split '99,1,0'"
"--no-create-attention-mask-in-dataloader"
"--no-mmap-bin-files"
"--num-workers 1"
# Note: --vocab-size might be inferred by HuggingFaceTokenizer or might need to be explicit.
"--vocab-size 128256"
)
fi
EVAL_AND_LOGGING_ARGS=(
--log-interval 1
--eval-iters 32
--eval-interval 100
--save-interval 1000
--log-throughput
--profile
--profile-step-start 4
--profile-step-end 6
--ckpt-format torch_dist
--distributed-timeout-minutes 60
--save "$CHECKPOINT_PATH"
--load "$CHECKPOINT_PATH"
--tensorboard-dir "$TENSORBOARD_LOGS_PATH"
)
# Ensure pretrain_gpt.py is found
if [ ! -f "$PRETRAIN_SCRIPT_PATH" ]; then
echo "Error: pretrain_gpt.py not found at $PRETRAIN_SCRIPT_PATH"
echo "Please ensure you are running this script from the root of the Megatron-LM repository, and pretrain_gpt.py is present."
exit 1
fi
# Run the training command
torchrun ${DISTRIBUTED_ARGS[@]} \
"$PRETRAIN_SCRIPT_PATH" \
${MODEL_ARGS[@]} \
${TRAINING_ARGS[@]} \
${DTYPE_ARGS[@]} \
${MODEL_PARALLEL_ARGS[@]} \
${DATA_ARGS_LIST[@]} \
${EVAL_AND_LOGGING_ARGS[@]}
set +x
\ No newline at end of file
checkpoints/
data-cache/
tensorboard/
triton-cache/
FROM nvcr.io/nvidia/pytorch:24.01-py3
RUN pip uninstall -y triton && \
pip install triton==2.1.0 sentencepiece==0.1.99 flask-restful
# The causal-conv1d and mamba-ssm packages below are built from scratch here
# (which takes significant time) because there are no wheels available on PyPI
# for these relatively newer versions of the packages that are compatible with
# the older NGC-variant PyTorch version (e.g. version 2.2.0.dev231106) that we
# are using (in the NGC base container). Generally, if the package is not
# compatible with the PyTorch version, then it will generate a Python import
# error. The package authors tend to only release wheels for new versions of
# these pacakges which are compatible with the versions of regular PyTorch and
# NGC-variant PyTorch that are newer at the time of release. So, to use newer
# versions of these packages with relatively older versions of the NGC PyTorch
# container, we tend to have to build the packages from scratch.
RUN cd /tmp && \
git clone https://github.com/Dao-AILab/causal-conv1d.git && \
cd causal-conv1d && \
git checkout v1.2.2.post1 && \
CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install . && \
cd .. && \
rm -rf causal-conv1d
RUN cd /tmp && \
git clone https://github.com/state-spaces/mamba.git && \
cd mamba && \
git checkout v2.0.3 && \
MAMBA_FORCE_BUILD=TRUE pip install . && \
cd .. && \
rm -rf mamba
# Mamba-based Language Models
## Introduction
This document is an entrypoint into the code used for
<em>[An Empirical Study of Mamba-based Language Models](https://arxiv.org/abs/2406.07887)</em>.
We are releasing the parameters for some of the models described in that
technical report via
[HuggingFace](https://huggingface.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c).
The code in the `main` branch is no longer compatible with the `Mamba2-*`
checkpoints. You can load them using the
[fixed snapshot of the code used for the technical report](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).
## Installation
Create and run a Docker container using the [Dockerfile](./Dockerfile).
```
docker build -t your_image_name:your_tag .
docker run --gpus all -it --rm \
-v /path/to/megatron:/workspace/megatron \
-v /path/to/dataset:/workspace/dataset \
-v /path/to/checkpoints:/workspace/checkpoints \
-w /workspace/megatron/examples/mamba \
your_image_name:your_tag
```
## Train
[`train.sh`](./train.sh) is an example pretraining script, showing how to run on
a single node. Select between 800M-scale and 8B-scale models by setting the
`MODEL_SCALE` variable. The 8B-scale hybrid model architecture is the same as
the one described in the technical report.
## Text Generation
Use [`run_text_gen_server_8b.sh`](./run_text_gen_server_8b.sh) to start a text
generation server using an 8B hybrid checkpoint. This is configured to run the
8B hybrid model described in the technical report, with tensor model parallel
set to 1.
The arguments in the script will need to be changed if using a checkpoint with a
different model parallel configuration or other differences, such as model
architecture. For example, to run the 8B pure Mamba-2 model, change
`--hybrid-attention-ratio` and `--hybrid-mlp-ratio` to 0.0, or remove them.
Use [`run_text_gen_server_8b_gpt3.sh`](./run_text_gen_server_8b_gpt3.sh) to start
a text generation server using the 8B reference Transformer checkpoint.
## Checkpoint Formats
For inference, the model must be configured to match the checkpoint file used,
including the hybrid layer configuration and model parallel configuration.
If you need to convert a hybrid checkpoint file to a different tensor parallel
or pipeline parallel size, use
[the hybrid conversion script](../../tools/checkpoint/hybrid_conversion.py).
There is an example run command at the end of that file.
Before running that script, you will need to set `PYTHONPATH` to include the
root directory of your Megatron-LM repository clone.
```
export PYTHONPATH=<path-to-megatron>:PYTHONPATH
```
## Hybrid Options
`--hybrid-attention-ratio ATT` specifies a target ratio of attention layers
to total layers. For example, 4 attention layers out of 48 total layers is
specified by `--hybrid-attention-ratio 0.08`.
`--hybrid-mlp-ratio MLP` specifies a target ratio of MLP layers to total
layers. For example, 24 MLP layers out of 48 total layers is specified by
`--hybrid-mlp-ratio 0.5`.
* (`ATT` + `MLP`) must be less than or equal to 1.0.
* (1.0 - `ATT` - `MLP`) is the hybrid mamba ratio, the ratio of mamba layers to
total layers.
* `ATT` = `MLP` = 0 is a pure Mamba model.
* `ATT` = `MLP` = 0.5 is a transfomer model.
If either `ATT` or `MLP` is greater than 0.0 or if `--hybrid-override-pattern`
is specified, the logfile will include information about the hybrid layer
pattern used. `--hybrid-override-pattern` can be used to specify a different
pattern than the default, algorithmically-generated one.
## Mamba vs Mamba-2
This codebase currently only supports Mamba-2, and not the original version of
Mamba. However, the
[fixed snapshot of the code used for the technical report](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba)
can be configured to run the original version of Mamba.
#!/bin/bash
# Use: ./run_text_gen_server_8b.sh <checkpoint-path> <tokenizer-path>
# To launch the client: python ../../tools/text_generation_cli.py <URL-provided-by-server>
CHECKPOINT_PATH=$1
TOKENIZER_PATH=$2
DISTRIBUTED_ARGS="--nproc_per_node 1 \
--nnodes 1 \
--node_rank 0 \
--master_addr localhost \
--master_port 6000"
export NCCL_IB_SL=1
export CUDA_DEVICE_MAX_CONNECTIONS=1
export NCCL_IB_TIMEOUT=19
export NCCL_IB_QPS_PER_CONNECTION=4
export TRITON_CACHE_DIR="./triton-cache/"
export TRITON_CACHE_MANAGER="megatron.core.ssm.triton_cache_manager:ParallelFileCacheManager"
torchrun $DISTRIBUTED_ARGS ../../tools/run_mamba_text_generation_server.py \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
--untie-embeddings-and-output-weights \
--num-layers 56 \
--hidden-size 4096 \
--load ${CHECKPOINT_PATH} \
--num-attention-heads 32 \
--group-query-attention \
--num-query-groups 8 \
--hybrid-attention-ratio 0.08 \
--hybrid-mlp-ratio 0.5 \
--attention-dropout 0.0 \
--hidden-dropout 0.0 \
--disable-bias-linear \
--normalization RMSNorm \
--seq-length 4096 \
--max-position-embeddings 4096 \
--position-embedding-type none \
--tokenizer-type GPTSentencePieceTokenizer \
--tokenizer-model ${TOKENIZER_PATH} \
--distributed-backend nccl \
--distributed-timeout-minutes 1440 \
--bf16 \
--micro-batch-size 1 \
--use-mcore-models \
--spec megatron.core.models.mamba.mamba_layer_specs mamba_stack_spec \
--seed 42
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment