Initial commit

f356f546 · maming · f356f546 · f356f546 · f356f546 · f356f546
Commit f356f546 authored Feb 04, 2026 by maming
6 changed files
--- a/Megatron-LM/examples/llama/README.md
+++ b/Megatron-LM/examples/llama/README.md
+# Llama Models
+## Table of contents
+- [1. Overview](#1-overview)
+- [2. Prerequisites](#2-prerequisites)
+- [3. Training Setup](#3-training-setup)
+- [4. Configuration](#4-configuration)
+- [5. Test Datasets](#5-test-datasets)
+- [6. FP8 Debugging](#6-fp8-debugging)
+## 1. Overview
+<a id="overview" name="overview"></a>
+Train Llama models using FP8 precision with Megatron-Core.
+## 2. Prerequisites
+<a id="prerequisites" name="prerequisites"></a>
+```bash
+# Clone repository
+export HOST_MEGATRON_LM_DIR="/path/to/your/host/megatron-lm"
+git clone https://github.com/NVIDIA/Megatron-LM.git "$HOST_MEGATRON_LM_DIR"
+cd "$HOST_MEGATRON_LM_DIR"
+git checkout "core_r0.12.0"
+# Set paths
+export HOST_CHECKPOINT_PATH="./checkpoints/llama3_8b_fp8"
+export HOST_TENSORBOARD_LOGS_PATH="./tensorboard_logs/llama3_8b_fp8"
+# Optional: For real data
+# export HOST_TOKENIZER_MODEL_PATH="/path/to/host/tokenizer.model"
+# export HOST_DATA_PREFIX="/path/to/host/mydata_prefix"
+```
+## 3. Training Setup
+<a id="training-setup" name="training-setup"></a>
+### Using Mock Data
+```bash
+PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.03-py3"
+docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
+  -v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
+  -v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
+  -v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
+  --workdir /workspace/megatron-lm \
+  $PYTORCH_IMAGE \
+  bash examples/llama/train_llama3_8b_h100_fp8.sh \
+    /workspace/checkpoints \
+    /workspace/tensorboard_logs \
+  2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_mock_$(date +'%y-%m-%d_%H-%M-%S').log"
+```
+### Using Custom Data and Tokenizer
+```bash
+PYTORCH_IMAGE="nvcr.io/nvidia/pytorch:25.03-py3"
+docker run --rm --gpus all --ipc=host --ulimit memlock=-1 \
+  -v "${HOST_MEGATRON_LM_DIR}:/workspace/megatron-lm" \
+  -v "${HOST_CHECKPOINT_PATH}:/workspace/checkpoints" \
+  -v "${HOST_TENSORBOARD_LOGS_PATH}:/workspace/tensorboard_logs" \
+  -v "${HOST_TOKENIZER_MODEL_PATH}:/workspace/tokenizer_model" \
+  -v "$(dirname "${HOST_DATA_PREFIX}"):/workspace/data_dir" \
+  --workdir /workspace/megatron-lm \
+  $PYTORCH_IMAGE \
+  bash examples/llama/train_llama3_8b_h100_fp8.sh \
+    /workspace/checkpoints \
+    /workspace/tensorboard_logs \
+    /workspace/tokenizer_model \
+    "/workspace/data_dir/$(basename "${HOST_DATA_PREFIX}")" \
+  2>&1 | tee "${HOST_TENSORBOARD_LOGS_PATH}/training_custom_$(date +'%y-%m-%d_%H-%M-%S').log"
+```
+## 4. Configuration
+<a id="configuration" name="configuration"></a>
+Default parallelism strategy:
+- Tensor Parallel: 1
+- Pipeline Parallel: 1
+- Context Parallel: 2
+Llama-3-8B architecture:
+- 32 layers
+- Hidden size: 4096
+- FFN hidden size: 14336
+- Attention heads: 32
+- Query groups: 8
+- Sequence length: 8192
+- RMSNorm normalization with SwiGLU and RoPE
+Key training parameters:
+- Micro-batch size: 1
+- Global batch size: 128
+- Learning rate: 1.5e-4
+- Min learning rate: 1.0e-5
+- Weight decay: 0.1
+- FP8 format: hybrid
+You can modify these parameters directly in the `train_llama3_8b_h100_fp8.sh` script.
+This configuration follows those defined in NeMo Framework's performance scripts, which can be found at [https://github.com/NVIDIA/NeMo/tree/main/scripts/performance](https://github.com/NVIDIA/NeMo/tree/main/scripts/performance). 
+### FP8 Performance
+| Model | #-GPUs | GBS | MBS | Seq Length | TP | PP | CP | VP | EP | GA | Tokens/sec/GPU | TFLOP/sec/GPU |
+|-------|--------|-----|-----|------------|----|----|----|----|----|----|----------------|---------------|
+| LLAMA3-8B | 8 | 128 | 1 | 8192 | 1 | 1 | 2 | 1 | 1 | 32 | 13812 | 800 |
+| LLAMA3-70B | 64 | 128 | 1 | 8192 | 4 | 8 | 1 | 5 | 1 | 64 | 1621 | 780 |
+| LLAMA3-405B | 1024 | 512 | 1 | 8192 | 8 | 8 | 2 | 8 | 1 | 64 | 315 | 834 |
+Legend:
+- GBS: Global Batch Size
+- MBS: Micro Batch Size
+- TP: Tensor Parallel size
+- PP: Pipeline Parallel size
+- CP: Context Parallel size
+- VP: Virtual Pipeline stages
+- EP: Expert Parallel size
+- GA: Gradient Accumulation steps
+As NeMo uses Megatron-Core, for the latest performance benchmarks, please refer to the official [NeMo documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/performance/performance_summary.html).
+## 5. Test Datasets
+<a id="test-datasets" name="test-datasets"></a>
+Recommended datasets:
+1. **WikiText-103**: https://huggingface.co/datasets/Salesforce/wikitext
+Preprocess datasets:
+```bash
+python "${HOST_MEGATRON_LM_DIR}/tools/preprocess_data.py" \
+       --input your_dataset.json \
+       --output-prefix test_dataset \
+       --tokenizer-type HuggingFaceTokenizer \
+       --tokenizer-model /path/to/tokenizer.model \
+       --append-eod
+```
+## 6. FP8 Training Considerations
+<a id="fp8-training-considerations" name="fp8-training-considerations"></a>
+- **Hardware**: Requires NVIDIA Hopper, Ada, or Blackwell GPUs for FP8 support
+- **Troubleshooting**: If you encounter NaN values or instability with FP8 training, please refer to [Transformer Engine](https://github.com/NVIDIA/TransformerEngine).
--- a/Megatron-LM/examples/llama/train_llama3_8b_h100_fp8.sh
+++ b/Megatron-LM/examples/llama/train_llama3_8b_h100_fp8.sh
+#!/bin/bash
+# Environment variables for performance tuning
+export CUDA_DEVICE_MAX_CONNECTIONS=${CUDA_DEVICE_MAX_CONNECTIONS:-1}
+#export LOG_LEVEL=${LOG_LEVEL:-INFO}
+#export NCCL_IB_TIMEOUT=${NCCL_IB_TIMEOUT:-19}
+#export NVTE_FWD_LAYERNORM_SM_MARGIN=${NVTE_FWD_LAYERNORM_SM_MARGIN:-16}
+#export NVTE_BWD_LAYERNORM_SM_MARGIN=${NVTE_BWD_LAYERNORM_SM_MARGIN:-16}
+#export NCCL_P2P_NET_CHUNKSIZE=${NCCL_P2P_NET_CHUNKSIZE:-2097152}
+#export NCCL_AVOID_RECORD_STREAMS=${NCCL_AVOID_RECORD_STREAMS:-1}
+CHECKPOINT_PATH=${1:-"checkpoints/llama3_8b_fp8"}
+TENSORBOARD_LOGS_PATH=${2:-"tensorboard_logs/llama3_8b_fp8"}
+TOKENIZER_ARG=${3:-"MOCK"} # Path to tokenizer model, or "MOCK"
+DATA_ARG=${4:-"MOCK"}     # Data prefix, or "MOCK"
+# Create directories if they don't exist
+mkdir -p "$(dirname "$CHECKPOINT_PATH")"
+mkdir -p "$(dirname "$TENSORBOARD_LOGS_PATH")"
+# Distributed training setup
+GPUS_PER_NODE=8
+NUM_NODES=1
+MASTER_ADDR=${MASTER_ADDR:-localhost}
+MASTER_PORT=${MASTER_PORT:-6000}
+NODE_RANK=${NODE_RANK:-0}
+WORLD_SIZE=$(($GPUS_PER_NODE*$NUM_NODES))
+# Path to the pretrain_gpt.py script, assuming this script is run from the root of the Megatron-LM repository
+PRETRAIN_SCRIPT_PATH="pretrain_gpt.py"
+# Fixed model and training parameters
+TP_SIZE=1     
+CP_SIZE=1     
+PP_SIZE=1     
+MICRO_BATCH_SIZE=1
+GLOBAL_BATCH_SIZE=128
+NUM_LAYERS=32  
+DTYPE="fp8"
+SEQ_LENGTH=8192
+MAX_POSITION_EMBEDDINGS=8192
+# Data cache path (useful for both mock and real data)
+DATA_CACHE_PATH="${PWD}/benchmark_cache_llama3_8b_fp8"
+mkdir -p "$DATA_CACHE_PATH"
+DISTRIBUTED_ARGS=(
+    --nproc_per_node $GPUS_PER_NODE
+    --nnodes $NUM_NODES
+    --node_rank $NODE_RANK
+    --master_addr $MASTER_ADDR
+    --master_port $MASTER_PORT
+)
+MODEL_ARGS=(
+    --use-mcore-models
+    --num-layers $NUM_LAYERS
+    --hidden-size 4096
+    --ffn-hidden-size 14336
+    --num-attention-heads 32
+    --group-query-attention
+    --num-query-groups 8
+    --kv-channels 128
+    --seq-length $SEQ_LENGTH
+    --max-position-embeddings $MAX_POSITION_EMBEDDINGS
+    --position-embedding-type rope
+    --rotary-base 1000000 
+    --rotary-percent 1.0
+    --attention-dropout 0.0
+    --hidden-dropout 0.0
+    --swiglu
+    --init-method-std 0.0134
+    --attention-backend fused
+    --apply-layernorm-1p 
+    --untie-embeddings-and-output-weights
+    --disable-bias-linear 
+)
+TRAINING_ARGS=(
+    --micro-batch-size $MICRO_BATCH_SIZE
+    --global-batch-size $GLOBAL_BATCH_SIZE
+    --train-samples 1953125000
+    --lr-decay-samples 1949218748
+    --lr-warmup-samples 3906252
+    --lr 0.00015
+    --min-lr 0.00001
+    --decoupled-lr 5.0e-4      # Specific to decoupled AdamW, ensure optimizer is compatible
+    --decoupled-min-lr 4.5e-5  # Specific to decoupled AdamW
+    --lr-decay-style cosine
+    --clip-grad 1.0
+    --weight-decay 0.1
+    --adam-beta1 0.9
+    --adam-beta2 0.95
+    --bf16
+    --grad-reduce-in-bf16
+    --cross-entropy-loss-fusion
+    --calculate-per-token-loss 
+    --manual-gc 
+    --empty-unused-memory-level 1 
+    --exit-duration-in-mins 235 
+)
+# Conditional arguments based on DTYPE (FP8)
+DTYPE_ARGS=()
+if [[ "$DTYPE" == "fp8" ]]; then
+    DTYPE_ARGS+=(
+        "--fp8-format hybrid"
+        "--fp8-amax-history-len 1024"
+        "--fp8-amax-compute-algo max"
+        "--fp8-param-gather"
+    )
+fi
+# Model parallelism arguments
+MODEL_PARALLEL_ARGS=(
+    --tensor-model-parallel-size $TP_SIZE
+    --context-parallel-size $CP_SIZE
+    # --pipeline-model-parallel-size $PP_SIZE # Not explicitly set in llama script options, assume 1 if not multi-node PP
+    --sequence-parallel  # Always enable sequence parallelism with TP_SIZE=2
+)
+# Distributed Data Parallel (DDP) arguments
+# From original script's ddp_args
+DDP_ARGS=(
+    --use-distributed-optimizer
+    --overlap-grad-reduce
+    --overlap-param-gather
+)
+TRAINING_ARGS+=("${DDP_ARGS[@]}")
+# Data arguments (conditional for mock vs real data)
+DATA_ARGS_LIST=()
+if [[ "$TOKENIZER_ARG" == "MOCK" ]] || [[ "$DATA_ARG" == "MOCK" ]] || [[ -z "$TOKENIZER_ARG" ]]; then
+    DATA_ARGS_LIST+=(
+        "--mock-data"
+        "--tokenizer-type NullTokenizer"
+        "--vocab-size 128256" 
+        "--data-cache-path ${DATA_CACHE_PATH}"
+        "--tiktoken-pattern v2" 
+        "--split '99,1,0'"
+        "--no-create-attention-mask-in-dataloader"
+        "--no-mmap-bin-files"
+        "--num-workers 1"
+    )
+else
+    # Settings for real data
+    DATA_ARGS_LIST+=(
+        "--data-path $DATA_ARG"
+        "--tokenizer-type HuggingFaceTokenizer" 
+        "--tokenizer-model $TOKENIZER_ARG"
+        "--data-cache-path ${DATA_CACHE_PATH}"
+        "--split '99,1,0'"
+        "--no-create-attention-mask-in-dataloader"
+        "--no-mmap-bin-files"
+        "--num-workers 1"
+        # Note: --vocab-size might be inferred by HuggingFaceTokenizer or might need to be explicit.
+        "--vocab-size 128256"
+    )
+fi
+EVAL_AND_LOGGING_ARGS=(
+    --log-interval 1
+    --eval-iters 32
+    --eval-interval 100
+    --save-interval 1000
+    --log-throughput
+    --profile
+    --profile-step-start 4
+    --profile-step-end 6
+    --ckpt-format torch_dist 
+    --distributed-timeout-minutes 60
+    --save "$CHECKPOINT_PATH"
+    --load "$CHECKPOINT_PATH" 
+    --tensorboard-dir "$TENSORBOARD_LOGS_PATH"
+)
+# Ensure pretrain_gpt.py is found
+if [ ! -f "$PRETRAIN_SCRIPT_PATH" ]; then
+    echo "Error: pretrain_gpt.py not found at $PRETRAIN_SCRIPT_PATH"
+    echo "Please ensure you are running this script from the root of the Megatron-LM repository, and pretrain_gpt.py is present."
+    exit 1
+fi
+# Run the training command
+torchrun ${DISTRIBUTED_ARGS[@]} \
+    "$PRETRAIN_SCRIPT_PATH" \
+    ${MODEL_ARGS[@]} \
+    ${TRAINING_ARGS[@]} \
+    ${DTYPE_ARGS[@]} \
+    ${MODEL_PARALLEL_ARGS[@]} \
+    ${DATA_ARGS_LIST[@]} \
+    ${EVAL_AND_LOGGING_ARGS[@]}
+set +x
\ No newline at end of file
--- a/Megatron-LM/examples/mamba/.gitignore
+++ b/Megatron-LM/examples/mamba/.gitignore
+checkpoints/
+data-cache/
+tensorboard/
+triton-cache/
--- a/Megatron-LM/examples/mamba/Dockerfile
+++ b/Megatron-LM/examples/mamba/Dockerfile
+FROM nvcr.io/nvidia/pytorch:24.01-py3
+RUN pip uninstall -y triton && \
+    pip install triton==2.1.0 sentencepiece==0.1.99 flask-restful
+# The causal-conv1d and mamba-ssm packages below are built from scratch here
+# (which takes significant time) because there are no wheels available on PyPI
+# for these relatively newer versions of the packages that are compatible with
+# the older NGC-variant PyTorch version (e.g. version 2.2.0.dev231106) that we
+# are using (in the NGC base container). Generally, if the package is not
+# compatible with the PyTorch version, then it will generate a Python import
+# error. The package authors tend to only release wheels for new versions of
+# these pacakges which are compatible with the versions of regular PyTorch and
+# NGC-variant PyTorch that are newer at the time of release. So, to use newer
+# versions of these packages with relatively older versions of the NGC PyTorch
+# container, we tend to have to build the packages from scratch.
+RUN cd /tmp && \
+    git clone https://github.com/Dao-AILab/causal-conv1d.git && \
+    cd causal-conv1d && \
+    git checkout v1.2.2.post1 && \
+    CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install . && \
+    cd .. && \
+    rm -rf causal-conv1d
+RUN cd /tmp && \
+    git clone https://github.com/state-spaces/mamba.git && \
+    cd mamba && \
+    git checkout v2.0.3 && \
+    MAMBA_FORCE_BUILD=TRUE pip install . && \
+    cd .. && \
+    rm -rf mamba
--- a/Megatron-LM/examples/mamba/README.md
+++ b/Megatron-LM/examples/mamba/README.md
+# Mamba-based Language Models
+## Introduction
+This document is an entrypoint into the code used for
+<em>[An Empirical Study of Mamba-based Language Models](https://arxiv.org/abs/2406.07887)</em>.
+We are releasing the parameters for some of the models described in that
+technical report via
+[HuggingFace](https://huggingface.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c).
+The code in the `main` branch is no longer compatible with the `Mamba2-*`
+checkpoints. You can load them using the
+[fixed snapshot of the code used for the technical report](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).
+## Installation
+Create and run a Docker container using the [Dockerfile](./Dockerfile).
+```
+docker build -t your_image_name:your_tag .
+docker run --gpus all -it --rm \
+  -v /path/to/megatron:/workspace/megatron \
+  -v /path/to/dataset:/workspace/dataset \
+  -v /path/to/checkpoints:/workspace/checkpoints \
+  -w /workspace/megatron/examples/mamba \
+  your_image_name:your_tag
+```
+## Train
+[`train.sh`](./train.sh) is an example pretraining script, showing how to run on
+a single node. Select between 800M-scale and 8B-scale models by setting the
+`MODEL_SCALE` variable. The 8B-scale hybrid model architecture is the same as
+the one described in the technical report.
+## Text Generation
+Use [`run_text_gen_server_8b.sh`](./run_text_gen_server_8b.sh) to start a text
+generation server using an 8B hybrid checkpoint. This is configured to run the
+8B hybrid model described in the technical report, with tensor model parallel
+set to 1.
+The arguments in the script will need to be changed if using a checkpoint with a
+different model parallel configuration or other differences, such as model
+architecture. For example, to run the 8B pure Mamba-2 model, change
+`--hybrid-attention-ratio` and `--hybrid-mlp-ratio` to 0.0, or remove them.
+Use [`run_text_gen_server_8b_gpt3.sh`](./run_text_gen_server_8b_gpt3.sh) to start
+a text generation server using the 8B reference Transformer checkpoint.
+## Checkpoint Formats
+For inference, the model must be configured to match the checkpoint file used,
+including the hybrid layer configuration and model parallel configuration.
+If you need to convert a hybrid checkpoint file to a different tensor parallel
+or pipeline parallel size, use
+[the hybrid conversion script](../../tools/checkpoint/hybrid_conversion.py).
+There is an example run command at the end of that file.
+Before running that script, you will need to set `PYTHONPATH` to include the
+root directory of your Megatron-LM repository clone.
+```
+export PYTHONPATH=<path-to-megatron>:PYTHONPATH
+```
+## Hybrid Options
+`--hybrid-attention-ratio ATT` specifies a target ratio of attention layers
+to total layers. For example, 4 attention layers out of 48 total layers is
+specified by `--hybrid-attention-ratio 0.08`.
+`--hybrid-mlp-ratio MLP` specifies a target ratio of MLP layers to total
+layers. For example, 24 MLP layers out of 48 total layers is specified by
+`--hybrid-mlp-ratio 0.5`.
+* (`ATT` + `MLP`) must be less than or equal to 1.0.
+* (1.0 - `ATT` - `MLP`) is the hybrid mamba ratio, the ratio of mamba layers to
+total layers.
+* `ATT` = `MLP` = 0 is a pure Mamba model.
+* `ATT` = `MLP` = 0.5 is a transfomer model.
+If either `ATT` or `MLP` is greater than 0.0 or if `--hybrid-override-pattern`
+is specified, the logfile will include information about the hybrid layer
+pattern used. `--hybrid-override-pattern` can be used to specify a different
+pattern than the default, algorithmically-generated one.
+## Mamba vs Mamba-2
+This codebase currently only supports Mamba-2, and not the original version of
+Mamba. However, the
+[fixed snapshot of the code used for the technical report](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba)
+can be configured to run the original version of Mamba.
--- a/Megatron-LM/examples/mamba/run_text_gen_server_8b.sh
+++ b/Megatron-LM/examples/mamba/run_text_gen_server_8b.sh
+#!/bin/bash
+# Use: ./run_text_gen_server_8b.sh <checkpoint-path> <tokenizer-path>
+# To launch the client: python ../../tools/text_generation_cli.py <URL-provided-by-server>
+CHECKPOINT_PATH=$1
+TOKENIZER_PATH=$2
+DISTRIBUTED_ARGS="--nproc_per_node 1 \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+export NCCL_IB_SL=1
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NCCL_IB_TIMEOUT=19
+export NCCL_IB_QPS_PER_CONNECTION=4
+export TRITON_CACHE_DIR="./triton-cache/"
+export TRITON_CACHE_MANAGER="megatron.core.ssm.triton_cache_manager:ParallelFileCacheManager"
+torchrun $DISTRIBUTED_ARGS ../../tools/run_mamba_text_generation_server.py \
+       --tensor-model-parallel-size 1  \
+       --pipeline-model-parallel-size 1  \
+       --untie-embeddings-and-output-weights \
+       --num-layers 56  \
+       --hidden-size 4096  \
+       --load ${CHECKPOINT_PATH}  \
+       --num-attention-heads 32  \
+       --group-query-attention \
+       --num-query-groups 8 \
+       --hybrid-attention-ratio 0.08 \
+       --hybrid-mlp-ratio 0.5 \
+       --attention-dropout 0.0 \
+       --hidden-dropout 0.0 \
+       --disable-bias-linear \
+       --normalization RMSNorm \
+       --seq-length 4096  \
+       --max-position-embeddings 4096  \
+       --position-embedding-type none \
+       --tokenizer-type GPTSentencePieceTokenizer  \
+       --tokenizer-model ${TOKENIZER_PATH} \
+       --distributed-backend nccl \
+       --distributed-timeout-minutes 1440 \
+       --bf16  \
+       --micro-batch-size 1  \
+       --use-mcore-models \
+       --spec megatron.core.models.mamba.mamba_layer_specs mamba_stack_spec \
+       --seed 42