Initial commit

deb8370c · hepj · deb8370c · deb8370c · deb8370c · deb8370c
Commit deb8370c authored Jan 09, 2025 by hepj
20 changed files
--- a/PAI-Megatron-LM-240718/examples/inference/quantization/README.md
+++ b/PAI-Megatron-LM-240718/examples/inference/quantization/README.md
+# Megatron Model Optimization and Deployment
+
+## Installation
+We recommend that users follow TensorRT-LLM's official installation guide to build it from source
+and proceed with a containerized environment (`docker.io/tensorrt_llm/release:latest`):
+
+```sh
+git clone https://github.com/NVIDIA/TensorRT-LLM.git
+cd TensorRT-LLM
+git checkout v0.10.0
+make -C docker release_build
+```
+
+> **TROUBLE SHOOTING:** rather than copying each folder separately in `docker/Dockerfile.multi`,
+> you may need to copy the entire dir as `COPY ./ /src/tensorrt_llm` since a `git submodule` is
+> called later which requires `.git` to continue.
+
+Once the container is built, install `nvidia-modelopt` and additional dependencies for sharded checkpoint support:
+```sh
+pip install "nvidia-modelopt[all]~=0.13.0" --extra-index-url https://pypi.nvidia.com
+pip install zarr tensorstore==0.1.45
+```
+TensorRT-LLM quantization functionalities are currently packaged in `nvidia-modelopt`.
+You can find more documentation about `nvidia-modelopt` [here](https://nvidia.github.io/TensorRT-Model-Optimizer/).
+
+## Support Matrix
+
+The following matrix shows the current support for the PTQ + TensorRT-LLM export flow.
+
+| model                       | fp16 | int8_sq | fp8 | int4_awq |
+|-----------------------------|------|---------| ----| -------- |
+| nextllm-2b                  | x    | x       |   x |          |
+| nemotron3-8b                | x    |         |   x |          |
+| nemotron3-15b               | x    |         |   x |          |
+| llama2-text-7b              | x    | x       |   x |      TP2 |
+| llama2-chat-70b             | x    | x       |   x |      TP4 |
+
+Our PTQ + TensorRT-LLM flow has native support on MCore `GPTModel` with a mixed layer spec (native ParallelLinear
+and Transformer-Engine Norm (`TENorm`). Note that this is not the default mcore gpt spec. You can still load the
+following checkpoint formats with some remedy:
+
+| GPTModel                          | sharded |                        remedy arguments     |
+|-----------------------------------|---------|---------------------------------------------|
+| megatron.legacy.model             |         | `--export-legacy-megatron` |
+| TE-Fused (default mcore gpt spec) |         | `--export-te-mcore-model`       |
+| TE-Fused (default mcore gpt spec) |       x |                                             |
+
+> **TROUBLE SHOOTING:** If you are trying to load an unpacked `.nemo` sharded checkpoint, then typically you will
+> need to adding `additional_sharded_prefix="model."` to `modelopt_load_checkpoint()` since NeMo has an additional
+> `model.` wrapper on top of the `GPTModel`.
+
+> **NOTE:** flag `--export-legacy-megatron` may not work on all legacy checkpoint versions.
+
+## Examples
+
+> **NOTE:** we only provide a simple text generation script to test the generated TensorRT-LLM engines. For
+> a production-level API server or enterprise support, see [NeMo](https://github.com/NVIDIA/NeMo) and TensorRT-LLM's
+> backend for [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server).
+
+### nemotron3-8B FP8 Quantization and TensorRT-LLM Deployment
+First download the nemotron checkpoint from https://huggingface.co/nvidia/nemotron-3-8b-base-4k, extract the
+sharded checkpoint from the `.nemo` tarbal and fix the tokenizer file name.
+
+> **NOTE:** The following cloning method uses `ssh`, and assume you have registered the `ssh-key` in Hugging Face.
+> If you are want to clone with `https`, then `git clone https://huggingface.co/nvidia/nemotron-3-8b-base-4k` with an access token.
+
+```sh
+git lfs install
+git clone git@hf.co:nvidia/nemotron-3-8b-base-4k
+cd nemotron-3-8b-base-4k
+tar -xvf Nemotron-3-8B-Base-4k.nemo
+mv 586f3f51a9cf43bc9369bd53fa08868c_a934dc7c3e1e46a6838bb63379916563_3feba89c944047c19d5a1d0c07a85c32_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model tokenizer.model
+cd ..
+```
+
+Now launch the PTQ + TensorRT-LLM export script,
+```sh
+bash examples/inference/quantization/ptq_trtllm_nemotron3_8b ./nemotron-3-8b-base-4k None
+```
+By default, `cnn_dailymail` is used for calibration. The `GPTModel` will have quantizers for simulating the
+quantization effect. The checkpoint will be saved optionally (with quantizers as additional states) and can
+be restored for further evaluation. TensorRT-LLM checkpoint and engine are exported to `/tmp/trtllm_ckpt` and
+built in `/tmp/trtllm_engine` by default.
+
+The script expects `${CHECKPOINT_DIR}` (`./nemotron-3-8b-base-4k`) to have the following structure:
+```
+├── model_weights
+│   ├── common.pt
+│   ...
+│
+├── model_config.yaml
+├── mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
+```
+
+> **NOTE:** The script is using `TP=8`. Change `$TP` in the script if your checkpoint has a different tensor
+> model parallelism.
+
+> **KNOWN ISSUES:** The `mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model` in the checkpoint is for
+> Megatron-LM's `GPTSentencePiece` tokenizer.
+> For TensorRT-LLM, we are trying to load this tokenizer as a Hugging Face `T5Tokenizer` by changing
+> some special tokens, `encode`, and `batch_decode`. As a result, the tokenizer behavior in TensorRT-LLM engine may
+> not match exactly.
+
+### llama2-text-7b INT8 SmoothQuant and TensorRT-LLM Deployment
+> **NOTE:** Due to the LICENSE issue, we do not provide a MCore checkpoint to download. Users can follow
+> the instruction in `docs/llama2.md` to convert the checkpoint to megatron legacy `GPTModel` format and
+> use `--export-legacy-megatron` flag which will remap the checkpoint to the MCore `GPTModel` spec
+> that we support.
+
+```sh
+bash examples/inference/quantization/ptq_trtllm_llama_7b.sh ${CHECKPOINT_DIR}
+```
+
+The script expect `${CHECKPOINT_DIR}` to have the following structure:
+```
+├── hf
+│   ├── tokenizer.config
+│   ├── tokenizer.model
+│   ...
+│
+├── iter_0000001
+│   ├── mp_rank_00
+│   ...
+│
+├── latest_checkpointed_iteration.txt
+```
+In short, other than the converted llama megatron checkpoint, also put the Hugging Face checkpoint inside as
+the source of the tokenizer.
--- a/PAI-Megatron-LM-240718/examples/inference/quantization/ptq_trtllm_llama_7b.sh
+++ b/PAI-Megatron-LM-240718/examples/inference/quantization/ptq_trtllm_llama_7b.sh
+#!/bin/bash
+set -e
+
+DEFAULT_NAME="/checkpoints/llama2-text-7b_v0.2.0"
+NAME="${1:-$DEFAULT_NAME}"
+
+DEFAULT_QUANT_CFG="int8_sq"
+QUANT_CFG="${2:-$DEFAULT_QUANT_CFG}"
+
+# CHANGE THE FOLLOWING IF YOU MOUNT YOUR DATA AND CHECKPOINTS DIFFERENTLY IN THE CONTAINER.
+TP="8"
+INFERENCE_TP=${TP}
+DECODER_TYPE="llama"
+CHECKPOINT_LOAD_DIR="${NAME}"
+TOKENIZER_MODEL="${CHECKPOINT_LOAD_DIR}/hf/tokenizer.model"
+
+# LLaMA2 text 7b has ffn_hidden_size 11008. int4_awq requires a block_size of 128 as a result the TP can at most be 2
+if [ "$QUANT_CFG" = "int4_awq" ]; then
+    INFERENCE_TP="2"
+fi
+
+additional_options=" \
+    --export-quant-cfg ${QUANT_CFG} \
+    --export-legacy-megatron \
+    --export-te-mcore-model \
+    --calib-batch-size 8 \
+    --decoder ${DECODER_TYPE} \
+    --export-dir /tmp/trtllm_ckpt \
+    --inference-tensor-parallel ${INFERENCE_TP} "
+
+trtllm_options=" \
+    --tensorrt-llm-checkpoint-dir /tmp/trtllm_ckpt \
+    --engine-dir /tmp/trtllm_engine \
+    --tokenizer ${CHECKPOINT_LOAD_DIR}/hf \
+    --max-input-len 2048 \
+    --max-output-len 512 \
+    --max-batch-size 8 "
+
+# DO NOT CHANGE THE SETTING BELOW UNLESS YOU KNOW WHAT YOU ARE DOING!!!
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+options=" \
+    --disable-bias-linear \
+    --swiglu \
+    --no-rope-fusion \
+    --untie-embeddings-and-output-weights \
+    --use-rotary-position-embeddings \
+    --normalization RMSNorm \
+    --rotary-percent 1.0 \
+    --no-position-embedding \
+    --no-masked-softmax-fusion \
+    --no-bias-gelu-fusion \
+    --no-bias-dropout-fusion \
+    --no-async-tensor-model-parallel-allreduce \
+    --tensor-model-parallel-size ${TP} \
+    --pipeline-model-parallel-size 1 \
+    --num-layers 32 \
+    --hidden-size 4096 \
+    --ffn-hidden-size 11008 \
+    --num-attention-heads 32 \
+    --seq-length 4096 \
+    --max-position-embeddings 4096 \
+    --micro-batch-size 1 \
+    --make-vocab-size-divisible-by 1 \
+    --tokenizer-type Llama2Tokenizer \
+    --tokenizer-model ${TOKENIZER_MODEL} \
+    --save-interval 1000000 \
+    --use-dist-ckpt \
+    --load ${CHECKPOINT_LOAD_DIR}
+    --fp16"
+
+# Precompile CUDA extentions
+python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_ext); print(ext.cuda_ext_fp8)"
+
+# Acquire launch configuration where variable launch_config will be set
+launch_config="--nproc_per_node=${TP}"
+
+# Launch multi-process with torchrun
+torchrun ${launch_config} examples/inference/quantization/text_generation_ptq.py ${options} ${additional_options}
+
+# This script is using mpi4py which will fork multiple processes.
+python examples/inference/quantization/trtllm_text_generation.py ${trtllm_options}
--- a/PAI-Megatron-LM-240718/examples/inference/quantization/ptq_trtllm_nemotron3_8b.sh
+++ b/PAI-Megatron-LM-240718/examples/inference/quantization/ptq_trtllm_nemotron3_8b.sh
+#!/bin/bash
+set -e
+
+DEFAULT_NAME="/checkpoints/nemotron3-8b_v0.3.0"
+NAME="${1:-$DEFAULT_NAME}"
+
+DEFAULT_QUANT_CFG="fp8"
+QUANT_CFG="${2:-$DEFAULT_QUANT_CFG}"
+
+# CHANGE THE FOLLOWING IF YOU MOUNT YOUR DATA AND CHECKPOINTS DIFFERENTLY IN THE CONTAINER.
+TP="8"
+INFERENCE_TP=${TP}
+DECODER_TYPE="gptnext"
+CHECKPOINT_LOAD_DIR="${NAME}"
+TOKENIZER_MODEL="${CHECKPOINT_LOAD_DIR}/tokenizer.model"
+
+if [ "$QUANT_CFG" = "int4_awq" ]; then
+    INFERENCE_TP="1"
+fi
+
+additional_options=" \
+    --export-quant-cfg ${QUANT_CFG} \
+    --export-legacy-megatron \
+    --export-te-mcore-model \
+    --calib-batch-size 8 \
+    --decoder ${DECODER_TYPE} \
+    --export-dir /tmp/trtllm_ckpt \
+    --inference-tensor-parallel ${INFERENCE_TP} "
+
+trtllm_options=" \
+    --tensorrt-llm-checkpoint-dir /tmp/trtllm_ckpt \
+    --engine-dir /tmp/trtllm_engine \
+    --tokenizer ${TOKENIZER_MODEL} \
+    --max-input-len 2048 \
+    --max-output-len 512 \
+    --max-batch-size 8 "
+
+# DO NOT CHANGE THE SETTING BELOW UNLESS YOU KNOW WHAT YOU ARE DOING!!!
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+options=" \
+    --apply-layernorm-1p \
+    --untie-embeddings-and-output-weights \
+    --disable-bias-linear \
+    --no-rope-fusion \
+    --no-position-embedding \
+    --use-rotary-position-embeddings \
+    --rotary-percent 0.5 \
+    --squared-relu \
+    --attention-dropout 0.0 \
+    --hidden-dropout 0.0 \
+    --tensor-model-parallel-size ${TP} \
+    --pipeline-model-parallel-size 1 \
+    --num-layers 32 \
+    --hidden-size 4096 \
+    --num-attention-heads 32 \
+    --seq-length 4096 \
+    --max-position-embeddings 4096 \
+    --micro-batch-size 1 \
+    --tokenizer-type GPTSentencePieceTokenizer \
+    --tokenizer-model ${TOKENIZER_MODEL} \
+    --save-interval 1000000 \
+    --load ${CHECKPOINT_LOAD_DIR} \
+    --fp16 \
+    --use-dist-ckpt"
+
+# Precompile CUDA extentions
+python -c "import modelopt.torch.quantization.extensions as ext; print(ext.cuda_ext); print(ext.cuda_ext_fp8)"
+
+# Acquire launch configuration where variable launch_config will be set
+launch_config="--nproc_per_node=${TP}"
+
+# Launch multi-process with torchrun
+torchrun ${launch_config} examples/inference/quantization/text_generation_ptq.py ${options} ${additional_options}
+
+# This script is using mpi4py which will fork multiple processes.
+python examples/inference/quantization/trtllm_text_generation.py ${trtllm_options}
--- a/PAI-Megatron-LM-240718/examples/inference/quantization/text_generation_ptq.py
+++ b/PAI-Megatron-LM-240718/examples/inference/quantization/text_generation_ptq.py
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+
+"""Sample Generate GPT."""
+import functools
+import os
+import sys
+from pathlib import Path
+
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), "../../")))
+
+import modelopt.torch.quantization as mtq
+import torch
+from datasets import load_dataset
+from modelopt.torch.utils.distributed import set_data_parallel_group, set_tensor_parallel_group
+from tqdm import tqdm
+
+# [ModelOpt]: changing the default model provider to the ModelOpt version
+from megatron.core import mpu
+from megatron.inference.arguments import add_modelopt_args
+from megatron.inference.checkpointing import load_modelopt_checkpoint
+from megatron.inference.gpt.model_provider import model_provider
+from megatron.inference.text_generation import generate_and_post_process
+from megatron.training import get_args, get_model, initialize_megatron
+from megatron.training.checkpointing import save_checkpoint
+from megatron.training.utils import print_rank_0, unwrap_model
+
+QUANT_CFG_CHOICES = {
+    "int8": mtq.INT8_DEFAULT_CFG,
+    "int8_sq": mtq.INT8_SMOOTHQUANT_CFG,
+    "fp8": mtq.FP8_DEFAULT_CFG,
+    "int4_awq": mtq.INT4_AWQ_CFG,
+    "w4a8_awq": mtq.W4A8_AWQ_BETA_CFG,
+    "int4": mtq.INT4_BLOCKWISE_WEIGHT_ONLY_CFG,
+}
+
+
+def add_trtllm_ckpt_export_args(parser):
+    """Add additional arguments for TensorRT-LLM."""
+    group = parser.add_argument_group(title="trtllm")
+
+    group.add_argument(
+        "--export-dir", type=str, help="The output TensorRT-LLM checkpoint.",
+    )
+    group.add_argument(
+        "--decoder", type=str, choices=["gptnext", 'llama'], help="The decoder type of the model.",
+    )
+    group.add_argument(
+        "--inference-tensor-parallel",
+        type=int,
+        help="Tensor parallel for the inference time, can be different from the training config.",
+        default=1,
+    )
+
+
+def add_text_generate_ptq_args(parser):
+    """Add additional arguments for ModelOpt text generation PTQ."""
+    group = parser.add_argument_group(title='ModelOpt text generation ptq')
+    group.add_argument(
+        "--calib-dataset",
+        type=str,
+        default="cnn_dailymail",
+        help="Calibration datasets from HuggingFace datasets.",
+    )
+    group.add_argument(
+        "--calib-batch-size", type=int, default=4, help="Batch size to use for ptq calibration."
+    )
+    group.add_argument(
+        "--calib-size", type=int, default=512, help="Samples to use for ptq calibration."
+    )
+    parser.add_argument(
+        "--prompts",
+        type=str,
+        default=(
+            "Born in north-east France, Soyer trained as a|Born in California, Soyer trained as a"
+        ),
+        help="Input texts. Please use | to separate different batches.",
+    )
+    add_modelopt_args(parser)
+    add_trtllm_ckpt_export_args(parser)
+    return parser
+
+
+def get_calib_dataloader(
+    data="cnn_dailymail", batch_size=4, calib_size=512, max_sequence_length=512
+):
+    if data == "pileval":
+        dataset = load_dataset(
+            "json", data_files="https://the-eye.eu/public/AI/pile/val.jsonl.zst", split="train"
+        )
+        text_column = "text"
+    elif data == "wikitext":
+        dataset = load_dataset("wikitext", "wikitext-103-v1", split="train")
+        text_column = "text"
+    elif data == "cnn_dailymail":
+        dataset = load_dataset("cnn_dailymail", name="3.0.0", split="train")
+        text_column = "article"
+
+    calib_size = max(min(len(dataset), calib_size), batch_size)
+    for i in range(calib_size // batch_size):
+        batch = dataset[i * batch_size : (i + 1) * batch_size][text_column]
+        for j in range(len(batch)):
+            batch[j] = batch[j][:max_sequence_length]
+        yield batch
+
+
+
+if __name__ == "__main__":
+    initialize_megatron(
+        extra_args_provider=add_text_generate_ptq_args,
+        args_defaults={
+            'tokenizer_type': 'GPT2BPETokenizer',
+            'no_load_rng': True,
+            'no_load_optim': True,
+        },
+    )
+
+    args = get_args()
+    if args.num_layers_per_virtual_pipeline_stage is not None:
+        print_rank_0("Interleaved pipeline schedule is not yet supported for text generation.")
+        exit()
+
+    print_rank_0("WARNING: Forcing exit_on_missing_checkpoint to True for text generation.")
+    args.exit_on_missing_checkpoint = True
+
+    # Set up model and load checkpoint
+    # [ModelOpt]: make sure that output logits are allgathered.
+    text_generation_model_provider = functools.partial(model_provider, parallel_output=False)
+    model = get_model(text_generation_model_provider, wrap_with_ddp=False)
+
+    if args.load is not None:
+        load_modelopt_checkpoint(model, strict=not args.untie_embeddings_and_output_weights)
+        print_rank_0("Done loading checkpoint")
+
+    # Removing virtual pipeline parallel and other wrapper
+    assert len(model) == 1, "Above condition should have caught this"
+    unwrapped_model = unwrap_model(model)
+
+    all_prompts = args.prompts.split("|")
+
+    def custom_prompt_forward_loop_func(model):
+        for prompt in tqdm(all_prompts):
+            if mpu.is_pipeline_first_stage() and mpu.get_tensor_model_parallel_rank() == 0:
+                (
+                    prompts_plus_generations,
+                    prompts_plus_generations_segments,
+                    logprobs,
+                    _,
+                ) = generate_and_post_process(
+                    model,
+                    prompts=[prompt],
+                    tokens_to_generate=128,
+                    return_output_log_probs=True,
+                    temperature=1.0,
+                )
+                print_rank_0(prompts_plus_generations)
+            else:
+                generate_and_post_process(model)
+
+    def hf_dataset_forword_loop_func(model):
+        dataloader = get_calib_dataloader(args.calib_dataset, args.calib_batch_size, args.calib_size)
+        for prompts in tqdm(dataloader, total=args.calib_size//args.calib_batch_size):
+            if mpu.is_pipeline_first_stage() and mpu.get_tensor_model_parallel_rank() == 0:
+                (
+                    prompts_plus_generations,
+                    prompts_plus_generations_segments,
+                    logprobs,
+                    _,
+                ) = generate_and_post_process(
+                    model,
+                    prompts=prompts,
+                    tokens_to_generate=0,
+                    return_output_log_probs=True,
+                    temperature=1.0,
+                )
+            else:
+                generate_and_post_process(model)
+
+    ptq_forward_loop_func = custom_prompt_forward_loop_func
+    if args.calib_dataset is not None:
+        ptq_forward_loop_func = hf_dataset_forword_loop_func
+
+    # Setting data parallel and tensor parallel group
+    set_data_parallel_group(mpu.get_data_parallel_group())
+    set_tensor_parallel_group(mpu.get_tensor_model_parallel_group())
+
+    if args.export_quant_cfg in QUANT_CFG_CHOICES:
+        mtq_config = QUANT_CFG_CHOICES[args.export_quant_cfg]
+        if "*output_layer*" not in mtq_config["quant_cfg"]:
+            mtq_config["quant_cfg"]["*output_layer*"] = {"enable": False}
+        if "awq" in args.export_quant_cfg:
+            weight_quantizer = mtq_config["quant_cfg"]["*weight_quantizer"]  # type: ignore
+            if isinstance(weight_quantizer, list):
+                weight_quantizer = weight_quantizer[0]
+            weight_quantizer["block_sizes"][-1] = 128
+        print_rank_0("Quantizing the model...")
+        mtq.quantize(unwrapped_model[0], mtq_config, ptq_forward_loop_func)
+
+    custom_prompt_forward_loop_func(model[0])
+
+    if args.save is not None and args.export_quant_cfg in QUANT_CFG_CHOICES:
+        save_checkpoint(1, unwrapped_model, None, None, 0)
+
+    print_rank_0(f"Fake Quantized Model:\n {unwrapped_model[0]}")
+
+    if args.export_dir:
+        assert args.decoder in ["gptnext", "llama"], f"Decoder type {args.decoder} not supported."
+        Path(args.export_dir).mkdir(parents=True, exist_ok=True)
+        print_rank_0("Exporting TensorRT-LLM checkpoints.")
+
+        from modelopt.torch.export import export_tensorrt_llm_checkpoint
+
+        # In TRT LLM, squared relu activation does not support bf16. So we use fp16 by default.
+        export_tensorrt_llm_checkpoint(
+            unwrapped_model[0],
+            args.decoder,
+            torch.bfloat16 if args.bf16 else torch.float16,
+            export_dir=args.export_dir,
+            inference_tensor_parallel=args.inference_tensor_parallel,
+            inference_pipeline_parallel=1,
+            use_nfs_workspace=True,
+        )
+
+        print_rank_0(f"TensorRT-LLM checkpoints saved to {args.export_dir}")
--- a/PAI-Megatron-LM-240718/examples/inference/quantization/trtllm_text_generation.py
+++ b/PAI-Megatron-LM-240718/examples/inference/quantization/trtllm_text_generation.py
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+
+"""An example script to run the tensorrt_llm engine."""
+
+import argparse
+from pathlib import Path
+
+import numpy as np
+import torch
+from modelopt.deploy.llm import LLM, build_tensorrt_llm
+from transformers import AutoTokenizer, T5Tokenizer
+
+
+class CustomSentencePieceTokenizer(T5Tokenizer):
+    """This is a custom GPTSentencePiece Tokenizer modified from the T5Tokenizer.
+
+    Note:
+        The modification is kept minimal to make `encode` and `batch_decode` working
+        properly (used in TensorRT-LLM engine). Other functions have not been tested.
+    """
+
+    def __init__(self, model):
+        super().__init__(model, extra_ids=0, bos_token="<s>", pad_token="<pad>")
+
+    def encode(self, text, add_special_tokens: bool = True, **kwargs):
+        return torch.Tensor(self.sp_model.encode_as_ids(text))
+
+    def batch_encode_plus(
+        self, batch_text_or_text_pairs, add_special_tokens: bool = True, **kwargs
+    ):
+        return {'input_ids': self.sp_model.encode_as_ids(batch_text_or_text_pairs)}
+
+    def batch_decode(self, sequences, skip_special_tokens: bool = False, **kwargs):
+        if isinstance(sequences, np.ndarray) or torch.is_tensor(sequences):
+            sequences = sequences.tolist()
+        return self.sp_model.decode(sequences)
+
+    def decode(self, token_ids, skip_special_tokens: bool = False, **kwargs):
+        return self.sp_model.decode([token_ids])[0]
+
+
+def parse_arguments():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--tokenizer", type=str, default="")
+    parser.add_argument("--max-input-len", type=int, default=4096)
+    parser.add_argument("--max-output-len", type=int, default=512)
+    parser.add_argument("--max-batch-size", type=int, default=8)
+    parser.add_argument("--tensorrt-llm-checkpoint-dir", type=str, default=None)
+    parser.add_argument("--engine-dir", type=str, default="/tmp/trtllm_engine")
+    parser.add_argument(
+        "--input-texts",
+        type=str,
+        default=(
+            "Born in north-east France, Soyer trained as a|Born in California, Soyer trained as a"
+        ),
+        help="Input texts. Please use | to separate different batches.",
+    )
+    parser.add_argument("--max-beam-width", type=int, default=1)
+    parser.add_argument("--profiler-output", type=str, default="")
+    return parser.parse_args()
+
+
+def run(args):
+    tokenizer_path = Path(args.tokenizer)
+
+    if tokenizer_path.is_dir():
+        # For llama models, use local HF tokenizer which is a folder.
+        tokenizer = AutoTokenizer.from_pretrained(args.tokenizer, trust_remote_code=True)
+    elif tokenizer_path.is_file():
+        # For nextllm and nemotron models, use local Megatron GPTSentencePiece tokenizer which is a model file.
+        tokenizer = CustomSentencePieceTokenizer(args.tokenizer)
+    else:
+        raise ValueError(
+            "arg.tokenizer must be a dir to a hf tokenizer checkpoint for llama or a SentencePiece .model file for gptnext"
+        )
+    print(tokenizer, tokenizer.vocab_size)
+
+    if not hasattr(args, "profiler_output"):
+        args.profiler_output = ""
+
+    input_texts = args.input_texts.split("|")
+    assert input_texts, "input_text not specified"
+    print(input_texts)
+
+    if args.tensorrt_llm_checkpoint_dir is not None:
+        print("Building TensorRT-LLM engines.")
+        build_tensorrt_llm(
+            args.tensorrt_llm_checkpoint_dir + "/config.json",
+            args.engine_dir,
+            max_input_len=args.max_input_len,
+            max_batch_size=args.max_batch_size,
+            max_beam_width=args.max_beam_width,
+            num_build_workers=1,
+        )
+        print(f"TensorRT-LLM engines saved to {args.engine_dir}")
+
+    free_memory_before = torch.cuda.mem_get_info()
+
+    # This is a ModelOpt wrapper on top of tensorrt_llm.hlapi.llm.LLM
+    llm_engine = LLM(args.engine_dir, tokenizer)
+
+    torch.cuda.cudart().cudaProfilerStart()
+    # outputs = llm_engine.generate_text(input_texts, args.max_output_len, args.max_beam_width)
+    outputs = llm_engine.generate(input_texts)
+    torch.cuda.cudart().cudaProfilerStop()
+
+    free_memory_after = torch.cuda.mem_get_info()
+    print(
+        f"Used GPU memory: {(free_memory_before[0] - free_memory_after[0]) / 1024 / 1024 / 1024} GB"
+    )
+    print(outputs)
+
+
+if __name__ == "__main__":
+    args = parse_arguments()
+    run(args)
--- a/PAI-Megatron-LM-240718/examples/inference/run_text_generation_server_345M.sh
+++ b/PAI-Megatron-LM-240718/examples/inference/run_text_generation_server_345M.sh
+#!/bin/bash
+# This example will start serving the 345M model.
+DISTRIBUTED_ARGS="--nproc_per_node 1 \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+CHECKPOINT=<Path to checkpoint (e.g /345m)>
+VOCAB_FILE=<Path to vocab.json (e.g. /gpt2-vocab.json)>
+MERGE_FILE=<Path to merges.txt (e.g. /gpt2-merges.txt)>
+
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+pip install flask-restful
+
+torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
+       --tensor-model-parallel-size 1  \
+       --pipeline-model-parallel-size 1  \
+       --num-layers 24  \
+       --hidden-size 1024  \
+       --load ${CHECKPOINT}  \
+       --num-attention-heads 16  \
+       --max-position-embeddings 1024  \
+       --tokenizer-type GPT2BPETokenizer  \
+       --fp16  \
+       --micro-batch-size 1  \
+       --seq-length 1024  \
+       --vocab-file $VOCAB_FILE  \
+       --merge-file $MERGE_FILE  \
+       --seed 42
--- a/PAI-Megatron-LM-240718/examples/inference/run_text_generation_server_345M_8_tensor_parallel.sh
+++ b/PAI-Megatron-LM-240718/examples/inference/run_text_generation_server_345M_8_tensor_parallel.sh
+#!/bin/bash
+# This example will start serving the 345M model that is partitioned 8 way tensor parallel
+DISTRIBUTED_ARGS="--nproc_per_node 8 \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+CHECKPOINT=<Path to checkpoint (e.g /345m)>
+VOCAB_FILE=<Path to vocab.json (e.g. /gpt2-vocab.json)>
+MERGE_FILE=<Path to merges.txt (e.g. /gpt2-merges.txt)>
+
+pip install flask-restful
+
+python -m torch.distributed.launch $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
+       --tensor-model-parallel-size 8  \
+       --pipeline-model-parallel-size 1  \
+       --num-layers 24  \
+       --hidden-size 1024  \
+       --load ${CHECKPOINT}  \
+       --num-attention-heads 16  \
+       --max-position-embeddings 1024  \
+       --tokenizer-type GPT2BPETokenizer  \
+       --fp16  \
+       --micro-batch-size 1  \
+       --seq-length 1024  \
+       --vocab-file $VOCAB_FILE  \
+       --merge-file $MERGE_FILE  \
+       --seed 42
--- a/PAI-Megatron-LM-240718/examples/mamba/.gitignore
+++ b/PAI-Megatron-LM-240718/examples/mamba/.gitignore
+checkpoints/
+data-cache/
+tensorboard/
+triton-cache/
--- a/PAI-Megatron-LM-240718/examples/mamba/Dockerfile
+++ b/PAI-Megatron-LM-240718/examples/mamba/Dockerfile
+FROM nvcr.io/nvidia/pytorch:24.01-py3
+
+RUN pip uninstall -y triton && \
+    pip install triton==2.1.0 sentencepiece==0.1.99 flask-restful
+
+# The causal-conv1d and mamba-ssm packages below are built from scratch here
+# (which takes significant time) because there are no wheels available on PyPI
+# for these relatively newer versions of the packages that are compatible with
+# the older NGC-variant PyTorch version (e.g. version 2.2.0.dev231106) that we
+# are using (in the NGC base container). Generally, if the package is not
+# compatible with the PyTorch version, then it will generate a Python import
+# error. The package authors tend to only release wheels for new versions of
+# these pacakges which are compatible with the versions of regular PyTorch and
+# NGC-variant PyTorch that are newer at the time of release. So, to use newer
+# versions of these packages with relatively older versions of the NGC PyTorch
+# container, we tend to have to build the packages from scratch.
+
+RUN cd /tmp && \
+    git clone https://github.com/Dao-AILab/causal-conv1d.git && \
+    cd causal-conv1d && \
+    git checkout v1.2.2.post1 && \
+    CAUSAL_CONV1D_FORCE_BUILD=TRUE pip install . && \
+    cd .. && \
+    rm -rf causal-conv1d
+
+RUN cd /tmp && \
+    git clone https://github.com/state-spaces/mamba.git && \
+    cd mamba && \
+    git checkout v2.0.3 && \
+    MAMBA_FORCE_BUILD=TRUE pip install . && \
+    cd .. && \
+    rm -rf mamba
--- a/PAI-Megatron-LM-240718/examples/mamba/README.md
+++ b/PAI-Megatron-LM-240718/examples/mamba/README.md
+# Mamba-based Language Models
+
+## Introduction
+
+This document is an entrypoint into the code used for
+<em>[An Empirical Study of Mamba-based Language Models](https://arxiv.org/abs/2406.07887)</em>.
+
+We are releasing the parameters for some of the models described in that
+technical report via
+[HuggingFace](https://huggingface.co/collections/nvidia/ssms-666a362c5c3bb7e4a6bcfb9c).
+The code in the `main` branch is no longer compatible with the `Mamba2-*`
+checkpoints. You can load them using the
+[fixed snapshot of the code used for the technical report](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba).
+
+## Installation
+
+Create and run a Docker container using the [Dockerfile](./Dockerfile).
+
+```
+docker build -t your_image_name:your_tag .
+docker run --gpus all -it --rm \
+  -v /path/to/megatron:/workspace/megatron \
+  -v /path/to/dataset:/workspace/dataset \
+  -v /path/to/checkpoints:/workspace/checkpoints \
+  -w /workspace/megatron/examples/mamba \
+  your_image_name:your_tag
+```
+
+## Train
+
+[`train.sh`](./train.sh) is an example pretraining script, showing how to run on
+a single node. Select between 800M-scale and 8B-scale models by setting the
+`MODEL_SCALE` variable. The 8B-scale hybrid model architecture is the same as
+the one described in the technical report.
+
+## Text Generation
+
+Use [`run_text_gen_server_8b.sh`](./run_text_gen_server_8b.sh) to start a text
+generation server using an 8B hybrid checkpoint. This is configured to run the
+8B hybrid model described in the technical report, with tensor model parallel
+set to 1.
+
+The arguments in the script will need to be changed if using a checkpoint with a
+different model parallel configuration or other differences, such as model
+architecture. For example, to run the 8B pure Mamba-2 model, change
+`--hybrid-attention-ratio` and `--hybrid-mlp-ratio` to 0.0, or remove them.
+
+Use [`run_text_gen_server_8b_gpt3.sh`](./run_text_gen_server_8b_gpt3.sh) to start
+a text generation server using the 8B reference Transformer checkpoint.
+
+## Checkpoint Formats
+
+For inference, the model must be configured to match the checkpoint file used,
+including the hybrid layer configuration and model parallel configuration.
+
+If you need to convert a hybrid checkpoint file to a different tensor parallel
+or pipeline parallel size, use
+[the hybrid conversion script](../../tools/checkpoint/hybrid_conversion.py).
+There is an example run command at the end of that file.
+
+Before running that script, you will need to set `PYTHONPATH` to include the
+root directory of your Megatron-LM repository clone.
+
+```
+export PYTHONPATH=<path-to-megatron>:PYTHONPATH
+```
+
+## Hybrid Options
+
+`--hybrid-attention-ratio ATT` specifies a target ratio of attention layers
+to total layers. For example, 4 attention layers out of 48 total layers is
+specified by `--hybrid-attention-ratio 0.08`.
+
+`--hybrid-mlp-ratio MLP` specifies a target ratio of MLP layers to total
+layers. For example, 24 MLP layers out of 48 total layers is specified by
+`--hybrid-mlp-ratio 0.5`.
+
+* (`ATT` + `MLP`) must be less than or equal to 1.0.
+* (1.0 - `ATT` - `MLP`) is the hybrid mamba ratio, the ratio of mamba layers to
+total layers.
+* `ATT` = `MLP` = 0 is a pure Mamba model.
+* `ATT` = `MLP` = 0.5 is a transfomer model.
+
+If either `ATT` or `MLP` is greater than 0.0 or if `--hybrid-override-pattern`
+is specified, the logfile will include information about the hybrid layer
+pattern used. `--hybrid-override-pattern` can be used to specify a different
+pattern than the default, algorithmically-generated one.
+
+## Mamba vs Mamba-2
+
+This codebase currently only supports Mamba-2, and not the original version of
+Mamba. However, the
+[fixed snapshot of the code used for the technical report](https://github.com/NVIDIA/Megatron-LM/tree/ssm/examples/mamba)
+can be configured to run the original version of Mamba.
--- a/PAI-Megatron-LM-240718/examples/mamba/run_text_gen_server_8b.sh
+++ b/PAI-Megatron-LM-240718/examples/mamba/run_text_gen_server_8b.sh
+#!/bin/bash
+
+# Use: ./run_text_gen_server_8b.sh <checkpoint-path> <tokenizer-path>
+# To launch the client: python ../../tools/text_generation_cli.py <URL-provided-by-server>
+
+CHECKPOINT_PATH=$1
+TOKENIZER_PATH=$2
+
+DISTRIBUTED_ARGS="--nproc_per_node 1 \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+export NCCL_IB_SL=1
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NCCL_IB_TIMEOUT=19
+export NCCL_IB_QPS_PER_CONNECTION=4
+
+export TRITON_CACHE_DIR="./triton-cache/"
+export TRITON_CACHE_MANAGER="megatron.core.ssm.triton_cache_manager:ParallelFileCacheManager"
+
+torchrun $DISTRIBUTED_ARGS ../../tools/run_mamba_text_generation_server.py \
+       --tensor-model-parallel-size 1  \
+       --pipeline-model-parallel-size 1  \
+       --untie-embeddings-and-output-weights \
+       --num-layers 56  \
+       --hidden-size 4096  \
+       --load ${CHECKPOINT_PATH}  \
+       --num-attention-heads 32  \
+       --group-query-attention \
+       --num-query-groups 8 \
+       --hybrid-attention-ratio 0.08 \
+       --hybrid-mlp-ratio 0.5 \
+       --attention-dropout 0.0 \
+       --hidden-dropout 0.0 \
+       --disable-bias-linear \
+       --normalization RMSNorm \
+       --seq-length 4096  \
+       --max-position-embeddings 4096  \
+       --position-embedding-type none \
+       --tokenizer-type GPTSentencePieceTokenizer  \
+       --tokenizer-model ${TOKENIZER_PATH} \
+       --distributed-backend nccl \
+       --distributed-timeout-minutes 1440 \
+       --bf16  \
+       --micro-batch-size 1  \
+       --use-mcore-models \
+       --spec megatron.core.models.mamba.mamba_layer_specs mamba_stack_spec \
+       --seed 42
--- a/PAI-Megatron-LM-240718/examples/mamba/run_text_gen_server_8b_gpt3.sh
+++ b/PAI-Megatron-LM-240718/examples/mamba/run_text_gen_server_8b_gpt3.sh
+#!/bin/bash
+
+# Use: ./run_text_gen_server_8b_gpt3.sh <checkpoint-path> <tokenizer-path>
+# To launch the client: python ../../tools/text_generation_cli.py <URL-provided-by-server>
+
+CHECKPOINT_PATH=$1
+TOKENIZER_PATH=$2
+
+DISTRIBUTED_ARGS="--nproc_per_node 1 \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+export NCCL_IB_SL=1
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NCCL_IB_TIMEOUT=19
+export NCCL_IB_QPS_PER_CONNECTION=4
+
+torchrun $DISTRIBUTED_ARGS ../../tools/run_text_generation_server.py \
+       --tensor-model-parallel-size 1  \
+       --pipeline-model-parallel-size 1  \
+       --use-flash-attn \
+       --apply-layernorm-1p \
+       --untie-embeddings-and-output-weights \
+       --num-layers 32  \
+       --hidden-size 4096  \
+       --load ${CHECKPOINT_PATH}  \
+       --num-attention-heads 32  \
+       --attention-dropout 0.0 \
+       --hidden-dropout 0.0 \
+       --disable-bias-linear \
+       --seq-length 4096  \
+       --max-position-embeddings 4096  \
+       --position-embedding-type rope \
+       --rotary-percent 0.5 \
+       --squared-relu \
+       --tokenizer-type GPTSentencePieceTokenizer  \
+       --tokenizer-model ${TOKENIZER_PATH} \
+       --distributed-backend nccl \
+       --distributed-timeout-minutes 1440 \
+       --bf16  \
+       --micro-batch-size 1  \
+       --use-mcore-models \
+       --transformer-impl local \
+       --seed 42
--- a/PAI-Megatron-LM-240718/examples/mamba/train.sh
+++ b/PAI-Megatron-LM-240718/examples/mamba/train.sh
+#!/bin/bash
+
+# Use: ./train.sh <data-path> <tokenizer-path>
+
+MODEL_SCALE="800M" # or "8B"
+
+case "${MODEL_SCALE}" in
+    "800M")
+        TENSOR_MODEL_PARALLEL_SIZE=1
+        NUM_LAYERS=48
+        HIDDEN_SIZE=1024
+        NUM_ATTENTION_HEADS=16
+        GLOBAL_BATCH_SIZE=32
+        ;;
+    "8B")
+        TENSOR_MODEL_PARALLEL_SIZE=4
+        NUM_LAYERS=56
+        HIDDEN_SIZE=4096
+        NUM_ATTENTION_HEADS=32
+        GLOBAL_BATCH_SIZE=8
+        ;;
+    *)
+        echo "Invalid version specified"
+        exit 1
+        ;;
+esac
+
+DATA_PATH=$1
+TOKENIZER_PATH=$2
+
+export NCCL_IB_SL=1
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+export NCCL_IB_TIMEOUT=19
+export NCCL_IB_QPS_PER_CONNECTION=4
+
+CHECKPOINT_DIR="./checkpoints"
+DATACACHE_DIR="./data-cache"
+TENSORBOARD_DIR="./tensorboard"
+
+mkdir -p ${CHECKPOINT_DIR}
+mkdir -p ${DATACACHE_DIR}
+mkdir -p ${TENSORBOARD_DIR}
+
+export TRITON_CACHE_DIR="./triton-cache/"
+export TRITON_CACHE_MANAGER="megatron.core.ssm.triton_cache_manager:ParallelFileCacheManager"
+
+SEQ_LEN=4096
+TRAIN_SAMPLES=73242188  # 300B tokens / 4096
+LR_WARMUP_SAMPLES=50000
+LR_DECAY_SAMPLES=73192188 # TRAIN_SAMPLES - LR_WARMUP_SAMPLES
+
+options=" \
+       --tensor-model-parallel-size ${TENSOR_MODEL_PARALLEL_SIZE} \
+       --sequence-parallel \
+       --pipeline-model-parallel-size 1 \
+       --use-distributed-optimizer \
+       --overlap-param-gather \
+       --overlap-grad-reduce \
+       --untie-embeddings-and-output-weights \
+       --init-method-std 0.02 \
+       --position-embedding-type none \
+       --num-layers ${NUM_LAYERS} \
+       --hidden-size ${HIDDEN_SIZE} \
+       --num-attention-heads ${NUM_ATTENTION_HEADS} \
+       --group-query-attention \
+       --num-query-groups 8 \
+       --hybrid-attention-ratio 0.08 \
+       --hybrid-mlp-ratio 0.5 \
+       --seq-length ${SEQ_LEN} \
+       --max-position-embeddings ${SEQ_LEN} \
+       --train-samples ${TRAIN_SAMPLES} \
+       --lr-warmup-samples ${LR_WARMUP_SAMPLES} \
+       --lr-decay-samples ${LR_DECAY_SAMPLES} \
+       --save ${CHECKPOINT_DIR} \
+       --load ${CHECKPOINT_DIR} \
+       --data-path ${DATA_PATH} \
+       --data-cache-path ${DATACACHE_DIR} \
+       --split 99,1,0 \
+       --tokenizer-type GPTSentencePieceTokenizer \
+       --tokenizer-model ${TOKENIZER_PATH} \
+       --distributed-backend nccl \
+       --micro-batch-size 4 \
+       --global-batch-size ${GLOBAL_BATCH_SIZE} \
+       --lr 2.5e-4 \
+       --min-lr 2.5e-5 \
+       --lr-decay-style cosine \
+       --weight-decay 0.1 \
+       --clip-grad 1.0 \
+       --attention-dropout 0.0 \
+       --hidden-dropout 0.0 \
+       --disable-bias-linear \
+       --normalization RMSNorm \
+       --adam-beta1 0.9 \
+       --adam-beta2 0.95 \
+       --log-interval 10 \
+       --save-interval 2000 \
+       --eval-interval 2000 \
+       --eval-iters 32 \
+       --bf16 \
+       --use-mcore-models \
+       --spec megatron.core.models.mamba.mamba_layer_specs mamba_stack_spec \
+       --no-create-attention-mask-in-dataloader \
+       --tensorboard-dir ${TENSORBOARD_DIR}"
+
+torchrun --nproc_per_node 8 ../../pretrain_mamba.py ${options}
--- a/PAI-Megatron-LM-240718/examples/mixtral/README.md
+++ b/PAI-Megatron-LM-240718/examples/mixtral/README.md
+# Mixtral 8x7B Model Inference and Finetuning
+
+## Download Mixtral 8x7B Checkpoints
+Download Mixtral 8x7B HF format checkpoint from [HF-hub](https://huggingface.co/mistralai/Mixtral-8x7B-v0.1/)
+
+Or you can simply run this following script to download Mixtral 8x7B into a specific folder.
+```python
+from huggingface_hub import snapshot_download
+SAVED_DIR = "" # Specify the saved directory
+# Download HF checkpoints
+snapshot_download(repo_id="mistralai/Mixtral-8x7B-v0.1", ignore_patterns=["*.pt"], local_dir=SAVED_DIR, local_dir_use_symlinks=False)
+```
+
+## Convert Mixtral 8x7B checkpoints from HF to MCore
+The HF checkpoints can be converted to Megatron format by using the provided checkpoint converter for HF format.
+The target model parallel size(e.g. TP,PP,EP) should be specified.
+
+```
+TOKENIZER_MODEL=/workspace/checkpoints/mixtral-hf/tokenizer.model
+MEGATRON_PATH="/workspace/megatron-lm"
+export PYTHONPATH=$MEGATRON_PATH:$PYTHONPATH
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+TARGET_TP_SIZE=1
+TARGET_PP_SIZE=4
+TARGET_EP_SIZE=8
+
+HF_FORMAT_DIR=/workspace/checkpoints/mixtral-hf
+MEGATRON_FORMAT_DIR=/workspace/checkpoints/mixtral-mcore-TP${TARGET_TP_SIZE}PP${TARGET_PP_SIZE}EP${TARGET_EP_SIZE}
+
+python tools/checkpoint/convert.py \
+--model-type GPT \
+--loader loader_mixtral_hf \
+--saver mcore \
+--target-tensor-parallel-size ${TARGET_TP_SIZE} \
+--target-pipeline-parallel-size ${TARGET_PP_SIZE} \
+--target-expert-parallel-size ${TARGET_EP_SIZE} \
+--load-dir ${HF_FORMAT_DIR} \
+--save-dir ${MEGATRON_FORMAT_DIR} \
+--tokenizer-model ${TOKENIZER_MODEL}
+```
+
+## Text generation with Mixtral 8x7B
+Inference with Mixtral 8x7B requires at least 2 GPUS, such that a distributed checkpoint with EP>=2 or PP>=2 converted with above script is needed.
+
+The Megatron-LM have included a simple REST server to use for text generation in `tools/run_text_generation_server.py`, launch it with the following script:
+```
+#!/bin/bash
+# This example will start serving the Mixtral 8x7B model.
+DISTRIBUTED_ARGS="--nproc_per_node 2 \
+                  --nnodes 1 \
+                  --node_rank 0 \
+                  --master_addr localhost \
+                  --master_port 6000"
+
+CHECKPOINT=<Path to checkpoint>
+TOKENIZER_MODEL=<Path to tokenizer (e.g. /tokenizer.model)>
+
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+pip install flask-restful
+
+torchrun $DISTRIBUTED_ARGS tools/run_text_generation_server.py   \
+       --tensor-model-parallel-size 1  \
+       --pipeline-model-parallel-size 2  \
+       --expert-model-parallel-size 1 \
+       --load ${CHECKPOINT}  \
+       --tokenizer-type Llama2Tokenizer \
+       --tokenizer-model $TOKENIZER_MODEL \
+       --use-mcore-models \
+       --max-position-embeddings 32768 \
+       --num-layers 32 \
+       --hidden-size 4096 \
+       --ffn-hidden-size 14336 \
+       --num-attention-heads 32 \
+       --normalization RMSNorm \
+       --disable-bias-linear \
+       --position-embedding-type rope \
+       --no-position-embedding \
+       --swiglu \
+       --untie-embeddings-and-output-weights \
+       --group-query-attention \
+       --num-query-groups 8 \
+       --bf16  \
+       --micro-batch-size 1  \
+       --seq-length 1024  \
+       --seed 42 \
+       --num-experts 8 \
+       --moe-router-topk 2 \
+       --moe-token-dispatcher-type alltoall \
+       --mock-data \
+       --rotary-base 1000000
+```
+
+Once the server is running you can use `tools/text_generation_cli.py` to query it, it takes one argument which is the host the server is running on.
+
+```
+python tools/text_generation_cli.py localhost:5000
+```
+
+
+## Finetuning from pretrained Mixtral 8x7B
+To finetuning pretrained Mixtral 8x7B, use the following scripts:
+
+
+```bash
+PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.04-py3
+CHECKPOINT_PATH="" # Speicfy path to checkpoint dir
+TOKENIZER_MODEL="" # Specify path to tokenizer.model
+DATA_PATH="" # Specify path to data
+
+docker run \
+    --gpus=all \
+    --ipc=host \
+    --workdir /workspace/megatron-lm \
+    -v /path/to/data:/path/to/data \
+    -v /path/to/megatron-lm:/workspace/megatron-lm \
+    $PYTORCH_IMAGE \
+    bash examples/mixtral/train_mixtral_8x7b_distributed.sh $CHECKPOINT_PATH $TOKENIZER_MODEL $DATA_PATH
+```
--- a/PAI-Megatron-LM-240718/examples/mixtral/train_mixtral_8x7b_distributed.sh
+++ b/PAI-Megatron-LM-240718/examples/mixtral/train_mixtral_8x7b_distributed.sh
+#!/bin/bash
+
+# Runs Mixtral 8x7B model
+
+export CUDA_DEVICE_MAX_CONNECTIONS=1
+
+GPUS_PER_NODE=8
+# Change for multinode config
+MASTER_ADDR=${MASTER_ADDR:-"localhost"}
+MASTER_PORT=${MASTER_PORT:-"6000"}
+NNODES=${SLURM_NNODES:-"1"}
+NODE_RANK=${RANK:-"0"}
+WORLD_SIZE=$(($GPUS_PER_NODE*$NNODES))
+
+CHECKPOINT_PATH=$1
+TOKENIZER_MODEL=$2
+DATA_PATH=$3
+
+DISTRIBUTED_ARGS=(
+    --nproc_per_node $GPUS_PER_NODE
+    --nnodes $NNODES
+    --node_rank $NODE_RANK
+    --master_addr $MASTER_ADDR
+    --master_port $MASTER_PORT
+)
+
+MODEL_ARGS=(
+    --use-mcore-models
+    --disable-bias-linear
+    --seq-length 4096
+    --max-position-embeddings 32768
+    --num-layers 32
+    --hidden-size 4096
+    --ffn-hidden-size 14336
+    --num-attention-heads 32
+    --init-method-std 0.01
+    --attention-dropout 0.0
+    --hidden-dropout 0.0
+    --normalization RMSNorm
+    --position-embedding-type rope
+    --swiglu
+    --untie-embeddings-and-output-weights
+    --group-query-attention
+    --num-query-groups 8
+    --no-masked-softmax-fusion
+    --no-position-embedding
+    --rotary-base 1000000
+)
+
+MOE_ARGS=(
+    --num-experts 8
+    --moe-router-topk 2
+    --moe-router-load-balancing-type aux_loss
+    --moe-aux-loss-coeff 1e-2
+    --moe-grouped-gemm
+    --moe-token-dispatcher-type alltoall
+    --overlap-param-gather
+    --overlap-grad-reduce
+)
+
+DATA_ARGS=(
+    --tokenizer-type Llama2Tokenizer
+    --tokenizer-model ${TOKENIZER_MODEL}
+    --data-path $DATA_PATH
+    --split 99990,8,2
+)
+
+TRAINING_ARGS=(
+    --micro-batch-size 1
+    --global-batch-size 256
+    --lr 1e-4
+    --train-iters 500000
+    --lr-decay-iters 320000
+    --lr-decay-style cosine
+    --min-lr 1.0e-5
+    --weight-decay 0.1
+    --lr-warmup-iters 500
+    --clip-grad 1.0
+    --bf16
+)
+
+MODEL_PARALLEL_ARGS=(
+    --tensor-model-parallel-size 1
+    --pipeline-model-parallel-size 4
+    --expert-model-parallel-size 8
+    --use-distributed-optimizer
+    --sequence-parallel
+)
+
+LOGGING_ARGS=(
+    --log-interval 1 \
+    --save-interval 10000 \
+    --eval-interval 1000 \
+    --eval-iters 10 \
+    --save $CHECKPOINT_PATH \
+    --load $CHECKPOINT_PATH \
+    --tensorboard-dir "${CHECKPOINT_PATH}/tensorboard" \
+    --no-load-optim \
+    --no-load-rng
+)
+
+if [ -n "${WANDB_API_KEY}" ]; then
+    LOGGING_ARGS+=(
+        --wandb-project ${WANDB_PROJECT:-"Mixtral"}
+        --wandb-exp-name ${WANDB_NAME:-"Mixtral_8x7B"}
+    )
+fi
+
+
+torchrun ${DISTRIBUTED_ARGS[@]} pretrain_gpt.py \
+    ${MODEL_ARGS[@]} \
+    ${MOE_ARGS[@]} \
+    ${DATA_ARGS[@]} \
+    ${TRAINING_ARGS[@]} \
+    ${MODEL_PARALLEL_ARGS[@]} \
+    ${LOGGING_ARGS[@]}
--- a/PAI-Megatron-LM-240718/examples/multimodal/Dockerfile
+++ b/PAI-Megatron-LM-240718/examples/multimodal/Dockerfile
+FROM nvcr.io/nvidia/pytorch:24.02-py3
+
+RUN apt update && \
+    apt -y upgrade && \
+    apt install -y --no-install-recommends \
+        software-properties-common \
+        build-essential \
+        python3-pip \
+        python3-dev \
+        bash \
+        git \
+        vim \
+        python-is-python3 \
+        default-jre
+
+RUN pip install --upgrade pip
+RUN pip install einops einops-exts sentencepiece braceexpand webdataset
+RUN pip install transformers datasets
+RUN pip install pytest-cov pytest_mock nltk wrapt
+RUN pip install zarr "tensorstore==0.1.45"
+RUN pip install git+https://github.com/fanshiqing/grouped_gemm@main
+RUN pip install black==19.10b0 isort click==8.0.2
+RUN pip install pycocoevalcap megatron-energon
+RUN pip install git+https://github.com/openai/CLIP.git
+# Use --no-deps for the following to avoid outdated and unnecessary dependencies.
+RUN pip install mmf --no-deps
+RUN pip install open-flamingo[eval] --no-deps
--- a/PAI-Megatron-LM-240718/examples/multimodal/README.md
+++ b/PAI-Megatron-LM-240718/examples/multimodal/README.md
+# Multimodal Example
+
+The following walks through all the steps required to pretrain and instruction tune a llava architecture vision-language model (VLM). It is important to precisely follow all steps to obtain the benchmark scores at the end.
+
+This example has been tested on an A100 based DGX cluster. Pretraining and instruction tuning took approximately 1 day and 11 hours respectively on 64 GPUs using four way tensor parallelism (tp=4). Training speed will scale approximately linearly with number of GPUs available.
+
+Multimodal support in megatron is still under active development. This example is not intended to produce state-of-the-art model quality (that would require more data and model refinements), it is merely intended to demonstrate the multimodal functionality in megatron. If you hit any problems, please open a github issue.
+
+## Setup
+
+### Docker container
+
+You can build a docker container using `examples/multimodal/Dockerfile` to run this example.
+
+### Language model
+
+Follow the instructions in `megatron-lm/docs/llama_mistral.md` to download weights for Mistral-7B-Instruct-v0.3 and convert to mcore format with tensor parallel size 4
+
+### Vision model
+
+This example uses the OpenAI CLIP `ViT-L/14@336px` Vision model. To download the weights from OpenAI and convert them to a format that can be loaded in megatron, please run the following:
+
+```
+python examples/multimodal/clip_converter.py --download-root /some/download/folder --output /some/output/folder --tensor-parallel-size 4
+```
+
+### Combined model checkpoint
+
+Update the paths to point to the mcore converted CLIP and Mistral models and run the following script to combine the Mistral and CLIP models into a single multimodal checkpoint folder:
+
+```
+examples/multimodal/combine_mistral_clip.sh
+```
+
+## Training
+
+### Pretraining
+
+1. Download the LLavA-Pretrain dataset from Hugging Face and unzip the images folder (NOTE: 79GB of disk space required):
+
+    ```
+    git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain
+    cd LLaVA-Pretrain
+    unzip images.zip
+    ```
+
+3. Run the following script to convert the data to webdataset format:
+
+    ```
+    cd <megatron-lm dir>
+    python examples/multimodal/convert_llava_pretrain_to_wds.py
+    ```
+
+4. Run the following command to convert to megatron-energon format:
+
+    ```
+    cd <LLaVA-Pretrain dir>/wds
+    energon ./
+    ```
+
+    select the following values for the presented options:
+
+    ```
+    > Please enter a desired train/val/test split like "0.5, 0.2, 0.3" or "8,1,1": 9,1,0
+    > Do you want to create a dataset.yaml interactively? [Y/n]: Y
+    > Please enter a number to choose a class: 10 (VQAWebdataset)
+    > Do you want to set a simple field_map[Y] (or write your own sample_loader [n])? [Y/n]: Y
+    > Please enter a webdataset field name for 'image' (<class 'torch.Tensor'>): jpg
+    > Please enter a webdataset field name for 'context' (<class 'str'>): json[0][value]
+    > Please enter a webdataset field name for 'answers' (typing.Optional[typing.List[str]], default: None): json[1][value]
+    > Please enter a webdataset field name for 'answer_weights' (typing.Optional[torch.Tensor], default: None):
+    ```
+
+5. Update `pretrain_dataset.yaml` so that both `path` variables point to `LLaVA-Pretrain/wds`
+
+6. Run the following script to pretrain a llava model for image captioning:
+
+    ```
+    cd <megatron-lm dir>
+    examples/multimodal/pretrain_mistral_clip.sh
+    ```
+
+All being well you should observe training and valiation loss curves similar to the following:
+
+<img src="assets/pretrain_curves.png" alt="Pretraining loss curves" width="600"/>
+
+These curves were obtained with global batch size of 256. Changing this value will likely change the curves. For pretraining and instruction tuning llava models we have found that loss curves are an unreliable predictor of downstream task performance. Therefore it is necessary to run test generation and evaluation on a range of metrics to understand model quality. We intend to add training time zero-shot evaluation in a future update.
+
+### SFT
+
+1. Prepare an instruction tuning dataset such in [megatron-energon format](https://nvidia.github.io/Megatron-Energon/data_prep.html#). NOTE: we do not provide instructions for this.
+
+5. Update `sft_dataset.yaml` so that both `path` variables point to the train and val splits of your instruction tuning dataset.
+
+Run the following script to instruction tune the pre-trained llava model:
+
+    ```
+    examples/multimodal/sft_mistral_clip.sh
+    ```
+
+## Evaluation
+
+### Generation
+
+Run the following script:
+
+```
+examples/multimodal/text_generation_mistral_clip.sh --input-image-path /path/to/input/images --output-path /some/output/directory \
+    --model-path /path/to/model.pt --tokenizer-path /path/to/tokenizer.model --gt-path /path/to/groundtruth/file --task generation-task-name
+```
+
+### After pretraining
+
+#### COCO captioning
+
+1. Download the COCO 2014 test image set:
+
+    ```wget http://images.cocodataset.org/zips/test2014.zip```
+
+2. Download COCO test image annotations:
+
+    ```https://storage.googleapis.com/sfr-vision-language-research/datasets/coco_karpathy_test.json```
+
+3. First, run text generation using `--task captioning`.
+
+4. Run the following command:
+
+    ```
+    python examples/multimodal/evaluate_coco.py --input-path /output/directory/from/generation --groundtruth-path /path/to/groundtruth/file
+    ```
+
+For the mistral-7b-instruct plus clip llava model you should obtain a COCO CIDer score of approximately 94.
+
+### After SFT
+
+#### MMMU
+
+The official MMMU repository is not pip installable currently so please clone their code in `examples/multimodal` by running `git clone https://github.com/MMMU-Benchmark/MMMU.git`.
+
+The MMMU dataset is loaded from HuggingFace automatically as part of the code.
+
+Run text generation using `--task MMMU`. Then, run the following command:
+
+```
+python examples/multimodal/evaluate_mmmu.py --input-path /output/directory/from/generation
+```
+
+For the mistral-7b-instruct plus clip instruction tuned llava model you should obtain a MMMU score of approximately 38.
--- a/PAI-Megatron-LM-240718/examples/multimodal/assets/pretrain_curves.png
+++ b/PAI-Megatron-LM-240718/examples/multimodal/assets/pretrain_curves.png
--- a/PAI-Megatron-LM-240718/examples/multimodal/clip_converter.py
+++ b/PAI-Megatron-LM-240718/examples/multimodal/clip_converter.py
+# Copyright (c) 2024, NVIDIA CORPORATION. All rights reserved.
+import argparse
+import os
+
+import clip
+import torch
+
+
+def convert(download_root, output_path, tensor_parallel_size, use_te_layernorm_linear):
+    device = "cuda"
+
+    model, _ = clip.load("ViT-L/14@336px", device=device, download_root=download_root)
+
+    state_dict = model.state_dict()
+    new_state_dicts = [{"model": dict()} for _ in range(tensor_parallel_size)]
+
+    # Indices from mapping pytorch multihead attention to megatron.
+    kv_channels = 64
+    hidden_dim = 1024
+    num_heads = 16
+    indices = []
+    for i in range(num_heads):
+        lb = i * kv_channels
+        ub = (i + 1) * kv_channels
+        indices.append(torch.arange(lb, ub, dtype=torch.int))
+        indices.append(torch.arange(hidden_dim + lb, hidden_dim + ub, dtype=torch.int))
+        indices.append(torch.arange(2 * hidden_dim + lb, 2 * hidden_dim + ub, dtype=torch.int))
+
+    indices = torch.cat(indices)
+
+    for name, tensor in state_dict.items():
+        # Skip text model.
+        if "visual" not in name:
+            continue
+
+        # Skip final layers not used in our model.
+        if name == "visual.proj" or "ln_post" in name:
+            continue
+
+        # Map parameter names to ones used in megatron.
+        new_name = ""
+        new_tensor = tensor
+        if new_tensor.dtype == torch.float16:
+            new_tensor = new_tensor.to(torch.float32)
+
+        # This is used for chunking some tensors to target tensor parallel size.
+        chunk_dim = None
+
+        if "class_embedding" in name:
+            new_name = "class_token"
+            # Our model uses class token that is expanded to input dimensions already.
+            new_tensor = new_tensor.expand(1, 1, -1)
+        elif "positional_embedding" in name:
+            new_name = "position_embeddings.weight"
+        elif "conv1" in name:
+            new_name = "conv1.weight"
+        elif "ln_pre.weight" in name:
+            new_name = "ln_pre.weight"
+        elif "ln_pre.bias" in name:
+            new_name = "ln_pre.bias"
+        elif "transformer.resblocks" in name:
+            layer_idx = name.split(".")[3]
+            base = f"decoder.layers.{layer_idx}"
+
+            if "attn.in_proj_weight" in name:
+                new_name = f"{base}.self_attention.linear_qkv.weight"
+                new_tensor = new_tensor[indices]
+                chunk_dim = 0
+            elif "attn.in_proj_bias" in name:
+                new_name = f"{base}.self_attention.linear_qkv.bias"
+                new_tensor = new_tensor[indices]
+                chunk_dim = 0
+            elif "attn.out_proj.weight" in name:
+                new_name = f"{base}.self_attention.linear_proj.weight"
+                chunk_dim = 1
+            elif "attn.out_proj.bias" in name:
+                new_name = f"{base}.self_attention.linear_proj.bias"
+            elif "ln_1.weight" in name:
+                new_name = f"{base}.input_layernorm.weight"
+                if use_te_layernorm_linear:
+                    new_name = f"{base}.self_attention.linear_qkv.layer_norm_weight"
+            elif "ln_1.bias" in name:
+                new_name = f"{base}.input_layernorm.bias"
+                if use_te_layernorm_linear:
+                    new_name = f"{base}.self_attention.linear_qkv.layer_norm_bias"
+            elif "mlp.c_fc.weight" in name:
+                new_name = f"{base}.mlp.linear_fc1.weight"
+                chunk_dim = 0
+            elif "mlp.c_fc.bias" in name:
+                new_name = f"{base}.mlp.linear_fc1.bias"
+                chunk_dim = 0
+            elif "mlp.c_proj.weight" in name:
+                new_name = f"{base}.mlp.linear_fc2.weight"
+                chunk_dim = 1
+            elif "mlp.c_proj.bias" in name:
+                new_name = f"{base}.mlp.linear_fc2.bias"
+            elif "ln_2.weight" in name:
+                new_name = f"{base}.pre_mlp_layernorm.weight"
+                if use_te_layernorm_linear:
+                    new_name = f"{base}.mlp.linear_fc1.layer_norm_weight"
+            elif "ln_2.bias" in name:
+                new_name = f"{base}.pre_mlp_layernorm.bias"
+                if use_te_layernorm_linear:
+                    new_name = f"{base}.mlp.linear_fc1.layer_norm_bias"
+
+        assert new_name != "", f"unexpected layer name {name}"
+
+        if chunk_dim is None:
+            new_tensors = [new_tensor for _ in range(tensor_parallel_size)]
+        else:
+            new_tensors = torch.chunk(new_tensor, tensor_parallel_size, dim=chunk_dim)
+
+        for i in range(tensor_parallel_size):
+            # chunk() creates a view of a bigger tensor. clone() is used here to avoid excessive storage.
+            new_state_dicts[i]["model"][new_name] = new_tensors[i].clone()
+
+    for i in range(tensor_parallel_size):
+        output_path_tp = os.path.join(output_path, f"state_dict_tp_{i}.pt")
+        torch.save(new_state_dicts[i], output_path_tp)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="""
+Convert OpenAI CLIP VIT weights to megatron format.
+
+
+Example usage:
+python clip_converter.py --download-root /some/download/folder --output /some/output/folder --tensor-parallel-size 4
+""",
+        formatter_class=argparse.RawDescriptionHelpFormatter,
+    )
+
+    parser.add_argument(
+        "--download-root", type=str, required=True, help="Download folder for OpenAI CLIP weights",
+    )
+    parser.add_argument(
+        "--output", type=str, required=True, help="output directory for megatron state dict file(s)"
+    )
+    parser.add_argument(
+        "--tensor-parallel-size", type=int, default=1, help="model tensor parallel size",
+    )
+    parser.add_argument(
+        "--use-te-layernorm-linear",
+        action="store_true",
+        help="Use Transformer Engine's LayerNormLinear",
+    )
+
+    args = parser.parse_args()
+
+    convert(
+        args.download_root, args.output, args.tensor_parallel_size, args.use_te_layernorm_linear
+    )
+
+    print("done.")
--- a/PAI-Megatron-LM-240718/examples/multimodal/combine_mistral_clip.sh
+++ b/PAI-Megatron-LM-240718/examples/multimodal/combine_mistral_clip.sh
+
+MCORE_MISTRAL=<path_to_mcore_mistral_model_folder>
+MCORE_CLIP=<path_to_mcore_clip_model_folder>
+OUTPUT_DIR=<path_to_output_folder_for_combined_checkpoint>
+
+python examples/multimodal/combine_state_dicts.py \
+    --input \
+    ${MCORE_MISTRAL}/iter_0000001/mp_rank_00/model_optim_rng.pt \
+    ${MCORE_CLIP}/iter_0000001/mp_rank_00/model_optim_rng.pt \
+    ${MCORE_MISTRAL}/iter_0000001/mp_rank_01/model_optim_rng.pt \
+    ${MCORE_CLIP}/iter_0000001/mp_rank_01/model_optim_rng.pt \
+    ${MCORE_MISTRAL}/iter_0000001/mp_rank_02/model_optim_rng.pt \
+    ${MCORE_CLIP}/vit-mcore-336px-tp4/iter_0000001/mp_rank_02/model_optim_rng.pt \
+    ${MCORE_MISTRAL}/iter_0000001/mp_rank_03/model_optim_rng.pt \
+    ${MCORE_CLIP}/iter_0000001/mp_rank_03/model_optim_rng.pt \
+    --prefixes language_model vision_model language_model vision_model language_model vision_model language_model vision_model \
+    --output \
+    ${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/iter_0000001/mp_rank_00/model_optim_rng.pt \
+    ${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/iter_0000001/mp_rank_01/model_optim_rng.pt \
+    ${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/iter_0000001/mp_rank_02/model_optim_rng.pt \
+    ${OUTPUT_DIR}/mistral_instruct_clip336_tp4_combined_mcore/iter_0000001/mp_rank_03/model_optim_rng.pt
\ No newline at end of file