Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
wangsen
megatron-lm
Commits
7c19b3a8
Commit
7c19b3a8
authored
Sep 26, 2024
by
wangsen
Browse files
Initial commit
parents
Pipeline
#1721
failed with stages
in 0 seconds
Changes
622
Pipelines
1
Show whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
1459 additions
and
0 deletions
+1459
-0
examples/academic_paper_scripts/sc21/CONFIG.sh
examples/academic_paper_scripts/sc21/CONFIG.sh
+57
-0
examples/academic_paper_scripts/sc21/README.md
examples/academic_paper_scripts/sc21/README.md
+50
-0
examples/academic_paper_scripts/sc21/SBATCH.sh
examples/academic_paper_scripts/sc21/SBATCH.sh
+13
-0
examples/academic_paper_scripts/sc21/SRUN.sh
examples/academic_paper_scripts/sc21/SRUN.sh
+18
-0
examples/academic_paper_scripts/sc21/run_figure_11.sh
examples/academic_paper_scripts/sc21/run_figure_11.sh
+46
-0
examples/academic_paper_scripts/sc21/run_figure_12.sh
examples/academic_paper_scripts/sc21/run_figure_12.sh
+54
-0
examples/academic_paper_scripts/sc21/run_figure_13.sh
examples/academic_paper_scripts/sc21/run_figure_13.sh
+46
-0
examples/academic_paper_scripts/sc21/run_figure_14.sh
examples/academic_paper_scripts/sc21/run_figure_14.sh
+47
-0
examples/academic_paper_scripts/sc21/run_figure_15.sh
examples/academic_paper_scripts/sc21/run_figure_15.sh
+47
-0
examples/academic_paper_scripts/sc21/run_figure_16.sh
examples/academic_paper_scripts/sc21/run_figure_16.sh
+43
-0
examples/academic_paper_scripts/sc21/run_figure_17.sh
examples/academic_paper_scripts/sc21/run_figure_17.sh
+54
-0
examples/academic_paper_scripts/sc21/run_figure_18.sh
examples/academic_paper_scripts/sc21/run_figure_18.sh
+54
-0
examples/academic_paper_scripts/sc21/run_table_1.sh
examples/academic_paper_scripts/sc21/run_table_1.sh
+145
-0
examples/bert/README.md
examples/bert/README.md
+54
-0
examples/bert/train_bert_340m_distributed.sh
examples/bert/train_bert_340m_distributed.sh
+77
-0
examples/gpt3/README.md
examples/gpt3/README.md
+57
-0
examples/gpt3/gpt_config.yaml
examples/gpt3/gpt_config.yaml
+303
-0
examples/gpt3/train_gpt3_175b_distributed.sh
examples/gpt3/train_gpt3_175b_distributed.sh
+81
-0
examples/gpt3/train_gpt3_345M_distributed.sh
examples/gpt3/train_gpt3_345M_distributed.sh
+85
-0
examples/inference/modelopt/README.md
examples/inference/modelopt/README.md
+128
-0
No files found.
examples/academic_paper_scripts/sc21/CONFIG.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# SLURM options.
export
SLURM_PARTITION
=
<slurm partition, used to feed
-p
option
in
slurm>
export
SLURM_ACCOUNT
=
<slurm account, used to feed
-A
option
in
slurm>
# Source code.
export
MEGATRON_CODE_DIR
=
<megatron
source
code directory>
# This variable is used to mount the relevant part of the filesystem
# inside the docker container. Note that the `MEGATRON_CODE_DIR` and the
# launch directory already get mounted; this variable should be used to
# mount the directories that contain the data and tokenizer files.
export
DOCKER_MOUNT_DIR
=
<megatron dataset and bpe tokenizer vocab path>
# Data and tokenizer files.
MEGATRON_DATA
=
<path to megatron processed data>
BPE_VOCAB_FILE
=
<path to bpe vocab file>
BPE_MERGE_FILE
=
<path to bpe merges file>
# Megatron input parameters.
# `MEGATRON_EXTRA_PARAMS` can be used to provide any extra parameters
# that are not listed here.
export
MEGATRON_PARAMS
=
"
${
MEGATRON_EXTRA_PARAMS
}
\
--tensor-model-parallel-size
${
TP
}
\
--pipeline-model-parallel-size
${
PP
}
\
--micro-batch-size
${
MBS
}
\
--global-batch-size
${
GBS
}
\
--num-layers
${
NLS
}
\
--hidden-size
${
HS
}
\
--num-attention-heads
${
NAH
}
\
--DDP-impl
${
DDP
}
\
--data-path
${
MEGATRON_DATA
}
\
--vocab-file
${
BPE_VOCAB_FILE
}
\
--merge-file
${
BPE_MERGE_FILE
}
\
--log-interval 5
\
--seq-length 2048
\
--max-position-embeddings 2048
\
--train-iters 500
\
--lr-decay-iters 320
\
--lr 0.0001
\
--min-lr 0.00001
\
--lr-decay-style cosine
\
--lr-warmup-fraction 0.01
\
--split 969,30,1
\
--eval-iters 100
\
--eval-interval 1000
\
--clip-grad 1.0
\
--fp16
\
--loss-scale 8192 "
examples/academic_paper_scripts/sc21/README.md
0 → 100644
View file @
7c19b3a8
# Reproducing Figures in SC21 Paper
This directory contains some of the scripts that were used to produce the
results in the
[
Megatron paper
](
https://arxiv.org/pdf/2104.04473.pdf
)
that is
to appear at
[
SuperComputing 2021
](
https://sc21.supercomputing.org/
)
. These
scripts use
[
Slurm
](
https://slurm.schedmd.com/documentation.html
)
with the
[
pyxis plugin
](
https://github.com/NVIDIA/pyxis
)
, but can be modified for other
schedulers as well.
## Git commit
To replicate these results use Megatron-LM commit: 6985e58938d40ad91ac07b0fddcfad8132e1447e
## Setup
All the cluster-dependent variables are in
[
`CONFIG.sh`
](
./CONFIG.sh
)
. Please
update the unspecified values (in angle brackets
`<...>`
) before launching any
scripts.
## Scripts
Below is a list of scripts that can be used to reproduce various figures in our
[
paper
](
https://arxiv.org/pdf/2104.04473.pdf
)
:
*
[
run_table_1.sh
](
./run_table_1.sh
)
: Table 1 showing weak-scaling throughput
for GPT models ranging from 1 billion to 1 trillion parameters.
*
[
run_figure_11.sh
](
./run_figure_11.sh
)
: Figure 11 showing the weak-scaling
performance of pipeline parallelism.
*
[
run_figure_12.sh
](
./run_figure_12.sh
)
: Figure 12 showing the effect of
the interleaved schedule on a 175B GPT model.
*
[
run_figure_13.sh
](
./run_figure_13.sh
)
: Figure 13 showing the effect of
different degrees of pipeline and tensor model parallelism on a model with
162.
2 billion parameters.
*
[
run_figure_14.sh
](
./run_figure_14.sh
)
: Figure 14 showing the effect of
different degrees of data and pipeline model parallelism on a model with
5.
9 billion parameters.
*
[
run_figure_15.sh
](
./run_figure_15.sh
)
: Figure 15 showing the effect of
different degrees of data and tensor model parallelism on a model with
5.
9 billion parameters.
*
[
run_figure_16.sh
](
./run_figure_16.sh
)
: Figure 16 showing the effect of
microbatch size.
*
[
run_figure_17.sh
](
./run_figure_17.sh
)
: Figure 17 showing the effect of
activation recomputation.
*
[
run_figure_18.sh
](
./run_figure_18.sh
)
: Figure 18 showing the effect of
the scatter-gather communication optimization.
examples/academic_paper_scripts/sc21/SBATCH.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
sbatch
-p
${
SLURM_PARTITION
}
\
-A
${
SLURM_ACCOUNT
}
\
--job-name
=
${
JOB_NAME
}
\
--nodes
=
${
NNODES
}
\
--export
=
MEGATRON_CODE_DIR,MEGATRON_PARAMS,DOCKER_MOUNT_DIR SRUN.sh
exit
0
examples/academic_paper_scripts/sc21/SRUN.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
#SBATCH -t 0:30:00 --exclusive --mem=0 --overcommit --ntasks-per-node=8
THIS_DIR
=
`
pwd
`
DATETIME
=
`
date
+
'date_%y-%m-%d_time_%H-%M-%S'
`
mkdir
-p
${
THIS_DIR
}
/logs
CMD
=
"python -u
${
MEGATRON_CODE_DIR
}
/pretrain_gpt.py
${
MEGATRON_PARAMS
}
"
srun
-l
\
--container-image
"nvcr.io#nvidia/pytorch:20.12-py3"
\
--container-mounts
"
${
THIS_DIR
}
:
${
THIS_DIR
}
,
${
MEGATRON_CODE_DIR
}
:
${
MEGATRON_CODE_DIR
}
,
${
DOCKER_MOUNT_DIR
}
:
${
DOCKER_MOUNT_DIR
}
"
\
--output
=
${
THIS_DIR
}
/logs/%x_%j_
$DATETIME
.log sh
-c
"
${
CMD
}
"
examples/academic_paper_scripts/sc21/run_figure_11.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Pipeline-parallel size options = [1, 2, 4, 8].
PP
=
1
# Batch size (global batch size) options = [8, 128].
GBS
=
8
# Set pipeline-parallel size options.
NLS
=
$((
3
*
PP
))
NNODES
=
${
PP
}
# Other params.
TP
=
8
MBS
=
1
HS
=
20480
NAH
=
128
DDP
=
local
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
# Name of the job.
export
JOB_NAME
=
results_figure_11_pipeline_parallel_size_
${
PP
}
_batch_size_
${
GBS
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/academic_paper_scripts/sc21/run_figure_12.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Interleaved schedule options = [YES, NO].
INTERLEAVED
=
YES
# Batch size (global batch size) options = [12, 24, 36, ..., 60].
GBS
=
12
# Set interleaved schedule options.
if
[
${
INTERLEAVED
}
==
"YES"
]
;
then
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 2 "
elif
[
${
INTERLEAVED
}
==
"NO"
]
;
then
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
else
echo
"Invalid configuration"
exit
1
fi
# Other params.
TP
=
8
PP
=
12
MBS
=
1
NLS
=
96
HS
=
12288
NAH
=
96
DDP
=
local
NNODES
=
12
# Name of the job.
export
JOB_NAME
=
results_figure_12_interleaved_
${
INTERLEAVED
}
_batch_size_
${
GBS
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/academic_paper_scripts/sc21/run_figure_13.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Pipeline-parallel size options = [2, 4, 8, 16, 32].
PP
=
2
# Batch size (global batch size) options = [32, 128].
GBS
=
32
# Set pipeline-parallel and tensor-parallel size options.
TP
=
$((
64
/
PP
))
# Other params.
MBS
=
1
NLS
=
32
HS
=
20480
NAH
=
128
DDP
=
local
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
NNODES
=
8
# Name of the job.
export
JOB_NAME
=
results_figure_13_pipeline_parallel_size_
${
PP
}
_tensor_parallel_size_
${
TP
}
_batch_size_
${
GBS
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/academic_paper_scripts/sc21/run_figure_14.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Pipeline-parallel size options = [2, 4, 8, 16, 32].
PP
=
2
# Batch size (global batch size) options = [32, 512].
GBS
=
32
# Set pipeline-parallel and data-parallel size options.
DP
=
$((
64
/
PP
))
# Other params.
TP
=
1
MBS
=
1
NLS
=
32
HS
=
3840
NAH
=
32
DDP
=
local
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
NNODES
=
8
# Name of the job.
export
JOB_NAME
=
results_figure_14_pipeline_parallel_size_
${
PP
}
_data_parallel_size_
${
DP
}
_batch_size_
${
GBS
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/academic_paper_scripts/sc21/run_figure_15.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Tensor-parallel size options = [2, 4, 8, 16, 32].
TP
=
2
# Batch size (global batch size) options = [32, 128, 512].
GBS
=
32
# Set tensor-parallel and data-parallel size options.
DP
=
$((
64
/
TP
))
# Other params.
PP
=
1
MBS
=
1
NLS
=
32
HS
=
3840
NAH
=
32
DDP
=
local
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
NNODES
=
8
# Name of the job.
export
JOB_NAME
=
results_figure_15_tensor_parallel_size_
${
TP
}
_data_parallel_size_
${
DP
}
_batch_size_
${
GBS
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/academic_paper_scripts/sc21/run_figure_16.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Microbatch size options = [1, 2, 4, 8].
MBS
=
1
# Batch size (global batch size) options = [128, 512].
GBS
=
128
# Other params.
TP
=
8
PP
=
8
NLS
=
32
HS
=
15360
NAH
=
128
DDP
=
local
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
NNODES
=
8
# Name of the job.
export
JOB_NAME
=
results_figure_16_microbatch_size_
${
MBS
}
_batch_size_
${
GBS
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/academic_paper_scripts/sc21/run_figure_17.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Activation recomputation options = [YES, NO].
ACTIVATION_RECOMPUTATION
=
YES
# Batch size (global batch size) options = [1, 2, 4, ..., 256].
GBS
=
1
# Set activation recomputation.
if
[
${
ACTIVATION_RECOMPUTATION
}
==
"YES"
]
;
then
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
elif
[
${
ACTIVATION_RECOMPUTATION
}
==
"NO"
]
;
then
MEGATRON_EXTRA_PARAMS
=
""
else
echo
"Invalid configuration"
exit
1
fi
# Other params.
TP
=
8
PP
=
16
MBS
=
1
NLS
=
80
HS
=
12288
NAH
=
96
DDP
=
local
NNODES
=
16
# Name of the job.
export
JOB_NAME
=
results_figure_17_activation_recomputation_
${
ACTIVATION_RECOMPUTATION
}
_batch_size_
${
GBS
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/academic_paper_scripts/sc21/run_figure_18.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Scatter-gather communication optimization options = [YES, NO].
SCATTER_GATHER
=
YES
# Batch size (global batch size) options = [12, 24, 36, ..., 60].
GBS
=
12
# Set scatter-gather communication optimization options.
if
[
${
SCATTER_GATHER
}
==
"YES"
]
;
then
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 2 "
elif
[
${
SCATTER_GATHER
}
==
"NO"
]
;
then
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 2 --no-scatter-gather-tensors-in-pipeline "
else
echo
"Invalid configuration"
exit
1
fi
# Other params.
TP
=
8
PP
=
12
MBS
=
1
NLS
=
96
HS
=
12288
NAH
=
96
DDP
=
local
NNODES
=
12
# Name of the job.
export
JOB_NAME
=
results_figure_18_scatter_gather_
${
SCATTER_GATHER
}
_batch_size_
${
GBS
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/academic_paper_scripts/sc21/run_table_1.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# model size options = [1.7B, 3.6B, 7.5B, 18B, 39B, 76B, 145B, 310B, 530B, 1T]
MODEL_SIZE
=
1.7B
if
[
${
MODEL_SIZE
}
==
"1.7B"
]
;
then
TP
=
1
PP
=
1
MBS
=
16
GBS
=
512
NLS
=
24
HS
=
2304
NAH
=
24
DDP
=
torch
NNODES
=
4
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
elif
[
${
MODEL_SIZE
}
==
"3.6B"
]
;
then
TP
=
2
PP
=
1
MBS
=
16
GBS
=
512
NLS
=
30
HS
=
3072
NAH
=
32
DDP
=
torch
NNODES
=
8
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
elif
[
${
MODEL_SIZE
}
==
"7.5B"
]
;
then
TP
=
4
PP
=
1
MBS
=
16
GBS
=
512
NLS
=
36
HS
=
4096
NAH
=
32
DDP
=
torch
NNODES
=
16
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
elif
[
${
MODEL_SIZE
}
==
"18B"
]
;
then
TP
=
8
PP
=
1
MBS
=
8
GBS
=
1024
NLS
=
40
HS
=
6144
NAH
=
48
DDP
=
torch
NNODES
=
32
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
elif
[
${
MODEL_SIZE
}
==
"39B"
]
;
then
TP
=
8
PP
=
2
MBS
=
4
GBS
=
1536
NLS
=
48
HS
=
8192
NAH
=
64
DDP
=
local
NNODES
=
64
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
elif
[
${
MODEL_SIZE
}
==
"76B"
]
;
then
TP
=
8
PP
=
4
MBS
=
2
GBS
=
1792
NLS
=
60
HS
=
10240
NAH
=
80
DDP
=
local
NNODES
=
128
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 5"
elif
[
${
MODEL_SIZE
}
==
"145B"
]
;
then
TP
=
8
PP
=
8
MBS
=
2
GBS
=
2304
NLS
=
80
HS
=
12288
NAH
=
96
DDP
=
local
NNODES
=
192
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 5 "
elif
[
${
MODEL_SIZE
}
==
"310B"
]
;
then
TP
=
8
PP
=
16
MBS
=
1
GBS
=
2160
NLS
=
96
HS
=
16384
NAH
=
128
DDP
=
local
NNODES
=
240
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 3 "
elif
[
${
MODEL_SIZE
}
==
"530B"
]
;
then
TP
=
8
PP
=
35
MBS
=
1
GBS
=
2520
NLS
=
105
HS
=
20480
NAH
=
128
DDP
=
local
NNODES
=
315
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform --num-layers-per-virtual-pipeline-stage 1 "
elif
[
${
MODEL_SIZE
}
==
"1T"
]
;
then
TP
=
8
PP
=
64
MBS
=
1
GBS
=
3072
NLS
=
128
HS
=
25600
NAH
=
160
DDP
=
local
NNODES
=
384
MEGATRON_EXTRA_PARAMS
=
"--activations-checkpoint-method uniform "
else
echo
"Invalid configuration"
exit
1
fi
# Name of the job
export
JOB_NAME
=
results_table_1_model_size_
${
MODEL_SIZE
}
# Import the configs.
.
`
pwd
`
/CONFIG.sh
# Submit the job.
.
`
pwd
`
/SBATCH.sh
exit
0
examples/bert/README.md
0 → 100644
View file @
7c19b3a8
# BERT MODEL
## Table of contents
-
[
1. Training Setup
](
#1-training-setup
)
-
[
2. Configurations
](
#2-configurations
)
## 1. Training setup
<a
id=
"markdown-training-setup"
name=
"training-setup"
></a>
To run the model using a docker container run it as follows
```
PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3
CHECKPOINT_PATH="" #<Specify path>
TENSORBOARD_LOGS_PATH=""#<Specify path>
VOCAB_FILE="" #<Specify path to file>//bert-vocab.txt
DATA_PATH="" #<Specify path and file prefix>_text_document
docker run \
--gpus=all \
--ipc=host \
--workdir /workspace/megatron-lm \
-v /path/to/data:/path/to/data \
-v /path/to/megatron-lm:/workspace/megatron-lm \
megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \
bash examples/bert/train_bert_340m_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $DATA_PATH "
```
NOTE: Depending on the environment you are running it the above command might like slightly different.
## 2. Configurations
<a
id=
"markdown-configurations"
name=
"configurations"
></a>
The example in this folder shows you how to run 340m large model. There are other configs you could run as well
### 4B
```
--num-layers 48 \
--hidden-size 2560 \
--num-attention-heads 32 \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
```
### 20B
```
--num-layers 48 \
--hidden-size 6144 \
--num-attention-heads 96 \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 4 \
```
\ No newline at end of file
examples/bert/train_bert_340m_distributed.sh
0 → 100644
View file @
7c19b3a8
#!/bin/bash
# Runs the "340M" parameter model (Bert - Large)
export
CUDA_DEVICE_MAX_CONNECTIONS
=
1
GPUS_PER_NODE
=
8
# Change for multinode config
MASTER_ADDR
=
localhost
MASTER_PORT
=
6000
NUM_NODES
=
1
NODE_RANK
=
0
WORLD_SIZE
=
$((
$GPUS_PER_NODE
*
$NUM_NODES
))
CHECKPOINT_PATH
=
$1
#<Specify path>
TENSORBOARD_LOGS_PATH
=
$2
#<Specify path>
VOCAB_FILE
=
$3
#<Specify path to file>/bert-vocab.json
DATA_PATH
=
$4
#<Specify path and file prefix>_text_document
DISTRIBUTED_ARGS
=(
--nproc_per_node
$GPUS_PER_NODE
--nnodes
$NUM_NODES
--master_addr
$MASTER_ADDR
--master_port
$MASTER_PORT
)
BERT_MODEL_ARGS
=(
--num-layers
24
--hidden-size
1024
--num-attention-heads
16
--seq-length
512
--max-position-embeddings
512
)
TRAINING_ARGS
=(
--micro-batch-size
4
--global-batch-size
32
--train-iters
1000000
--weight-decay
1e-2
--clip-grad
1.0
--fp16
--lr
0.0001
--lr-decay-iters
990000
--lr-decay-style
linear
--min-lr
1.0e-5
--weight-decay
1e-2
--lr-warmup-fraction
.01
--clip-grad
1.0
)
MODEL_PARALLEL_ARGS
=(
--tensor-model-parallel-size
8
--pipeline-model-parallel-size
16
)
DATA_ARGS
=(
--data-path
$DATA_PATH
--vocab-file
$VOCAB_FILE
--split
949,50,1
)
EVAL_AND_LOGGING_ARGS
=(
--log-interval
100
--save-interval
10000
--eval-interval
1000
--save
$CHECKPOINT_PATH
--load
$CHECKPOINT_PATH
--eval-iters
10
--tensorboard-dir
$TENSORBOARD_LOGS_PATH
)
torchrun
${
DISTRIBUTED_ARGS
[@]
}
pretrain_bert.py
\
${
BERT_MODEL_ARGS
[@]
}
\
${
TRAINING_ARGS
[@]
}
\
${
MODEL_PARALLEL_ARGS
[@]
}
\
${
DATA_ARGS
[@]
}
\
${
EVAL_AND_LOGGING_ARGS
[@]
}
examples/gpt3/README.md
0 → 100644
View file @
7c19b3a8
# GPT3 MODEL
## Table of contents
-
[
1. Training Setup
](
#1-training-setup
)
-
[
2. Configurations
](
#2-configurations
)
-
[
3. Training Results
](
#3-training-results
)
## 1. Training setup
<a
id=
"markdown-training-setup"
name=
"training-setup"
></a>
To run the model using a docker container run it as follows
```
PYTORCH_IMAGE=nvcr.io/nvidia/pytorch:24.01-py3
CHECKPOINT_PATH="" #<Specify path>
TENSORBOARD_LOGS_PATH=""#<Specify path>
VOCAB_FILE="" #<Specify path to file>/gpt2-vocab.json
MERGE_FILE="" #<Specify path to file>/gpt2-merges.txt
DATA_PATH="" #<Specify path and file prefix>_text_document
docker run \
--gpus=all \
--ipc=host \
--workdir /workspace/megatron-lm \
-v /path/to/data:/path/to/data \
-v /path/to/megatron-lm:/workspace/megatron-lm \
megatron-lm nvcr.io/nvidia/pytorch:24.01-py3 \
bash examples/gpt3/train_gpt3_175b_distributed.sh $CHECKPOINT_PATH $TENSORBOARD_LOGS_PATH $VOCAB_FILE $MERGE_FILE $DATA_PATH "
```
NOTE: Depending on the environment you are running it the above command might like slightly different.
## 2. Configurations
<a
id=
"markdown-configurations"
name=
"configurations"
></a>
The example in this folder shows you how to run 175B model. There are other configs you could run as well
### 345M
```
--num-layers 12 \
--hidden-size 512 \
--num-attention-heads 8 \
--seq-length 1024 \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
```
### 857M
```
--num-layers 24 \
--hidden-size 1024 \
--num-attention-heads 16 \
--seq-length 2048 \
--tensor-model-parallel-size 1 \
--pipeline-model-parallel-size 1 \
```
examples/gpt3/gpt_config.yaml
0 → 100644
View file @
7c19b3a8
# WARNING: Yaml configs is currently an experimental feature
language_model
:
# model architecture
num_layers
:
24
hidden_size
:
1024
num_attention_heads
:
16
num_query_groups
:
null
ffn_hidden_size
:
null
kv_channels
:
null
hidden_dropout
:
0.0
attention_dropout
:
0.0
fp32_residual_connection
:
False
apply_residual_connection_post_layernorm
:
False
layernorm_epsilon
:
1.e-5
layernorm_zero_centered_gamma
:
True
add_bias_linear
:
False
bias_activation_fusion
:
False
add_qkv_bias
:
False
gated_linear_unit
:
False
activation_func
:
swiglu
num_moe_experts
:
null
rotary_interleaved
:
False
window_size
:
null
# initialization
init_method
:
null
init_method_std
:
0.02
output_layer_init_method
:
null
# mixed-precision
apply_query_key_layer_scaling
:
False
attention_softmax_in_fp32
:
False
# fusion
bias_swiglu_fusion
:
True
masked_softmax_fusion
:
True
persist_layer_norm
:
False
memory_efficient_layer_norm
:
False
bias_dropout_fusion
:
True
apply_rope_fusion
:
True
# activation recomputation
recompute_granularity
:
null
recompute_method
:
null
recompute_num_layers
:
null
distribute_saved_activations
:
null
# fp8 related
fp8
:
null
fp8_margin
:
0
fp8_interval
:
1
fp8_amax_history_len
:
1
fp8_amax_compute_algo
:
"
most_recent"
fp8_wgrad
:
True
# miscellaneous
clone_scatter_output_in_embedding
:
True
normalization
:
"
LayerNorm"
# alt value supported by TE: "RMSNorm"
# MoE related
moe_router_load_balancing_type
:
"
aux_loss"
moe_router_topk
:
2
moe_grouped_gemm
:
False
moe_aux_loss_coeff
:
0
# 1e-2 would be a good start value for load balance loss.
moe_z_loss_coeff
:
null
# 1e-3 would be a good start value for z-loss
moe_input_jitter_eps
:
null
moe_token_dropping
:
False
model_parallel
:
# Model parallelism
tensor_model_parallel_size
:
1
context_parallel_size
:
1
pipeline_model_parallel_size
:
1
virtual_pipeline_model_parallel_size
:
null
sequence_parallel
:
True
expert_model_parallel_size
:
1
# Initialization
perform_initialization
:
True
use_cpu_initialization
:
null
# Training
fp16
:
False
bf16
:
True
params_dtype
:
null
# Set from above arguments for core
timers
:
null
# Optimizations
gradient_accumulation_fusion
:
True
async_tensor_model_parallel_allreduce
:
True
tp_comm_overlap
:
False
# Debug Options
tp_comm_split_ag
:
True
tp_comm_atomic_ag
:
True
tp_comm_split_rs
:
True
tp_comm_atomic_rs
:
True
tp_comm_bulk_wgrad
:
True
tp_comm_bulk_dgrad
:
True
# Parallelism
finalize_model_grads_func
:
null
# Pipeline Parallel
pipeline_dtype
:
null
grad_scale_func
:
null
enable_autocast
:
False
autocast_dtype
:
null
variable_seq_lengths
:
False
num_microbatches_with_partial_activation_checkpoints
:
null
overlap_p2p_comm
:
False
batch_p2p_comm
:
True
batch_p2p_sync
:
True
use_ring_exchange_p2p
:
False
deallocate_pipeline_outputs
:
False
no_sync_func
:
null
grad_sync_func
:
null
param_sync_func
:
null
pipeline_model_parallel_split_rank
:
null
# CPU Offloading
cpu_offloading
:
False
cpu_offloading_num_layers
:
0
_cpu_offloading_context
:
null
cpu_offloading_weights
:
False
cpu_offloading_activations
:
True
# Timing
barrier_with_L1_time
:
True
# training:
use_legacy_models
:
False
spec
:
null
micro_batch_size
:
2
global_batch_size
:
128
rampup_batch_size
:
[
32
,
32
,
65324160
]
check_for_nan_in_loss_and_grad
:
True
num_layers_per_virtual_pipeline_stage
:
null
encoder_num_layers
:
null
decoder_num_layers
:
null
rotary_seq_len_interpolation_factor
:
null
add_position_embedding
:
False
make_vocab_size_divisible_by
:
128
group_query_attention
:
False
exit_signal_handler
:
False
exit_duration_in_mins
:
null
exit_interval
:
null
untie_embeddings_and_output_weights
:
True
position_embedding_type
:
rope
rotary_percent
:
0.5
openai_gelu
:
False
squared_relu
:
False
swiglu
:
True
onnx_safe
:
null
bert_binary_head
:
True
max_position_embeddings
:
4096
transformer_impl
:
local
use_flash_attn
:
False
seed
:
1234
data_parallel_random_init
:
False
# Optimizer
optimizer
:
adam
lr
:
2.5e-4
lr_decay_style
:
cosine
lr_decay_iters
:
null
lr_decay_samples
:
255126953
lr_warmup_fraction
:
null
lr_warmup_iters
:
0
lr_warmup_samples
:
81381
lr_warmup_init
:
0.0
min_lr
:
2.5e-5
weight_decay
:
0.1
start_weight_decay
:
null
end_weight_decay
:
null
weight_decay_incr_style
:
constant
clip_grad
:
1.0
adam_beta1
:
0.9
adam_beta2
:
0.95
adam_eps
:
1.e-08
sgd_momentum
:
0.9
override_opt_param_scheduler
:
False
use_checkpoint_opt_param_scheduler
:
False
# checkpointing arguments
save
:
null
save_interval
:
20000
no_save_optim
:
null
no_save_rng
:
null
load
:
null
no_load_optim
:
null
no_load_rng
:
null
finetune
:
False
use_checkpoint_args
:
False
exit_on_missing_checkpoint
:
False
# loss arguments
loss_scale
:
null
initial_loss_scale
:
4294967296
min_loss_scale
:
1.0
loss_scale_window
:
1000
hysteresis
:
2
accumulate_allreduce_grads_in_fp32
:
False
fp16_lm_cross_entropy
:
False
# distributed arguments
distributed_backend
:
nccl
distributed_timeout_minutes
:
10
overlap_grad_reduce
:
False
delay_grad_reduce
:
True
overlap_param_gather
:
False
delay_param_gather
:
False
scatter_gather_tensors_in_pipeline
:
True
local_rank
:
null
lazy_mpu_init
:
null
empty_unused_memory_level
:
0
standalone_embedding_stage
:
False
use_distributed_optimizer
:
False
nccl_communicator_config_path
:
null
train_iters
:
null
eval_iters
:
32
eval_interval
:
2000
skip_train
:
False
adlr_autoresume
:
False
adlr_autoresume_interval
:
1000
# garbage collection
manual_gc
:
False
manual_gc_interval
:
0
manual_gc_eval
:
True
tp_comm_overlap_cfg
:
null
#data
data_path
:
null
split
:
'
99,1,0'
train_data_path
:
null
valid_data_path
:
null
test_data_path
:
null
data_cache_path
:
null
mock_data
:
False
vocab_size
:
null
vocab_file
:
null
merge_file
:
null
vocab_extra_ids
:
0
seq_length
:
4096
encoder_seq_length
:
null
decoder_seq_length
:
null
retriever_seq_length
:
256
sample_rate
:
1.0
mask_prob
:
0.15
short_seq_prob
:
0.1
num_workers
:
2
tokenizer_type
:
GPTSentencePieceTokenizer
tokenizer_model
:
null
reset_position_ids
:
False
reset_attention_mask
:
False
eod_mask_loss
:
False
train_samples
:
268554688
dataloader_type
:
null
#profile:
profile
:
False
profile_ranks
:
[
0
]
profile_step_end
:
12
profile_step_start
:
10
#logging:
log_params_norm
:
True
log_num_zeros_in_grad
:
True
log_throughput
:
False
log_progress
:
False
timing_log_level
:
0
timing_log_option
:
minmax
tensorboard_log_interval
:
1
tensorboard_queue_size
:
1000
log_timers_to_tensorboard
:
False
log_batch_size_to_tensorboard
:
False
log_learning_rate_to_tensorboard
:
True
log_learning_rate_to_tensorboard
:
True
log_validation_ppl_to_tensorboard
:
False
log_memory_to_tensorboard
:
False
log_world_size_to_tensorboard
:
False
log_loss_scale_to_tensorboard
:
True
wandb_project
:
'
'
wandb_exp_name
:
'
'
wandb_save_dir
:
'
'
enable_one_logger
:
False
one_logger_project
:
e2e-tracking
one_logger_entity
:
hwinf_dcm
one_logger_run_name
:
null
log_interval
:
100
tensorboard_dir
:
null
examples/gpt3/train_gpt3_175b_distributed.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# Runs the "175B" parameter model
export
CUDA_DEVICE_MAX_CONNECTIONS
=
1
GPUS_PER_NODE
=
8
# Change for multinode config
MASTER_ADDR
=
localhost
MASTER_PORT
=
6000
NUM_NODES
=
1
NODE_RANK
=
0
WORLD_SIZE
=
$((
$GPUS_PER_NODE
*
$NUM_NODES
))
CHECKPOINT_PATH
=
$1
#<Specify path>
TENSORBOARD_LOGS_PATH
=
$2
#<Specify path>
VOCAB_FILE
=
$3
#<Specify path to file>/gpt2-vocab.json
MERGE_FILE
=
$4
#<Specify path to file>/gpt2-merges.txt
DATA_PATH
=
$5
#<Specify path and file prefix>_text_document
DISTRIBUTED_ARGS
=(
--nproc_per_node
$GPUS_PER_NODE
--nnodes
$NUM_NODES
--master_addr
$MASTER_ADDR
--master_port
$MASTER_PORT
)
GPT_MODEL_ARGS
=(
--num-layers
96
--hidden-size
12288
--num-attention-heads
96
--seq-length
2048
--max-position-embeddings
2048
)
TRAINING_ARGS
=(
--micro-batch-size
1
--global-batch-size
1536
--rampup-batch-size
16 16 5859375
--train-iters
500000
--weight-decay
0.1
--adam-beta1
0.9
--adam-beta2
0.95
--init-method-std
0.006
--clip-grad
1.0
--fp16
--lr
6.0e-5
--lr-decay-style
cosine
--min-lr
6.0e-6
--lr-warmup-fraction
.001
--lr-decay-iters
430000
)
MODEL_PARALLEL_ARGS
=(
--tensor-model-parallel-size
8
--pipeline-model-parallel-size
16
)
DATA_ARGS
=(
--data-path
$DATA_PATH
--vocab-file
$VOCAB_FILE
--merge-file
$MERGE_FILE
--split
949,50,1
)
EVAL_AND_LOGGING_ARGS
=(
--log-interval
100
--save-interval
10000
--eval-interval
1000
--save
$CHECKPOINT_PATH
--load
$CHECKPOINT_PATH
--eval-iters
10
--tensorboard-dir
$TENSORBOARD_LOGS_PATH
)
torchrun
${
DISTRIBUTED_ARGS
[@]
}
pretrain_gpt.py
\
${
GPT_MODEL_ARGS
[@]
}
\
${
TRAINING_ARGS
[@]
}
\
${
MODEL_PARALLEL_ARGS
[@]
}
\
${
DATA_ARGS
[@]
}
\
${
EVAL_AND_LOGGING_ARGS
[@]
}
examples/gpt3/train_gpt3_345M_distributed.sh
0 → 100755
View file @
7c19b3a8
#!/bin/bash
# Runs the "175B" parameter model
export
CUDA_DEVICE_MAX_CONNECTIONS
=
1
GPUS_PER_NODE
=
1
#8
# Change for multinode config
MASTER_ADDR
=
localhost
MASTER_PORT
=
6000
NUM_NODES
=
1
NODE_RANK
=
0
WORLD_SIZE
=
$((
$GPUS_PER_NODE
*
$NUM_NODES
))
CHECKPOINT_PATH
=
./tmp
#$1 #<Specify path>
TENSORBOARD_LOGS_PATH
=
./tmp
#$2 #<Specify path>
#VOCAB_FILE=$3 #<Specify path to file>/gpt2-vocab.json
#MERGE_FILE=$4 #<Specify path to file>/gpt2-merges.txt
DATA_PATH
=
"/root/megatron-llama/dataset/my-llama_text_document"
#<Specify path and file prefix>_text_document
TOKENIZER_PATH
=
"/root/megatron-llama/tokenizer.model"
DISTRIBUTED_ARGS
=(
--nproc_per_node
$GPUS_PER_NODE
--nnodes
$NUM_NODES
--master_addr
$MASTER_ADDR
--master_port
$MASTER_PORT
)
GPT_MODEL_ARGS
=(
--num-layers
12
--hidden-size
512
--num-attention-heads
8
--seq-length
2048
--max-position-embeddings
2048
)
TRAINING_ARGS
=(
--transformer-impl
local
--use-legacy-models
--micro-batch-size
1
--global-batch-size
60
--train-iters
50
--weight-decay
0.1
--adam-beta1
0.9
--adam-beta2
0.95
--init-method-std
0.006
--clip-grad
1.0
--fp16
--lr
6.0e-5
--lr-decay-style
cosine
--min-lr
6.0e-6
--lr-warmup-fraction
.001
--lr-decay-iters
20
)
MODEL_PARALLEL_ARGS
=(
--tensor-model-parallel-size
1
--pipeline-model-parallel-size
1
)
DATA_ARGS
=(
--data-path
$DATA_PATH
--split
949,50,1
--untie-embeddings-and-output-weights
--position-embedding-type
rope
--tokenizer-model
$TOKENIZER_PATH
--tokenizer-type
GPTSentencePieceTokenizer
)
EVAL_AND_LOGGING_ARGS
=(
--log-interval
1
--save-interval
10000
--eval-interval
1000
--save
$CHECKPOINT_PATH
--load
$CHECKPOINT_PATH
--eval-iters
10
--tensorboard-dir
$TENSORBOARD_LOGS_PATH
)
torchrun
${
DISTRIBUTED_ARGS
[@]
}
pretrain_gpt.py
\
${
GPT_MODEL_ARGS
[@]
}
\
${
TRAINING_ARGS
[@]
}
\
${
MODEL_PARALLEL_ARGS
[@]
}
\
${
DATA_ARGS
[@]
}
\
${
EVAL_AND_LOGGING_ARGS
[@]
}
examples/inference/modelopt/README.md
0 → 100644
View file @
7c19b3a8
# Megatron Model Optimization and Deployment
## Installation
We recommend that users follow TensorRT-LLM's official installation guide to build it from source
and proceed with a containerized environment (
`docker.io/tensorrt_llm/release:latest`
):
```
sh
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd
TensorRT-LLM
git checkout v0.10.0
make
-C
docker release_build
```
> **TROUBLE SHOOTING:** rather than copying each folder separately in `docker/Dockerfile.multi`,
> you may need to copy the entire dir as `COPY ./ /src/tensorrt_llm` since a `git submodule` is
> called later which requires `.git` to continue.
Once the container is built, install
`nvidia-modelopt`
and additional dependencies for sharded checkpoint support:
```
sh
pip
install
"nvidia-modelopt[all]~=0.13.0"
--extra-index-url
https://pypi.nvidia.com
pip
install
zarr
tensorstore
==
0.1.45
```
TensorRT-LLM quantization functionalities are currently packaged in
`nvidia-modelopt`
.
You can find more documentation about
`nvidia-modelopt`
[
here
](
https://nvidia.github.io/TensorRT-Model-Optimizer/
)
.
## Support Matrix
The following matrix shows the current support for the PTQ + TensorRT-LLM export flow.
| model | fp16 | int8_sq | fp8 | int4_awq |
|-----------------------------|------|---------| ----| -------- |
| nextllm-2b | x | x | x | |
| nemotron3-8b | x | | x | |
| nemotron3-15b | x | | x | |
| llama2-text-7b | x | x | x | TP2 |
| llama2-chat-70b | x | x | x | TP4 |
Our PTQ + TensorRT-LLM flow has native support on MCore
`GPTModel`
with a mixed layer spec (native ParallelLinear
and Transformer-Engine Norm (
`TENorm`
). Note that this is not the default mcore gpt spec. You can still load the
following checkpoint formats with some remedy:
| GPTModel | sharded | remedy arguments |
|-----------------------------------|---------|---------------------------------------------|
| megatron.legacy.model | |
`--export-legacy-megatron`
|
| TE-Fused (default mcore gpt spec) | |
`--export-te-mcore-model`
|
| TE-Fused (default mcore gpt spec) | x | |
> **TROUBLE SHOOTING:** If you are trying to load an unpacked `.nemo` sharded checkpoint, then typically you will
> need to adding `additional_sharded_prefix="model."` to `modelopt_load_checkpoint()` since NeMo has an additional
> `model.` wrapper on top of the `GPTModel`.
> **NOTE:** flag `--export-legacy-megatron` may not work on all legacy checkpoint versions.
## Examples
> **NOTE:** we only provide a simple text generation script to test the generated TensorRT-LLM engines. For
> a production-level API server or enterprise support, see [NeMo](https://github.com/NVIDIA/NeMo) and TensorRT-LLM's
> backend for [NVIDIA Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server).
### nemotron3-8B FP8 Quantization and TensorRT-LLM Deployment
First download the nemotron checkpoint from https://huggingface.co/nvidia/nemotron-3-8b-base-4k, extract the
sharded checkpoint from the
`.nemo`
tarbal and fix the tokenizer file name.
> **NOTE:** The following cloning method uses `ssh`, and assume you have registered the `ssh-key` in Hugging Face.
> If you are want to clone with `https`, then `git clone https://huggingface.co/nvidia/nemotron-3-8b-base-4k` with an access token.
```
sh
git lfs
install
git clone git@hf.co:nvidia/nemotron-3-8b-base-4k
cd
nemotron-3-8b-base-4k
tar
-xvf
Nemotron-3-8B-Base-4k.nemo
mv
586f3f51a9cf43bc9369bd53fa08868c_a934dc7c3e1e46a6838bb63379916563_3feba89c944047c19d5a1d0c07a85c32_mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model tokenizer.model
cd
..
```
Now launch the PTQ + TensorRT-LLM export script,
```
sh
bash examples/inference/ptq_trtllm_nemotron3_8b ./nemotron-3-8b-base-4k None
```
By default,
`cnn_dailymail`
is used for calibration. The
`GPTModel`
will have quantizers for simulating the
quantization effect. The checkpoint will be saved optionally (with quantizers as additional states) and can
be restored for further evaluation. TensorRT-LLM checkpoint and engine are exported to
`/tmp/trtllm_ckpt`
and
built in
`/tmp/trtllm_engine`
by default.
The script expects
`${CHECKPOINT_DIR}`
(
`./nemotron-3-8b-base-4k`
) to have the following structure:
```
├── model_weights
│ ├── common.pt
│ ...
│
├── model_config.yaml
├── mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model
```
> **NOTE:** The script is using `TP=8`. Change `$TP` in the script if your checkpoint has a different tensor
> model parallelism.
> **KNOWN ISSUES:** The `mt_nlg_plus_multilingual_ja_zh_the_stack_frac_015_256k.model` in the checkpoint is for
> Megatron-LM's `GPTSentencePiece` tokenizer.
> For TensorRT-LLM, we are trying to load this tokenizer as a Hugging Face `T5Tokenizer` by changing
> some special tokens, `encode`, and `batch_decode`. As a result, the tokenizer behavior in TensorRT-LLM engine may
> not match exactly.
### llama2-text-7b INT8 SmoothQuant and TensorRT-LLM Deployment
> **NOTE:** Due to the LICENSE issue, we do not provide a MCore checkpoint to download. Users can follow
> the instruction in `docs/llama2.md` to convert the checkpoint to megatron legacy `GPTModel` format and
> use `--export-legacy-megatron` flag which will remap the checkpoint to the MCore `GPTModel` spec
> that we support.
```
sh
bash examples/inference/ptq_trtllm_llama_7b.sh
${
CHECKPOINT_DIR
}
```
The script expect
`${CHECKPOINT_DIR}`
to have the following structure:
```
├── hf
│ ├── tokenizer.config
│ ├── tokenizer.model
│ ...
│
├── iter_0000001
│ ├── mp_rank_00
│ ...
│
├── latest_checkpointed_iteration.txt
```
In short, other than the converted llama megatron checkpoint, also put the Hugging Face checkpoint inside as
the source of the tokenizer.
Prev
1
2
3
4
5
6
7
…
32
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment