Commit 1c3a1e6a authored by Jared Casper's avatar Jared Casper
Browse files

Merge branch 'main' into t5_pipeline_parallelism

Fixed conflicts with memory optimizations in megatron/schedules.py
parents da1c96e9 8e8d7603
...@@ -127,7 +127,7 @@ Further command line arguments are described in the source file [`preprocess_dat ...@@ -127,7 +127,7 @@ Further command line arguments are described in the source file [`preprocess_dat
## BERT Pretraining ## BERT Pretraining
The `examples/pretrain_bert.sh` script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at `--lr` to a minimum set by `--min-lr` over `--lr-decay-iters` iterations. The fraction of training iterations used for warmup is set by `--lr-warmup-fraction`. While this is single GPU training, the batch size specified by `--micro-batch-size` is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches `global-batch-size` whcih is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with `--seed`). We use `train-iters` as the training iterations requested. Alternatively, one can provide `--train-samples` which is total number of samples to train on. If this option is present, then instead of providing `--lr-decay-iters`, one will need to provide `--lr-decay-samples`. The `examples/pretrain_bert.sh` script runs single GPU 345M parameter BERT pretraining. Debugging is the primary use for single GPU training, as the code base and command line arguments are optimized for highly distributed training. Most of the arguments are fairly self-explanatory. By default, the learning rate decays linearly over the training iterations starting at `--lr` to a minimum set by `--min-lr` over `--lr-decay-iters` iterations. The fraction of training iterations used for warmup is set by `--lr-warmup-fraction`. While this is single GPU training, the batch size specified by `--micro-batch-size` is a single forward-backward path batch-size and the code will perform gradient accumulation steps until it reaches `global-batch-size` which is the batch size per iteration. The data is partitioned into a 949:50:1 ratio for training/validation/test sets (default is 969:30:1). This partitioning happens on the fly, but is consistent across runs with the same random seed (1234 by default, or specified manually with `--seed`). We use `train-iters` as the training iterations requested. Alternatively, one can provide `--train-samples` which is total number of samples to train on. If this option is present, then instead of providing `--lr-decay-iters`, one will need to provide `--lr-decay-samples`.
The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the `--data-path` now includes the additional `_text_sentence` suffix added in preprocessing, but does not include the file extensions. The logging, checkpoint-saving, and evaluation intervals are specified. Checkpointing the activations facilitates the training of larger models and/or batches. Note that the `--data-path` now includes the additional `_text_sentence` suffix added in preprocessing, but does not include the file extensions.
......
...@@ -23,6 +23,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS \ ...@@ -23,6 +23,7 @@ python -m torch.distributed.launch $DISTRIBUTED_ARGS \
--num-attention-heads 16 \ --num-attention-heads 16 \
--micro-batch-size 2 \ --micro-batch-size 2 \
--global-batch-size 16 \ --global-batch-size 16 \
--seq-length 512 \
--max-position-embeddings 512 \ --max-position-embeddings 512 \
--train-iters 1000000 \ --train-iters 1000000 \
--save $CHECKPOINT_PATH \ --save $CHECKPOINT_PATH \
......
#!/bin/bash
# SLURM options.
export SLURM_PARTITION=<slurm partition, used to feed -p option in slurm>
export SLURM_ACCOUNT=<slurm account, used to feed -A option in slurm>
# Source code.
export MEGATRON_CODE_DIR=<megatron source code directory>
# This variable is used to mount the relevant part of the filesystem
# inside the docker container. Note that the `MEGATRON_CODE_DIR` and the
# launch directory already get mounted; this variable should be used to
# mount the directories that contain the data and tokenizer files.
export DOCKER_MOUNT_DIR=<megatron dataset and bpe tokenizer vocab path>
# Data and tokenizer files.
MEGATRON_DATA=<path to megatron processed data>
BPE_VOCAB_FILE=<path to bpe vocab file>
BPE_MERGE_FILE=<path to bpe merges file>
# Megatron input parameters.
# `MEGATRON_EXTRA_PARAMS` can be used to provide any extra parameters
# that are not listed here.
export MEGATRON_PARAMS=" ${MEGATRON_EXTRA_PARAMS} \
--tensor-model-parallel-size ${TP} \
--pipeline-model-parallel-size ${PP} \
--micro-batch-size ${MBS} \
--global-batch-size ${GBS} \
--num-layers ${NLS} \
--hidden-size ${HS} \
--num-attention-heads ${NAH} \
--DDP-impl ${DDP} \
--data-path ${MEGATRON_DATA} \
--vocab-file ${BPE_VOCAB_FILE} \
--merge-file ${BPE_MERGE_FILE} \
--log-interval 5 \
--seq-length 2048 \
--max-position-embeddings 2048 \
--train-iters 500 \
--lr-decay-iters 320 \
--lr 0.0001 \
--min-lr 0.00001 \
--lr-decay-style cosine \
--lr-warmup-fraction 0.01 \
--split 969,30,1 \
--eval-iters 100 \
--eval-interval 1000 \
--clip-grad 1.0 \
--fp16 \
--loss-scale 8192 "
# Reproducing Figures in SC21 Paper
This directory contains some of the scripts that were used to produce the
results in the [Megatron paper](https://arxiv.org/pdf/2104.04473.pdf) that is
to appear at [SuperComputing 2021](https://sc21.supercomputing.org/). These
scripts use [Slurm](https://slurm.schedmd.com/documentation.html) with the
[pyxis plugin](https://github.com/NVIDIA/pyxis), but can be modified for other
schedulers as well.
## Setup
All the cluster-dependent variables are in [`CONFIG.sh`](./CONFIG.sh). Please
update the unspecified values (in angle brackets `<...>`) before launching any
scripts.
## Scripts
Below is a list of scripts that can be used to reproduce various figures in our
[paper](https://arxiv.org/pdf/2104.04473.pdf):
* [run_table_1.sh](./run_table_1.sh): Table 1 showing weak-scaling throughput
for GPT models ranging from 1 billion to 1 trillion parameters.
* [run_figure_11.sh](./run_figure_11.sh): Figure 11 showing the weak-scaling
performance of pipeline parallelism.
* [run_figure_12.sh](./run_figure_12.sh): Figure 12 showing the effect of
the interleaved schedule on a 175B GPT model.
* [run_figure_13.sh](./run_figure_13.sh): Figure 13 showing the effect of
different degrees of pipeline and tensor model parallelism on a model with
162.2 billion parameters.
* [run_figure_14.sh](./run_figure_14.sh): Figure 14 showing the effect of
different degrees of data and pipeline model parallelism on a model with
5.9 billion parameters.
* [run_figure_15.sh](./run_figure_15.sh): Figure 15 showing the effect of
different degrees of data and tensor model parallelism on a model with
5.9 billion parameters.
* [run_figure_16.sh](./run_figure_16.sh): Figure 16 showing the effect of
microbatch size.
* [run_figure_17.sh](./run_figure_17.sh): Figure 17 showing the effect of
activation recomputation.
* [run_figure_18.sh](./run_figure_18.sh): Figure 18 showing the effect of
the scatter-gather communication optimization.
#!/bin/bash
sbatch -p ${SLURM_PARTITION} \
-A ${SLURM_ACCOUNT} \
--job-name=${JOB_NAME} \
--nodes=${NNODES} \
--export=MEGATRON_CODE_DIR,MEGATRON_PARAMS,DOCKER_MOUNT_DIR SRUN.sh
exit 0
#!/bin/bash
#SBATCH -t 0:30:00 --exclusive --mem=0 --overcommit --ntasks-per-node=8
THIS_DIR=`pwd`
DATETIME=`date +'date_%y-%m-%d_time_%H-%M-%S'`
mkdir -p ${THIS_DIR}/logs
CMD="python -u ${MEGATRON_CODE_DIR}/pretrain_gpt.py ${MEGATRON_PARAMS}"
srun -l \
--container-image "nvcr.io#nvidia/pytorch:20.12-py3" \
--container-mounts "${THIS_DIR}:${THIS_DIR},${MEGATRON_CODE_DIR}:${MEGATRON_CODE_DIR},${DOCKER_MOUNT_DIR}:${DOCKER_MOUNT_DIR}" \
--output=${THIS_DIR}/logs/%x_%j_$DATETIME.log sh -c "${CMD}"
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Pipeline-parallel size options = [1, 2, 4, 8].
PP=1
# Batch size (global batch size) options = [8, 128].
GBS=8
# Set pipeline-parallel size options.
NLS=$((3*PP))
NNODES=${PP}
# Other params.
TP=8
MBS=1
HS=20480
NAH=128
DDP=local
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
# Name of the job.
export JOB_NAME=results_figure_11_pipeline_parallel_size_${PP}_batch_size_${GBS}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Interleaved schedule options = [YES, NO].
INTERLEAVED=YES
# Batch size (global batch size) options = [12, 24, 36, ..., 60].
GBS=12
# Set interleaved schedule options.
if [ ${INTERLEAVED} == "YES" ]; then
MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 2 "
elif [ ${INTERLEAVED} == "NO" ]; then
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
else
echo "Invalid configuration"
exit 1
fi
# Other params.
TP=8
PP=12
MBS=1
NLS=96
HS=12288
NAH=96
DDP=local
NNODES=12
# Name of the job.
export JOB_NAME=results_figure_12_interleaved_${INTERLEAVED}_batch_size_${GBS}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Pipeline-parallel size options = [2, 4, 8, 16, 32].
PP=2
# Batch size (global batch size) options = [32, 128].
GBS=32
# Set pipeline-parallel and tensor-parallel size options.
TP=$((64/PP))
# Other params.
MBS=1
NLS=32
HS=20480
NAH=128
DDP=local
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
NNODES=8
# Name of the job.
export JOB_NAME=results_figure_13_pipeline_parallel_size_${PP}_tensor_parallel_size_${TP}_batch_size_${GBS}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Pipeline-parallel size options = [2, 4, 8, 16, 32].
PP=2
# Batch size (global batch size) options = [32, 512].
GBS=32
# Set pipeline-parallel and data-parallel size options.
DP=$((64/PP))
# Other params.
TP=1
MBS=1
NLS=32
HS=3840
NAH=32
DDP=local
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
NNODES=8
# Name of the job.
export JOB_NAME=results_figure_14_pipeline_parallel_size_${PP}_data_parallel_size_${DP}_batch_size_${GBS}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Tensor-parallel size options = [2, 4, 8, 16, 32].
TP=2
# Batch size (global batch size) options = [32, 128, 512].
GBS=32
# Set tensor-parallel and data-parallel size options.
DP=$((64/TP))
# Other params.
PP=1
MBS=1
NLS=32
HS=3840
NAH=32
DDP=local
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
NNODES=8
# Name of the job.
export JOB_NAME=results_figure_15_tensor_parallel_size_${TP}_data_parallel_size_${DP}_batch_size_${GBS}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Microbatch size options = [1, 2, 4, 8].
MBS=1
# Batch size (global batch size) options = [128, 512].
GBS=128
# Other params.
TP=8
PP=8
NLS=32
HS=15360
NAH=128
DDP=local
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
NNODES=8
# Name of the job.
export JOB_NAME=results_figure_16_microbatch_size_${MBS}_batch_size_${GBS}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Activation recomputation options = [YES, NO].
ACTIVATION_RECOMPUTATION=YES
# Batch size (global batch size) options = [1, 2, 4, ..., 256].
GBS=1
# Set activation recomputation.
if [ ${ACTIVATION_RECOMPUTATION} == "YES" ]; then
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
elif [ ${ACTIVATION_RECOMPUTATION} == "NO" ]; then
MEGATRON_EXTRA_PARAMS=""
else
echo "Invalid configuration"
exit 1
fi
# Other params.
TP=8
PP=16
MBS=1
NLS=80
HS=12288
NAH=96
DDP=local
NNODES=16
# Name of the job.
export JOB_NAME=results_figure_17_activation_recomputation_${ACTIVATION_RECOMPUTATION}_batch_size_${GBS}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# Scatter-gather communication optimization options = [YES, NO].
SCATTER_GATHER=YES
# Batch size (global batch size) options = [12, 24, 36, ..., 60].
GBS=12
# Set scatter-gather communication optimization options.
if [ ${SCATTER_GATHER} == "YES" ]; then
MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 2 "
elif [ ${SCATTER_GATHER} == "NO" ]; then
MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 2 --no-scatter-gather-tensors-in-pipeline "
else
echo "Invalid configuration"
exit 1
fi
# Other params.
TP=8
PP=12
MBS=1
NLS=96
HS=12288
NAH=96
DDP=local
NNODES=12
# Name of the job.
export JOB_NAME=results_figure_18_scatter_gather_${SCATTER_GATHER}_batch_size_${GBS}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
#!/bin/bash
# ================================
# Choose the case to run.
# ================================
# model size options = [1.7B, 3.6B, 7.5B, 18B, 39B, 76B, 145B, 310B, 530B, 1T]
MODEL_SIZE=1.7B
if [ ${MODEL_SIZE} == "1.7B" ]; then
TP=1
PP=1
MBS=16
GBS=512
NLS=24
HS=2304
NAH=24
DDP=torch
NNODES=4
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
elif [ ${MODEL_SIZE} == "3.6B" ]; then
TP=2
PP=1
MBS=16
GBS=512
NLS=30
HS=3072
NAH=32
DDP=torch
NNODES=8
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
elif [ ${MODEL_SIZE} == "7.5B" ]; then
TP=4
PP=1
MBS=16
GBS=512
NLS=36
HS=4096
NAH=32
DDP=torch
NNODES=16
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
elif [ ${MODEL_SIZE} == "18B" ]; then
TP=8
PP=1
MBS=8
GBS=1024
NLS=40
HS=6144
NAH=48
DDP=torch
NNODES=32
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
elif [ ${MODEL_SIZE} == "39B" ]; then
TP=8
PP=2
MBS=4
GBS=1536
NLS=48
HS=8192
NAH=64
DDP=local
NNODES=64
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
elif [ ${MODEL_SIZE} == "76B" ]; then
TP=8
PP=4
MBS=2
GBS=1792
NLS=60
HS=10240
NAH=80
DDP=local
NNODES=128
MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 5"
elif [ ${MODEL_SIZE} == "145B" ]; then
TP=8
PP=8
MBS=2
GBS=2304
NLS=80
HS=12288
NAH=96
DDP=local
NNODES=192
MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 5 "
elif [ ${MODEL_SIZE} == "310B" ]; then
TP=8
PP=16
MBS=1
GBS=2160
NLS=96
HS=16384
NAH=128
DDP=local
NNODES=240
MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 3 "
elif [ ${MODEL_SIZE} == "530B" ]; then
TP=8
PP=35
MBS=1
GBS=2520
NLS=105
HS=20480
NAH=128
DDP=local
NNODES=315
MEGATRON_EXTRA_PARAMS="--checkpoint-activations --num-layers-per-virtual-pipeline-stage 1 "
elif [ ${MODEL_SIZE} == "1T" ]; then
TP=8
PP=64
MBS=1
GBS=3072
NLS=128
HS=25600
NAH=160
DDP=local
NNODES=384
MEGATRON_EXTRA_PARAMS="--checkpoint-activations "
else
echo "Invalid configuration"
exit 1
fi
# Name of the job
export JOB_NAME=results_table_1_model_size_${MODEL_SIZE}
# Import the configs.
. `pwd`/CONFIG.sh
# Submit the job.
. `pwd`/SBATCH.sh
exit 0
...@@ -160,6 +160,11 @@ def parse_args(extra_args_provider=None, defaults={}, ...@@ -160,6 +160,11 @@ def parse_args(extra_args_provider=None, defaults={},
assert args.DDP_impl == 'local' assert args.DDP_impl == 'local'
args.use_contiguous_buffers_in_ddp = True args.use_contiguous_buffers_in_ddp = True
# If we use a contiguous buffer to hold main grads, we need to have
# local DDP.
if args.use_contiguous_buffers_in_ddp:
assert args.DDP_impl == 'local'
if args.dataloader_type is None: if args.dataloader_type is None:
args.dataloader_type = 'single' args.dataloader_type = 'single'
...@@ -605,6 +610,11 @@ def _add_distributed_args(parser): ...@@ -605,6 +610,11 @@ def _add_distributed_args(parser):
group.add_argument('--use-cpu-initialization', action='store_true', group.add_argument('--use-cpu-initialization', action='store_true',
default=None, help='If set, affine parallel weights ' default=None, help='If set, affine parallel weights '
'initialization uses CPU' ) 'initialization uses CPU' )
group.add_argument('--empty-unused-memory-level', default=0, type=int,
choices=[0, 1, 2],
help='Call torch.cuda.empty_cache() each iteration '
'(training and eval), to reduce fragmentation.'
'0=off, 1=moderate, 2=aggressive.')
return parser return parser
......
...@@ -106,6 +106,40 @@ def get_checkpoint_tracker_filename(checkpoints_path): ...@@ -106,6 +106,40 @@ def get_checkpoint_tracker_filename(checkpoints_path):
return os.path.join(checkpoints_path, 'latest_checkpointed_iteration.txt') return os.path.join(checkpoints_path, 'latest_checkpointed_iteration.txt')
def read_metadata(tracker_filename):
# Read the tracker file and either set the iteration or
# mark it as a release checkpoint.
iteration = 0
release = False
with open(tracker_filename, 'r') as f:
metastring = f.read().strip()
try:
iteration = int(metastring)
except ValueError:
release = metastring == 'release'
if not release:
print_rank_0('ERROR: Invalid metadata file {}. Exiting'.format(
tracker_filename))
sys.exit()
assert iteration > 0 or release, 'error parsing metadata file {}'.format(
tracker_filename)
# Get the max iteration retrieved across the ranks.
iters_cuda = torch.cuda.LongTensor([iteration])
torch.distributed.all_reduce(iters_cuda, op=torch.distributed.ReduceOp.MAX)
max_iter = iters_cuda[0].item()
# We should now have all the same iteration.
# If not, print a warning and chose the maximum
# iteration across all ranks.
if iteration != max_iter:
print('WARNING: on rank {} found iteration {} in the '
'metadata while max iteration across the ranks '
'is {}, replacing it with max iteration.'.format(
rank, iteration, max_iter), flush=True)
return max_iter, release
def save_checkpoint(iteration, model, optimizer, lr_scheduler): def save_checkpoint(iteration, model, optimizer, lr_scheduler):
"""Save a model checkpoint.""" """Save a model checkpoint."""
args = get_args() args = get_args()
...@@ -260,21 +294,7 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True ...@@ -260,21 +294,7 @@ def load_checkpoint(model, optimizer, lr_scheduler, load_arg='load', strict=True
# Otherwise, read the tracker file and either set the iteration or # Otherwise, read the tracker file and either set the iteration or
# mark it as a release checkpoint. # mark it as a release checkpoint.
iteration = 0 iteration, release = read_metadata(tracker_filename)
release = False
with open(tracker_filename, 'r') as f:
metastring = f.read().strip()
try:
iteration = int(metastring)
except ValueError:
release = metastring == 'release'
if not release:
print_rank_0('ERROR: Invalid metadata file {}. Exiting'.format(
tracker_filename))
sys.exit()
assert iteration > 0 or release, 'error parsing metadata file {}'.format(
tracker_filename)
# Checkpoint. # Checkpoint.
checkpoint_name = get_checkpoint_name(load_dir, iteration, release) checkpoint_name = get_checkpoint_name(load_dir, iteration, release)
......
...@@ -674,7 +674,7 @@ def get_samples_mapping(indexed_dataset, ...@@ -674,7 +674,7 @@ def get_samples_mapping(indexed_dataset,
# Build samples mapping # Build samples mapping
verbose = torch.distributed.get_rank() == 0 verbose = torch.distributed.get_rank() == 0
start_time = time.time() start_time = time.time()
print_rank_0(' > building sapmles index mapping for {} ...'.format( print_rank_0(' > building samples index mapping for {} ...'.format(
name)) name))
# First compile and then import. # First compile and then import.
from megatron.data import helpers from megatron.data import helpers
...@@ -688,7 +688,7 @@ def get_samples_mapping(indexed_dataset, ...@@ -688,7 +688,7 @@ def get_samples_mapping(indexed_dataset,
seed, seed,
verbose, verbose,
2 if binary_head else 1) 2 if binary_head else 1)
print_rank_0(' > done building sapmles index maping') print_rank_0(' > done building samples index maping')
np.save(indexmap_filename, samples_mapping, allow_pickle=True) np.save(indexmap_filename, samples_mapping, allow_pickle=True)
print_rank_0(' > saved the index mapping in {}'.format( print_rank_0(' > saved the index mapping in {}'.format(
indexmap_filename)) indexmap_filename))
......
...@@ -111,7 +111,7 @@ __global__ void scaled_masked_softmax_warp_forward( ...@@ -111,7 +111,7 @@ __global__ void scaled_masked_softmax_warp_forward(
constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE; constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE; constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1; constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
constexpr int ELEMENTS_PER_LDG_STG = 4; constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
// blockDim/threadIdx = (WARP_SIZE, WARPS_PER_BLOCK, ) // blockDim/threadIdx = (WARP_SIZE, WARPS_PER_BLOCK, )
// gridDim/blockIdx = (seq_len, attn_heads, batches) // gridDim/blockIdx = (seq_len, attn_heads, batches)
...@@ -230,7 +230,7 @@ __global__ void scaled_masked_softmax_warp_backward( ...@@ -230,7 +230,7 @@ __global__ void scaled_masked_softmax_warp_backward(
constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE; constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE; constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1; constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
constexpr int ELEMENTS_PER_LDG_STG = 4; constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
// blockDim/threadIdx = (WARP_SIZE, WARPS_PER_BLOCK, ) // blockDim/threadIdx = (WARP_SIZE, WARPS_PER_BLOCK, )
// gridDim/blockIdx = (seq_len, attn_heads, batches) // gridDim/blockIdx = (seq_len, attn_heads, batches)
......
...@@ -125,7 +125,7 @@ __global__ void scaled_upper_triang_masked_softmax_warp_forward( ...@@ -125,7 +125,7 @@ __global__ void scaled_upper_triang_masked_softmax_warp_forward(
constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE; constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE; constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1; constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
constexpr int ELEMENTS_PER_LDG_STG = 4; constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
int first_batch = (blockDim.y * blockIdx.y + threadIdx.y) * gridDim.x * WARP_BATCH + blockIdx.x; int first_batch = (blockDim.y * blockIdx.y + threadIdx.y) * gridDim.x * WARP_BATCH + blockIdx.x;
int local_seq = blockIdx.x + 1; int local_seq = blockIdx.x + 1;
...@@ -245,7 +245,7 @@ __global__ void scaled_upper_triang_masked_softmax_warp_backward( ...@@ -245,7 +245,7 @@ __global__ void scaled_upper_triang_masked_softmax_warp_backward(
constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE; constexpr int WARP_SIZE = (next_power_of_two < C10_WARP_SIZE) ? next_power_of_two : C10_WARP_SIZE;
constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE; constexpr int WARP_ITERATIONS = next_power_of_two / WARP_SIZE;
constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1; constexpr int WARP_BATCH = (next_power_of_two <= 128) ? 2 : 1;
constexpr int ELEMENTS_PER_LDG_STG = 4; constexpr int ELEMENTS_PER_LDG_STG = (WARP_ITERATIONS < 4) ? 1 : 4;
int first_batch = (blockDim.y * blockIdx.y + threadIdx.y) * gridDim.x * WARP_BATCH + blockIdx.x; int first_batch = (blockDim.y * blockIdx.y + threadIdx.y) * gridDim.x * WARP_BATCH + blockIdx.x;
int local_seq = blockIdx.x + 1; int local_seq = blockIdx.x + 1;
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment