"vscode:/vscode.git/clone" did not exist on "23893bb0a87b98c7193ba24c5f6b9e802eeb9dea"
Unverified Commit f2ac7eaf authored by Jeff Rasley's avatar Jeff Rasley Committed by GitHub
Browse files

ZeRO-2 (#217)



Updates for ZeRO stage 2 + ZeRO stage 1 w. RS
Co-authored-by: default avatarTunji Ruwase <olruwase@microsoft.com>
Co-authored-by: default avatarSamyam Rajbhandari <samyamr@microsoft.com>
Co-authored-by: default avatarShaden Smith <ShadenTSmith@gmail.com>
Co-authored-by: default avatarElton Zheng <eltonz@microsoft.com>
Co-authored-by: default avatarShaden Smith <Shaden.Smith@microsoft.com>
Co-authored-by: default avataryuxionghe <yuxhe@microsoft.com>
Co-authored-by: default avatarArash Ashari <arashari@microsoft.com>
parent c61e23b4
...@@ -2,6 +2,7 @@ ...@@ -2,6 +2,7 @@
.idea/ .idea/
*~ *~
*.swp *.swp
*.log
deepspeed/git_version_info.py deepspeed/git_version_info.py
# Build + installation data # Build + installation data
......
Subproject commit 9e2c735f5aabe48395c03a276fa7a0c51f6d3025 Subproject commit 274787a189b265814ed75dd5ddeae2dce026ea88
[![Build Status](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_apis/build/status/microsoft.DeepSpeed?branchName=master)](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/latest?definitionId=1&branchName=master) [![Build Status](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_apis/build/status/microsoft.DeepSpeed?branchName=master)](https://dev.azure.com/DeepSpeedMSFT/DeepSpeed/_build/latest?definitionId=1&branchName=master)
[![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE) [![License MIT](https://img.shields.io/badge/License-MIT-blue.svg)](https://github.com/Microsoft/DeepSpeed/blob/master/LICENSE)
[DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization library that makes distributed training easy, [DeepSpeed](https://www.deepspeed.ai/) is a deep learning optimization
efficient, and effective. library that makes distributed training easy, efficient, and effective.
<p align="center"><i><b>10x Larger Models</b></i></p> <p align="center"><i><b>10x Larger Models</b></i></p>
<p align="center"><i><b>5x Faster Training</b></i></p> <p align="center"><i><b>10x Faster Training</b></i></p>
<p align="center"><i><b>Minimal Code Change</b></i></p> <p align="center"><i><b>Minimal Code Change</b></i></p>
DeepSpeed can train deep learning models with over a hundred billion parameters on current DeepSpeed can train deep learning models with over a hundred billion parameters on current
generation of GPU clusters, while achieving over 5x in system performance generation of GPU clusters, while achieving over 10x in system performance
compared to the state-of-art. Early adopters of DeepSpeed have already produced compared to the state-of-art. Early adopters of DeepSpeed have already produced
a language model (LM) with over 17B parameters called a language model (LM) with over 17B parameters called
[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft), [Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
establishing a new SOTA in the LM category. establishing a new SOTA in the LM category.
**_For further documentation, tutorials, and technical deep-dives please see [deepspeed.ai](https://www.deepspeed.ai/)!_**
# News # News
* [Turing-NLG: A 17-billion-parameter language model by Microsoft](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/) * [2020/05/19] [ZeRO-2 empowers training models as large as 170 billion parameters up to 10x faster compared to state-of-the-art](https://www.deepspeed.ai/news/2020/05/19/zero-stage2.html)
* [ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/) <span style="color:dodgerblue">**[_NEW_]**</span>
* [2020/05/19] [DeepSpeed optimizes transformer kernels to achieve world’s fastest BERT training record: 44 minutes on 1024 NVIDIA V100 GPUs](https://www.deepspeed.ai/news/2020/05/19/bert-record.html)
<span style="color:dodgerblue">**[_NEW_]**</span>
* [2020/02/13] [Turing-NLG: A 17-billion-parameter language model by Microsoft](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft/)
* [2020/02/13] [ZeRO & DeepSpeed: New system optimizations enable training models with over 100 billion parameters](https://www.microsoft.com/en-us/research/blog/zero-deepspeed-new-system-optimizations-enable-training-models-with-over-100-billion-parameters/)
# Table of Contents # Table of Contents
...@@ -39,93 +46,6 @@ a large model easily runs out of memory with pure data parallelism and it is ...@@ -39,93 +46,6 @@ a large model easily runs out of memory with pure data parallelism and it is
difficult to use model parallelism. DeepSpeed addresses these challenges to difficult to use model parallelism. DeepSpeed addresses these challenges to
accelerate model development *and* training. accelerate model development *and* training.
## Distributed, Effective, and Efficient Training with Ease
The DeepSpeed API is a lightweight wrapper on [PyTorch](https://pytorch.org/). This
means that you can use everything you love in PyTorch and without learning a new
platform. In addition, DeepSpeed manages all of the boilerplate state-of-the-art
training techniques, such as distributed training, mixed precision, gradient
accumulation, and checkpoints so that you can focus on your model development. Most
importantly, you can leverage the distinctive efficiency and effectiveness benefit of
DeepSpeed to boost speed and scale with just a few lines of code changes to your PyTorch
models.
## Speed
DeepSpeed achieves high performance and fast convergence through a combination of
efficiency optimizations on compute/communication/memory/IO and effectiveness
optimizations on advanced hyperparameter tuning and optimizers. For example:
* DeepSpeed trains BERT-large to parity in 14 hours using 64 GPUs (4 DGX-2 boxes) and in
3.7 hours using 256 GPUs (16 DGX-2 boxes).
**BERT-large Training Times**
| Devices | Source | Training Time (hours) |
| ------------- | --------- | ---------------------:|
| 64 TPUs | Google | 96 |
| 64 V100 GPUs | DeepSpeed | **14** |
| 256 V100 GPUs | NVIDIA | 3.9 |
| 256 V100 GPUs | DeepSpeed | **3.7** |
*Read more*: [BERT pre-training tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/)
* DeepSpeed trains GPT2 (1.5 billion parameters) 3.75x faster than state-of-art, NVIDIA
Megatron on Azure GPUs.
*Read more*: [GPT tutorial](https://www.deepspeed.ai/tutorials/megatron/)
## Memory efficiency
DeepSpeed provides memory-efficient data parallelism and enables training models without
model parallelism. For example, DeepSpeed can train models with up to 6 billion parameters on
NVIDIA V100 GPUs with 32GB of device memory. In comparison, existing frameworks (e.g.,
PyTorch's Distributed Data Parallel) run out of memory with 1.5 billion parameter models.
DeepSpeed reduces the training memory footprint through a novel solution called Zero
Redundancy Optimizer (ZeRO). Unlike basic data parallelism where memory states are
replicated across data-parallel processes, ZeRO partitions model states to save
significant memory. The current implementation (stage 1 of ZeRO) reduces memory by up to
4x relative to the state-of-art. You can read more about ZeRO in our [paper](https://arxiv.org/abs/1910.02054).
With this impressive memory reduction, early adopters of DeepSpeed have already
produced a language model (LM) with over 17B parameters called
[Turing-NLG](https://www.microsoft.com/en-us/research/blog/turing-nlg-a-17-billion-parameter-language-model-by-microsoft),
establishing a new SOTA in the LM category.
## Scalability
DeepSpeed supports efficient data parallelism, model parallelism, and their
combination. ZeRO boosts the scaling capability and efficiency further.
* DeepSpeed provides system support to run models up to 100 billion parameters,
10x larger than the state-of-art (8 billion NVIDIA GPT, 11 billion Google T5).
* DeepSpeed can run large models more efficiently, up to 6x faster for models with
various sizes spanning 1.5B to 100B. More specifically, the data parallelism powered by ZeRO
is complementary and can be combined with different types of model parallelism. It allows
DeepSpeed to fit models using lower degree of model parallelism and higher batch size, offering
significant performance gains compared to using model parallelism alone.
*Read more*: [technical report](https://arxiv.org/abs/1910.02054)
and [GPT tutorial](https://www.deepspeed.ai/tutorials/megatron/)
![DeepSpeed-vs-Megatron](./docs/assets/images/DeepSpeed-vs-Megatron.png)
<p align="center">
<em>The figure depicts system throughput improvements of DeepSpeed (combining ZeRO-powered data parallelism with model parallelism of NVIDIA Megatron-LM) over using Megatron-LM alone.</em>
</p>
## Fast convergence for effectiveness
DeepSpeed supports advanced hyperparameter tuning and large batch size
optimizers such as [LAMB](https://arxiv.org/abs/1904.00962). These improve the
effectiveness of model training and reduce the number of samples required to
convergence to desired accuracy.
*Read more*: [Tuning tutorial](https://www.deepspeed.ai/tutorials/1Cycle/) and [BERT pre-training tutorial](https://www.deepspeed.ai/tutorials/bert-pretraining/)
## Usability
Only a few lines of code changes are needed to enable a PyTorch model to use DeepSpeed and ZeRO. Compared to current model parallelism libraries, DeepSpeed does not require a code redesign or model refactoring. It also does not put limitations on model dimensions (such as number of attention heads, hidden sizes, and others), batch size, or any other training parameters. For models of up to six billion parameters, you can use ZeRO-powered data parallelism conveniently without requiring model parallelism, while in contrast, standard data parallelism will run out of memory for models with more than 1.3 billion parameters. In addition, DeepSpeed conveniently supports flexible combination of ZeRO-powered data parallelism with custom model parallelisms, such as tensor slicing of NVIDIA's Megatron-LM.
# Features # Features
Below we provide a brief feature list, see our detailed [feature Below we provide a brief feature list, see our detailed [feature
......
...@@ -35,11 +35,6 @@ jobs: ...@@ -35,11 +35,6 @@ jobs:
pre-commit run --all-files pre-commit run --all-files
displayName: 'Formatting checks' displayName: 'Formatting checks'
- script: |
pip install --user pylint
pylint --exit-zero deepspeed/
displayName: 'Code linter'
- script: | - script: |
pytest --forked --verbose tests/unit/ pytest --forked --verbose tests/unit/
displayName: 'Unit tests' displayName: 'Unit tests'
......
...@@ -6,6 +6,8 @@ from deepspeed.pt.deepspeed_light import DeepSpeedLight ...@@ -6,6 +6,8 @@ from deepspeed.pt.deepspeed_light import DeepSpeedLight
from deepspeed.pt.deepspeed_light import ADAM_OPTIMIZER, LAMB_OPTIMIZER from deepspeed.pt.deepspeed_light import ADAM_OPTIMIZER, LAMB_OPTIMIZER
from deepspeed.pt.deepspeed_lr_schedules import add_tuning_arguments from deepspeed.pt.deepspeed_lr_schedules import add_tuning_arguments
import deepspeed.pt.deepspeed_checkpointing as checkpointing
try: try:
from deepspeed.git_version_info import git_hash, git_branch from deepspeed.git_version_info import git_hash, git_branch
except ImportError: except ImportError:
...@@ -14,7 +16,7 @@ except ImportError: ...@@ -14,7 +16,7 @@ except ImportError:
# Export version information # Export version information
__version_major__ = 0 __version_major__ = 0
__version_minor__ = 1 __version_minor__ = 2
__version_patch__ = 0 __version_patch__ = 0
__version__ = '.'.join( __version__ = '.'.join(
map(str, map(str,
...@@ -33,7 +35,8 @@ def initialize(args, ...@@ -33,7 +35,8 @@ def initialize(args,
lr_scheduler=None, lr_scheduler=None,
mpu=None, mpu=None,
dist_init_required=None, dist_init_required=None,
collate_fn=None): collate_fn=None,
config_params=None):
"""Initialize the DeepSpeed Engine. """Initialize the DeepSpeed Engine.
Arguments: Arguments:
...@@ -91,7 +94,8 @@ def initialize(args, ...@@ -91,7 +94,8 @@ def initialize(args,
lr_scheduler=lr_scheduler, lr_scheduler=lr_scheduler,
mpu=mpu, mpu=mpu,
dist_init_required=dist_init_required, dist_init_required=dist_init_required,
collate_fn=collate_fn) collate_fn=collate_fn,
config_params=config_params)
return_items = [ return_items = [
engine, engine,
......
This diff is collapsed.
"""
Copyright (c) Microsoft Corporation
Licensed under the MIT license.
"""
from deepspeed.pt.deepspeed_config_utils import get_scalar_param
#########################################
# DeepSpeed Activation Checkpointing
#########################################
# Activation Checkpointing Allows to save memory by only keeping a select few
#activations for the backpropagation.
ACTIVATION_CHKPT_FORMAT = '''
Activation Checkpointing should be configured as:
"session_params": {
"activation_checkpointing": {
"partitioned_activations": [true|false],
"number_checkpoints": 100,
"contiguous_memory_optimization": [true|false],
"cpu_checkpointing": [true|false]
"profile": [true|false],
"synchronize_checkpoint_boundary": [true|false],
}
}
'''
ACT_CHKPT_PARTITION_ACTIVATIONS = 'partition_activations'
ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT = False
ACT_CHKPT_NUMBER_CHECKPOINTS = 'number_checkpoints'
ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT = None
ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION = 'contiguous_memory_optimization'
ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT = False
ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY = 'synchronize_checkpoint_boundary'
ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT = False
ACT_CHKPT_PROFILE = 'profile'
ACT_CHKPT_PROFILE_DEFAULT = False
ACT_CHKPT_CPU_CHECKPOINTING = 'cpu_checkpointing'
ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT = False
ACT_CHKPT = 'activation_checkpointing'
ACT_CHKPT_DEFAULT = {
ACT_CHKPT_PARTITION_ACTIVATIONS: ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT,
ACT_CHKPT_NUMBER_CHECKPOINTS: ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT,
ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION:
ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT,
ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY:
ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT,
ACT_CHKPT_PROFILE: ACT_CHKPT_PROFILE_DEFAULT,
ACT_CHKPT_CPU_CHECKPOINTING: ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT
}
class DeepSpeedActivationCheckpointingConfig(object):
def __init__(self, param_dict):
super(DeepSpeedActivationCheckpointingConfig, self).__init__()
self.partition_activations = None
self.contiguous_memory_optimization = None
self.cpu_checkpointing = None
self.number_checkpoints = None
self.synchronize_checkpoint_boundary = None
self.profile = None
if ACT_CHKPT in param_dict.keys():
act_chkpt_config_dict = param_dict[ACT_CHKPT]
else:
act_chkpt_config_dict = ACT_CHKPT_DEFAULT
self._initialize(act_chkpt_config_dict)
"""
For json serialization
"""
def repr(self):
return self.__dict__
def _initialize(self, act_chkpt_config_dict):
self.partition_activations = get_scalar_param(
act_chkpt_config_dict,
ACT_CHKPT_PARTITION_ACTIVATIONS,
ACT_CHKPT_PARTITION_ACTIVATIONS_DEFAULT)
self.contiguous_memory_optimization = get_scalar_param(
act_chkpt_config_dict,
ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION,
ACT_CHKPT_CONTIGUOUS_MEMORY_OPTIMIZATION_DEFAULT)
self.cpu_checkpointing = get_scalar_param(act_chkpt_config_dict,
ACT_CHKPT_CPU_CHECKPOINTING,
ACT_CHKPT_CPU_CHECKPOINTING_DEFAULT)
self.number_checkpoints = get_scalar_param(act_chkpt_config_dict,
ACT_CHKPT_NUMBER_CHECKPOINTS,
ACT_CHKPT_NUMBER_CHECKPOINTS_DEFAULT)
self.profile = get_scalar_param(act_chkpt_config_dict,
ACT_CHKPT_PROFILE,
ACT_CHKPT_PROFILE_DEFAULT)
self.synchronize_checkpoint_boundary = get_scalar_param(
act_chkpt_config_dict,
ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY,
ACT_CHKPT_SYNCHRONIZE_CHECKPOINT_BOUNDARY_DEFAULT)
...@@ -8,6 +8,9 @@ import logging ...@@ -8,6 +8,9 @@ import logging
import json import json
from deepspeed.pt.deepspeed_constants import * from deepspeed.pt.deepspeed_constants import *
from deepspeed.pt.loss_scaler import INITIAL_LOSS_SCALE, SCALE_WINDOW, DELAYED_SHIFT, MIN_LOSS_SCALE from deepspeed.pt.loss_scaler import INITIAL_LOSS_SCALE, SCALE_WINDOW, DELAYED_SHIFT, MIN_LOSS_SCALE
from deepspeed.pt.deepspeed_config_utils import get_scalar_param, dict_raise_error_on_duplicate_keys
from deepspeed.pt.deepspeed_zero_config import DeepSpeedZeroConfig
from deepspeed.pt.deepspeed_checkpointing_config import DeepSpeedActivationCheckpointingConfig
TENSOR_CORE_ALIGN_SIZE = 8 TENSOR_CORE_ALIGN_SIZE = 8
ADAM_OPTIMIZER = 'adam' ADAM_OPTIMIZER = 'adam'
...@@ -15,13 +18,6 @@ LAMB_OPTIMIZER = 'lamb' ...@@ -15,13 +18,6 @@ LAMB_OPTIMIZER = 'lamb'
DEEPSPEED_OPTIMIZERS = [ADAM_OPTIMIZER, LAMB_OPTIMIZER] DEEPSPEED_OPTIMIZERS = [ADAM_OPTIMIZER, LAMB_OPTIMIZER]
def get_scalar_param(param_dict, param_name, param_default_value):
if param_name in param_dict.keys():
return param_dict[param_name]
else:
return param_default_value
def get_fp16_enabled(param_dict): def get_fp16_enabled(param_dict):
if FP16 in param_dict.keys(): if FP16 in param_dict.keys():
return get_scalar_param(param_dict[FP16], FP16_ENABLED, FP16_ENABLED_DEFAULT) return get_scalar_param(param_dict[FP16], FP16_ENABLED, FP16_ENABLED_DEFAULT)
...@@ -92,10 +88,20 @@ def get_sparse_gradients_enabled(param_dict): ...@@ -92,10 +88,20 @@ def get_sparse_gradients_enabled(param_dict):
return get_scalar_param(param_dict, SPARSE_GRADIENTS, SPARSE_GRADIENTS_DEFAULT) return get_scalar_param(param_dict, SPARSE_GRADIENTS, SPARSE_GRADIENTS_DEFAULT)
def get_zero_enabled(param_dict): def get_zero_optimization(param_dict):
return get_scalar_param(param_dict, ZERO_OPTIMIZATION, ZERO_OPTIMIZATION_DEFAULT) return get_scalar_param(param_dict, ZERO_OPTIMIZATION, ZERO_OPTIMIZATION_DEFAULT)
def get_zero_reduce_scatter(param_dict):
return get_scalar_param(param_dict, ZERO_REDUCE_SCATTER, ZERO_REDUCE_SCATTER_DEFAULT)
def get_zero_max_elements_per_comm(param_dict):
return get_scalar_param(param_dict,
ZERO_MAX_ELEMENTS_PER_COMM,
ZERO_MAX_ELEMENTS_PER_COMM_DEFAULT)
def get_allgather_size(param_dict): def get_allgather_size(param_dict):
return get_scalar_param(param_dict, return get_scalar_param(param_dict,
ALLGATHER_SIZE, ALLGATHER_SIZE,
...@@ -204,6 +210,10 @@ def get_wall_clock_breakdown(param_dict): ...@@ -204,6 +210,10 @@ def get_wall_clock_breakdown(param_dict):
WALL_CLOCK_BREAKDOWN_DEFAULT) WALL_CLOCK_BREAKDOWN_DEFAULT)
def get_memory_breakdown(param_dict):
return get_scalar_param(param_dict, MEMORY_BREAKDOWN, MEMORY_BREAKDOWN_DEFAULT)
def get_tensorboard_enabled(param_dict): def get_tensorboard_enabled(param_dict):
if TENSORBOARD in param_dict.keys(): if TENSORBOARD in param_dict.keys():
return get_scalar_param(param_dict[TENSORBOARD], return get_scalar_param(param_dict[TENSORBOARD],
...@@ -231,10 +241,39 @@ def get_tensorboard_job_name(param_dict): ...@@ -231,10 +241,39 @@ def get_tensorboard_job_name(param_dict):
return TENSORBOARD_JOB_NAME_DEFAULT return TENSORBOARD_JOB_NAME_DEFAULT
'''Write deepspeed config files by modifying basic templates.
Can be used for quicly changing parameters via command line parameters.'''
class DeepSpeedConfigWriter:
def __init__(self, data=None):
self.data = data if data is not None else {}
def add_config(self, key, value):
self.data[key] = value
def load_config(self, filename):
self.data = json.load(open(filename,
'r'),
object_pairs_hook=dict_raise_error_on_duplicate_keys)
def write_config(self, filename):
with open(filename, 'w') as outfile:
json.dump(self.data, outfile)
class DeepSpeedConfig(object): class DeepSpeedConfig(object):
def __init__(self, json_file, mpu=None): def __init__(self, json_file, mpu=None, param_dict=None):
super(DeepSpeedConfig, self).__init__() super(DeepSpeedConfig, self).__init__()
self._param_dict = json.load(open(json_file, 'r'))
if param_dict is None:
self._param_dict = json.load(
open(json_file,
'r'),
object_pairs_hook=dict_raise_error_on_duplicate_keys)
else:
self._param_dict = param_dict
try: try:
self.global_rank = torch.distributed.get_rank() self.global_rank = torch.distributed.get_rank()
if mpu is None: if mpu is None:
...@@ -263,7 +302,14 @@ class DeepSpeedConfig(object): ...@@ -263,7 +302,14 @@ class DeepSpeedConfig(object):
self.sparse_gradients_enabled = get_sparse_gradients_enabled(param_dict) self.sparse_gradients_enabled = get_sparse_gradients_enabled(param_dict)
self.allgather_size = get_allgather_size(param_dict) self.allgather_size = get_allgather_size(param_dict)
self.zero_enabled = get_zero_enabled(param_dict)
self.zero_config = DeepSpeedZeroConfig(param_dict)
self.zero_optimization_stage = self.zero_config.stage
self.zero_enabled = self.zero_optimization_stage > 0
self.activation_checkpointing_config = DeepSpeedActivationCheckpointingConfig(
param_dict)
self.gradient_clipping = get_gradient_clipping(param_dict) self.gradient_clipping = get_gradient_clipping(param_dict)
self.fp16_enabled = get_fp16_enabled(param_dict) self.fp16_enabled = get_fp16_enabled(param_dict)
self.loss_scale = get_loss_scale(param_dict) self.loss_scale = get_loss_scale(param_dict)
...@@ -285,6 +331,7 @@ class DeepSpeedConfig(object): ...@@ -285,6 +331,7 @@ class DeepSpeedConfig(object):
self.scheduler_params = get_scheduler_params(param_dict) self.scheduler_params = get_scheduler_params(param_dict)
self.wall_clock_breakdown = get_wall_clock_breakdown(param_dict) self.wall_clock_breakdown = get_wall_clock_breakdown(param_dict)
self.memory_breakdown = get_memory_breakdown(param_dict)
self.tensorboard_enabled = get_tensorboard_enabled(param_dict) self.tensorboard_enabled = get_tensorboard_enabled(param_dict)
self.tensorboard_output_path = get_tensorboard_output_path(param_dict) self.tensorboard_output_path = get_tensorboard_output_path(param_dict)
self.tensorboard_job_name = get_tensorboard_job_name(param_dict) self.tensorboard_job_name = get_tensorboard_job_name(param_dict)
...@@ -305,8 +352,8 @@ class DeepSpeedConfig(object): ...@@ -305,8 +352,8 @@ class DeepSpeedConfig(object):
f'Gradient accumulation steps: {grad_acc} has to be greater than 0' f'Gradient accumulation steps: {grad_acc} has to be greater than 0'
assert train_batch == micro_batch * grad_acc * self.world_size, \ assert train_batch == micro_batch * grad_acc * self.world_size, \
(f'Check batch related parameters. Train_batch_size is not equal' (f'Check batch related parameters. train_batch_size is not equal'
'to micro_batch_per_gpu * gradient_acc_step * world_size' ' to micro_batch_per_gpu * gradient_acc_step * world_size'
f'{train_batch} != {micro_batch} * {grad_acc} * {self.world_size}') f'{train_batch} != {micro_batch} * {grad_acc} * {self.world_size}')
def _set_batch_related_parameters(self): def _set_batch_related_parameters(self):
...@@ -387,6 +434,7 @@ class DeepSpeedConfig(object): ...@@ -387,6 +434,7 @@ class DeepSpeedConfig(object):
def _do_error_check(self): def _do_error_check(self):
if self.zero_enabled: if self.zero_enabled:
assert self.fp16_enabled, "DeepSpeedConfig: ZeRO is only supported if fp16 is enabled" assert self.fp16_enabled, "DeepSpeedConfig: ZeRO is only supported if fp16 is enabled"
assert self.zero_optimization_stage <= MAX_STAGE_ZERO_OPTIMIZATION, "DeepSpeedConfig: Maximum supported ZeRO stage is {}".format(MAX_STAGE_ZERO_OPTIMIZATION)
assert self.train_micro_batch_size_per_gpu, "DeepSpeedConfig: {} is not defined".format(TRAIN_MICRO_BATCH_SIZE_PER_GPU) assert self.train_micro_batch_size_per_gpu, "DeepSpeedConfig: {} is not defined".format(TRAIN_MICRO_BATCH_SIZE_PER_GPU)
......
"""
Copyright (c) Microsoft Corporation
Licensed under the MIT license.
"""
"""
Collection of DeepSpeed configuration utilities
"""
def get_scalar_param(param_dict, param_name, param_default_value):
if param_name in param_dict.keys():
return param_dict[param_name]
else:
return param_default_value
def dict_raise_error_on_duplicate_keys(ordered_pairs):
"""Reject duplicate keys."""
d = {}
for k, v in ordered_pairs:
if k in d:
raise ValueError("Duplicate key in DeepSpeed config: %r" % (k, ))
else:
d[k] = v
return d
...@@ -15,7 +15,7 @@ ROUTE_ENCODE = "encode" ...@@ -15,7 +15,7 @@ ROUTE_ENCODE = "encode"
# Batch size # Batch size
############################################# #############################################
TRAIN_BATCH_SIZE = "train_batch_size" TRAIN_BATCH_SIZE = "train_batch_size"
TRAIN_BATCH_SIZE_DEFAULT = 1 TRAIN_BATCH_SIZE_DEFAULT = None
############################################# #############################################
# Optimizer and lr scheduler # Optimizer and lr scheduler
...@@ -133,14 +133,27 @@ GRADIENT_CLIPPING_DEFAULT = 0. ...@@ -133,14 +133,27 @@ GRADIENT_CLIPPING_DEFAULT = 0.
# ZeRO optimization # ZeRO optimization
######################################### #########################################
# ZeRO optimization. By default, this optimization is not enabled. # ZeRO optimization. By default, this optimization is not enabled.
# Users can configure in ds_config.json as below example: # Users have to configure the desired optimization (0 means disabled) in params.json as below example:
ZERO_FORMAT = ''' ZERO_FORMAT = '''
ZeRO optimization should be enabled as: ZeRO optimization should be enabled as:
"zero_optimization": true, "session_params": {
"zero_all_gather_size": 200 "zero_optimization": [0|1|2],
"zero_all_gather_size": 200
}
''' '''
ZERO_OPTIMIZATION = 'zero_optimization' ZERO_OPTIMIZATION = 'zero_optimization'
ZERO_OPTIMIZATION_DEFAULT = False ZERO_OPTIMIZATION_DEFAULT = 0
ZERO_OPTIMIZATION_OPTIMIZER_STATES = 1
ZERO_OPTIMIZATION_GRADIENTS = 2
ZERO_OPTIMIZATION_WEIGHTS = 3
MAX_STAGE_ZERO_OPTIMIZATION = ZERO_OPTIMIZATION_GRADIENTS
ZERO_REDUCE_SCATTER = "zero_reduce_scatter"
ZERO_REDUCE_SCATTER_DEFAULT = True
ZERO_MAX_ELEMENTS_PER_COMM = "zero_max_elements_per_comm"
ZERO_MAX_ELEMENTS_PER_COMM_DEFAULT = 5e8
ALLGATHER_SIZE = 'allgather_size' ALLGATHER_SIZE = 'allgather_size'
ALLGATHER_SIZE_DEFAULT = 500000000 ALLGATHER_SIZE_DEFAULT = 500000000
...@@ -217,6 +230,9 @@ Wall block breakdown should be enabled as: ...@@ -217,6 +230,9 @@ Wall block breakdown should be enabled as:
WALL_CLOCK_BREAKDOWN = 'wall_clock_breakdown' WALL_CLOCK_BREAKDOWN = 'wall_clock_breakdown'
WALL_CLOCK_BREAKDOWN_DEFAULT = False WALL_CLOCK_BREAKDOWN_DEFAULT = False
MEMORY_BREAKDOWN = 'memory_breakdown'
MEMORY_BREAKDOWN_DEFAULT = False
######################################### #########################################
# Tensorboard # Tensorboard
######################################### #########################################
......
...@@ -8,11 +8,14 @@ import os ...@@ -8,11 +8,14 @@ import os
import warnings import warnings
import torch.distributed as dist import torch.distributed as dist
from torch.nn.modules import Module from torch.nn.modules import Module
from torch.distributed.distributed_c10d import _get_global_rank
from tensorboardX import SummaryWriter from tensorboardX import SummaryWriter
from deepspeed.pt.deepspeed_timer import ThroughputTimer, SynchronizedWallClockTimer from deepspeed.pt.deepspeed_timer import ThroughputTimer, SynchronizedWallClockTimer
from deepspeed.pt.deepspeed_zero_optimizer import FP16_DeepSpeedZeroOptimizer from deepspeed.pt.deepspeed_zero_optimizer import FP16_DeepSpeedZeroOptimizer
from deepspeed.pt.zero_optimizer_stage1 import FP16_DeepSpeedZeroOptimizer_Stage1
import deepspeed.pt.deepspeed_checkpointing as deepspeed_activation_checkpointing
from deepspeed.pt.fp16_optimizer import FP16_Optimizer from deepspeed.pt.fp16_optimizer import FP16_Optimizer
from deepspeed.pt.fp16_unfused_optimizer import FP16_UnfusedOptimizer from deepspeed.pt.fp16_unfused_optimizer import FP16_UnfusedOptimizer
...@@ -21,8 +24,10 @@ from deepspeed.pt.deepspeed_config import DeepSpeedConfig, \ ...@@ -21,8 +24,10 @@ from deepspeed.pt.deepspeed_config import DeepSpeedConfig, \
ADAM_OPTIMIZER, LAMB_OPTIMIZER, DEEPSPEED_OPTIMIZERS ADAM_OPTIMIZER, LAMB_OPTIMIZER, DEEPSPEED_OPTIMIZERS
from deepspeed.pt.deepspeed_dataloader import DeepSpeedDataLoader from deepspeed.pt.deepspeed_dataloader import DeepSpeedDataLoader
from deepspeed.pt.deepspeed_constants import ROUTE_TRAIN, ROUTE_PREDICT, \ from deepspeed.pt.deepspeed_constants import \
ROUTE_EVAL, TORCH_DISTRIBUTED_DEFAULT_PORT ROUTE_TRAIN, ROUTE_PREDICT, ROUTE_EVAL, \
TORCH_DISTRIBUTED_DEFAULT_PORT, \
ZERO_OPTIMIZATION_OPTIMIZER_STATES, ZERO_OPTIMIZATION_GRADIENTS
import deepspeed.pt.deepspeed_lr_schedules as lr_schedules import deepspeed.pt.deepspeed_lr_schedules as lr_schedules
from deepspeed.pt.deepspeed_csr_tensor import CSRTensor from deepspeed.pt.deepspeed_csr_tensor import CSRTensor
...@@ -96,7 +101,8 @@ class DeepSpeedLight(Module): ...@@ -96,7 +101,8 @@ class DeepSpeedLight(Module):
lr_scheduler=None, lr_scheduler=None,
mpu=None, mpu=None,
dist_init_required=None, dist_init_required=None,
collate_fn=None): collate_fn=None,
config_params=None):
super(DeepSpeedLight, self).__init__() super(DeepSpeedLight, self).__init__()
logging.basicConfig(level=logging.INFO, logging.basicConfig(level=logging.INFO,
...@@ -116,6 +122,7 @@ class DeepSpeedLight(Module): ...@@ -116,6 +122,7 @@ class DeepSpeedLight(Module):
self.gradient_predivide_factor = 1.0 self.gradient_predivide_factor = 1.0
self.gradient_average = True self.gradient_average = True
self.warn_unscaled_loss = True self.warn_unscaled_loss = True
self.config_params = config_params
if dist_init_required is None: if dist_init_required is None:
dist_init_required = not dist.is_initialized() dist_init_required = not dist.is_initialized()
...@@ -146,6 +153,9 @@ class DeepSpeedLight(Module): ...@@ -146,6 +153,9 @@ class DeepSpeedLight(Module):
# Configure distributed model # Configure distributed model
self._configure_distributed_model(model) self._configure_distributed_model(model)
# Configure wall clock timer
self.timers = SynchronizedWallClockTimer()
# Throughput timer # Throughput timer
self.tput_timer = ThroughputTimer( self.tput_timer = ThroughputTimer(
batch_size=self.train_micro_batch_size_per_gpu(), batch_size=self.train_micro_batch_size_per_gpu(),
...@@ -163,9 +173,6 @@ class DeepSpeedLight(Module): ...@@ -163,9 +173,6 @@ class DeepSpeedLight(Module):
self._configure_lr_scheduler(lr_scheduler) self._configure_lr_scheduler(lr_scheduler)
self._report_progress(0) self._report_progress(0)
# Configure wall clock timer
self.timers = SynchronizedWallClockTimer()
# Bookkeeping for csr support # Bookkeeping for csr support
self.csr_tensor_module_names = set() self.csr_tensor_module_names = set()
if self.sparse_gradients_enabled(): if self.sparse_gradients_enabled():
...@@ -245,6 +252,9 @@ class DeepSpeedLight(Module): ...@@ -245,6 +252,9 @@ class DeepSpeedLight(Module):
def wall_clock_breakdown(self): def wall_clock_breakdown(self):
return self._config.wall_clock_breakdown return self._config.wall_clock_breakdown
def memory_breakdown(self):
return self._config.memory_breakdown
def sparse_gradients_enabled(self): def sparse_gradients_enabled(self):
return self._config.sparse_gradients_enabled return self._config.sparse_gradients_enabled
...@@ -275,6 +285,30 @@ class DeepSpeedLight(Module): ...@@ -275,6 +285,30 @@ class DeepSpeedLight(Module):
def zero_allow_untested_optimizer(self): def zero_allow_untested_optimizer(self):
return self._config.zero_allow_untested_optimizer return self._config.zero_allow_untested_optimizer
def zero_reduce_scatter(self):
return self._config.zero_config.reduce_scatter
def zero_overlap_comm(self):
return self._config.zero_config.overlap_comm
def zero_max_elements_per_comm(self):
return self._config.zero_max_elements_per_comm
def zero_optimization_stage(self):
return self._config.zero_optimization_stage
def zero_reduce_bucket_size(self):
return self._config.zero_config.reduce_bucket_size
def zero_allgather_bucket_size(self):
return self._config.zero_config.allgather_bucket_size
def zero_optimization_partition_gradients(self):
return self.zero_optimization_stage() >= ZERO_OPTIMIZATION_GRADIENTS
def zero_contiguous_gradients(self):
return self._config.zero_config.contiguous_gradients
def allgather_size(self): def allgather_size(self):
return self._config.allgather_size return self._config.allgather_size
...@@ -296,8 +330,8 @@ class DeepSpeedLight(Module): ...@@ -296,8 +330,8 @@ class DeepSpeedLight(Module):
def steps_per_print(self): def steps_per_print(self):
return self._config.steps_per_print return self._config.steps_per_print
def disable_allgather(self): def zero_allgather_partitions(self):
return self._config.disable_allgather return self._config.zero_config.allgather_partitions
def dump_state(self): def dump_state(self):
return self._config.dump_state return self._config.dump_state
...@@ -375,7 +409,9 @@ class DeepSpeedLight(Module): ...@@ -375,7 +409,9 @@ class DeepSpeedLight(Module):
# Configure based on command line arguments # Configure based on command line arguments
def _configure_with_arguments(self, args, mpu): def _configure_with_arguments(self, args, mpu):
self.local_rank = args.local_rank if hasattr(args, 'local_rank') else 0 self.local_rank = args.local_rank if hasattr(args, 'local_rank') else 0
self._config = DeepSpeedConfig(args.deepspeed_config, mpu) self._config = DeepSpeedConfig(args.deepspeed_config,
mpu,
param_dict=self.config_params)
# Validate command line arguments # Validate command line arguments
def _do_args_sanity_check(self, args): def _do_args_sanity_check(self, args):
...@@ -390,6 +426,7 @@ class DeepSpeedLight(Module): ...@@ -390,6 +426,7 @@ class DeepSpeedLight(Module):
assert hasattr(args, 'local_rank') and type(args.local_rank) == int, \ assert hasattr(args, 'local_rank') and type(args.local_rank) == int, \
'DeepSpeed requires integer command line parameter --local_rank' 'DeepSpeed requires integer command line parameter --local_rank'
if self.config_params is None:
assert hasattr(args, 'deepspeed_config') and args.deepspeed_config is not None, \ assert hasattr(args, 'deepspeed_config') and args.deepspeed_config is not None, \
'DeepSpeed requires --deepspeed_config to specify configuration file' 'DeepSpeed requires --deepspeed_config to specify configuration file'
...@@ -424,7 +461,8 @@ class DeepSpeedLight(Module): ...@@ -424,7 +461,8 @@ class DeepSpeedLight(Module):
else: else:
self.data_parallel_group = self.mpu.get_data_parallel_group() self.data_parallel_group = self.mpu.get_data_parallel_group()
self.dp_world_size = self.mpu.get_data_parallel_world_size() self.dp_world_size = self.mpu.get_data_parallel_world_size()
src_rank = self.mpu.get_model_parallel_rank() src_rank = _get_global_rank(self.mpu.get_data_parallel_group(), 0)
print(f"global src_rank={src_rank}")
for p in self.module.parameters(): for p in self.module.parameters():
if torch.is_tensor(p): if torch.is_tensor(p):
dist.broadcast(p, src_rank, group=self.data_parallel_group) dist.broadcast(p, src_rank, group=self.data_parallel_group)
...@@ -518,17 +556,42 @@ class DeepSpeedLight(Module): ...@@ -518,17 +556,42 @@ class DeepSpeedLight(Module):
return optimizer return optimizer
def _configure_zero_optimizer(self, optimizer): def _configure_zero_optimizer(self, optimizer):
logging.info('Creating fp16 zero optimizer') zero_stage = self.zero_optimization_stage()
optimizer = FP16_DeepSpeedZeroOptimizer( logging.info('Creating fp16 ZeRO stage {} optimizer'.format(zero_stage))
if zero_stage == ZERO_OPTIMIZATION_OPTIMIZER_STATES:
assert self.zero_reduce_scatter(), 'Stage 1 only supports reduce scatter mode'
logging.info('Creating fp16 ZeRO Optimizer Stage 1')
optimizer = FP16_DeepSpeedZeroOptimizer_Stage1(
optimizer, optimizer,
static_loss_scale=self.loss_scale(), static_loss_scale=self.loss_scale(),
dynamic_loss_scale=self.dynamic_loss_scale(), dynamic_loss_scale=self.dynamic_loss_scale(),
dynamic_loss_args=self.dynamic_loss_scale_args(), dynamic_loss_args=self.dynamic_loss_scale_args(),
clip_grad=self.gradient_clipping(),
all_gather_partitions=self.zero_allgather_partitions(),
allgather_size=self.zero_allgather_bucket_size(),
max_elements_per_comm=self.zero_reduce_bucket_size(),
dp_process_group=self.data_parallel_group, dp_process_group=self.data_parallel_group,
mpu=self.mpu)
elif zero_stage == ZERO_OPTIMIZATION_GRADIENTS:
assert self.gradient_accumulation_steps() == 1, "ZeRO stage 2 does not support gradient accumulation, if you need gradient accumulation please use stage 1"
optimizer = FP16_DeepSpeedZeroOptimizer(
optimizer,
timers=self.timers,
static_loss_scale=self.loss_scale(),
dynamic_loss_scale=self.dynamic_loss_scale(),
dynamic_loss_args=self.dynamic_loss_scale_args(),
clip_grad=self.gradient_clipping(), clip_grad=self.gradient_clipping(),
all_gather_partitions=not self.disable_allgather(), contiguous_gradients=self.zero_contiguous_gradients(),
allgather_size=self.allgather_size(), reduce_bucket_size=self.zero_reduce_bucket_size(),
allgather_bucket_size=self.zero_allgather_bucket_size(),
dp_process_group=self.data_parallel_group,
reduce_scatter=self.zero_reduce_scatter(),
overlap_comm=self.zero_overlap_comm(),
mpu=self.mpu) mpu=self.mpu)
else:
raise NotImplementedError("ZeRO stage {} not implemented".format(zero_stage))
logging.info('Creating fp16 zero stage {} optimizer'.format(zero_stage))
return optimizer return optimizer
...@@ -624,6 +687,15 @@ class DeepSpeedLight(Module): ...@@ -624,6 +687,15 @@ class DeepSpeedLight(Module):
def allreduce_gradients(self, bucket_size=MEMORY_OPT_ALLREDUCE_SIZE): def allreduce_gradients(self, bucket_size=MEMORY_OPT_ALLREDUCE_SIZE):
if self.is_gradient_accumulation_boundary(): if self.is_gradient_accumulation_boundary():
if self.zero_optimization_stage() == ZERO_OPTIMIZATION_OPTIMIZER_STATES:
assert self.zero_reduce_scatter()
self.optimizer.reduce_scatter_gradients(
postscale_gradients=self.postscale_gradients(),
gradient_predivide_factor=self.gradient_predivide_factor,
gradient_average=self.gradient_average)
elif self.zero_optimization_partition_gradients():
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
else:
self.buffered_allreduce_fallback(elements_per_buffer=bucket_size) self.buffered_allreduce_fallback(elements_per_buffer=bucket_size)
def backward(self, loss, allreduce_gradients=True): def backward(self, loss, allreduce_gradients=True):
...@@ -636,7 +708,7 @@ class DeepSpeedLight(Module): ...@@ -636,7 +708,7 @@ class DeepSpeedLight(Module):
# scale loss w.r.t. gradient accumulation if needed # scale loss w.r.t. gradient accumulation if needed
if self.gradient_accumulation_steps() > 1: if self.gradient_accumulation_steps() > 1:
loss = self._scale_loss(loss) loss = self._scale_loss(loss.float())
# Log training Loss # Log training Loss
if self.tensorboard_enabled(): if self.tensorboard_enabled():
...@@ -765,16 +837,17 @@ class DeepSpeedLight(Module): ...@@ -765,16 +837,17 @@ class DeepSpeedLight(Module):
'backward_inner_microstep', 'backward_inner_microstep',
'backward_allreduce_microstep', 'backward_allreduce_microstep',
'step_microstep' 'step_microstep'
]) ],
# Log timing memory_breakdown=self.memory_breakdown())
if self.tensorboard_enabled():
if self.is_gradient_accumulation_boundary(): if self.is_gradient_accumulation_boundary():
if self.global_rank == 0: if self.tensorboard_enabled() and torch.distributed.get_rank(
self.summary_events = [(f'Train/Samples/elapsed_time_ms_forward', self.timers('forward').elapsed(reset=False) * 1000.0, self.sample_count), \ ) == 0: # this is done before the log because log resets timers
(f'Train/Samples/elapsed_time_ms_backward', self.timers('backward').elapsed(reset=False) * 1000.0, self.sample_count), \ self.summary_events = [(f'Train/elapsed_time_ms_forward', self.timers('forward').elapsed(reset=False) * 1000.0, self.sample_count), \
(f'Train/Samples/elapsed_time_ms_backward_inner', self.timers('backward_inner').elapsed(reset=False) * 1000.0, self.sample_count), \ (f'Train/elapsed_time_ms_backward', self.timers('backward').elapsed(reset=False) * 1000.0, self.sample_count), \
(f'Train/Samples/elapsed_time_ms_backward_allreduce', self.timers('backward_allreduce').elapsed(reset=False) * 1000.0, self.sample_count), \ (f'Train/elapsed_time_ms_backward_inner', self.timers('backward_inner').elapsed(reset=False) * 1000.0, self.sample_count), \
(f'Train/Samples/elapsed_time_ms_step', self.timers('step').elapsed(reset=False) * 1000.0, self.sample_count) (f'Train/elapsed_time_ms_backward_allreduce', self.timers('backward_allreduce').elapsed(reset=False) * 1000.0, self.sample_count), \
(f'Train/elapsed_time_ms_step', self.timers('step').elapsed(reset=False) * 1000.0, self.sample_count)
] ]
for event in self.summary_events: # write_summary_events for event in self.summary_events: # write_summary_events
self.summary_writer.add_scalar(event[0], event[1], event[2]) self.summary_writer.add_scalar(event[0], event[1], event[2])
...@@ -971,19 +1044,30 @@ class DeepSpeedLight(Module): ...@@ -971,19 +1044,30 @@ class DeepSpeedLight(Module):
if not os.path.exists(dirname): if not os.path.exists(dirname):
os.makedirs(dirname) os.makedirs(dirname)
def load_checkpoint(self, load_dir, tag, load_optimizer_states=True): def load_checkpoint(self,
load_dir,
tag,
load_module_strict=True,
load_optimizer_states=True,
load_lr_scheduler_states=True):
r"""Load training checkpoint r"""Load training checkpoint
Arguments: Arguments:
load_dir: Required. Directory to load the checkpoint from load_dir: Required. Directory to load the checkpoint from
tag: Required. Checkpoint tag used as a unique identifier for the checkpoint. Ex. Global Step. tag: Required. Checkpoint tag used as a unique identifier for the checkpoint. Ex. Global Step.
load_module_strict: Optional. Boolean to strictly enforce that the keys in state_dict of module and checkpoint match.
load_optimizer_states: Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM's momentum and variance load_optimizer_states: Optional. Boolean to load the training optimizer states from Checkpoint. Ex. ADAM's momentum and variance
load_lr_scheduler_states: Optional. Boolean to add the learning rate scheduler states from Checkpoint.
Return: Return:
load_path: Path of the loaded checkpoint. None if loading the checkpoint failed load_path: Path of the loaded checkpoint. None if loading the checkpoint failed
client_state: State dictionary used for loading required training states in the client code. client_state: State dictionary used for loading required training states in the client code.
""" """
load_path, client_states = self._load_checkpoint(load_dir, tag, load_optimizer_states=load_optimizer_states) load_path, client_states = self._load_checkpoint(load_dir,
tag,
load_module_strict=load_module_strict,
load_optimizer_states=load_optimizer_states,
load_lr_scheduler_states=load_lr_scheduler_states)
if self.zero_optimization() and load_path is not None: if self.zero_optimization() and load_path is not None:
self._load_zero_checkpoint(load_dir, self._load_zero_checkpoint(load_dir,
...@@ -992,7 +1076,12 @@ class DeepSpeedLight(Module): ...@@ -992,7 +1076,12 @@ class DeepSpeedLight(Module):
return load_path, client_states return load_path, client_states
def _load_checkpoint(self, load_dir, tag, load_optimizer_states=True): def _load_checkpoint(self,
load_dir,
tag,
load_module_strict=True,
load_optimizer_states=True,
load_lr_scheduler_states=True):
load_path = self._get_ckpt_name(load_dir, tag) load_path = self._get_ckpt_name(load_dir, tag)
...@@ -1005,12 +1094,13 @@ class DeepSpeedLight(Module): ...@@ -1005,12 +1094,13 @@ class DeepSpeedLight(Module):
logging.info('Loading checkpoint: {}'.format(load_path)) logging.info('Loading checkpoint: {}'.format(load_path))
checkpoint = torch.load(load_path, map_location=lambda storage, loc: storage) checkpoint = torch.load(load_path, map_location=lambda storage, loc: storage)
self.load_module_state_dict(checkpoint['module']) self.load_module_state_dict(state_dict=checkpoint['module'],
strict=load_module_strict)
if not self.zero_optimization(): if not self.zero_optimization():
self.optimizer.load_state_dict(checkpoint['optimizer'], self.optimizer.load_state_dict(checkpoint['optimizer'],
load_optimizer_states=load_optimizer_states) load_optimizer_states=load_optimizer_states)
if self.lr_scheduler is not None: if load_lr_scheduler_states and self.lr_scheduler is not None:
self.lr_scheduler.load_state_dict(checkpoint['lr_scheduler']) self.lr_scheduler.load_state_dict(checkpoint['lr_scheduler'])
self.csr_tensor_module_names = checkpoint['csr_tensor_module_names'] self.csr_tensor_module_names = checkpoint['csr_tensor_module_names']
...@@ -1019,6 +1109,7 @@ class DeepSpeedLight(Module): ...@@ -1019,6 +1109,7 @@ class DeepSpeedLight(Module):
deepspeed_states = [ deepspeed_states = [
'module', 'module',
'optimizer', 'optimizer',
'lr_scheduler',
'csr_tensor_module_names', 'csr_tensor_module_names',
'skipped_steps', 'skipped_steps',
'global_steps' 'global_steps'
...@@ -1058,19 +1149,15 @@ class DeepSpeedLight(Module): ...@@ -1058,19 +1149,15 @@ class DeepSpeedLight(Module):
#There seems to be issue creating them in parallel #There seems to be issue creating them in parallel
self._create_checkpoint_files(save_dir, tag) self._create_checkpoint_files(save_dir, tag)
try:
if self.save_non_zero_checkpoint: if self.save_non_zero_checkpoint:
self._save_checkpoint(save_dir, tag, client_state=client_state) self._save_checkpoint(save_dir, tag, client_state=client_state)
if self.save_zero_checkpoint: if self.save_zero_checkpoint:
self._save_zero_checkpoint(save_dir, tag) self._save_zero_checkpoint(save_dir, tag)
except:
logging.error(f'Failed Saving model checkpoint to {save_dir} with tag {tag}')
return False
return True return True
def _create_checkpoint_files(self, save_dir, tag): def _create_checkpoint_files(self, save_dir, tag):
#checkpoint files are created sequentially #checkpoint files are created sequentially
for rank in range(self.world_size): for rank in range(self.world_size):
if rank == self.global_rank: if rank == self.global_rank:
...@@ -1114,14 +1201,8 @@ class DeepSpeedLight(Module): ...@@ -1114,14 +1201,8 @@ class DeepSpeedLight(Module):
torch.save(state, save_path) torch.save(state, save_path)
def _save_zero_checkpoint(self, save_path, tag): def _save_zero_checkpoint(self, save_path, tag):
try:
zero_checkpoint_name = self._get_zero_ckpt_name(save_path, tag) zero_checkpoint_name = self._get_zero_ckpt_name(save_path, tag)
#self._ensure_directory_exists(zero_checkpoint_name) #self._ensure_directory_exists(zero_checkpoint_name)
except:
logging.error(
f'Failed Saving Zero model checkpoint to {save_path} with tag {tag}')
zero_sd = {'optimizer_state_dict': self.optimizer.state_dict()} zero_sd = {'optimizer_state_dict': self.optimizer.state_dict()}
torch.save(zero_sd, zero_checkpoint_name) torch.save(zero_sd, zero_checkpoint_name)
logging.info('zero checkpoint saved {}'.format(zero_checkpoint_name)) logging.info('zero checkpoint saved {}'.format(zero_checkpoint_name))
...@@ -69,13 +69,27 @@ class SynchronizedWallClockTimer: ...@@ -69,13 +69,27 @@ class SynchronizedWallClockTimer:
self.timers[name] = self.Timer(name) self.timers[name] = self.Timer(name)
return self.timers[name] return self.timers[name]
def log(self, names, normalizer=1.0, reset=True): @staticmethod
def memory_usage():
alloc = "mem_allocated: {:.4f} GB".format(torch.cuda.memory_allocated() /
(1024 * 1024 * 1024))
max_alloc = "max_mem_allocated: {:.4f} GB".format(
torch.cuda.max_memory_allocated() / (1024 * 1024 * 1024))
cache = "cache_allocated: {:.4f} GB".format(torch.cuda.memory_cached() /
(1024 * 1024 * 1024))
max_cache = "max_cache_allocated: {:.4f} GB".format(
torch.cuda.max_memory_cached() / (1024 * 1024 * 1024))
return " | {} | {} | {} | {}".format(alloc, max_alloc, cache, max_cache)
def log(self, names, normalizer=1.0, reset=True, memory_breakdown=False):
"""Log a group of timers.""" """Log a group of timers."""
assert normalizer > 0.0 assert normalizer > 0.0
string = 'time (ms)' string = 'time (ms)'
for name in names: for name in names:
elapsed_time = self.timers[name].elapsed(reset=reset) * 1000.0 / normalizer elapsed_time = self.timers[name].elapsed(reset=reset) * 1000.0 / normalizer
string += ' | {}: {:.2f}'.format(name, elapsed_time) string += ' | {}: {:.2f}'.format(name, elapsed_time)
if memory_breakdown:
string += self.memory_usage()
print_rank_0(string) print_rank_0(string)
......
...@@ -12,9 +12,10 @@ from torch._six import inf ...@@ -12,9 +12,10 @@ from torch._six import inf
class CheckOverflow(object): class CheckOverflow(object):
'''Checks for overflow in gradient across parallel process''' '''Checks for overflow in gradient across parallel process'''
def __init__(self, param_groups=None, mpu=None): def __init__(self, param_groups=None, mpu=None, zero_reduce_scatter=False):
self.mpu = mpu self.mpu = mpu
self.params = [] if param_groups else None self.params = [] if param_groups else None
self.zero_reduce_scatter = zero_reduce_scatter
if param_groups: if param_groups:
for group in param_groups: for group in param_groups:
for param in group: for param in group:
...@@ -54,8 +55,8 @@ class CheckOverflow(object): ...@@ -54,8 +55,8 @@ class CheckOverflow(object):
# `params` is a list / generator of torch.Variable # `params` is a list / generator of torch.Variable
def has_overflow_serial(self, params): def has_overflow_serial(self, params):
for p in params: for i, p in enumerate(params):
if p.grad is not None and self._has_inf_or_nan(p.grad.data): if p.grad is not None and self._has_inf_or_nan(p.grad.data, i):
return True return True
return False return False
...@@ -67,7 +68,11 @@ class CheckOverflow(object): ...@@ -67,7 +68,11 @@ class CheckOverflow(object):
#torch.distributed.all_reduce(overflow_gpu, #torch.distributed.all_reduce(overflow_gpu,
# op=torch.distributed.ReduceOp.MAX, # op=torch.distributed.ReduceOp.MAX,
# group=mpu.get_model_parallel_group()) # group=mpu.get_model_parallel_group())
if self.mpu is not None: if self.zero_reduce_scatter:
torch.distributed.all_reduce(overflow_gpu,
op=torch.distributed.ReduceOp.MAX,
group=torch.distributed.group.WORLD)
elif self.mpu is not None:
torch.distributed.all_reduce(overflow_gpu, torch.distributed.all_reduce(overflow_gpu,
op=torch.distributed.ReduceOp.MAX, op=torch.distributed.ReduceOp.MAX,
group=self.mpu.get_model_parallel_group()) group=self.mpu.get_model_parallel_group())
...@@ -76,7 +81,7 @@ class CheckOverflow(object): ...@@ -76,7 +81,7 @@ class CheckOverflow(object):
# `x` is a torch.Tensor # `x` is a torch.Tensor
@staticmethod @staticmethod
def _has_inf_or_nan(x): def _has_inf_or_nan(x, i):
try: try:
# if x is half, the .float() incurs an additional deep copy, but it's necessary if # if x is half, the .float() incurs an additional deep copy, but it's necessary if
# Pytorch's .sum() creates a one-element tensor of the same type as x # Pytorch's .sum() creates a one-element tensor of the same type as x
...@@ -93,10 +98,25 @@ class CheckOverflow(object): ...@@ -93,10 +98,25 @@ class CheckOverflow(object):
return True return True
else: else:
if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum: if cpu_sum == float('inf') or cpu_sum == -float('inf') or cpu_sum != cpu_sum:
_handle_overflow(cpu_sum, x, i)
return True return True
return False return False
def _handle_overflow(cpu_sum, x, i):
import math
rank = torch.distributed.get_rank()
if rank == 0:
t_i = -1
for v_i, v in enumerate(x.data.contiguous().view(-1)):
if not math.isfinite(float(v)):
t_i = v_i
break
print(
f"rank {rank} detected overflow {cpu_sum} in tensor {i}:{t_i} shape {x.shape}"
)
def get_grad_norm(parameters, norm_type=2, mpu=None): def get_grad_norm(parameters, norm_type=2, mpu=None):
"""Clips gradient norm of an iterable of parameters. """Clips gradient norm of an iterable of parameters.
...@@ -221,3 +241,33 @@ def get_weight_norm(parameters, norm_type=2, mpu=None): ...@@ -221,3 +241,33 @@ def get_weight_norm(parameters, norm_type=2, mpu=None):
total_norm = -1 total_norm = -1
return total_norm return total_norm
def is_model_parallel_parameter(p):
return hasattr(p, 'model_parallel') and p.model_parallel
def see_memory_usage(message):
return
if torch.distributed.is_initialized() and not torch.distributed.get_rank() == 0:
return
# Print message except when distributed but not rank 0
print(message, flush=True)
print("Memory Allocated ",
torch.cuda.memory_allocated() / (1024 * 1024 * 1024),
"GigaBytes",
flush=True)
print("Max Memory Allocated ",
torch.cuda.max_memory_allocated() / (1024 * 1024 * 1024),
"GigaBytes",
flush=True)
print("Cache Allocated ",
torch.cuda.memory_cached() / (1024 * 1024 * 1024),
"GigaBytes",
flush=True)
print("Max cache Allocated ",
torch.cuda.max_memory_cached() / (1024 * 1024 * 1024),
"GigaBytes",
flush=True)
print(" ", flush=True)
"""
Copyright (c) Microsoft Corporation
Licensed under the MIT license.
"""
import logging
#from deepspeed.pt.deepspeed_constants import *
from deepspeed.pt.deepspeed_config_utils import get_scalar_param
#########################################
# ZeRO optimization
#########################################
# ZeRO optimization. By default, this optimization is not enabled.
# Users have to configure the desired optimization (0 means disabled) in params.json as below example:
ZERO_FORMAT = '''
ZeRO optimization should be enabled as:
"session_params": {
"zero_optimization": {
"stage": [0|1|2],
"allgather_partitions": [true|false],
"allgather_bucket_size": 500000000,
"reduce_scatter": [true|false],
"contiguous_gradients" : [true|false]
"overlap_comm": [true|false],
"reduce_bucket_size": 500000000
}
}
'''
ZERO_OPTIMIZATION = 'zero_optimization'
ZERO_OPTIMIZATION_DISABLED = 0
ZERO_OPTIMIZATION_OPTIMIZER_STATES = 1
ZERO_OPTIMIZATION_GRADIENTS = 2
ZERO_OPTIMIZATION_WEIGHTS = 3
MAX_STAGE_ZERO_OPTIMIZATION = ZERO_OPTIMIZATION_GRADIENTS
ZERO_OPTIMIZATION_STAGE = 'stage'
ZERO_OPTIMIZATION_STAGE_1 = 'stage_1'
ZERO_OPTIMIZATION_STAGE_2 = 'stage_2'
ZERO_OPTIMIZATION_STAGE_3 = 'stage_3'
ZERO_OPTIMIZATION_STAGE_DEFAULT = ZERO_OPTIMIZATION_DISABLED
ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS = 'allgather_partitions'
ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT = True
ZERO_OPTIMIZATION_REDUCE_SCATTER = 'reduce_scatter'
ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT = True
ZERO_OPTIMIZATION_OVERLAP_COMM = 'overlap_comm'
ZERO_OPTIMIZATION_OVERLAP_COMM_DEFAULT = False
ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS = 'contiguous_gradients'
ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT = True
ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE = 'reduce_bucket_size'
ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT = 500000000
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE = 'allgather_bucket_size'
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT = 500000000
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEPRECATED = 'allgather_size'
ZERO_OPTIMIZATION_DEFAULT = {
ZERO_OPTIMIZATION_STAGE: ZERO_OPTIMIZATION_STAGE_DEFAULT,
ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS:
ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT,
ZERO_OPTIMIZATION_REDUCE_SCATTER: ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT,
ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE: ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT,
ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS:
ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT,
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE:
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT
}
class DeepSpeedZeroConfig(object):
def __init__(self, param_dict):
super(DeepSpeedZeroConfig, self).__init__()
self.stage = None
self.contiguous_gradients = None
self.reduce_scatter = None
self.reduce_bucket_size = None
self.allgather_partitions = None
self.allgather_bucket_size = None
self.overlap_comm = None
if ZERO_OPTIMIZATION in param_dict.keys():
zero_config_dict = param_dict[ZERO_OPTIMIZATION]
if type(zero_config_dict) is bool:
zero_config_dict = self.read_zero_config_deprecated(param_dict)
else:
zero_config_dict = ZERO_OPTIMIZATION_DEFAULT
self._initialize(zero_config_dict)
def read_zero_config_deprecated(self, param_dict):
zero_config_dict = {}
zero_config_dict[
ZERO_OPTIMIZATION_STAGE] = 1 if param_dict[ZERO_OPTIMIZATION] else 0
if zero_config_dict[ZERO_OPTIMIZATION_STAGE] > 0:
zero_config_dict[ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE] = get_scalar_param(
param_dict,
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEPRECATED,
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT)
logging.warning(
'DeepSpeedConfig: this format of ZeRO optimization setup is deprecated. Please use the following format: {}'
.format(ZERO_FORMAT))
return zero_config_dict
"""
For json serialization
"""
def repr(self):
return self.__dict__
def _initialize(self, zero_config_dict):
self.stage = get_scalar_param(zero_config_dict,
ZERO_OPTIMIZATION_STAGE,
ZERO_OPTIMIZATION_STAGE_DEFAULT)
self.contiguous_gradients = get_scalar_param(
zero_config_dict,
ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS,
ZERO_OPTIMIZATION_CONTIGUOUS_GRADIENTS_DEFAULT)
self.reduce_bucket_size = get_scalar_param(
zero_config_dict,
ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE,
ZERO_OPTIMIZATION_REDUCE_BUCKET_SIZE_DEFAULT)
self.reduce_scatter = get_scalar_param(zero_config_dict,
ZERO_OPTIMIZATION_REDUCE_SCATTER,
ZERO_OPTIMIZATION_REDUCE_SCATTER_DEFAULT)
self.overlap_comm = get_scalar_param(zero_config_dict,
ZERO_OPTIMIZATION_OVERLAP_COMM,
ZERO_OPTIMIZATION_OVERLAP_COMM_DEFAULT)
self.allgather_partitions = get_scalar_param(
zero_config_dict,
ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS,
ZERO_OPTIMIZATION_ALLGATHER_PARTITIONS_DEFAULT)
self.allgather_bucket_size = get_scalar_param(
zero_config_dict,
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE,
ZERO_OPTIMIZATION_ALLGATHER_BUCKET_SIZE_DEFAULT)
This diff is collapsed.
This diff is collapsed.
import torch
import torch.distributed as dist
def _initialize_parameter_parallel_groups(parameter_parallel_size=None):
data_parallel_size = int(dist.get_world_size())
if parameter_parallel_size is None:
parameter_parallel_size = int(data_parallel_size)
print(data_parallel_size, parameter_parallel_size)
assert data_parallel_size % parameter_parallel_size == 0, \
'world size should be divisible by parameter parallel size'
rank = dist.get_rank()
my_group = None
for i in range(dist.get_world_size() // parameter_parallel_size):
ranks = range(i * parameter_parallel_size, (i + 1) * parameter_parallel_size)
group = torch.distributed.new_group(ranks)
if rank in ranks:
my_group = group
return my_group
def pprint(msg):
if not dist.is_initialized() or dist.get_rank() == 0:
print(msg)
...@@ -29,6 +29,14 @@ collections: ...@@ -29,6 +29,14 @@ collections:
tutorials: tutorials:
output: true output: true
permalink: /:collection/:path/ permalink: /:collection/:path/
order:
- getting-started.md
- azure.md
- cifar-10.md
- bert-pretraining.md
- megatron.md
- 1Cycle.md
- lrrt.md
defaults: defaults:
- scope: - scope:
......
...@@ -18,7 +18,7 @@ lnav: ...@@ -18,7 +18,7 @@ lnav:
children: children:
- title: "Installation" - title: "Installation"
url: /getting-started/#installation url: /getting-started/#installation
- title: "Writing Models" - title: "Writing models"
url: /getting-started/#writing-deepspeed-models url: /getting-started/#writing-deepspeed-models
- title: "Training" - title: "Training"
url: /getting-started/#training url: /getting-started/#training
...@@ -37,19 +37,25 @@ lnav: ...@@ -37,19 +37,25 @@ lnav:
url: /docs/config-json/#communication-options url: /docs/config-json/#communication-options
- title: "FP16" - title: "FP16"
url: /docs/config-json/#fp16-training-options url: /docs/config-json/#fp16-training-options
- title: "ZeRO optimizations"
url: /docs/config-json/#zero-optimizations-for-fp16-training
- title: "Logging" - title: "Logging"
url: /docs/config-json/#logging url: /docs/config-json/#logging
- title: "Activation checkpointing"
url: /docs/config-json/#activation-checkpointing
- title: "Tutorials" - title: "Tutorials"
url: /tutorials/ url: /tutorials/
children: children:
- title: "Getting Started on Azure" - title: "Getting started"
url: /getting-started/
- title: "Getting started on Azure"
url: /tutorials/azure/ url: /tutorials/azure/
- title: "CIFAR-10" - title: "CIFAR-10"
url: /tutorials/cifar-10/ url: /tutorials/cifar-10/
- title: "Megatron-LM GPT2"
url: /tutorials/megatron/
- title: "BERT Pre-training" - title: "BERT Pre-training"
url: /tutorials/bert-pretraining/ url: /tutorials/bert-pretraining/
- title: "Megatron-LM GPT2"
url: /tutorials/megatron/
- title: "1-Cycle Schedule" - title: "1-Cycle Schedule"
url: /tutorials/1Cycle/ url: /tutorials/1Cycle/
- title: "Learning Rate Range Test" - title: "Learning Rate Range Test"
......
...@@ -12,12 +12,6 @@ layout: archive ...@@ -12,12 +12,6 @@ layout: archive
{% endif %} {% endif %}
<h2>Features Coming Soon</h2>
{% assign soon = posts | where: "sneak_preview", "true" %}
{% for post in soon %}
{% include archive-single.html %}
{% endfor %}
<h2>{{ site.data.ui-text[site.locale].recent_posts | default: "Recent Posts" }}</h2> <h2>{{ site.data.ui-text[site.locale].recent_posts | default: "Recent Posts" }}</h2>
{% assign news = posts | where: "sneak_preview", "false" %} {% assign news = posts | where: "sneak_preview", "false" %}
{% for post in news %} {% for post in news %}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment