Unverified Commit b9a8dff7 authored by digger-yu's avatar digger-yu Committed by GitHub
Browse files

[doc] Fix typo under colossalai and doc(#3618)

* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402
parent e1b0a78a
...@@ -74,7 +74,7 @@ class ColoInitContext(InsertPostInitMethodToModuleSubClasses): ...@@ -74,7 +74,7 @@ class ColoInitContext(InsertPostInitMethodToModuleSubClasses):
""" """
Args: Args:
device (torch.device): the device where parameters initialized are resident. Defaults to torch.device('cpu'). device (torch.device): the device where parameters initialized are resident. Defaults to torch.device('cpu').
dtype (torch.dtype): the dtype of parameters initialized. Defults to torch.float. dtype (torch.dtype): the dtype of parameters initialized. Defaults to torch.float.
default_pg (ProcessGroup): the default process group for all initialized parameters. default_pg (ProcessGroup): the default process group for all initialized parameters.
default_dist_spec: the default distributed specifications. default_dist_spec: the default distributed specifications.
""" """
...@@ -164,7 +164,7 @@ def post_process_colo_init_ctx(model: torch.nn.Module, ...@@ -164,7 +164,7 @@ def post_process_colo_init_ctx(model: torch.nn.Module,
model (torch.nn.module): the model model (torch.nn.module): the model
device (torch.device, optional): device type of the model params. Defaults to torch.device('cpu'). device (torch.device, optional): device type of the model params. Defaults to torch.device('cpu').
dtype (torch.dtype, optional): dtype of the model params. Defaults to torch.float. dtype (torch.dtype, optional): dtype of the model params. Defaults to torch.float.
default_pg (Optional[ProcessGroup], optional): default process group. Defaults to None. Inidicates a DP-only process group. default_pg (Optional[ProcessGroup], optional): default process group. Defaults to None. Indicates a DP-only process group.
default_dist_spec (Any, optional): default dist spec of params. Defaults to None. default_dist_spec (Any, optional): default dist spec of params. Defaults to None.
Raises: Raises:
......
...@@ -42,7 +42,7 @@ class ZeroDDP(ColoDDP): ...@@ -42,7 +42,7 @@ class ZeroDDP(ColoDDP):
Args: Args:
module (torch.nn.Module): Module to apply ZeRO-DP. module (torch.nn.Module): Module to apply ZeRO-DP.
gemini_manager (GeminiManager): Manages the chunk manager and heterogeneous momery space. gemini_manager (GeminiManager): Manages the chunk manager and heterogeneous memory space.
For more details, see the API reference of ``GeminiManager``. For more details, see the API reference of ``GeminiManager``.
pin_memory (bool): Chunks on CPU Memory use pin-memory. pin_memory (bool): Chunks on CPU Memory use pin-memory.
force_outputs_fp32 (bool): If set to True, outputs will be fp32. Otherwise, outputs will be fp16. force_outputs_fp32 (bool): If set to True, outputs will be fp32. Otherwise, outputs will be fp16.
...@@ -684,7 +684,7 @@ class GeminiDDP(ZeroDDP): ...@@ -684,7 +684,7 @@ class GeminiDDP(ZeroDDP):
memstats: Optional[MemStats] = None, memstats: Optional[MemStats] = None,
verbose: bool = False) -> None: verbose: bool = False) -> None:
""" """
A torch.Module warpper using ZeRO-DP and Genimi. A torch.Module wrapper using ZeRO-DP and Gemini.
ZeRO is for parallel. Gemini is for memory management. ZeRO is for parallel. Gemini is for memory management.
WARNING: The class will modify the module inline! WARNING: The class will modify the module inline!
...@@ -706,7 +706,7 @@ class GeminiDDP(ZeroDDP): ...@@ -706,7 +706,7 @@ class GeminiDDP(ZeroDDP):
Users can provide this argument to speed up searching. Users can provide this argument to speed up searching.
If users do not know this argument before training, it is ok. We will use a default value 1024. If users do not know this argument before training, it is ok. We will use a default value 1024.
min_chunk_size_mb (float, optional): the minimum chunk size in MegaByte. min_chunk_size_mb (float, optional): the minimum chunk size in MegaByte.
If the aggregate size of parameters is still samller than the minimum chunk size, If the aggregate size of parameters is still smaller than the minimum chunk size,
all parameters will be compacted into one small chunk. all parameters will be compacted into one small chunk.
memstats (MemStats, optional) the memory statistics collector by a runtime memory tracer. memstats (MemStats, optional) the memory statistics collector by a runtime memory tracer.
""" """
......
...@@ -8,7 +8,7 @@ from . import BaseOpHook ...@@ -8,7 +8,7 @@ from . import BaseOpHook
@OPHOOKS.register_module @OPHOOKS.register_module
class ShardGradMemTracerHook(BaseOpHook): class ShardGradMemTracerHook(BaseOpHook):
""" """
A hook to process sharded param before and afther FWD and BWD operator executing. A hook to process sharded param before and after FWD and BWD operator executing.
""" """
def __init__(self): def __init__(self):
......
...@@ -8,7 +8,7 @@ from . import BaseOpHook ...@@ -8,7 +8,7 @@ from . import BaseOpHook
@OPHOOKS.register_module @OPHOOKS.register_module
class ShardParamHook(BaseOpHook): class ShardParamHook(BaseOpHook):
""" """
A hook to process sharded param before and afther FWD and BWD operator executing. A hook to process sharded param before and after FWD and BWD operator executing.
""" """
def __init__(self): def __init__(self):
......
...@@ -53,7 +53,7 @@ class StatefulTensorMgr(object): ...@@ -53,7 +53,7 @@ class StatefulTensorMgr(object):
self._evict_time = 0 self._evict_time = 0
def adjust_layout(self) -> None: def adjust_layout(self) -> None:
""" Adjust the layout of statefuil tensor according to the information provided """ Adjust the layout of stateful tensor according to the information provided
by mem_stats_collector, which should belongs to a Sharded Model. by mem_stats_collector, which should belongs to a Sharded Model.
""" """
# find stateful tensor in state COMPUTE # find stateful tensor in state COMPUTE
......
...@@ -97,7 +97,7 @@ class ZeroInitContext(InsertPostInitMethodToModuleSubClasses): ...@@ -97,7 +97,7 @@ class ZeroInitContext(InsertPostInitMethodToModuleSubClasses):
"""We use this function to substitute fan-in and fan-out calculation in torch.nn.init. """We use this function to substitute fan-in and fan-out calculation in torch.nn.init.
This can help us get correct fan-in and fan-out for sharded tensor. This can help us get correct fan-in and fan-out for sharded tensor.
""" """
assert isinstance(tensor, nn.Parameter), "Sharded tensor initilization is only allowed for paramters" assert isinstance(tensor, nn.Parameter), "Sharded tensor initialization is only allowed for parameters"
# get correct shape of input tensor # get correct shape of input tensor
if not hasattr(tensor, 'colo_attr') or not tensor.colo_attr.param_is_sharded: if not hasattr(tensor, 'colo_attr') or not tensor.colo_attr.param_is_sharded:
......
...@@ -14,7 +14,7 @@ class BucketTensorShardStrategy(TensorShardStrategy): ...@@ -14,7 +14,7 @@ class BucketTensorShardStrategy(TensorShardStrategy):
"""Use the same shard scheme as `TensorShardStrategy`'s, but it gathers tensors of a sub-module together, """Use the same shard scheme as `TensorShardStrategy`'s, but it gathers tensors of a sub-module together,
which will fully utilize network bandwidth. which will fully utilize network bandwidth.
It is especially useful when sub-module contains bias, It is especially useful when sub-module contains bias,
since we cannot utilize network bandwidth well if we only gather a bias tensor (bias is usaully small). since we cannot utilize network bandwidth well if we only gather a bias tensor (bias is usually small).
""" """
def gather(self, tensor_list: List[ShardedTensor], process_group: Optional[dist.ProcessGroup] = None): def gather(self, tensor_list: List[ShardedTensor], process_group: Optional[dist.ProcessGroup] = None):
......
...@@ -192,7 +192,7 @@ class ShardedModelV2(nn.Module): ...@@ -192,7 +192,7 @@ class ShardedModelV2(nn.Module):
def dump_memory_stats(self, filename: Optional[str] = 'dump_mem_stats.log') -> None: def dump_memory_stats(self, filename: Optional[str] = 'dump_mem_stats.log') -> None:
""" """
dummy memory tracer collected infomation to a file. dummy memory tracer collected information to a file.
try: try:
# forward: model(inputs) # forward: model(inputs)
# backward: optimizer.backward() # backward: optimizer.backward()
...@@ -201,7 +201,7 @@ class ShardedModelV2(nn.Module): ...@@ -201,7 +201,7 @@ class ShardedModelV2(nn.Module):
exit(0) exit(0)
""" """
if self._use_memory_tracer: if self._use_memory_tracer:
self.logger.error(f'dump memort tracer collected infomation to a {filename}', ranks=[0]) self.logger.error(f'dump memort tracer collected information to a {filename}', ranks=[0])
if gpc.get_global_rank() == 0: if gpc.get_global_rank() == 0:
with open(filename, 'w+') as f: with open(filename, 'w+') as f:
f.write(f'cuda reserved {torch.cuda.memory_reserved(get_current_device()) / 1e9} GB\n') f.write(f'cuda reserved {torch.cuda.memory_reserved(get_current_device()) / 1e9} GB\n')
...@@ -293,7 +293,7 @@ class ShardedModelV2(nn.Module): ...@@ -293,7 +293,7 @@ class ShardedModelV2(nn.Module):
if not p.requires_grad: if not p.requires_grad:
continue continue
# Leave the gradient accumulation state (_require_backward_grad_sync) as-is if not synchronizing this pass. # Leave the gradient accumulation state (_require_backward_grad_sync) as-is if not synchronizing this pass.
# NOTE() (no-sync)/sync pass: (not conduct)/conduct gradient allreducing between process group. # NOTE() (no-sync)/sync pass: (not conduct)/conduct gradient all reducing between process group.
# If _require_backward_grad_sync is True, # If _require_backward_grad_sync is True,
# p.grad remains the accumulated unsharded gradient from prior no-sync passes. # p.grad remains the accumulated unsharded gradient from prior no-sync passes.
# We also allows to interleave no-sync pass with sync passes, if desired. # We also allows to interleave no-sync pass with sync passes, if desired.
...@@ -385,7 +385,7 @@ class ShardedModelV2(nn.Module): ...@@ -385,7 +385,7 @@ class ShardedModelV2(nn.Module):
param.colo_attr.grad_payload_reset(grad.data) param.colo_attr.grad_payload_reset(grad.data)
# release the memory of param # release the memory of param
# we set a false None for parameter's payload # we set a false None for parameter's payload
# so we can get paramter's device and dtype later in optimizer # so we can get parameter's device and dtype later in optimizer
param.colo_attr.data_payload_reset(torch.empty(0, device=grad.device, dtype=grad.dtype)) param.colo_attr.data_payload_reset(torch.empty(0, device=grad.device, dtype=grad.dtype))
if param.colo_attr.is_replicated: if param.colo_attr.is_replicated:
......
...@@ -67,8 +67,8 @@ class ShardedOptimizerV2(ColossalaiOptimizer): ...@@ -67,8 +67,8 @@ class ShardedOptimizerV2(ColossalaiOptimizer):
growth_interval (float, optional): growth_interval used by DynamicGradScaler. Defaults to 1000. growth_interval (float, optional): growth_interval used by DynamicGradScaler. Defaults to 1000.
hysteresis (float, optional): hysteresis used by DynamicGradScaler. Defaults to 2. hysteresis (float, optional): hysteresis used by DynamicGradScaler. Defaults to 2.
max_scale (int, optional): max_scale used by DynamicGradScaler. Defaults to 2**32. max_scale (int, optional): max_scale used by DynamicGradScaler. Defaults to 2**32.
dp_process_group (Optional[ProcessGroup], optional): data paralle process group. Defaults to None. dp_process_group (Optional[ProcessGroup], optional): data parallel process group. Defaults to None.
mp_process_group (Optional[ProcessGroup], optional): model paralle process group. Defaults to None. mp_process_group (Optional[ProcessGroup], optional): model parallel process group. Defaults to None.
.. _PatrickStar\: Parallel Training of Pre-trained Models via Chunk-based Memory Management: .. _PatrickStar\: Parallel Training of Pre-trained Models via Chunk-based Memory Management:
https://arxiv.org/abs/2108.05818 https://arxiv.org/abs/2108.05818
...@@ -274,7 +274,7 @@ class ShardedOptimizerV2(ColossalaiOptimizer): ...@@ -274,7 +274,7 @@ class ShardedOptimizerV2(ColossalaiOptimizer):
assert hasattr(p, 'colo_attr'), 'The parameter must be wrapped with ShardedParam' assert hasattr(p, 'colo_attr'), 'The parameter must be wrapped with ShardedParam'
shard_flag = not p.colo_attr.sharded_data_tensor.is_sharded and p.colo_attr.is_replicated shard_flag = not p.colo_attr.sharded_data_tensor.is_sharded and p.colo_attr.is_replicated
if shard_flag: if shard_flag:
# we always shard replicated paramters # we always shard replicated parameters
self.shard_strategy.shard([p.colo_attr.sharded_data_tensor], self.dp_process_group) self.shard_strategy.shard([p.colo_attr.sharded_data_tensor], self.dp_process_group)
self.master_params[p] = StatefulTensor(cast_tensor_to_fp32(p.colo_attr.data_payload.to(self.device))) self.master_params[p] = StatefulTensor(cast_tensor_to_fp32(p.colo_attr.data_payload.to(self.device)))
if shard_flag: if shard_flag:
...@@ -312,7 +312,7 @@ class ShardedOptimizerV2(ColossalaiOptimizer): ...@@ -312,7 +312,7 @@ class ShardedOptimizerV2(ColossalaiOptimizer):
# If reuse_fp16_shard, grad fp16 which wasn't be offloaded may be evicted to CPU # If reuse_fp16_shard, grad fp16 which wasn't be offloaded may be evicted to CPU
if not p.colo_attr.offload_grad: if not p.colo_attr.offload_grad:
colo_model_data_tensor_move_inline(p.colo_attr.saved_grad, torch.cuda.current_device()) colo_model_data_tensor_move_inline(p.colo_attr.saved_grad, torch.cuda.current_device())
# FIXME(ver217): p.data here is an empty tensor on CUDA and has no useful infomation # FIXME(ver217): p.data here is an empty tensor on CUDA and has no useful information
# If we change p.grad directly # If we change p.grad directly
# it may raise error because of different shape/dtype/device of p.data and p.grad # it may raise error because of different shape/dtype/device of p.data and p.grad
# We just set p.data = p.colo_attr.saved_grad.payload here # We just set p.data = p.colo_attr.saved_grad.payload here
...@@ -333,7 +333,7 @@ class ShardedOptimizerV2(ColossalaiOptimizer): ...@@ -333,7 +333,7 @@ class ShardedOptimizerV2(ColossalaiOptimizer):
def _copy_master_model_to_model_fp16(self): def _copy_master_model_to_model_fp16(self):
# Copy master param data (fp32) to payload of colo_attr (fp16) # Copy master param data (fp32) to payload of colo_attr (fp16)
# TODO() improve efficiency by gathering tensors into a chunk and transfering # TODO() improve efficiency by gathering tensors into a chunk and transferring
# a chunk. # a chunk.
for group in self.optim.param_groups: for group in self.optim.param_groups:
for p in group['params']: for p in group['params']:
...@@ -350,7 +350,7 @@ class ShardedOptimizerV2(ColossalaiOptimizer): ...@@ -350,7 +350,7 @@ class ShardedOptimizerV2(ColossalaiOptimizer):
p.data = self.master_params[p].payload p.data = self.master_params[p].payload
# we need to allocate new memory for keep_not_shard paramters # we need to allocate new memory for keep_not_shard parameters
# in order to use copy, otherwise, the sizes of tensor is not compatible # in order to use copy, otherwise, the sizes of tensor is not compatible
if p.colo_attr.data_payload.numel() != p.data.numel(): if p.colo_attr.data_payload.numel() != p.data.numel():
p.colo_attr.data_payload_reset( p.colo_attr.data_payload_reset(
......
...@@ -26,7 +26,7 @@ def zero_model_wrapper(model: nn.Module, ...@@ -26,7 +26,7 @@ def zero_model_wrapper(model: nn.Module,
zero_stage (int, optional): The stage of ZeRO DDP. You can find more information in ZeRO's paper. zero_stage (int, optional): The stage of ZeRO DDP. You can find more information in ZeRO's paper.
https://arxiv.org/abs/1910.02054 https://arxiv.org/abs/1910.02054
gemini_config (dict, optional): The configuration dictionary of `GeminiDDP`. `GeminiDDP` is enabled gemini_config (dict, optional): The configuration dictionary of `GeminiDDP`. `GeminiDDP` is enabled
when the stage is set to 3. You can set the arguemnts of `GeminiDDP` in the gemini_config. when the stage is set to 3. You can set the arguments of `GeminiDDP` in the gemini_config.
Here is an example where we set the device of the model, the placement policy of Gemini, and the Here is an example where we set the device of the model, the placement policy of Gemini, and the
size of hidden dimension to help Gemini find out a unified chunk size. size of hidden dimension to help Gemini find out a unified chunk size.
...@@ -78,7 +78,7 @@ def zero_optim_wrapper(model: nn.Module, ...@@ -78,7 +78,7 @@ def zero_optim_wrapper(model: nn.Module,
max_norm (float, optional): max_norm used for `clip_grad_norm`. You should notice that you shall not do max_norm (float, optional): max_norm used for `clip_grad_norm`. You should notice that you shall not do
clip_grad_norm by yourself when using ZeRO DDP. The ZeRO optimizer will take care of clip_grad_norm. clip_grad_norm by yourself when using ZeRO DDP. The ZeRO optimizer will take care of clip_grad_norm.
norm_type (float, optional): norm_type used for `clip_grad_norm`. norm_type (float, optional): norm_type used for `clip_grad_norm`.
optim_config (dict, optinoal): The configuration used for the ZeRO optimizer. optim_config (dict, optional): The configuration used for the ZeRO optimizer.
Example: Example:
>>> zero2_config = dict(reduce_bucket_size=12 * 1024 * 1024, overlap_communication=True) >>> zero2_config = dict(reduce_bucket_size=12 * 1024 * 1024, overlap_communication=True)
......
...@@ -4,7 +4,7 @@ Colossal-Auto simplifies the process of deploying large-scale machine learning m ...@@ -4,7 +4,7 @@ Colossal-Auto simplifies the process of deploying large-scale machine learning m
### 1. Basic usage ### 1. Basic usage
Colossal-Auto can be used to find a hybrid SPMD parallel strategy includes data, tensor(i.e., 1D, 2D, sequencial) for each operation. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel). Colossal-Auto can be used to find a hybrid SPMD parallel strategy includes data, tensor(i.e., 1D, 2D, sequential) for each operation. You can follow the [GPT example](https://github.com/hpcaitech/ColossalAI/tree/main/examples/language/gpt/experiments/auto_parallel).
Detailed instructions can be found in its `README.md`. Detailed instructions can be found in its `README.md`.
### 2. Integration with activation checkpoint ### 2. Integration with activation checkpoint
......
...@@ -44,7 +44,7 @@ In some solutions, the [Zero-offload](https://arxiv.org/abs/2101.06840) adopted ...@@ -44,7 +44,7 @@ In some solutions, the [Zero-offload](https://arxiv.org/abs/2101.06840) adopted
</figure> </figure>
Colossal-AI designed Gemini, just like two-stars, which manages the memory space of CPU and GPU efficiently. It can make the tensor dynamically distributed in the storage space of CPU-GPU during training, so that the model training can break through the memory wall of GPU. The memory manager consists of two parts: **MemStatsCollector (MSC)** and **StatefuleTensorMgr (STM)**. Colossal-AI designed Gemini, just like two-stars, which manages the memory space of CPU and GPU efficiently. It can make the tensor dynamically distributed in the storage space of CPU-GPU during training, so that the model training can break through the memory wall of GPU. The memory manager consists of two parts: **MemStatsCollector (MSC)** and **StatefulTensorMgr (STM)**.
We take advantage of the iterative characteristics of the deep learning network training process. We divide iterations into two stages: warmup and non-warmup. One or several iterative steps at the beginning belong to the warmup stage, and the other iterative steps belong to the non-warmup stage. In the warmup stage, we collect information for the MSC, while in the non-warmup stage, STM gets the information collected by the MSC to move the tensor, so as to minimize the CPU-GPU data movement volume. We take advantage of the iterative characteristics of the deep learning network training process. We divide iterations into two stages: warmup and non-warmup. One or several iterative steps at the beginning belong to the warmup stage, and the other iterative steps belong to the non-warmup stage. In the warmup stage, we collect information for the MSC, while in the non-warmup stage, STM gets the information collected by the MSC to move the tensor, so as to minimize the CPU-GPU data movement volume.
......
...@@ -20,7 +20,7 @@ To launch the distributed inference service quickly, you can download the OPT-12 ...@@ -20,7 +20,7 @@ To launch the distributed inference service quickly, you can download the OPT-12
2. Prepare a prebuilt service image 2. Prepare a prebuilt service image
Pull a docker image from dockerhub installed with Colossal-AI inference. Pull a docker image from docker hub installed with Colossal-AI inference.
```bash ```bash
docker pull hpcaitech/energon-ai:latest docker pull hpcaitech/energon-ai:latest
......
...@@ -12,7 +12,7 @@ Author: Yuxuan Lou ...@@ -12,7 +12,7 @@ Author: Yuxuan Lou
## Introduction ## Introduction
In this example for ViT model, Colossal-AI provides three different parallelism techniques which acclerate model training: data parallelism, pipeline parallelism and tensor parallelism. In this example for ViT model, Colossal-AI provides three different parallelism techniques which accelerate model training: data parallelism, pipeline parallelism and tensor parallelism.
We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs. We will show you how to train ViT on CIFAR-10 dataset with these parallelism techniques. To run this example, you will need 2-4 GPUs.
...@@ -31,7 +31,7 @@ pip install colossalai ...@@ -31,7 +31,7 @@ pip install colossalai
## Data Parallelism ## Data Parallelism
Data parallism is one basic way to accelerate model training process. You can apply data parallism to training by only two steps: Data parallism is one basic way to accelerate model training process. You can apply data parallelism to training by only two steps:
1. Define a configuration file 1. Define a configuration file
2. Change a few lines of code in train script 2. Change a few lines of code in train script
...@@ -108,7 +108,7 @@ disable_existing_loggers() ...@@ -108,7 +108,7 @@ disable_existing_loggers()
logger = get_dist_logger() logger = get_dist_logger()
``` ```
After initialization, you can acess the variables in the config file by using `colossalai.core.global_context`. After initialization, you can access the variables in the config file by using `colossalai.core.global_context`.
```python ```python
#access parameters #access parameters
...@@ -162,7 +162,7 @@ optimizer = colossalai.nn.Lamb(model.parameters(), lr=1.8e-2, weight_decay=0.1) ...@@ -162,7 +162,7 @@ optimizer = colossalai.nn.Lamb(model.parameters(), lr=1.8e-2, weight_decay=0.1)
# build loss # build loss
criterion = torch.nn.CrossEntropyLoss() criterion = torch.nn.CrossEntropyLoss()
# lr_scheduelr # lr_scheduler
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS) lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
``` ```
...@@ -230,10 +230,10 @@ torchrun --standalone --nproc_per_node <NUM_GPUs> train_dp.py --config ./config ...@@ -230,10 +230,10 @@ torchrun --standalone --nproc_per_node <NUM_GPUs> train_dp.py --config ./config
## Pipeline Parallelism ## Pipeline Parallelism
Aside from data parallelism, Colossal-AI also support pipleline parallelism. In specific, Colossal-AI uses 1F1B pipeline introduced by NVIDIA. For more details, you can view the related [documents](https://www.colossalai.org/tutorials/features/pipeline_parallel). Aside from data parallelism, Colossal-AI also support pipeline parallelism. In specific, Colossal-AI uses 1F1B pipeline introduced by NVIDIA. For more details, you can view the related [documents](https://www.colossalai.org/tutorials/features/pipeline_parallel).
### Define your configuration file(`hybrid_parallel/configs/vit_pipeline.py`) ### Define your configuration file(`hybrid_parallel/configs/vit_pipeline.py`)
To apply pipleline parallel on the data parallel basis, you only need to add a **parallel dict** To apply pipeline parallel on the data parallel basis, you only need to add a **parallel dict**
```python ```python
from colossalai.amp import AMP_TYPE from colossalai.amp import AMP_TYPE
...@@ -250,7 +250,7 @@ clip_grad_norm = 1.0 ...@@ -250,7 +250,7 @@ clip_grad_norm = 1.0
Other configs: Other configs:
```python ```python
# hyperparameters # hyper parameters
# BATCH_SIZE is as per GPU # BATCH_SIZE is as per GPU
# global batch size = BATCH_SIZE x data parallel size # global batch size = BATCH_SIZE x data parallel size
BATCH_SIZE = 256 BATCH_SIZE = 256
...@@ -276,7 +276,7 @@ Colossal-AI provides two methods to build a pipeline model from the existing mod ...@@ -276,7 +276,7 @@ Colossal-AI provides two methods to build a pipeline model from the existing mod
- `colossalai.builder.build_pipeline_model_from_cfg` - `colossalai.builder.build_pipeline_model_from_cfg`
- `colossalai.builder.build_pipeline_model` - `colossalai.builder.build_pipeline_model`
Besides, you can also build a pipeline model from scrath with Colossal-AI. Besides, you can also build a pipeline model from scratch with Colossal-AI.
```python ```python
import math import math
from typing import Callable from typing import Callable
...@@ -521,7 +521,7 @@ def build_cifar(batch_size): ...@@ -521,7 +521,7 @@ def build_cifar(batch_size):
return train_dataloader, test_dataloader return train_dataloader, test_dataloader
# craete dataloaders # create dataloaders
train_dataloader , test_dataloader = build_cifar() train_dataloader , test_dataloader = build_cifar()
# create loss function # create loss function
...@@ -539,7 +539,7 @@ lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer, ...@@ -539,7 +539,7 @@ lr_scheduler = CosineAnnealingWarmupLR(optimizer=optimizer,
#### Start Colossal-AI engine #### Start Colossal-AI engine
```python ```python
# intiailize # initialize
engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model, engine, train_dataloader, test_dataloader, _ = colossalai.initialize(model=model,
optimizer=optimizer, optimizer=optimizer,
criterion=criterion, criterion=criterion,
...@@ -615,7 +615,7 @@ TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE) ...@@ -615,7 +615,7 @@ TENSOR_SHAPE = (BATCH_SIZE // NUM_MICRO_BATCHES, SEQ_LENGTH, HIDDEN_SIZE)
Ohter configs: Ohter configs:
```python ```python
# hyperparameters # hyper parameters
# BATCH_SIZE is as per GPU # BATCH_SIZE is as per GPU
# global batch size = BATCH_SIZE x data parallel size # global batch size = BATCH_SIZE x data parallel size
BATCH_SIZE = 256 BATCH_SIZE = 256
......
...@@ -42,7 +42,7 @@ Therefore, when using Distributed Spec, we only need to describe the way that th ...@@ -42,7 +42,7 @@ Therefore, when using Distributed Spec, we only need to describe the way that th
## Compute Spec ## Compute Spec
An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec) describes how a Coloensor be used in DNN training. Currently, we will set the correct Compute Pattern for the ColoTensor as the parameters of the module. The specific application scenarios will be shown in the next document. An instance of class [ComputeSpec](https://colossalai.readthedocs.io/en/latest/colossalai/colossalai.tensor.compute_spec.html#colossalai.tensor.compute_spec.ComputeSpec) describes how a Colotensor be used in DNN training. Currently, we will set the correct Compute Pattern for the ColoTensor as the parameters of the module. The specific application scenarios will be shown in the next document.
## ColoParameter ## ColoParameter
......
...@@ -172,7 +172,7 @@ In this config file, we specify that we want to use batch size 128 per GPU and r ...@@ -172,7 +172,7 @@ In this config file, we specify that we want to use batch size 128 per GPU and r
#### Step 2. Initialize Distributed Environment #### Step 2. Initialize Distributed Environment
We need to initialize the distributed training environment. This has been introduced in the tutorial on how to We need to initialize the distributed training environment. This has been introduced in the tutorial on how to
[launch Colossal-AI](./launch_colossalai.md). For this demostration, we use `launch_from_torch` and PyTorch launch utility. [launch Colossal-AI](./launch_colossalai.md). For this demonstration, we use `launch_from_torch` and PyTorch launch utility.
```python ```python
import colossalai import colossalai
......
...@@ -6,18 +6,18 @@ Author: Shenggui Li, Siqi Mai ...@@ -6,18 +6,18 @@ Author: Shenggui Li, Siqi Mai
With the development of deep learning model size, it is important to shift to a new training paradigm. The traditional training method with no parallelism and optimization became a thing of the past and new training methods are the key to make training large-scale models efficient and cost-effective. With the development of deep learning model size, it is important to shift to a new training paradigm. The traditional training method with no parallelism and optimization became a thing of the past and new training methods are the key to make training large-scale models efficient and cost-effective.
Colossal-AI is designed to be a unfied system to provide an integrated set of training skills and utilities to the user. You can find the common training utilities such as mixed precision training and gradient accumulation. Besides, we provide an array of parallelism including data, tensor and pipeline parallelism. We optimize tensor parallelism with different multi-dimensional distributed matrix-matrix multiplication algorithm. We also provided different pipeline parallelism methods to allow the user to scale their model across nodes efficiently. More advanced features such as offloading can be found in this tutorial documentation in detail as well. Colossal-AI is designed to be a unified system to provide an integrated set of training skills and utilities to the user. You can find the common training utilities such as mixed precision training and gradient accumulation. Besides, we provide an array of parallelism including data, tensor and pipeline parallelism. We optimize tensor parallelism with different multi-dimensional distributed matrix-matrix multiplication algorithm. We also provided different pipeline parallelism methods to allow the user to scale their model across nodes efficiently. More advanced features such as offloading can be found in this tutorial documentation in detail as well.
## General Usage ## General Usage
We aim to make Colossal-AI easy to use and non-instrusive to user code. There is a simple general workflow if you want to use Colossal-AI. We aim to make Colossal-AI easy to use and non-intrusive to user code. There is a simple general workflow if you want to use Colossal-AI.
<figure style={{textAlign: "center"}}> <figure style={{textAlign: "center"}}>
<img src="https://s2.loli.net/2022/01/28/ZK7ICWzbMsVuJof.png"/> <img src="https://s2.loli.net/2022/01/28/ZK7ICWzbMsVuJof.png"/>
<figcaption>Workflow</figcaption> <figcaption>Workflow</figcaption>
</figure> </figure>
1. Prepare a configiguration file where specifies the features you want to use and your parameters. 1. Prepare a configuration file where specifies the features you want to use and your parameters.
2. Initialize distributed backend with `colossalai.launch` 2. Initialize distributed backend with `colossalai.launch`
3. Inject the training features into your training components (e.g. model, optimizer) with `colossalai.initialize`. 3. Inject the training features into your training components (e.g. model, optimizer) with `colossalai.initialize`.
4. Run training and testing 4. Run training and testing
......
...@@ -42,7 +42,7 @@ Given $P$ processors, we present the theoretical computation and memory cost, as ...@@ -42,7 +42,7 @@ Given $P$ processors, we present the theoretical computation and memory cost, as
## Usage ## Usage
To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallism setting as below. To enable 1D tensor parallelism for our model, e.g. on 2 GPUs, we need to configure the parallelism setting as below.
```python ```python
CONFIG = dict(parallel=dict( CONFIG = dict(parallel=dict(
data=1, data=1,
......
...@@ -60,7 +60,7 @@ Given $P=q\times q$ processors, we present the theoretical computation and memor ...@@ -60,7 +60,7 @@ Given $P=q\times q$ processors, we present the theoretical computation and memor
## Usage ## Usage
To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallism setting as below. To enable 2D tensor parallelism for our model, e.g. on 4 GPUs, we need to configure the parallelism setting as below.
```python ```python
CONFIG = dict(parallel=dict( CONFIG = dict(parallel=dict(
data=1, data=1,
......
...@@ -57,7 +57,7 @@ Given $P=q \times q \times d$ processors, we present the theoretical computation ...@@ -57,7 +57,7 @@ Given $P=q \times q \times d$ processors, we present the theoretical computation
## Usage ## Usage
To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallism setting as below. To enable 2.5D tensor parallelism for our model, e.g. on 8 GPUs, we need to configure the parallelism setting as below.
```python ```python
CONFIG = dict(parallel=dict( CONFIG = dict(parallel=dict(
data=1, data=1,
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment