As PyTorch is based on dynamic computation graph, the computation flow is not known until execution. To support pipeline parallelism, you have the following two ways to split your model,
1. Split your model directly. Below is an exmaple of resnet split into two pipeline stages.
```python
fromtorchvision.modelsimportresnet18
fromcolossalai.coreimportglobal_contextasgpc
model=resnet18(num_classes=10)
ifgpc.get_local_rank(ParallelMode.PIPELINE)==0:
model=nn.Sequential(
model.conv1,
model.bn1,
model.relu,
model.maxpool,
model.layer1,
model.layer2
)
elifgpc.get_local_rank(ParallelMode.PIPELINE)==1:
fromfunctoolsimportpartial
classFlatten(nn.Module):
defforward(self,x):
returntorch.flatten(x,1)
model=nn.Sequential(
model.layer3,
model.layer4,
model.avgpool,
Flatten(),
model.fc
)
```
2. Make sure your model inherit `colossalai.nn.model.ModelFromConfig` and registered into the
`MODELS` registry. Define the `self.layers_cfg` attribute.
Pass in a dict/Config object which specifies the parameters of your model.
Use `colossalai.builder.pipeline.PipelineModelInitializer` to partition the layers.
In `ignite.engine` or `keras.engine`, the process function is always provided by users. However, it is tricky for users
to write their own process functions for pipeline parallelism. Aiming at offering accessible hybrid parallelism for
users, we provide the powerful `Engine` class. This class enables pipeline parallelism and offers
one-forward-one-backward non-interleaving strategy. Also, you can use pre-defined learning rate scheduler in
the `Engine` class to adjust learning rate during training.
In order to build your engine, just set variables `model`, `criterion`, `optimizer`, `lr_scheduler` and `schedule`. The
following code block provides an example. **The engine is automatically created from the config file for you if you
start with `colossalai.initialize`.**
The engine class is a high-level wrapper of these frequently-used functions while preserving the PyTorch-like function signature and integrating with our features.
The Zero Redundancy Optimizer (ZeRO) removes the memory redundancies across data-parallel processes by partitioning three
model states (optimizer states, gradients, and parameters) across data-parallel processes instead of replicating them.
model states (optimizer states, gradients, and parameters) instead of replicating them.
By doing so, memory efficiency is boosted drastically compared to classic data parallelism while the computational granularity
and communication efficiency are retained.
...
...
@@ -14,30 +14,26 @@ partition them during the forward and backward passes.
## Getting Started with ZeRO
If you are training models with Colossal-AI, enabling ZeRO-3 offload is as simple as enabling it in your Colossal-AI configuration!
If you are training models with Colossal-AI, enabling ZeRO DP and Offloading is easy by addding several lines in your configuration file. We support configration for level 2 and 3. You have use [PyTorch native implementation](https://pytorch.org/tutorials/recipes/zero_redundancy_optimizer.html) for level 1 optimizer.
Below are a few examples of ZeRO-3 configurations.
### Example of ZeRO-3 Configurations
Here we use `Adam` as the initial optimizer.
1. Use ZeRO to partition the optimizer states (level 1), gradients (level 2), and parameters (level 3).
1. Use ZeRO to partition the optimizer states, gradients (level 2), and parameters (level 3).
```python
optimizer=dict(
type='Adam',
lr=0.001,
weight_decay=0
)
zero=dict(
type='ZeroRedundancyOptimizer_Level_3',
level=3,
dynamic_loss_scale=True,
clip_grad=1.0
)
```
2. Additionally offload the optimizer states and computations to the CPU.
```python
zero=dict(
level=3,
offload_optimizer_config=dict(
device='cpu',
pin_memory=True,
...
...
@@ -49,6 +45,7 @@ Here we use `Adam` as the initial optimizer.
3. Save even more memory by offloading parameters to the CPU memory.
```python
zero=dict(
level=3,
offload_optimizer_config=dict(
device='cpu',
pin_memory=True,
...
...
@@ -65,6 +62,7 @@ Here we use `Adam` as the initial optimizer.
4. Save even MORE memory by offloading to NVMe (if available on your system):
```python
zero=dict(
level=3,
offload_optimizer_config=dict(
device='nvme',
pin_memory=True,
...
...
@@ -81,7 +79,7 @@ Here we use `Adam` as the initial optimizer.
)
```
Note that `fp16` is automatically enabled when using ZeRO.
Note that `fp16` is automatically enabled when using ZeRO. This relies on `AMP_TYPE.NAIVE` in Colossal-AI AMP module.