Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
OpenDAS
ColossalAI
Commits
35813ed3
Unverified
Commit
35813ed3
authored
Dec 13, 2021
by
Frank Lee
Committed by
GitHub
Dec 13, 2021
Browse files
update examples and sphnix docs for the new api (#63)
parent
7d371105
Changes
118
Hide whitespace changes
Inline
Side-by-side
Showing
20 changed files
with
136 additions
and
94 deletions
+136
-94
docs/colossalai/colossalai.nn.model.vision_transformer.vision_transformer.rst
...ssalai.nn.model.vision_transformer.vision_transformer.rst
+0
-5
docs/colossalai/colossalai.nn.multi_tensor_apply.multi_tensor_apply.rst
...i/colossalai.nn.multi_tensor_apply.multi_tensor_apply.rst
+0
-5
docs/colossalai/colossalai.nn.optimizer.loss_scaler.rst
docs/colossalai/colossalai.nn.optimizer.loss_scaler.rst
+0
-5
docs/colossalai/colossalai.nn.optimizer.rst
docs/colossalai/colossalai.nn.optimizer.rst
+4
-9
docs/colossalai/colossalai.nn.optimizer.zero_redundancy_optimizer_level_1.rst
...ssalai.nn.optimizer.zero_redundancy_optimizer_level_1.rst
+0
-5
docs/colossalai/colossalai.nn.optimizer.zero_redundancy_optimizer_level_2.rst
...ssalai.nn.optimizer.zero_redundancy_optimizer_level_2.rst
+0
-5
docs/colossalai/colossalai.nn.optimizer.zero_redundancy_optimizer_level_3.rst
...ssalai.nn.optimizer.zero_redundancy_optimizer_level_3.rst
+0
-5
docs/colossalai/colossalai.nn.rst
docs/colossalai/colossalai.nn.rst
+4
-5
docs/colossalai/colossalai.registry.rst
docs/colossalai/colossalai.registry.rst
+4
-4
docs/colossalai/colossalai.rst
docs/colossalai/colossalai.rst
+11
-9
docs/colossalai/colossalai.trainer.rst
docs/colossalai/colossalai.trainer.rst
+4
-3
docs/colossalai/colossalai.utils.data_sampler.rst
docs/colossalai/colossalai.utils.data_sampler.rst
+5
-0
docs/colossalai/colossalai.utils.gradient_accumulation.rst
docs/colossalai/colossalai.utils.gradient_accumulation.rst
+5
-0
docs/colossalai/colossalai.utils.multi_tensor_apply.rst
docs/colossalai/colossalai.utils.multi_tensor_apply.rst
+8
-0
docs/colossalai/colossalai.utils.rst
docs/colossalai/colossalai.utils.rst
+7
-4
docs/colossalai/colossalai.zero.rst
docs/colossalai/colossalai.zero.rst
+5
-0
docs/config.md
docs/config.md
+9
-0
docs/installation.md
docs/installation.md
+5
-7
docs/run_demo.md
docs/run_demo.md
+64
-23
docs/zero.md
docs/zero.md
+1
-0
No files found.
docs/colossalai/colossalai.nn.model.vision_transformer.vision_transformer.rst
deleted
100644 → 0
View file @
7d371105
colossalai.nn.model.vision\_transformer.vision\_transformer
===========================================================
.. automodule:: colossalai.nn.model.vision_transformer.vision_transformer
:members:
docs/colossalai/colossalai.nn.multi_tensor_apply.multi_tensor_apply.rst
deleted
100644 → 0
View file @
7d371105
colossalai.nn.multi\_tensor\_apply.multi\_tensor\_apply
=======================================================
.. automodule:: colossalai.nn.multi_tensor_apply.multi_tensor_apply
:members:
docs/colossalai/colossalai.nn.optimizer.loss_scaler.rst
deleted
100644 → 0
View file @
7d371105
colossalai.nn.optimizer.loss\_scaler
====================================
.. automodule:: colossalai.nn.optimizer.loss_scaler
:members:
docs/colossalai/colossalai.nn.optimizer.rst
View file @
35813ed3
colossalai.nn.optimizer
=======================
.. automodule:: colossalai.nn.optimizer
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.optimizer.fp16_optimizer
colossalai.nn.optimizer.fused_adam
colossalai.nn.optimizer.fused_lamb
colossalai.nn.optimizer.fused_sgd
colossalai.nn.optimizer.lamb
colossalai.nn.optimizer.lars
colossalai.nn.optimizer.loss_scaler
colossalai.nn.optimizer.zero_redundancy_optimizer_level_1
colossalai.nn.optimizer
.zero_redundancy_optimizer_level_2
colossalai.nn.optimizer.zero_redundancy_optimizer_level_3
.. automodule::
colossalai.nn.optimizer
:members:
docs/colossalai/colossalai.nn.optimizer.zero_redundancy_optimizer_level_1.rst
deleted
100644 → 0
View file @
7d371105
colossalai.nn.optimizer.zero\_redundancy\_optimizer\_level\_1
=============================================================
.. automodule:: colossalai.nn.optimizer.zero_redundancy_optimizer_level_1
:members:
docs/colossalai/colossalai.nn.optimizer.zero_redundancy_optimizer_level_2.rst
deleted
100644 → 0
View file @
7d371105
colossalai.nn.optimizer.zero\_redundancy\_optimizer\_level\_2
=============================================================
.. automodule:: colossalai.nn.optimizer.zero_redundancy_optimizer_level_2
:members:
docs/colossalai/colossalai.nn.optimizer.zero_redundancy_optimizer_level_3.rst
deleted
100644 → 0
View file @
7d371105
colossalai.nn.optimizer.zero\_redundancy\_optimizer\_level\_3
=============================================================
.. automodule:: colossalai.nn.optimizer.zero_redundancy_optimizer_level_3
:members:
docs/colossalai/colossalai.nn.rst
View file @
35813ed3
colossalai.nn
=============
.. automodule:: colossalai.nn
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.data
colossalai.nn.layer
colossalai.nn.loss
colossalai.nn.lr_scheduler
colossalai.nn.model
colossalai.nn.multi_tensor_apply
colossalai.nn.optimizer
.. automodule:: colossalai.nn
:members:
docs/colossalai/colossalai.registry.rst
View file @
35813ed3
colossalai.registry
===================
.. automodule:: colossalai.registry
:members:
.. toctree::
:maxdepth: 2
colossalai.registry.registry
.. automodule:: colossalai.registry
:members:
docs/colossalai/colossalai.rst
View file @
35813ed3
colossalai
==========
.. automodule:: colossalai
:members:
.. toctree::
:maxdepth: 2
colossalai.constants
colossalai.core
colossalai.initialize
.. toctree::
:maxdepth: 2
colossalai.amp
colossalai.builder
colossalai.communication
colossalai.context
...
...
@@ -16,11 +22,7 @@ colossalai
colossalai.registry
colossalai.trainer
colossalai.utils
colossalai.zero
.. toctree::
:maxdepth: 2
colossalai.constants
colossalai.core
colossalai.initialize
.. automodule:: colossalai
:members:
docs/colossalai/colossalai.trainer.rst
View file @
35813ed3
colossalai.trainer
==================
.. automodule:: colossalai.trainer
:members:
.. toctree::
:maxdepth: 2
...
...
@@ -14,3 +11,7 @@ colossalai.trainer
:maxdepth: 2
colossalai.trainer.metric
.. automodule:: colossalai.trainer
:members:
docs/colossalai/colossalai.
nn.optimizer.fp16_optimiz
er.rst
→
docs/colossalai/colossalai.
utils.data_sampl
er.rst
View file @
35813ed3
colossalai.
nn.optimizer.fp16\_optimiz
er
colossalai.
utils.data\_sampl
er
=======================================
.. automodule:: colossalai.
nn.optimizer.fp16_optimiz
er
.. automodule:: colossalai.
utils.data_sampl
er
:members:
docs/colossalai/colossalai.utils.gradient_accumulation.rst
0 → 100644
View file @
35813ed3
colossalai.utils.gradient\_accumulation
=======================================
.. automodule:: colossalai.utils.gradient_accumulation
:members:
docs/colossalai/colossalai.
nn
.multi_tensor_apply.rst
→
docs/colossalai/colossalai.
utils
.multi_tensor_apply.rst
View file @
35813ed3
colossalai.nn.multi\_tensor\_apply
==================================
.. automodule:: colossalai.
nn
.multi_tensor_apply
.. automodule:: colossalai.
utils.multi_tensor_apply
.multi_tensor_apply
:members:
.. toctree::
:maxdepth: 2
colossalai.nn.multi_tensor_apply.multi_tensor_apply
docs/colossalai/colossalai.utils.rst
View file @
35813ed3
colossalai.utils
================
.. automodule:: colossalai.utils
:members:
.. toctree::
:maxdepth: 2
...
...
@@ -12,5 +8,12 @@ colossalai.utils
colossalai.utils.checkpointing
colossalai.utils.common
colossalai.utils.cuda
colossalai.utils.data_sampler
colossalai.utils.gradient_accumulation
colossalai.utils.memory
colossalai.utils.multi_tensor_apply
colossalai.utils.timer
.. automodule:: colossalai.utils
:members:
docs/colossalai/colossalai.zero.rst
0 → 100644
View file @
35813ed3
colossalai.zero
================
.. automodule:: colossalai.zero
:members:
docs/config.md
View file @
35813ed3
...
...
@@ -18,6 +18,15 @@ fp16 = dict(
initial_scale
=
2
**
8
)
# optional
# configuration for zero
# you can refer to the Zero Redundancy optimizer and zero offload section for details
# https://www.colossalai.org/zero.html
zero
=
dict
(
level
=<
int
>
,
...
)
# optional
# if you are using complex gradient handling
# otherwise, you do not need this in your config file
...
...
docs/installation.md
View file @
35813ed3
# Setup
##
Install with pip
##
# PyPI
```
bash
pip
install
colossalai
```
## Install from source
### Install From Source (Recommended)
> We **recommend** you to install from source as the Colossal-AI is updating frequently in the early versions. The documentation will be in line with the main branch of the repository. Feel free to raise an issue if you encounter any problem. :)
```
shell
git clone
git@
github.com
:
hpcaitech/ColossalAI.git
git clone
https://
github.com
/
hpcaitech/ColossalAI.git
cd
ColossalAI
# install dependency
pip
install
-r
requirements/requirements.txt
...
...
@@ -22,8 +24,4 @@ Install and enable CUDA kernel fusion (compulsory installation when using fused
```
shell
pip
install
-v
--no-cache-dir
--global-option
=
"--cuda_ext"
.
# install with editable enabled
pip
install
-v
--no-cache-dir
--global-option
=
"--cuda_ext"
-e
.
```
docs/run_demo.md
View file @
35813ed3
...
...
@@ -7,51 +7,92 @@ can also run on systems with only one GPU. Quick demos showing how to use Coloss
## Single GPU
Colossal-AI can be used to train deep learning models on systems with only one GPU and achieve baseline
performances.
[
Here
](
https://colab.research.google.com/drive/1fJnqqFzPuzZ_kn1lwCpG2nh3l2ths0KE?usp=sharing#scrollTo=cQ_y7lBG09LS
)
is an
example
showing how to train a LeNet model on the CIFAR10 dataset using Colossal-AI
.
performances.
We provided an example to train ResNet on CIFAR10 data with only one GPU. You can find this example in
`
example
s\resnet_cifar10_data_parallel`
in the repository. Detailed instructions can be found in its
`README.md`
.
## Multiple GPUs
Colossal-AI can be used to train deep learning models on distributed systems with multiple GPUs and accelerate the
training process drastically by applying efficient parallelization techiniques, which will be elaborated in
the
[
Parallelization
](
parallelization.md
)
section below. Run the code below on your distributed system with 4 GPUs,
where
`HOST`
is the IP address of your system. Note that we use
the
[
Slurm
](
https://slurm.schedmd.com/documentation.html
)
job scheduling system here.
the
[
Parallelization
](
parallelization.md
)
section below.
```
bash
HOST
=
xxx.xxx.xxx.xxx srun ./scripts/slurm_dist_train.sh ./examples/run_trainer.py ./configs/vit/vit_2d.py
```
You can turn the resnet example mentioned above into a multi-GPU training by setting
`--nproc_per_node`
to be the number of
GPUs you have on your system. We also provide an example of Vision Transformer which relies on
training with more GPUs. You can visit this example in
`examples\vit_b16_imagenet_data_parallel`
. It has a detailed instructional
`README.md`
for you too.
## Sample Training Script
`./configs/vit/vit_2d.py`
is a config file, which is introduced in the
[
Config file
](
config.md
)
section below. These
config files are used by Colossal-AI to define all kinds of training arguments, such as the model, dataset and training
method (optimizer, lr_scheduler, epoch, etc.). Config files are highly customizable and can be modified so as to train
different models.
`./examples/run_trainer.py`
contains a standard training script and is presented below, it reads the config file and
realizes the training process.
Below is a typical way of how you train the model using
```
python
import
colossalai
from
colossalai.
core
import
global_context
as
gpc
from
colossalai.
amp
import
AMP_TYPE
from
colossalai.logging
import
get_dist_logger
from
colossalai.trainer
import
Trainer
from
colossalai.trainer
import
Trainer
,
hooks
from
colossalai.utils
import
get_dataloader
CONFIG
=
dict
(
parallel
=
dict
(
pipeline
=
1
,
tensor
=
1
,
mode
=
None
),
fp16
=
dict
(
mode
=
AMP_TYPE
.
TORCH
),
gradient_accumulation
=
4
,
clip_grad_norm
=
1.0
)
def
run_trainer
():
engine
,
train_dataloader
,
test_dataloader
=
colossalai
.
initialize
()
parser
=
colossalai
.
get_default_parser
()
args
=
parser
.
parse_args
()
colossalai
.
launch
(
config
=
CONFIG
,
rank
=
args
.
rank
,
world_size
=
args
.
world_size
,
host
=
args
.
host
,
port
=
args
.
port
,
backend
=
args
.
backend
)
logger
=
get_dist_logger
()
logger
.
info
(
"engine is built"
,
ranks
=
[
0
])
# instantiate your compoentns
model
=
MyModel
()
optimizer
=
MyOptimizer
(
model
.
parameters
(),
...)
train_dataset
=
TrainDataset
()
test_dataset
=
TestDataset
()
train_dataloader
=
get_dataloader
(
train_dataset
,
...)
test_dataloader
=
get_dataloader
(
test_dataset
,
...)
lr_scheduler
=
MyScheduler
()
logger
.
info
(
"components are built"
)
engine
,
train_dataloader
,
test_dataloader
,
lr_scheduler
=
colossalai
.
initialize
(
model
,
optimizer
,
criterion
,
train_dataloader
,
test_dataloader
,
lr_scheduler
)
trainer
=
Trainer
(
engine
=
engine
,
verbose
=
True
)
logger
.
info
(
"trainer is built"
,
ranks
=
[
0
])
logger
.
info
(
"start training"
,
ranks
=
[
0
])
hook_list
=
[
hooks
.
LossHook
(),
hooks
.
LRSchedulerHook
(
lr_scheduler
=
lr_scheduler
,
by_epoch
=
False
),
hooks
.
AccuracyHook
(),
hooks
.
TensorboardHook
(
log_dir
=
'./tb_logs'
,
ranks
=
[
0
]),
hooks
.
LogMetricByEpochHook
(
logger
),
hooks
.
LogMemoryByEpochHook
(
logger
),
hooks
.
SaveCheckpointHook
(
checkpoint_dir
=
'./ckpt'
)
]
trainer
.
fit
(
train_dataloader
=
train_dataloader
,
test_dataloader
=
test_dataloader
,
epochs
=
gpc
.
config
.
num_epochs
,
hooks
_cfg
=
gpc
.
config
.
hooks
,
epochs
=
NUM_EPOCH
,
hooks
=
hook_list
,
display_progress
=
True
,
test_interval
=
2
)
...
...
docs/zero.md
View file @
35813ed3
...
...
@@ -19,6 +19,7 @@ Below are a few examples of ZeRO-3 configurations.
### Example of ZeRO-3 Configurations
You can refer to the
[
DeepSpeed configuration
](
https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
)
for details.
Here we use
`Adam`
as the initial optimizer.
1.
Use ZeRO to partition the optimizer states, gradients (level 2), and parameters (level 3).
...
...
Prev
1
2
3
4
5
6
Next
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment