## 🔥 Multi-dimensional Hybrid Parallel with Vision Transformer
1. Go to **hybrid_parallel** folder in the **tutorial** directory.
2. Install our model zoo.
```bash
pip install titans
```
3. Run with synthetic data which is of similar shape to CIFAR10 with the `-s` flag.
```bash
colossalai run --nproc_per_node 4 train.py --config config.py -s
```
4. Modify the config file to play with different types of tensor parallelism, for example, change tensor parallel size to be 4 and mode to be 2d and run on 8 GPUs.
## ☀️ Sequence Parallel with BERT
1. Go to the **sequence_parallel** folder in the **tutorial** directory.
2. Run with the following command
```bash
export PYTHONPATH=$PWD
colossalai run --nproc_per_node 4 train.py -s
```
3. The default config is sequence parallel size = 2, pipeline size = 1, let’s change pipeline size to be 2 and try it again.
## 📕 Large batch optimization with LARS and LAMB
1. Go to the **large_batch_optimizer** folder in the **tutorial** directory.
2. Run with synthetic data
```bash
colossalai run --nproc_per_node 4 train.py --config config.py -s
```
## 😀 Auto-Parallel Tutorial
1. Go to the **auto_parallel** folder in the **tutorial** directory.
2. Install `pulp` and `coin-or-cbc` for the solver.
```bash
pip install pulp
conda install-c conda-forge coin-or-cbc
```
2. Run the auto parallel resnet example with 4 GPUs with synthetic dataset.
```bash
colossalai run --nproc_per_node 4 auto_parallel_with_resnet.py -s
```
You should expect to the log like this. This log shows the edge cost on the computation graph as well as the sharding strategy for an operation. For example, `layer1_0_conv1 S01R = S01R X RR` means that the first dimension (batch) of the input and output is sharded while the weight is not sharded (S means sharded, R means replicated), simply equivalent to data parallel training.
This shows that given different memory budgets, the model is automatically injected with activation checkpoint and its time taken per iteration. You can run this benchmark for GPT as well but it can much longer since the model is larger.
```bash
python auto_ckpt_solver_test.py --model gpt2
```
4. Run a simple benchmark to find the optimal batch size for checkpointed model.