Unverified Commit b9a8dff7 authored by digger-yu's avatar digger-yu Committed by GitHub
Browse files

[doc] Fix typo under colossalai and doc(#3618)

* Fixed several spelling errors under colossalai

* Fix the spelling error in colossalai and docs directory

* Cautious Changed the spelling error under the example folder

* Update runtime_preparation_pass.py

revert autograft to autograd

* Update search_chunk.py

utile to until

* Update check_installation.py

change misteach to mismatch in line 91

* Update 1D_tensor_parallel.md

revert to perceptron

* Update 2D_tensor_parallel.md

revert to perceptron in line 73

* Update 2p5D_tensor_parallel.md

revert to perceptron in line 71

* Update 3D_tensor_parallel.md

revert to perceptron in line 80

* Update README.md

revert to resnet in line 42

* Update reorder_graph.py

revert to indice in line 7

* Update p2p.py

revert to megatron in line 94

* Update initialize.py

revert to torchrun in line 198

* Update routers.py

change to detailed in line 63

* Update routers.py

change to detailed in line 146

* Update README.md

revert  random number in line 402
parent e1b0a78a
...@@ -28,7 +28,7 @@ gradient_accumulation = <int> ...@@ -28,7 +28,7 @@ gradient_accumulation = <int>
## Hands-on Practice ## Hands-on Practice
We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation) We provide a [runnable example](https://github.com/hpcaitech/ColossalAI-Examples/tree/main/features/gradient_accumulation)
to demonstrate gradient accumulation. In this example, we set the gradinet accumulation size to be 4. You can run the script using this command: to demonstrate gradient accumulation. In this example, we set the gradient accumulation size to be 4. You can run the script using this command:
```shell ```shell
python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py python -m torch.distributed.launch --nproc_per_node 1 --master_addr localhost --master_port 29500 run_resnet_cifar10_with_engine.py
......
...@@ -101,7 +101,7 @@ you can use `colossalai.amp.convert_to_amp`. ...@@ -101,7 +101,7 @@ you can use `colossalai.amp.convert_to_amp`.
```python ```python
from colossalai.amp import AMP_TYPE from colossalai.amp import AMP_TYPE
# exmaple of using torch amp # example of using torch amp
model, optimizer, criterion = colossalai.amp.convert_to_amp(model, model, optimizer, criterion = colossalai.amp.convert_to_amp(model,
optimizer, optimizer,
criterion, criterion,
...@@ -220,7 +220,7 @@ The default parameters of Naive AMP: ...@@ -220,7 +220,7 @@ The default parameters of Naive AMP:
- initial_scale(int): initial scale of gradient scaler - initial_scale(int): initial scale of gradient scaler
- growth_factor(int): the growth rate of loss scale - growth_factor(int): the growth rate of loss scale
- backoff_factor(float): the decrease rate of loss scale - backoff_factor(float): the decrease rate of loss scale
- hysterisis(int): delay shift in dynamic loss scaling - hysteresis(int): delay shift in dynamic loss scaling
- max_scale(int): maximum loss scale allowed - max_scale(int): maximum loss scale allowed
- verbose(bool): if set to `True`, will print debug info - verbose(bool): if set to `True`, will print debug info
...@@ -292,7 +292,7 @@ colossalai.launch_from_torch(config=args.config) ...@@ -292,7 +292,7 @@ colossalai.launch_from_torch(config=args.config)
### Step 4. Create training components ### Step 4. Create training components
Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is Build your model, optimizer, loss function, lr scheduler and dataloaders. Note that the root path of the dataset is
obtained from the environment varialbe `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])` obtained from the environment variable `DATA`. You may `export DATA=/path/to/data` or change `Path(os.environ['DATA'])`
to a path on your machine. Data will be automatically downloaded to the root path. to a path on your machine. Data will be automatically downloaded to the root path.
```python ```python
...@@ -326,7 +326,7 @@ to a path on your machine. Data will be automatically downloaded to the root pat ...@@ -326,7 +326,7 @@ to a path on your machine. Data will be automatically downloaded to the root pat
# build loss # build loss
criterion = torch.nn.CrossEntropyLoss() criterion = torch.nn.CrossEntropyLoss()
# lr_scheduelr # lr_scheduler
lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS) lr_scheduler = LinearWarmupLR(optimizer, warmup_steps=50, total_steps=gpc.config.NUM_EPOCHS)
``` ```
......
...@@ -57,7 +57,7 @@ It's compatible with all parallel methods in ColossalAI. ...@@ -57,7 +57,7 @@ It's compatible with all parallel methods in ColossalAI.
Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`. Let's start from two simple examples -- training GPT with different methods. These examples relies on `transformers`.
We should install denpendencies first: We should install dependencies first:
```shell ```shell
pip install psutil transformers pip install psutil transformers
...@@ -99,7 +99,7 @@ class GPTLMLoss(nn.Module): ...@@ -99,7 +99,7 @@ class GPTLMLoss(nn.Module):
shift_labels.view(-1)) shift_labels.view(-1))
``` ```
And we define some utility functions, which generates random data, computes the number of paramters of a model and get memory usage of current process: And we define some utility functions, which generates random data, computes the number of parameters of a model and get memory usage of current process:
```python ```python
def get_data(batch_size: int, seq_len: int, def get_data(batch_size: int, seq_len: int,
...@@ -251,7 +251,7 @@ Time: 3.691 s ...@@ -251,7 +251,7 @@ Time: 3.691 s
Mem usage: 5298.344 MB Mem usage: 5298.344 MB
``` ```
NVME offload saves about 294 MB memory. Note that enabling `pin_memory` of Gemini can accelerate training but increase memory usage. So this result also meets our expectation. If we disable `pin_memory`, we can aslo observe a memory usage drop about 900 MB. NVME offload saves about 294 MB memory. Note that enabling `pin_memory` of Gemini can accelerate training but increase memory usage. So this result also meets our expectation. If we disable `pin_memory`, we can also observe a memory usage drop about 900 MB.
## API Reference ## API Reference
......
...@@ -32,11 +32,11 @@ and the first and second momentum estimates) are partitioned across the processe ...@@ -32,11 +32,11 @@ and the first and second momentum estimates) are partitioned across the processe
3. **Shard Parameter**: The 16-bit model parameters are partitioned across the processes of a data parallel group. 3. **Shard Parameter**: The 16-bit model parameters are partitioned across the processes of a data parallel group.
4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for paramters, gradients and optimizer states. 4. **[Gemini](../advanced_tutorials/meet_gemini.md)**: Dynamic heterogeneous memory space manager for parameters, gradients and optimizer states.
Besides, this article will introduce the Zero Redundancy Optimizer with chunk-based memory management. Besides, this article will introduce the Zero Redundancy Optimizer with chunk-based memory management.
When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significiant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization. When using ZeRO, we distributed the model by sharding the parameters. The advantage of this method is that the memory of each node is load balanced. But this approach has two significant disadvantages. First, during communication, a temporary memory buffer needs to be allocated and released afterwards, leading to the memory fragmentation problem. Secondly, using tensor as the granularity for communication will cause the network bandwidth underutilized. Generally, the longer the transmitted message length, the higher the bandwidth utilization.
Using the Chunk mechanism introduced in ColossalAI v0.1.8, we can improve the efficiency of ZeRO. We store a continuous set of parameters in initialization order into a Chunk (a chunk is a continuous memory space), and each Chunk has the same size. Organizing memory in chunks can lead to efficient use of network bandwidth between PCI-e and GPU-GPU, reduce the number of communications, and avoid potential memory fragmentation. Using the Chunk mechanism introduced in ColossalAI v0.1.8, we can improve the efficiency of ZeRO. We store a continuous set of parameters in initialization order into a Chunk (a chunk is a continuous memory space), and each Chunk has the same size. Organizing memory in chunks can lead to efficient use of network bandwidth between PCI-e and GPU-GPU, reduce the number of communications, and avoid potential memory fragmentation.
......
...@@ -13,7 +13,7 @@ from datasets import load_dataset ...@@ -13,7 +13,7 @@ from datasets import load_dataset
def make_multi_folder_data(paths, caption_files=None, **kwargs): def make_multi_folder_data(paths, caption_files=None, **kwargs):
"""Make a concat dataset from multiple folders """Make a concat dataset from multiple folders
Don't suport captions yet Don't support captions yet
If paths is a list, that's ok, if it's a Dict interpret it as: If paths is a list, that's ok, if it's a Dict interpret it as:
k=folder v=n_times to repeat that k=folder v=n_times to repeat that
""" """
......
...@@ -40,7 +40,7 @@ class DataLoaderX(DataLoader): ...@@ -40,7 +40,7 @@ class DataLoaderX(DataLoader):
# A custom data loader class that inherits from DataLoader # A custom data loader class that inherits from DataLoader
def __iter__(self): def __iter__(self):
# Overriding the __iter__ method of DataLoader to return a BackgroundGenerator # Overriding the __iter__ method of DataLoader to return a BackgroundGenerator
#This is to enable data laoding in the background to improve training performance #This is to enable data loading in the background to improve training performance
return BackgroundGenerator(super().__iter__()) return BackgroundGenerator(super().__iter__())
...@@ -60,7 +60,7 @@ def get_parser(**parser_kwargs): ...@@ -60,7 +60,7 @@ def get_parser(**parser_kwargs):
# Create an ArgumentParser object with specifies kwargs # Create an ArgumentParser object with specifies kwargs
parser = argparse.ArgumentParser(**parser_kwargs) parser = argparse.ArgumentParser(**parser_kwargs)
# Add vairous command line arguments with their default balues and descriptions # Add various command line arguments with their default values and descriptions
parser.add_argument( parser.add_argument(
"-n", "-n",
"--name", "--name",
...@@ -162,7 +162,7 @@ def get_parser(**parser_kwargs): ...@@ -162,7 +162,7 @@ def get_parser(**parser_kwargs):
# A function that returns the non-default arguments between two objects # A function that returns the non-default arguments between two objects
def nondefault_trainer_args(opt): def nondefault_trainer_args(opt):
# create an argument parsser # create an argument parser
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
# add pytorch lightning trainer default arguments # add pytorch lightning trainer default arguments
parser = Trainer.add_argparse_args(parser) parser = Trainer.add_argparse_args(parser)
...@@ -203,7 +203,7 @@ def worker_init_fn(_): ...@@ -203,7 +203,7 @@ def worker_init_fn(_):
else: else:
return np.random.seed(np.random.get_state()[1][0] + worker_id) return np.random.seed(np.random.get_state()[1][0] + worker_id)
#Provide functionality for creating data loadedrs based on provided dataset configurations #Provide functionality for creating data loaders based on provided dataset configurations
class DataModuleFromConfig(pl.LightningDataModule): class DataModuleFromConfig(pl.LightningDataModule):
def __init__(self, def __init__(self,
...@@ -255,7 +255,7 @@ class DataModuleFromConfig(pl.LightningDataModule): ...@@ -255,7 +255,7 @@ class DataModuleFromConfig(pl.LightningDataModule):
def _train_dataloader(self): def _train_dataloader(self):
#Check if the train dataset is iterable #Check if the train dataset is iterable
is_iterable_dataset = isinstance(self.datasets['train'], Txt2ImgIterableBaseDataset) is_iterable_dataset = isinstance(self.datasets['train'], Txt2ImgIterableBaseDataset)
#Set the worker initialization function of the dataset isiterable or use_worker_init_fn is True #Set the worker initialization function of the dataset is iterable or use_worker_init_fn is True
if is_iterable_dataset or self.use_worker_init_fn: if is_iterable_dataset or self.use_worker_init_fn:
init_fn = worker_init_fn init_fn = worker_init_fn
else: else:
...@@ -310,7 +310,7 @@ class DataModuleFromConfig(pl.LightningDataModule): ...@@ -310,7 +310,7 @@ class DataModuleFromConfig(pl.LightningDataModule):
class SetupCallback(Callback): class SetupCallback(Callback):
# I nitialize the callback with the necessary parameters # Initialize the callback with the necessary parameters
def __init__(self, resume, now, logdir, ckptdir, cfgdir, config, lightning_config): def __init__(self, resume, now, logdir, ckptdir, cfgdir, config, lightning_config):
super().__init__() super().__init__()
...@@ -371,7 +371,7 @@ class SetupCallback(Callback): ...@@ -371,7 +371,7 @@ class SetupCallback(Callback):
# trainer.save_checkpoint(ckpt_path) # trainer.save_checkpoint(ckpt_path)
# PyTorch Lightning callback for ogging images during training and validation of a deep learning model # PyTorch Lightning callback for logging images during training and validation of a deep learning model
class ImageLogger(Callback): class ImageLogger(Callback):
def __init__(self, def __init__(self,
...@@ -379,10 +379,10 @@ class ImageLogger(Callback): ...@@ -379,10 +379,10 @@ class ImageLogger(Callback):
max_images, # Maximum number of images to log max_images, # Maximum number of images to log
clamp=True, # Whether to clamp pixel values to [-1,1] clamp=True, # Whether to clamp pixel values to [-1,1]
increase_log_steps=True, # Whether to increase frequency of log steps exponentially increase_log_steps=True, # Whether to increase frequency of log steps exponentially
rescale=True, # Whetehr to rescale pixel values to [0,1] rescale=True, # Whether to rescale pixel values to [0,1]
disabled=False, # Whether to disable logging disabled=False, # Whether to disable logging
log_on_batch_idx=False, # Whether to log on baych index instead of global step log_on_batch_idx=False, # Whether to log on batch index instead of global step
log_first_step=False, # Whetehr to log on the first step log_first_step=False, # Whether to log on the first step
log_images_kwargs=None): # Additional keyword arguments to pass to log_images method log_images_kwargs=None): # Additional keyword arguments to pass to log_images method
super().__init__() super().__init__()
self.rescale = rescale self.rescale = rescale
...@@ -593,7 +593,7 @@ if __name__ == "__main__": ...@@ -593,7 +593,7 @@ if __name__ == "__main__":
parser = Trainer.add_argparse_args(parser) parser = Trainer.add_argparse_args(parser)
opt, unknown = parser.parse_known_args() opt, unknown = parser.parse_known_args()
# Veirfy the arguments are both specified # Verify the arguments are both specified
if opt.name and opt.resume: if opt.name and opt.resume:
raise ValueError("-n/--name and -r/--resume cannot be specified both." raise ValueError("-n/--name and -r/--resume cannot be specified both."
"If you want to resume training in a new log folder, " "If you want to resume training in a new log folder, "
...@@ -646,7 +646,7 @@ if __name__ == "__main__": ...@@ -646,7 +646,7 @@ if __name__ == "__main__":
# Sets the seed for the random number generator to ensure reproducibility # Sets the seed for the random number generator to ensure reproducibility
seed_everything(opt.seed) seed_everything(opt.seed)
# Intinalize and save configuratioon using teh OmegaConf library. # Initialize and save configuration using teh OmegaConf library.
try: try:
# init and save configs # init and save configs
configs = [OmegaConf.load(cfg) for cfg in opt.base] configs = [OmegaConf.load(cfg) for cfg in opt.base]
......
...@@ -61,7 +61,7 @@ torchrun --nproc_per_node 2 train_dreambooth_colossalai.py \ ...@@ -61,7 +61,7 @@ torchrun --nproc_per_node 2 train_dreambooth_colossalai.py \
- `INSTANCE_DIR` refers to personalized path to instance images, you might need to insert information here. - `INSTANCE_DIR` refers to personalized path to instance images, you might need to insert information here.
- `OUTPUT_DIR` refers to local path to save the trained model, you might need to find a path with enough space. - `OUTPUT_DIR` refers to local path to save the trained model, you might need to find a path with enough space.
- `resolution` refers to the corresponding resolution number of your target model. Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model. - `resolution` refers to the corresponding resolution number of your target model. Note: Change the `resolution` to 768 if you are using the [stable-diffusion-2](https://huggingface.co/stabilityai/stable-diffusion-2) 768x768 model.
- `placement` refers to the training strategy supported by Colossal AI, defult = 'cuda', which refers to loading all the parameters into cuda memory. On the other hand, 'cpu' refers to 'cpu offload' strategy while 'auto' enables 'Gemini', both featured by Colossal AI. - `placement` refers to the training strategy supported by Colossal AI, default = 'cuda', which refers to loading all the parameters into cuda memory. On the other hand, 'cpu' refers to 'cpu offload' strategy while 'auto' enables 'Gemini', both featured by Colossal AI.
### Training with prior-preservation loss ### Training with prior-preservation loss
......
...@@ -40,7 +40,7 @@ We provide two stable solutions. ...@@ -40,7 +40,7 @@ We provide two stable solutions.
One utilizes the Gemini to implement hybrid parallel strategies of Gemini, DDP/ZeRO, and Tensor Parallelism for a huggingface GPT model. One utilizes the Gemini to implement hybrid parallel strategies of Gemini, DDP/ZeRO, and Tensor Parallelism for a huggingface GPT model.
The other one use [Titans](https://github.com/hpcaitech/Titans), a distributed executed model zoo maintained by ColossalAI,to implement the hybrid parallel strategies of TP + ZeRO + PP. The other one use [Titans](https://github.com/hpcaitech/Titans), a distributed executed model zoo maintained by ColossalAI,to implement the hybrid parallel strategies of TP + ZeRO + PP.
We recommend using Gemini to qucikly run your model in a distributed manner. We recommend using Gemini to quickly run your model in a distributed manner.
It doesn't require significant changes to the model structures, therefore you can apply it on a new model easily. It doesn't require significant changes to the model structures, therefore you can apply it on a new model easily.
And use Titans as an advanced weapon to pursue a more extreme performance. And use Titans as an advanced weapon to pursue a more extreme performance.
Titans has included the some typical models, such as Vit and GPT. Titans has included the some typical models, such as Vit and GPT.
......
...@@ -27,7 +27,7 @@ pip install transformers ...@@ -27,7 +27,7 @@ pip install transformers
## Dataset ## Dataset
For simplicity, the input data is randonly generated here. For simplicity, the input data is randomly generated here.
## Training ## Training
......
...@@ -34,7 +34,7 @@ conda install -c conda-forge coin-or-cbc ...@@ -34,7 +34,7 @@ conda install -c conda-forge coin-or-cbc
## Dataset ## Dataset
For simplicity, the input data is randonly generated here. For simplicity, the input data is randomly generated here.
## Training ## Training
......
...@@ -27,7 +27,7 @@ pip install transformers ...@@ -27,7 +27,7 @@ pip install transformers
## Dataset ## Dataset
For simplicity, the input data is randonly generated here. For simplicity, the input data is randomly generated here.
## Training ## Training
......
...@@ -163,7 +163,7 @@ def main(): ...@@ -163,7 +163,7 @@ def main():
else: else:
init_dev = get_current_device() init_dev = get_current_device()
# shard init prameters # shard init parameters
if args.shardinit: if args.shardinit:
logger.info("Sharding initialization !", ranks=[0]) logger.info("Sharding initialization !", ranks=[0])
else: else:
...@@ -192,7 +192,7 @@ def main(): ...@@ -192,7 +192,7 @@ def main():
config=config, config=config,
local_files_only=False) local_files_only=False)
# enable graident checkpointing # enable gradient checkpointing
model.gradient_checkpointing_enable() model.gradient_checkpointing_enable()
numel = sum([p.numel() for p in model.parameters()]) numel = sum([p.numel() for p in model.parameters()])
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment