# Features in LiBai LiBai provides many features out of the box. This section shows how to configure them step by step. ## Automatic Mixed Precision Training AMP stands for automatic mixed precision training. To enable AMP in LiBaiYou, add `train.amp.enabled=True` in your configuration file . ### Usage ```python # import config from .common.train import train # get config from libai.config import get_config train = get_config("common/train.py").train # enable amp train.amp.enabled = True # disable amp train.amp.enabled = False ``` ## Gradient Clipping Gradient clipping is a technique that tackles exploding gradients. The idea of gradient clipping is very simple: the gradient will be rescaled down if it gets too large. LiBai supports gradient clipping in a convenient way, and you don't have to implement it by yourself. You just need to add some settings to your configuration file to enable gradient clipping. **Note:** We do not recommend writing gradient clipping by yourself, because naive gradient clipping may fail when using tensor parallel or pipeline parallel. ### Usage ```python # import config from .common.optim import optim # get config from libai.config import get_config optim = get_config("common/optim.py").optim # enable gradient clipping optim.params.clip_grad_max_norm = 1.0 optim.params.clip_grad_norm_type = 2.0 # disable gradient clipping optim.params.clip_grad_max_norm = None optim.params.clip_grad_norm_type = None ``` `clip_grad_max_norm` and `clip_grad_norm_type` can be checked in [API docs of oneflow](https://oneflow.readthedocs.io/en/master/nn.html#oneflow.nn.utils.clip_grad_norm_). ## Gradient Accumulation Gradient accumulation is a common strategy to train large-scale models when memory becomes the bottleneck. This technique splits the mini-batch into several micro-batches, then performs normal forward and backward operations. Models will only be updated after accumulating the gradients of all these micro-batches. Besides, when training with pipeline parallel, gradient accumulation makes different stages executed in different micro-batch in parallel. Therefore, the calculation of each stage can be overlapped. ### Usage ```python # import config from .common.train import train # get config from libai.config import get_config train = get_config("common/train.py").train # enable grad accumulation for 4 steps train.num_accumulation_steps = 4 # disable grad accumulation train.num_accumulation_steps = None ``` ## Activation Checkpointing To reduce GPU memory usage and deploy a large model to a training system, LiBai supports activation checkpointing. LiBai uses a Transformer layer as the unit of checkpointing, because the activation size bloats in the middle of a Transformer layer, so checkpointing the input of a Transformer layer is storage-efficient. LiBai supports [activation checkpointing](https://arxiv.org/abs/1604.06174) by `set_activation_checkpoint` in `GraphBase`. So models using `libai.layers.TransformerLayer` support activation checkpointing by default. If you want to set activation checkpointing for customized layers, you need to override this function. ```python def set_activation_checkpoint(self): for module_block in self.model.modules(): if isinstance(module_block.to(nn.Module), TransformerLayer): module_block.to(nn.graph.GraphModule).activation_checkpointing = True ``` ### Usage ```python # import config from .common.train import train # get config from libai.config import get_config train = get_config("common/train.py").train # enable activation checkpointing train.activation_checkpoint.enabled = True # disable activation checkpointing train.activation_checkpoint.enabled = False ``` ## ZeRO Unlike normal data parallelism, where model states and gradients are replicated across data-parallel processes, Zero Redundancy Optimizer (ZeRO) partitions them across data-parallel processes, which can reduce memory footprint significantly. - Level 1: The optimizer states (e.g., for Adam optimizer, 32-bit weights, and the first and second moment estimates) are partitioned across the processes so that each process will only update its own partition. - Level 2: The reduced 32-bit gradients for updating the model weights are also partitioned so that each process retains only the gradients corresponding to its portion of the optimizer states. > **Note:** ZeRO only supports data parallel and pipeline parallel, or the combination of them. If you use tensor parallel in your training, make sure ZeRO is disabled. ### Usage ```python # import config from .common.train import train # get config from libai.config import get_config train = get_config("common/train.py").train # enable zero train.zero_optimization.enabled = True # enable zero for level-1 train.zero_optimization.stage = 1 # enable zero for level-2 train.zero_optimization.stage = 2 # disable zero train.zero_optimization.enabled = False ```