Execute forward & backward when utilizing pipeline parallel.
Return loss or Huggingface style model outputs if needed.
Warning: This function is tailored for the scenario of pipeline parallel.
As a result, please don't do the forward/backward pass in the conventional way (model(input)/loss.backward())
when doing pipeline parallel training with booster, which will cause unexpected errors.
Args:
data_iter(Iterator): The iterator for getting the next batch of data. Usually there are two ways to obtain this argument:
1. wrap the dataloader to iterator through: iter(dataloader)
2. get the next batch from dataloader, and wrap this batch to iterator: iter([batch])
model (nn.Module): The model to execute forward/backward, it should be a model wrapped by a plugin that supports pipeline.
criterion: (Callable[[Any, Any], torch.Tensor]): Criterion to be used. It should take two arguments: model outputs and inputs, and returns loss tensor.
'lambda y, x: loss_fn(y)' can turn a normal loss function into a valid two-argument criterion here.
optimizer (Optimizer, optional): The optimizer for execution of backward. Can be None when only doing forward (i.e. evaluation). Defaults to None.
return_loss (bool, optional): Whether to return loss in the dict returned by this method. Defaults to True.
return_output (bool, optional): Whether to return Huggingface style model outputs in the dict returned by this method. Defaults to False.
Returns:
Dict[str, Any]: Output dict in the form of {'loss': ..., 'outputs': ...}.
ret_dict['loss'] is the loss of forward if return_loss is set to True, else None.
ret_dict['outputs'] is the Huggingface style model outputs during forward if return_output is set to True, else None.
"""
assertisinstance(
self.plugin,PipelinePluginBase
),f"The plugin {self.plugin.__class__.__name__} does not support pipeline."
model (nn.Module or ModelWrapper): A model boosted by Booster.
checkpoint (str): Path to the checkpoint. It must be a local path.
It is a file path if ``shard=False``. Otherwise, it is a directory path.
shard (bool, optional): Whether to save checkpoint a sharded way.
If true, the checkpoint will be a folder with the same format as Huggingface transformers checkpoint. Otherwise, it will be a single file. Defaults to False.
gather_dtensor (bool, optional): whether to gather the distributed tensor to the first device. Default: True.
prefix (str, optional): A prefix added to parameter and buffer
names to compose the keys in state_dict. Defaults to None.
size_per_shard (int, optional): Maximum size of checkpoint shard file in MB. This is useful only when ``shard=True``. Defaults to 1024.
use_safetensors (bool, optional): whether to use safe tensors. Default: False. If set to True, the checkpoint will be saved.
Precision for mixed precision training in FP16 using apex AMP.
Args:
opt_level(str, optional, default="O1" ): Pure or mixed precision optimization level. Accepted values are “O0”, “O1”, “O2”, and “O3”, explained in detail above Apex AMP Documentation.
cast_model_type (torch.dtype, optional, default=None): Casts your model’s parameters and buffers to the desired type.
patch_torch_functions (bool, optional, default=None): Patch all Torch functions and Tensor methods to perform Tensor Core-friendly ops like GEMMs and convolutions in FP16, and any ops that benefit from FP32 precision in FP32.
keep_batchnorm_fp32 (bool or str, optional, default=None): To enhance precision and enable cudnn batchnorm (which improves performance), it’s often beneficial to keep batchnorm weights in FP32 even if the rest of the model is FP16.
master_weights (bool, optional, default=None): Maintain FP32 master weights to accompany any FP16 model weights. FP32 master weights are stepped by the optimizer to enhance precision and capture small gradients.
loss_scale (float or str, optional, default=None): If loss_scale is a float value, use this value as the static (fixed) loss scale. If loss_scale is the string "dynamic", adaptively adjust the loss scale over time. Dynamic loss scale adjustments are performed by Amp automatically.
cast_model_outputs (torch.dpython:type, optional, default=None): Option to ensure that the outputs of your model(s) are always cast to a particular type regardless of opt_level.
num_losses(int, optional, default=1): Option to tell AMP in advance how many losses/backward passes you plan to use. When used in conjunction with the loss_id argument to `amp.scale_loss`, enables Amp to use a different loss scale per loss/backward pass, which can improve stability. If num_losses is left to 1, Amp will still support multiple losses/backward passes, but use a single global loss scale for all of them.
verbosity(int, default=1): Set to 0 to suppress Amp-related output.
min_loss_scale(float, default=None): Sets a floor for the loss scale values that can be chosen by dynamic loss scaling. The default value of None means that no floor is imposed. If dynamic loss scaling is not used, min_loss_scale is ignored.
max_loss_scale(float, default=2.**24 ): Sets a ceiling for the loss scale values that can be chosen by dynamic loss scaling. If dynamic loss scaling is not used, max_loss_scale is ignored.
chunk_init_device (torch.device, optional): device to initialize the chunk.
placement_policy (str, optional): "static" and "auto". Defaults to "static".
shard_param_frac (float, optional): fraction of parameters to be sharded. Only for "static" placement.
If `shard_param_frac` is 1.0, it's equal to zero-3. If `shard_param_frac` is 0.0, it's equal to zero-2. Defaults to 1.0.
offload_optim_frac (float, optional): fraction of optimizer states to be offloaded. Only for "static" placement.
If `shard_param_frac` is 1.0 and `offload_optim_frac` is 0.0, it's equal to old "cuda" placement. Defaults to 0.0.
offload_param_frac (float, optional): fraction of parameters to be offloaded. Only for "static" placement.
For efficiency, this argument is useful only when `shard_param_frac` is 1.0 and `offload_optim_frac` is 1.0.
If `shard_param_frac` is 1.0, `offload_optim_frac` is 1.0 and `offload_param_frac` is 1.0, it's equal to old "cpu" placement.
When using static placement, we recommend users to tune `shard_param_frac` first and then `offload_optim_frac`.
Defaults to 0.0.
warmup_non_model_data_ratio (float, optional): ratio of expected non-model data memory during warmup. Only for "auto" placement. Defaults to 0.8.
steady_cuda_cap_ratio (float, optional): ratio of allowed cuda capacity for model data during steady state. Only for "auto" placement. Defaults to 0.9.
precision (str, optional): precision. Support 'fp16' and 'bf16'. Defaults to 'fp16'.
pin_memory (bool, optional): use pin memory on CPU. Defaults to False.
force_outputs_fp32 (bool, optional): force outputs are fp32. Defaults to False.
strict_ddp_mode (bool, optional): use strict ddp mode (only use dp without other parallelism). Defaults to False.
search_range_mb (int, optional): chunk size searching range in MegaByte. Defaults to 32.
search_range_m (int, optional): chunk size searching range divided by 2^20. Defaults to 32.
hidden_dim (int, optional): the hidden dimension of DNN.
Users can provide this argument to speed up searching.
If users do not know this argument before training, it is ok. We will use a default value 1024.
min_chunk_size_mb (float, optional): the minimum chunk size in MegaByte.
If the aggregate size of parameters is still samller than the minimum chunk size,
min_chunk_size_m (float, optional): the minimum chunk size divided by 2^20.
If the aggregate size of parameters is still smaller than the minimum chunk size,
all parameters will be compacted into one small chunk.
memstats (MemStats, optional) the memory statistics collector by a runtime memory tracer.
gpu_margin_mem_ratio (float, optional): The ratio of GPU remaining memory (after the first forward-backward)
which will be used when using hybrid CPU optimizer.
This argument is meaningless when `placement_policy` of `GeminiManager` is not "auto".
Defaults to 0.0.
initial_scale (float, optional): Initial scale used by DynamicGradScaler. Defaults to 2**32.
initial_scale (float, optional): Initial scale used by DynamicGradScaler. Defaults to 2**16.
min_scale (float, optional): Min scale used by DynamicGradScaler. Defaults to 1.
growth_factor (float, optional): growth_factor used by DynamicGradScaler. Defaults to 2.
backoff_factor (float, optional): backoff_factor used by DynamicGradScaler. Defaults to 0.5.
...
...
@@ -152,17 +287,24 @@ class GeminiPlugin(Plugin):
def__init__(
self,
device:Optional[torch.device]=None,
placement_policy:str="cpu",
chunk_config_dict:Optional[dict]=None,
chunk_init_device:Optional[torch.device]=None,
placement_policy:str="static",
shard_param_frac:float=1.0,# only for static placement
offload_optim_frac:float=0.0,# only for static placement
offload_param_frac:float=0.0,# only for static placement
warmup_non_model_data_ratio:float=0.8,# only for auto placement
steady_cuda_cap_ratio:float=0.9,# only for auto placement
precision:str="fp16",
pin_memory:bool=False,
force_outputs_fp32:bool=False,
strict_ddp_mode:bool=False,
search_range_mb:int=32,
search_range_m:int=32,
hidden_dim:Optional[int]=None,
min_chunk_size_mb:float=32,
min_chunk_size_m:float=32,
memstats:Optional[MemStats]=None,
gpu_margin_mem_ratio:float=0.0,
initial_scale:float=2**32,
initial_scale:float=2**16,
min_scale:float=1,
growth_factor:float=2,
backoff_factor:float=0.5,
...
...
@@ -173,24 +315,31 @@ class GeminiPlugin(Plugin):
norm_type:float=2.0,
verbose:bool=False,
)->None:
assertdist.is_initialized(
),'torch.distributed is not initialized, please use colossalai.launch to create the distributed environment'
self.rank=dist.get_rank()
self.world_size=dist.get_world_size()
super().__init__()
assertprecisioninSUPPORTED_PRECISION,f"precision {precision} is not supported"