"""Create a gradient accumulator that will accumulate gradients in fp32.
Args:
named_parameters: The parameters that will be updated by the optimizer. In case of Zero 1, this is the parameters that will be updated in this DP rank.
grad_buckets_named_params: The parameters to accumulate gradients for. If None it defaults to `named_parameters`. In case of Zero 1, this should be all the parameters in the model.
Note: We use `grad_buckets_named_params` to keep grad buffers for all parameters even when Zero 1 is used. This is because we need to accumulate gradients for all parameters without having to reduce in every accumulation step.
Note: We make a fp32 copy of parameters during initialization. Therefore parameters need to be initialized or loaded from a checkpoint before constructing this gradient accumulator
),f"Elements don't match:\n - Elements in `self.id_to_name` that aren't in the other one: {set(self.id_to_name.values())-set(state_dict['names'].values())}\n - Elements in `state_dict[\"names\"]` that aren't in the other one: {set(state_dict['names'].values())-set(self.id_to_name.values())}"
assertlen(state_dict["state"])==len(
state_dict["names"]
),f"Number of params in loaded state dict ({len(state_dict['state'])}) doesn't match number of names ({len(state_dict['names'])})"
assertlen(state_dict["state"])>0,"Loading empty state dict"
"""Optimizer that handles partitioning of optimizer's states across DP ranks. See ZeRO Stage 1 in the paper https://arxiv.org/abs/1910.02054v3 for more details."""
# maps each model's param to the optimizer's dp rank that is responsible for updating it
# We assume that parameters can be sharded across DP, ie we can "split" a parameter in different DP. This does break some optimizers, like Adafactor and such.
# `param_name_to_dp_rank_offsets[name]` is a `Dict[int, Tuple[int, int]]` keys are dp_rank, and `Tuple[int, int]` are the offsets of the param belonging to this DP
param_name_to_dp_rank_offsets={}
# NOTE: save the original shapes before flattening the params
# so that later on, we can reshape the params to their original shapes
# for topology-agnostic optimizer states loading
self._orig_param_shapes={}
forname,paraminnamed_params:
self._orig_param_shapes[name]=param.shape
forname,paraminnamed_params:
# We assume parameter to be contiguous in order to have an easy way of sharding it.
assertparam.is_contiguous(),f"Parameter {name} is not contiguous"
f"[ZeRO sharding] DP Rank {dp_rank} has {human_format(acc_numel)} out of {human_format(all_numel)} ({0ifall_numel==0elseacc_numel/all_numel*100:.2f}%) params' optimizer states",
"""Base class for all parameters in Nanotronmodels
A NanotronParameter can have specific properties:
- sharded: the parameter is considered to be `sharded` across multiple devices
- tied: the parameter is considered to be `tied` with other parameters. We sum gradients over those.
.. note::
Notes about tied weights:
- Tied weights means weights that need to be synced only within the same DP rank, regardless if they are part of TP strategy or just shared weights between two layers.
- Syncing tied weights usually require to sum gradients.
- Some weights are synced without needing to reduce grads over ranks. They can be in the same device (ex: enc/dec embeds in the same PP stage) or they can be duplicated across TP and duplicate the workload across TP ranks (ex: LN using traditional TP)
- Even if some weights don't need their grads to be reduced, it's still useful for them to be marked as tied. For example, current serialization format requires to mark them correctly.
// TODO @thomasw21: How do we extrapolate this notion to a tree. Not sure exactly, but topological ordering should be fine
# TODOs:
- [ ] passing activation that don't require backward screws me as 1f1b works because you have the same number of forward and the same number of backward (in the stage sense)
"""Most granular pipeline block, ie within this module, everything will be part of a single rank, ie the entire computation within this block will happen on a specific device.
Current limitations:
- PipelineBlocks have to wrap a method/function/module that outputs a Dict[str, torch.Tensor]
Some considerations:
- In the literature, authors often refer to pipeline stages as a granularity block. Our notion is more granular. A pipeline stage is list of contiguous (in the forward sense) of pipeline blocks.
All PipelineBlock definition exist in each rank, they are just instantiated/built on a single rank per pipeline parallel process group.
# Send activations from other devices to local rank
forname,tensorinsorted_kwargs:
ifisinstance(tensor,TensorPointer):
# Current rank is neither the rank holding the data nor the rank responsible for computing block
continue
else:
assertisinstance(tensor,torch.Tensor)
# We need to send the tensor to the rank that actually runs the compute
ifself.pipeline_stateisnotNone:
send_to_pipeline_state_buffer(
tensor,
to_rank=self.rank,
p2p=self.p2p,
pipeline_state=self.pipeline_state,
)
continue
iftensor.requires_gradisTrue:
raiseValueError(
f"Pipeline engine is None and tensor requires grad. Tried sending a tensor to {self.rank}. Usually that means that your model is pipeline sharded and you haven't chosen a specific pipeline engine."
# This assumes that prior communication was already done
# In case of interleaved 1f1b, if this is the second model chunk, then we need to send the previous activations before receiving the current activations
f"Pipeline engine is None and tensor requires grad. Tried receiving a tensor to {self.rank}. Usually that means that your model is pipeline sharded and you haven't chosen a specific pipeline engine."
"""If model returns tensor, we use it as a loss to backpropagate. If model returns a dict, we assume that the key "loss" is the loss to backpropagate."""