scaling helps overcomes gradient overflow issues. When used in conjunction with
scaling helps overcomes gradient overflow issues. When used in conjunction with
mixed precision, it enables training larger models and makes the training
mixed precision, it enables training larger models and makes the training
process more stable, especially in deep networks [#879]
process more stable, especially in deep networks [#879]
- FSDP: Added state_dict_on_rank_0_only flag allow user choose to return full
state dict on rank 0 and return empty dict non-rank 0 to prevent OOM [#844]
- FSDP: Added process_group_reduce_scatter parameter to allow users to pass in the process group that is used for reduce scatter operation. [#897]
- FSDP: Added process_group_reduce_scatter parameter to allow users to pass in the process group that is used for reduce scatter operation. [#897]
- FSDP: Added state_dict_on_rank_0_only flag allow user choose to return full state dict on rank 0 and return empty dict non-rank 0 to prevent OOM [#844]
- FSDP: Added state_dict_on_rank_0_only flag allow user choose to return full state dict on rank 0 and return empty dict non-rank 0 to prevent OOM [#844]