transformer_engine/pytorch/distributed.py · 236a2030ac34489f8efa2c23a1bd296e7061bda0 · OpenDAS / TransformerEngine

[PyTorch] Distributed intermediate/activation tensors for FSDP (#687) · 0edf30b8

Alp Dener authored Jun 07, 2024



* New TE wrapper for PyTorch FullyShardedDataParallel to make TE modules distribute their activations after the forward pass and gather them before the backward pass
Signed-off-by: Alp Dener <adener@nvidia.com>

* simplified TE module setup for FSDP comms
Signed-off-by: Alp Dener <adener@nvidia.com>

* FSDP scatter/gather for tensors saved into autograd ctx now working for base TE modules
Signed-off-by: Alp Dener <adener@nvidia.com>

* make sure activation recompute disables FSDP scatter/gather
Signed-off-by: Alp Dener <adener@nvidia.com>

* make sure Fp8 weight buffers are sharded at the end of the backward pass and gathered before forward
Signed-off-by: Alp Dener <adener@nvidia.com>

* Fixed typo in attribute name
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed bug in finding FSDP-wrapped TE modules
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed typo in fp8 weight tensor name
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixed incorrect # of gradients
Signed-off-by: Alp Dener <adener@nvidia.com>

* Added fp8 amax gradient hook tensor to the parameter reset
Signed-off-by: Alp Dener <adener@nvidia.com>

* get rid of erroneous dummy tensor leftover from incorrect rebase
Signed-off-by: Alp Dener <adener@nvidia.com>

* Linting fixes
Signed-off-by: Alp Dener <adener@nvidia.com>

* fixing git snafu and removing debug statements
Signed-off-by: Alp Dener <adener@nvidia.com>

---------
Signed-off-by: Alp Dener <adener@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

0edf30b8

distributed.py 35.8 KB

Replace distributed.py