• Nikhila Ravi's avatar
    Updates to support Accelerate and multigpu training (#37) · aa8b03f3
    Nikhila Ravi authored
    Summary:
    ## Changes:
    - Added Accelerate Library and refactored experiment.py to use it
    - Needed to move `init_optimizer` and `ExperimentConfig` to a separate file to be compatible with submitit/hydra
    - Needed to make some modifications to data loaders etc to work well with the accelerate ddp wrappers
    - Loading/saving checkpoints incorporates an unwrapping step so remove the ddp wrapped model
    
    ## Tests
    
    Tested with both `torchrun` and `submitit/hydra` on two gpus locally. Here are the commands:
    
    **Torchrun**
    
    Modules loaded:
    ```sh
    1) anaconda3/2021.05   2) cuda/11.3   3) NCCL/2.9.8-3-cuda.11.3   4) gcc/5.2.0. (but unload gcc when using submit)
    ```
    
    ```sh
    torchrun --nnodes=1 --nproc_per_node=2 experiment.py --config-path ./configs --config-name repro_singleseq_nerf_test
    ```
    
    **Submitit/Hydra Local test**
    
    ```sh
    ~/pytorch3d/projects/implicitron_trainer$ HYDRA_FULL_ERROR=1 python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs  hydra/launcher=submitit_local hydra.launcher.gpus_per_node=2 hydra.launcher.tasks_per_node=2 hydra.launcher.nodes=1
    ```
    
    **Submitit/Hydra distributed test**
    
    ```sh
    ~/implicitron/pytorch3d$ python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs  hydra/launcher=submitit_slurm hydra.launcher.gpus_per_node=8 hydra.launcher.tasks_per_node=8 hydra.launcher.nodes=1 hydra.launcher.partition=learnlab hydra.launcher.timeout_min=4320
    ```
    
    ## TODOS:
    - Fix distributed evaluation: currently this doesn't work as the input format to the evaluation function is not suitable for gathering across gpus (needs to be nested list/tuple/dicts of objects that satisfy `is_torch_tensor`) and currently `frame_data`  contains `Cameras` type.
    - Refactor the `accelerator` object to be accessible by all functions instead of needing to pass it around everywhere? Maybe have a `Trainer` class and add it as a method?
    - Update readme with installation instructions for accelerate and also commands for running jobs with torchrun and submitit/hydra
    
    X-link: https://github.com/fairinternal/pytorch3d/pull/37
    
    Reviewed By: davnov134, kjchalup
    
    Differential Revision: D37543870
    
    Pulled By: bottler
    
    fbshipit-source-id: be9eb4e91244d4fe3740d87dafec622ae1e0cf76
    aa8b03f3
visualize_reconstruction.py 12.6 KB