1. 12 Jul, 2022 1 commit
    • Nikhila Ravi's avatar
      Updates to support Accelerate and multigpu training (#37) · aa8b03f3
      Nikhila Ravi authored
      Summary:
      ## Changes:
      - Added Accelerate Library and refactored experiment.py to use it
      - Needed to move `init_optimizer` and `ExperimentConfig` to a separate file to be compatible with submitit/hydra
      - Needed to make some modifications to data loaders etc to work well with the accelerate ddp wrappers
      - Loading/saving checkpoints incorporates an unwrapping step so remove the ddp wrapped model
      
      ## Tests
      
      Tested with both `torchrun` and `submitit/hydra` on two gpus locally. Here are the commands:
      
      **Torchrun**
      
      Modules loaded:
      ```sh
      1) anaconda3/2021.05   2) cuda/11.3   3) NCCL/2.9.8-3-cuda.11.3   4) gcc/5.2.0. (but unload gcc when using submit)
      ```
      
      ```sh
      torchrun --nnodes=1 --nproc_per_node=2 experiment.py --config-path ./configs --config-name repro_singleseq_nerf_test
      ```
      
      **Submitit/Hydra Local test**
      
      ```sh
      ~/pytorch3d/projects/implicitron_trainer$ HYDRA_FULL_ERROR=1 python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs  hydra/launcher=submitit_local hydra.launcher.gpus_per_node=2 hydra.launcher.tasks_per_node=2 hydra.launcher.nodes=1
      ```
      
      **Submitit/Hydra distributed test**
      
      ```sh
      ~/implicitron/pytorch3d$ python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs  hydra/launcher=submitit_slurm hydra.launcher.gpus_per_node=8 hydra.launcher.tasks_per_node=8 hydra.launcher.nodes=1 hydra.launcher.partition=learnlab hydra.launcher.timeout_min=4320
      ```
      
      ## TODOS:
      - Fix distributed evaluation: currently this doesn't work as the input format to the evaluation function is not suitable for gathering across gpus (needs to be nested list/tuple/dicts of objects that satisfy `is_torch_tensor`) and currently `frame_data`  contains `Cameras` type.
      - Refactor the `accelerator` object to be accessible by all functions instead of needing to pass it around everywhere? Maybe have a `Trainer` class and add it as a method?
      - Update readme with installation instructions for accelerate and also commands for running jobs with torchrun and submitit/hydra
      
      X-link: https://github.com/fairinternal/pytorch3d/pull/37
      
      Reviewed By: davnov134, kjchalup
      
      Differential Revision: D37543870
      
      Pulled By: bottler
      
      fbshipit-source-id: be9eb4e91244d4fe3740d87dafec622ae1e0cf76
      aa8b03f3
  2. 06 Jul, 2022 1 commit
    • Jeremy Reizenstein's avatar
      typing for trainer · 40fb189c
      Jeremy Reizenstein authored
      Summary: Enable pyre checking of the trainer code.
      
      Reviewed By: shapovalov
      
      Differential Revision: D36545438
      
      fbshipit-source-id: db1ea8d1ade2da79a2956964eb0c7ba302fa40d1
      40fb189c
  3. 25 May, 2022 1 commit
  4. 20 May, 2022 3 commits
    • Jeremy Reizenstein's avatar
      dataset_map_provider · 79c61a2d
      Jeremy Reizenstein authored
      Summary: replace dataset_zoo with a pluggable DatasetMapProvider. The logic is now in annotated_file_dataset_map_provider.
      
      Reviewed By: shapovalov
      
      Differential Revision: D36443965
      
      fbshipit-source-id: 9087649802810055e150b2fbfcc3c197a761f28a
      79c61a2d
    • Jeremy Reizenstein's avatar
      New file for ImplicitronDatasetBase · 69c6d06e
      Jeremy Reizenstein authored
      Summary: Separate ImplicitronDatasetBase and FrameData (to be used by all data sources) from ImplicitronDataset (which is specific).
      
      Reviewed By: shapovalov
      
      Differential Revision: D36413111
      
      fbshipit-source-id: 3725744cde2e08baa11aff4048237ba10c7efbc6
      69c6d06e
    • Jeremy Reizenstein's avatar
      data_source · 73dc109d
      Jeremy Reizenstein authored
      Summary:
      Move dataset_args and dataloader_args from ExperimentConfig into a new member called datasource so that it can contain replaceables.
      
      Also add enum Task for task type.
      
      Reviewed By: shapovalov
      
      Differential Revision: D36201719
      
      fbshipit-source-id: 47d6967bfea3b7b146b6bbd1572e0457c9365871
      73dc109d
  5. 13 May, 2022 1 commit
  6. 09 May, 2022 1 commit
  7. 24 Mar, 2022 1 commit
  8. 21 Mar, 2022 1 commit