Commits · aa8b03f31dc2a178f8d7da457df28f19b5917009 · OpenDAS / pytorch3d

12 Jul, 2022 1 commit

Updates to support Accelerate and multigpu training (#37) · aa8b03f3

Nikhila Ravi authored Jul 11, 2022

Summary:
## Changes:
- Added Accelerate Library and refactored experiment.py to use it
- Needed to move `init_optimizer` and `ExperimentConfig` to a separate file to be compatible with submitit/hydra
- Needed to make some modifications to data loaders etc to work well with the accelerate ddp wrappers
- Loading/saving checkpoints incorporates an unwrapping step so remove the ddp wrapped model

## Tests

Tested with both `torchrun` and `submitit/hydra` on two gpus locally. Here are the commands:

**Torchrun**

Modules loaded:
```sh
1) anaconda3/2021.05   2) cuda/11.3   3) NCCL/2.9.8-3-cuda.11.3   4) gcc/5.2.0. (but unload gcc when using submit)
```

```sh
torchrun --nnodes=1 --nproc_per_node=2 experiment.py --config-path ./configs --config-name repro_singleseq_nerf_test
```

**Submitit/Hydra Local test**

```sh
~/pytorch3d/projects/implicitron_trainer$ HYDRA_FULL_ERROR=1 python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs  hydra/launcher=submitit_local hydra.launcher.gpus_per_node=2 hydra.launcher.tasks_per_node=2 hydra.launcher.nodes=1
```

**Submitit/Hydra distributed test**

```sh
~/implicitron/pytorch3d$ python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs  hydra/launcher=submitit_slurm hydra.launcher.gpus_per_node=8 hydra.launcher.tasks_per_node=8 hydra.launcher.nodes=1 hydra.launcher.partition=learnlab hydra.launcher.timeout_min=4320
```

## TODOS:
- Fix distributed evaluation: currently this doesn't work as the input format to the evaluation function is not suitable for gathering across gpus (needs to be nested list/tuple/dicts of objects that satisfy `is_torch_tensor`) and currently `frame_data`  contains `Cameras` type.
- Refactor the `accelerator` object to be accessible by all functions instead of needing to pass it around everywhere? Maybe have a `Trainer` class and add it as a method?
- Update readme with installation instructions for accelerate and also commands for running jobs with torchrun and submitit/hydra

X-link: https://github.com/fairinternal/pytorch3d/pull/37

Reviewed By: davnov134, kjchalup

Differential Revision: D37543870

Pulled By: bottler

fbshipit-source-id: be9eb4e91244d4fe3740d87dafec622ae1e0cf76

aa8b03f3

06 Jul, 2022 1 commit

typing for trainer · 40fb189c

Jeremy Reizenstein authored Jul 06, 2022

Summary: Enable pyre checking of the trainer code.

Reviewed By: shapovalov

Differential Revision: D36545438

fbshipit-source-id: db1ea8d1ade2da79a2956964eb0c7ba302fa40d1

40fb189c

25 May, 2022 1 commit

rename ImplicitronDataset to JsonIndexDataset · fbd3c679

Jeremy Reizenstein authored May 25, 2022

Summary: The ImplicitronDataset class corresponds to JsonIndexDatasetMapProvider

Reviewed By: shapovalov

Differential Revision: D36661396

fbshipit-source-id: 80ca2ff81ef9ecc2e3d1f4e1cd14b6f66a7ec34d

fbd3c679

20 May, 2022 3 commits

dataset_map_provider · 79c61a2d

Jeremy Reizenstein authored May 20, 2022

Summary: replace dataset_zoo with a pluggable DatasetMapProvider. The logic is now in annotated_file_dataset_map_provider.

Reviewed By: shapovalov

Differential Revision: D36443965

fbshipit-source-id: 9087649802810055e150b2fbfcc3c197a761f28a

79c61a2d

New file for ImplicitronDatasetBase · 69c6d06e

Jeremy Reizenstein authored May 20, 2022

Summary: Separate ImplicitronDatasetBase and FrameData (to be used by all data sources) from ImplicitronDataset (which is specific).

Reviewed By: shapovalov

Differential Revision: D36413111

fbshipit-source-id: 3725744cde2e08baa11aff4048237ba10c7efbc6

69c6d06e

data_source · 73dc109d

Jeremy Reizenstein authored May 20, 2022

Summary:
Move dataset_args and dataloader_args from ExperimentConfig into a new member called datasource so that it can contain replaceables.

Also add enum Task for task type.

Reviewed By: shapovalov

Differential Revision: D36201719

fbshipit-source-id: 47d6967bfea3b7b146b6bbd1572e0457c9365871

73dc109d

13 May, 2022 1 commit

return types for dataset_zoo, dataloader_zoo · 2c190152

Jeremy Reizenstein authored May 13, 2022

Summary: Stronger typing for these functions

Reviewed By: shapovalov

Differential Revision: D36170489

fbshipit-source-id: a2104b29dbbbcfcf91ae1d076cd6b0e3d2030c0b

2c190152

09 May, 2022 1 commit

Extracted ImplicitronModelBase and unified API for GenericModel and ModelDBIR · a6dada39

Roman Shapovalov authored May 09, 2022

Summary:
To avoid model_zoo, we need to make GenericModel pluggable.
I also align creation APIs for convenience.

Reviewed By: bottler, davnov134

Differential Revision: D35933093

fbshipit-source-id: 8228926528eb41a795fbfbe32304b8019197e2b1

a6dada39

24 Mar, 2022 1 commit

Using the new dataset idx API everywhere. · e2622d79

Roman Shapovalov authored Mar 24, 2022

Summary: Using the API from D35012121 everywhere.

Reviewed By: bottler

Differential Revision: D35045870

fbshipit-source-id: dab112b5e04160334859bbe8fa2366344b6e0f70

e2622d79

21 Mar, 2022 1 commit

implicitron v0 (#1133) · cdd2142d

Jeremy Reizenstein authored Mar 21, 2022


Co-authored-by: Jeremy Francis Reizenstein <bottler@users.noreply.github.com>

cdd2142d