Commits · c83ec3555dc1399789f8157475cb5fef7e691575 · OpenDAS / pytorch3d

09 Aug, 2022 1 commit

Mods and bugfixes for LLFF and Blender repros · c83ec355

Krzysztof Chalupka authored Aug 09, 2022

Summary:
LLFF (and most/all non-synth datasets) will have no background/foreground distinction. Add support for data with no fg mask.

Also, we had a bug in stats loading, like this:
  * Load stats
  * One of the stats has a history of length 0
  * That's fine, e.g. maybe it's fg_error but the dataset has no notion of fg/bg. So leave it as len 0
  * Check whether all the stats have the same history length as an arbitrarily chosen "reference-stat"
  * Ooops the reference-stat happened to be the stat with length 0
  * assert (legit_stat_len == reference_stat_len (=0)) ---> failed assert

Also some minor fixes (from Jeremy's other diff) to support LLFF

Reviewed By: davnov134

Differential Revision: D38475272

fbshipit-source-id: 5b35ac86d1d5239759f537621f41a3aa4eb3bd68

c83ec355

02 Aug, 2022 3 commits

Move load_stats to TrainingLoop · c3f8dad5

David Novotny authored Aug 02, 2022

Summary:
Stats are logically connected to the training loop, not to the model. Hence, moving to the training loop.

Also removing resume_epoch from OptimizerFactory in favor of a single place - ModelFactory. This removes the need for config consistency checks etc.

Reviewed By: kjchalup

Differential Revision: D38313475

fbshipit-source-id: a1d188a63e28459df381ff98ad8acdcdb14887b7

c3f8dad5

Fix train_stats.pdf: they now work by default · b7b188bf

Krzysztof Chalupka authored Aug 02, 2022

Summary: Before this diff, train_stats.py would not be created by default, EXCEPT when resuming training. This makes them appear from start.

Reviewed By: shapovalov

Differential Revision: D38320341

fbshipit-source-id: 8ea5b99ec81c377ae129f58e78dc2eaff94821ad

b7b188bf

remove get_task · f8bf5280

Jeremy Reizenstein authored Aug 02, 2022

Summary: Remove the dataset's need to provide the task type.

Reviewed By: davnov134, kjchalup

Differential Revision: D38314000

fbshipit-source-id: 3805d885b5d4528abdc78c0da03247edb9abf3f7

f8bf5280

01 Aug, 2022 1 commit

Better seeding of random engines · 80fc0ee0

David Novotny authored Aug 01, 2022

Summary: Currently, seeds are set only inside the train loop. But this does not ensure that the model weights are initialized the same way everywhere which makes all experiments irreproducible. This diff fixes it.

Reviewed By: bottler

Differential Revision: D38315840

fbshipit-source-id: 3d2ecebbc36072c2b68dd3cd8c5e30708e7dd808

80fc0ee0

30 Jul, 2022 1 commit

Replace pluggable components to create a proper Configurable hierarchy. · 1b0584f7

Krzysztof Chalupka authored Jul 29, 2022

Summary:
This large diff rewrites a significant portion of Implicitron's config hierarchy. The new hierarchy, and some of the default implementation classes, are as follows:
```
Experiment
data_source: ImplicitronDataSource
dataset_map_provider
data_loader_map_provider
model_factory: ImplicitronModelFactory
model: GenericModel
optimizer_factory: ImplicitronOptimizerFactory
training_loop: ImplicitronTrainingLoop
evaluator: ImplicitronEvaluator
```

1) Experiment (used to be ExperimentConfig) is now a top-level Configurable and contains as members mainly (mostly new) high-level factory Configurables.
2) Experiment's job is to run factories, do some accelerate setup and then pass the results to the main training loop.
3) ImplicitronOptimizerFactory and ImplicitronModelFactory are new high-level factories that create the optimizer, scheduler, model, and stats objects.
4) TrainingLoop is a new configurable that runs the main training loop and the inner train-validate step.
5) Evaluator is a new configurable that TrainingLoop uses to run validation/test steps.
6) GenericModel is not the only model choice anymore. Instead, ImplicitronModelBase (by default instantiated with GenericModel) is a member of Experiment and can be easily replaced by a custom implementation by the user.

All the new Configurables are children of ReplaceableBase, and can be easily replaced with custom implementations.

In addition, I added support for the exponential LR schedule, updated the config files and the test, as well as added a config file that reproduces NERF results and a test to run the repro experiment.

Reviewed By: bottler

Differential Revision: D37723227

fbshipit-source-id: b36bee880d6aa53efdd2abfaae4489d8ab1e8a27

1b0584f7

15 Jul, 2022 1 commit

Fixed typing to have compatibility with OmegaConf 2.2.2 in Pytorch3D · 0f966217

Iurii Makarov authored Jul 15, 2022

Summary:
I tried to run `experiment.py` and `pytorch3d_implicitron_runner` and faced the failure with this traceback: https://www.internalfb.com/phabricator/paste/view/P515734086

It seems to be due to the new release of OmegaConf (version=2.2.2) which requires different typing. This fix helped to overcome it.

Reviewed By: bottler

Differential Revision: D37881644

fbshipit-source-id: be0cd4ced0526f8382cea5bdca9b340e93a2fba2

0f966217

12 Jul, 2022 1 commit

Updates to support Accelerate and multigpu training (#37) · aa8b03f3

Nikhila Ravi authored Jul 11, 2022

Summary:
## Changes:
- Added Accelerate Library and refactored experiment.py to use it
- Needed to move `init_optimizer` and `ExperimentConfig` to a separate file to be compatible with submitit/hydra
- Needed to make some modifications to data loaders etc to work well with the accelerate ddp wrappers
- Loading/saving checkpoints incorporates an unwrapping step so remove the ddp wrapped model

## Tests

Tested with both `torchrun` and `submitit/hydra` on two gpus locally. Here are the commands:

**Torchrun**

Modules loaded:
```sh
1) anaconda3/2021.05   2) cuda/11.3   3) NCCL/2.9.8-3-cuda.11.3   4) gcc/5.2.0. (but unload gcc when using submit)
```

```sh
torchrun --nnodes=1 --nproc_per_node=2 experiment.py --config-path ./configs --config-name repro_singleseq_nerf_test
```

**Submitit/Hydra Local test**

```sh
~/pytorch3d/projects/implicitron_trainer$ HYDRA_FULL_ERROR=1 python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs  hydra/launcher=submitit_local hydra.launcher.gpus_per_node=2 hydra.launcher.tasks_per_node=2 hydra.launcher.nodes=1
```

**Submitit/Hydra distributed test**

```sh
~/implicitron/pytorch3d$ python3.9 experiment.py --config-name repro_singleseq_nerf_test --multirun --config-path ./configs  hydra/launcher=submitit_slurm hydra.launcher.gpus_per_node=8 hydra.launcher.tasks_per_node=8 hydra.launcher.nodes=1 hydra.launcher.partition=learnlab hydra.launcher.timeout_min=4320
```

## TODOS:
- Fix distributed evaluation: currently this doesn't work as the input format to the evaluation function is not suitable for gathering across gpus (needs to be nested list/tuple/dicts of objects that satisfy `is_torch_tensor`) and currently `frame_data`  contains `Cameras` type.
- Refactor the `accelerator` object to be accessible by all functions instead of needing to pass it around everywhere? Maybe have a `Trainer` class and add it as a method?
- Update readme with installation instructions for accelerate and also commands for running jobs with torchrun and submitit/hydra

X-link: https://github.com/fairinternal/pytorch3d/pull/37

Reviewed By: davnov134, kjchalup

Differential Revision: D37543870

Pulled By: bottler

fbshipit-source-id: be9eb4e91244d4fe3740d87dafec622ae1e0cf76

aa8b03f3