Commits · abdeafb00e8e41805b4e9f6a2bd697ca58a94739 · OpenDAS / d2go

05 Apr, 2023 1 commit

Setup root logger once & on import time · abdeafb0

Mik Vyatskov authored Apr 05, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/523

To avoid setting it up multiple times, add run_once() decorator.

Additionally make sure logging is configured for datalodaing workers, which have a different entry point, by moving setting up logging to the import time. Right now when a dataloader worker is created using spawn method from multiprocessing module, a new Python interpreter is created, with all the modules imported anew and with the entry point set to the method specified. This means that the entry point of the training framework is skipped, together with the logging setup.

With this change, the logging is configured on the import time, which means that when a dataloading process is created, even though the training main is not invoked, the logging is still configured because even though train_net is not invoked as an entry point, it's still imported in the child process.

Reviewed By: miqueljubert

Differential Revision: D44641142

fbshipit-source-id: 06ea85363d965b31d7f9ade3c2615ed9db67470b

abdeafb0

16 Feb, 2023 1 commit

Add reply files to d2go training processes · f0f55cdc

Sudarshan Raghunathan authored Feb 15, 2023

Summary:
This diff contains a minimal set of changes to support returning reply files to MAST.

There are three parts:
1. First, we have a try..except in the main function to catch all the "catchable" Python exceptions. Exceptions from C++ code or segfaults will not be handled here.
2. Each exception is then written to a per-process JSON reply file.
3. At the end, all per-process files are stat-ed and the earliest file is copied to a location specified by MAST.

# Limitations
1. This only works when local processes are launched using multiprocessing (which is the default)
2. If any error happens in C++ code - it will likely not be caught in Python and the reply file might not have the correct logs

Differential Revision: D43097683

fbshipit-source-id: 0eaf4f19f6199a9c77f2ce4c7d2bbc2a2078be99

f0f55cdc

13 Jan, 2023 1 commit

Rewrite FSDP wrapping as modeling hook · dc6fac12

Anthony Chen authored Jan 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440

Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook

**Motivation**
When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.

This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.

**API changes**
* Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
* `FSDP.ALGORITHM` can only be `full` or `grad_optim`

**Note**
It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented

Reviewed By: wat3rBro

Differential Revision: D41416917

fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06

dc6fac12

19 Dec, 2022 1 commit

separate TestNetOutput and TrainNetOutput · e2537c82

Yanghan Wang authored Dec 19, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/449

separate TestNetOutput and TrainNetOutput
- update d2go binaries
- update operators / workflows

Reviewed By: mcimpoi

Differential Revision: D42103714

fbshipit-source-id: 53f318c79d7339fb6fcfc3486e8b9cf249a598bf

e2537c82

17 Nov, 2022 1 commit

Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8

Anthony Chen authored Nov 17, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396

Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.

Sources
* Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

API changes
* Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
* Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models

Reviewed By: wat3rBro

Differential Revision: D39228316

fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c

02625ff8

14 Nov, 2022 1 commit

Set logger level to info for d2go tools which do not have it set · cc3e0e4d

Miquel Jubert Hermoso authored Nov 14, 2022

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/388

Reviewed By: wat3rBro

Differential Revision: D40377653

fbshipit-source-id: 3f99d30480a801c794665e67bb2b0d28c7c5b0e5

cc3e0e4d

23 Oct, 2022 1 commit

Add shared workers context API · 69bf820c

Tsahi Glik authored Oct 23, 2022

Summary:
X-link: https://github.com/facebookresearch/mobile-vision/pull/116

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/398

D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go doesn't have per node initialization api, but only per worker initialization that happens per subprocess.
Some projects (like IOBT) need to way to do shared initialization before spawning all the workers in subprocess and pass this initialized shared context to the workers.
This diff adds API to create a shared context object before launching workers and then use this shared context by the runners inside the workers after launch.

Reviewed By: wat3rBro

Differential Revision: D40001329

fbshipit-source-id: 231a4e7e4da7b5db50849176c58b104c4565306a

69bf820c

09 Aug, 2022 2 commits

Move TrainNetOutput from the binary to the library · dba54f21

Mik Vyatskov authored Aug 09, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/357

This change makes it possible to unpickle TrainNetOutput which is currently cannot be unpickled because it's a part of main module which can be different for the binary that's unpickling this dataclass.

Reviewed By: miqueljubert

Differential Revision: D38536040

fbshipit-source-id: 856594251b2eca7630d69c7917bc4746859dab9f

dba54f21

Allow to disable postmortem on fail in binaries · 9a7b2e0f

Mik Vyatskov authored Aug 09, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/356

Attaching PDB on failure is not working when running in distributed environment. This change allows to disable this behavior by passing a command line argument.

Reviewed By: miqueljubert

Differential Revision: D38514736

fbshipit-source-id: 2e0008d6fbc6a4518a605debe67d76f8354364fc

9a7b2e0f

28 Jul, 2022 1 commit

Make model_configs optional · 80d3844b

Mircea Cimpoi authored Jul 28, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/349

This is to allow None, meaning model_configs is not used.

Added tasks for the other TODO.

Reviewed By: wat3rBro

Differential Revision: D38199075

fbshipit-source-id: 774ca42a82a972b7e4c642cc4306aec39e2c2f7f

80d3844b

27 Jul, 2022 1 commit

Allow skipping do_test after do_train. · 7910ab16

Peizhao Zhang authored Jul 27, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/278

Allow skipping do_test after do_train.

Reviewed By: wat3rBro

Differential Revision: D36786790

fbshipit-source-id: 785556b5743ee9af2abfe6c0e9e78c7055697048

7910ab16

25 Jul, 2022 1 commit

add evaluation result type annotation · b04ba38b

Yanghan Wang authored Jul 25, 2022

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/343

Reviewed By: miqueljubert

Differential Revision: D38077850

fbshipit-source-id: a79541d899ce2b49a30c7f2a81a616f76321026f

b04ba38b

22 Jul, 2022 1 commit

use dataclass to annotate the output of main & operator · 5c16a4ea

Yanghan Wang authored Jul 22, 2022

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/340

Reviewed By: miqueljubert

Differential Revision: D37968017

fbshipit-source-id: a3953fdbb2c48ceaffcf94df081c0b3253d247d5

5c16a4ea

30 Jun, 2022 1 commit

use kwargs for extra args in launch · 4397dcbe

Yanghan Wang authored Jun 30, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/320

MCV/D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go's `launch` now supports `kwargs`, which matches elastic launch. Let's always use `args=(cfg, output_dir, runner_name)` for all the binaries, and use `kwargs` for remaining binary arguments (which matches the `extra_args` in FBL's OperatorArgument).

Reviewed By: sstsai-adl

Differential Revision: D37535145

fbshipit-source-id: 9767e8d71421d2262aee1fd4b9019758aa4a6bbd

4397dcbe

24 Jun, 2022 2 commits

Only save results to file from rank 0 · f0297b81

Mik Vyatskov authored Jun 24, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/309

Right now multiple machines can try to write to the same output file,
since they get the same argument. Additionally, on the same machine, several
outputs can be saved which requires unncessary unpacking. This change makes
train_net only write output of the rank 0 trainer.

Reviewed By: wat3rBro

Differential Revision: D37310084

fbshipit-source-id: 9d5352a274e8fb1d2043393b12896d402333c17b

f0297b81

use runner class instead of instance outside of main · 8051775c

Yanghan Wang authored Jun 23, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/312

As discussed, we decided to not use runner instance outside of `main`, previous diffs already solved the prerequisites, this diff mainly does the renaming.
- Use runner name (str) in the fblearner, ML pipeline.
- Use runner name (str) in FBL operator, MAST and binary operator.
- Use runner class as the interface of main, it can be either the name of class (str) or actual class. The main usage should be using `str`, so that the importing of class happens inside `main`. But it's also a common use case to import runner class and call `main` for things like ad-hoc scripts or tests, supporting actual class makes it easier modify code for those cases (eg. some local test class doesn't have a name, so it's not feasible to use runner name).

Reviewed By: newstzpz

Differential Revision: D37060338

fbshipit-source-id: 879852d41902b87d6db6cb9d7b3e8dc55dc4b976

8051775c

18 Jun, 2022 2 commits

Support saving results in d2go tools · b57fde40

Tsahi Glik authored Jun 18, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/297

X-link: https://github.com/facebookresearch/mobile-vision/pull/84

Add command line arg to specify whether and where to save results.
This is useful where binaries are being launched from another process, or remotely on another machine.

Reviewed By: wat3rBro

Differential Revision: D37157955

fbshipit-source-id: 2a48cf967f6cf928049f2be41952834e1dd2a04d

b57fde40

fix OSS CLI tools · da04300b

Tsahi Glik authored Jun 18, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/302

Fixing issue introduced in D35035813 (https://github.com/facebookresearch/d2go/commit/744d72d73b7103b8dd9ca69372a179b44ad7d733) that break the OSS cli tools defined in https://github.com/facebookresearch/d2go/blob/8098d160c0b38b796a2c164719650a50238a0f89/setup.py#L87-L92.
The cli alias in setup need a function without any args to call. So creating a new main_cli function

Reviewed By: wat3rBro

Differential Revision: D37210948

fbshipit-source-id: efb3df15e9933c617414a727e5b53553db170622

da04300b

16 Jun, 2022 1 commit

Implement a central helper for converting arguments to CLI args · 3b64e76a

Mik Vyatskov authored Jun 16, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/301

This is a follow-up of earlier work to extract part responsible for the
centrally defined parameters from the helper in train_net closer to where the
parameters are defined.

Reviewed By: tglik

Differential Revision: D37176212

fbshipit-source-id: 226415f36f4872ac3d9ba41541b4389a18cc11e6

3b64e76a

15 Jun, 2022 1 commit

Introduce a helper to convert args to CLI args · 7e436109

Mik Vyatskov authored Jun 15, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/290

When running through torchx, converting from arguments to CLI arguments is necessary.

Reviewed By: wat3rBro

Differential Revision: D37086938

fbshipit-source-id: d17c4e36bece8eb02955263181789b71e3483a40

7e436109

15 May, 2022 1 commit

apply import merging for fbcode (7 of 11) · b3a9204c

John Reese authored May 15, 2022

Summary:
Applies new import merging and sorting from µsort v1.0.

When merging imports, µsort will make a best-effort to move associated
comments to match merged elements, but there are known limitations due to
the diynamic nature of Python and developer tooling. These changes should
not produce any dangerous runtime changes, but may require touch-ups to
satisfy linters and other tooling.

Note that µsort uses case-insensitive, lexicographical sorting, which
results in a different ordering compared to isort. This provides a more
consistent sorting order, matching the case-insensitive order used when
sorting import statements by module name, and ensures that "frog", "FROG",
and "Frog" always sort next to each other.

For details on µsort's sorting and merging semantics, see the user guide:
https://usort.readthedocs.io/en/stable/guide.html#sorting

Reviewed By: lisroach

Differential Revision: D36402205

fbshipit-source-id: a4efc688d02da80c6e96685aa8eb00411615a366

b3a9204c

05 Mar, 2022 1 commit

fix cli arg parsing · a578044f

Yanghan Wang authored Mar 04, 2022

Summary: fix D34540275 (https://github.com/facebookresearch/d2go/commit/d8bdc633ec66e6ce73076d027f8e777791c2e067)

Reviewed By: tglik

Differential Revision: D34662745

fbshipit-source-id: 6fd67db041fab6f5810763702e4cc3f16a08c5df

a578044f

03 Mar, 2022 1 commit

Integrate AIEnv with D2Go train_net · d8bdc633

Tsahi Glik authored Mar 02, 2022

Summary:
Add support in d2go.distributed for `env://` init method. Use env variables as specified in https://pytorch.org/docs/stable/distributed.html#environment-variable-initialization for initialized distributed params.

Also change train_net cli function signature to accept args list instead of only using `sys.argv`. To allow calling this function from AIEnv launcher.

Differential Revision: D34540275

fbshipit-source-id: 7f718aed4c010b0ac8347d43b5ca5b401210756c

d8bdc633

22 May, 2021 1 commit

support FP16 gradient compression · 57809b0f

Zhicheng Yan authored May 21, 2021

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/70

DDP supports an fp16_compress_hook which compresses the gradient to FP16 before communication. This can result in a significant speed up.

Add one argument `_C.MODEL.DDP_FP16_GRAD_COMPRESS` to trigger it.

Reviewed By: zhanghang1989

Differential Revision: D28467701

fbshipit-source-id: 3c80865222f48eb8fe6947ea972448c445ee3ef3

57809b0f

03 Mar, 2021 1 commit
- Initial commit · f23248c0
  facebook-github-bot authored Mar 02, 2021
```
fbshipit-source-id: f4a8ba78691d8cf46e003ef0bd2e95f170932778
```
  f23248c0