Commits · 2256bdb7420218f9c435be525adfef0451a3a2f6 · OpenDAS / d2go

03 Mar, 2024 1 commit

apply Black 2024 style in fbcode (7/16) · 2256bdb7

Amethyst Reese authored Mar 02, 2024

Summary:
Formats the covered files with pyfmt.

paintitblack

Reviewed By: aleivag

Differential Revision: D54447732

fbshipit-source-id: e21fbbe27882c8af183d021f4ac27029cbe93e8e

2256bdb7

08 Jan, 2024 1 commit

mobile-vision/experimental 2/5 · 94cf5068

generatedunixname89002005287564 authored Jan 08, 2024

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/646

Reviewed By: wat3rBro

Differential Revision: D52536555

fbshipit-source-id: e57dc5b2774771f0739118c5244014171732c151

94cf5068

07 Dec, 2023 1 commit

add API reset optimzation engine · da53aa10

Yanghan Wang authored Dec 07, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/640

Reviewed By: tglik

Differential Revision: D51908239

fbshipit-source-id: 7bcbad1fc7065b736cf4e38d155eed5d734758f7

da53aa10

05 Nov, 2023 1 commit

allow to skip loading model weights in build_model() · f2a0c52c

Zhicheng Yan authored Nov 05, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630

Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights.
This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file.

Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights.
Note, this diff is entirely back-compatible and is NOT expected to break existing implementations.

Reviewed By: navsud, wat3rBro

Differential Revision: D50623772

fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2

f2a0c52c

27 Sep, 2023 1 commit

add damit uri support in train_net local run · c668ed4e

Min Xu authored Sep 27, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/622

as title

Reviewed By: jiaxuzhu92, ywwwer

Differential Revision: D49672980

fbshipit-source-id: f34ffe944c25c948fe1abd492ea0b96e47dc5b06

c668ed4e

24 Aug, 2023 1 commit

Add Optimizer FSDP and AC on 3xUnet/5xUnet · 7ad54f57

Jessica Zhong authored Aug 24, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/614

Reviewed By: wat3rBro, YanjunChen329

Differential Revision: D48544742

fbshipit-source-id: 9e49f13aa50e065c30e5551a636a83afd2d11acd

7ad54f57

19 Jul, 2023 1 commit

minor update of result gathering logic · 95e429a1

Yanghan Wang authored Jul 18, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/596

`outputs = {0: result}` feels a bit hacky, technically it should be `outputs = {worker_rank: result}` in order to match the `outputs` semantic in the else-branch.

Reviewed By: frabu6

Differential Revision: D47442322

fbshipit-source-id: f4d24f7022971b4f919b4fb4a563164c7f71cd2b

95e429a1

12 Jul, 2023 1 commit

Extend reply files to all binaries · e4fa6d63

Francisc Bungiu authored Jul 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/591

We previously added reply files for train_net, but not the other relevant binaries with MAST support: evaluator and lightning.
Adding support here by extracting the common bits into a separate module and wrapping the functions to reuse the functionality.

Differential Revision: D47293689

fbshipit-source-id: 70630a471c0cf037d180c9edfb57a4db4fdf7bde

e4fa6d63

22 Jun, 2023 1 commit

Add MAST support for eval · 60b6995d

Francisc Bungiu authored Jun 22, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/583

Extend support to MAST for evaluator binary.

Reviewed By: miqueljubert

Differential Revision: D46762473

fbshipit-source-id: 62ac68f195c89924abf71c9b6a9715d60ffcbf9b

60b6995d

19 Jun, 2023 1 commit

Fix key error 0 in multinode training · 78328839

Francisc Bungiu authored Jun 19, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/579

Current code assumed training runs only on one node, and there is always a global rank0 on each node. This assumption fails on multinode training, resulting in a key 0 error.

Reviewed By: crassirostris

Differential Revision: D46841286

fbshipit-source-id: d57919239fa5042de795d74c9c2013b07c9a0a48

78328839

02 Jun, 2023 1 commit

Enable Torch Elastic Launch on Mast in D2go · 7d35bae7

Jessica Zhong authored Jun 02, 2023

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/566

Reviewed By: wat3rBro

Differential Revision: D45829249

fbshipit-source-id: 4e70bed0e85179b49b4e2358be3d937cfbf474d4

7d35bae7

11 Apr, 2023 1 commit

Set gradient as bucket view · 8353ad23

Fei Sun authored Apr 10, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/526

Add a config variable: DDP_GRADIENT_AS_BUCKET_VIEW. Pass it to DDP. This variable reduces the memory consumption of the model.

Reviewed By: tglik

Differential Revision: D44273339

fbshipit-source-id: 272e2ffbea89532a55df0ebdb3bd49f0df7d78a5

8353ad23

05 Apr, 2023 1 commit

Setup root logger once & on import time · abdeafb0

Mik Vyatskov authored Apr 05, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/523

To avoid setting it up multiple times, add run_once() decorator.

Additionally make sure logging is configured for datalodaing workers, which have a different entry point, by moving setting up logging to the import time. Right now when a dataloader worker is created using spawn method from multiprocessing module, a new Python interpreter is created, with all the modules imported anew and with the entry point set to the method specified. This means that the entry point of the training framework is skipped, together with the logging setup.

With this change, the logging is configured on the import time, which means that when a dataloading process is created, even though the training main is not invoked, the logging is still configured because even though train_net is not invoked as an entry point, it's still imported in the child process.

Reviewed By: miqueljubert

Differential Revision: D44641142

fbshipit-source-id: 06ea85363d965b31d7f9ade3c2615ed9db67470b

abdeafb0

16 Feb, 2023 1 commit

Add reply files to d2go training processes · f0f55cdc

Sudarshan Raghunathan authored Feb 15, 2023

Summary:
This diff contains a minimal set of changes to support returning reply files to MAST.

There are three parts:
1. First, we have a try..except in the main function to catch all the "catchable" Python exceptions. Exceptions from C++ code or segfaults will not be handled here.
2. Each exception is then written to a per-process JSON reply file.
3. At the end, all per-process files are stat-ed and the earliest file is copied to a location specified by MAST.

# Limitations
1. This only works when local processes are launched using multiprocessing (which is the default)
2. If any error happens in C++ code - it will likely not be caught in Python and the reply file might not have the correct logs

Differential Revision: D43097683

fbshipit-source-id: 0eaf4f19f6199a9c77f2ce4c7d2bbc2a2078be99

f0f55cdc

01 Feb, 2023 1 commit

Allow specifying extra lightning trainer params via `_DEFAULTS_` in yaml · 6940fa9c

Yanghan Wang authored Feb 01, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/461

There're needs for extending trainer parameters that are not in (or conflict with) the base d2go config, this diff adds a way to inject those configs without touching the base d2go config.
- In `get_trainer_params`, it simply checks the `LIGHTNING_TRAINER` and use whatever configs under it.
- Adds `GeneralizedRCNNTaskNoDefaultConfig`, which allows specify default config via yaml file for `GeneralizedRCNNTask`. (also make some changes for prerequisite)
- (next diff) User can add their own config updater by registering it in `CONFIG_UPDATER_REGISTRY`.

Differential Revision: D42928992

fbshipit-source-id: f2a1d8a3f2bec9908bb1af03928611d963b92c0e

6940fa9c

13 Jan, 2023 1 commit

Rewrite FSDP wrapping as modeling hook · dc6fac12

Anthony Chen authored Jan 12, 2023

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440

Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook

**Motivation**
When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.

This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.

**API changes**
* Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
* `FSDP.ALGORITHM` can only be `full` or `grad_optim`

**Note**
It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented

Reviewed By: wat3rBro

Differential Revision: D41416917

fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06

dc6fac12

19 Dec, 2022 1 commit

separate TestNetOutput and TrainNetOutput · e2537c82

Yanghan Wang authored Dec 19, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/449

separate TestNetOutput and TrainNetOutput
- update d2go binaries
- update operators / workflows

Reviewed By: mcimpoi

Differential Revision: D42103714

fbshipit-source-id: 53f318c79d7339fb6fcfc3486e8b9cf249a598bf

e2537c82

17 Nov, 2022 1 commit

Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8

Anthony Chen authored Nov 17, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396

Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.

Sources
* Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html

API changes
* Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
* Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models

Reviewed By: wat3rBro

Differential Revision: D39228316

fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c

02625ff8

14 Nov, 2022 1 commit

Set logger level to info for d2go tools which do not have it set · cc3e0e4d

Miquel Jubert Hermoso authored Nov 14, 2022

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/388

Reviewed By: wat3rBro

Differential Revision: D40377653

fbshipit-source-id: 3f99d30480a801c794665e67bb2b0d28c7c5b0e5

cc3e0e4d

11 Nov, 2022 1 commit

custom precision dtype for AMP training on D2 backend · 729682ff

Anthony Chen authored Nov 11, 2022

Summary:
X-link: https://github.com/facebookresearch/detectron2/pull/4654

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/412

Support custom precision dtype [float16, bfloat16] for AMP training on D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb) backend. There's an old config key `SOLVER.AMP.PRECISION` that only works on lightning backend. This diff enables this config key on D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb) backend (train_net binary) as well.

Reviewed By: tax313, wat3rBro

Differential Revision: D40811604

fbshipit-source-id: 58da17ae1519a54243b5295eb4253c297e4d9296

729682ff

27 Oct, 2022 1 commit

fix exporter to set shared context · 4f3cc35f

Tsahi Glik authored Oct 27, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/401

as followup on D40001329 (https://github.com/facebookresearch/d2go/commit/69bf820c64cd0ffb6a84f465199c9134814cf58e).
The export is running main func without launching distributed workers, so it need to set the shared context explicitly.

Reviewed By: wat3rBro

Differential Revision: D40708631

fbshipit-source-id: 7689a45dff383ba2cce01d33d3be95d612269fbe

4f3cc35f

23 Oct, 2022 1 commit

Add shared workers context API · 69bf820c

Tsahi Glik authored Oct 23, 2022

Summary:
X-link: https://github.com/facebookresearch/mobile-vision/pull/116

Pull Request resolved: https://github.com/facebookresearch/d2go/pull/398

D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go doesn't have per node initialization api, but only per worker initialization that happens per subprocess.
Some projects (like IOBT) need to way to do shared initialization before spawning all the workers in subprocess and pass this initialized shared context to the workers.
This diff adds API to create a shared context object before launching workers and then use this shared context by the runners inside the workers after launch.

Reviewed By: wat3rBro

Differential Revision: D40001329

fbshipit-source-id: 231a4e7e4da7b5db50849176c58b104c4565306a

69bf820c

05 Oct, 2022 1 commit

Add "float16" and "bfloat16" precision when training with lightning Task · 382bec5b

Artsiom Sanakoyeu authored Oct 05, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/381

Introduce extra parameter SOLVER.AMP.PRECISION which can be sued to control the mixed precision training when lightning  backend is used.

Previous value `precision: "mixed"` was worng and the training failed (See screenshot below)
{F777576618}

I had to make AMP.PRECISION as string and make sure that it can work with two values: "float16" and "bfloat16". Before feeding it to the Trainer we convert "float16" string to integer value 16. Such a workaround was unavoidable because D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go's config value cannot be of int and str at the same time.

Reviewed By: wat3rBro

Differential Revision: D40035367

fbshipit-source-id: ed4f615ab29a2258164cbe179a9adba11559d804

382bec5b

28 Sep, 2022 1 commit

Pass gradient clipping and mixed precision params to the lightning Trainer · fc3a3983

Artsiom Sanakoyeu authored Sep 28, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/374

AMP trained with mixed precision is implemented for the Native d2go Runner, but not for Lightning Tasks.

Now we pass params SOLVER.AMP* and SOLVER.CLIP_GRADIENTS* to the lightning Trainer as well.

Reviewed By: wat3rBro

Differential Revision: D39798007

fbshipit-source-id: e48560a91d37c21c56d953eed141876d8c759329

fc3a3983

10 Sep, 2022 1 commit

use the decoded backend in evaluator · 272290cd

Yanghan Wang authored Sep 09, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/367

EZ

Reviewed By: xiecong

Differential Revision: D39407416

fbshipit-source-id: d0e6fa09ff926780e98c210bfce955e6b8eec7f6

272290cd

09 Aug, 2022 2 commits

Move TrainNetOutput from the binary to the library · dba54f21

Mik Vyatskov authored Aug 09, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/357

This change makes it possible to unpickle TrainNetOutput which is currently cannot be unpickled because it's a part of main module which can be different for the binary that's unpickling this dataclass.

Reviewed By: miqueljubert

Differential Revision: D38536040

fbshipit-source-id: 856594251b2eca7630d69c7917bc4746859dab9f

dba54f21

Allow to disable postmortem on fail in binaries · 9a7b2e0f

Mik Vyatskov authored Aug 09, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/356

Attaching PDB on failure is not working when running in distributed environment. This change allows to disable this behavior by passing a command line argument.

Reviewed By: miqueljubert

Differential Revision: D38514736

fbshipit-source-id: 2e0008d6fbc6a4518a605debe67d76f8354364fc

9a7b2e0f

28 Jul, 2022 1 commit

Make model_configs optional · 80d3844b

Mircea Cimpoi authored Jul 28, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/349

This is to allow None, meaning model_configs is not used.

Added tasks for the other TODO.

Reviewed By: wat3rBro

Differential Revision: D38199075

fbshipit-source-id: 774ca42a82a972b7e4c642cc4306aec39e2c2f7f

80d3844b

27 Jul, 2022 1 commit

Allow skipping do_test after do_train. · 7910ab16

Peizhao Zhang authored Jul 27, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/278

Allow skipping do_test after do_train.

Reviewed By: wat3rBro

Differential Revision: D36786790

fbshipit-source-id: 785556b5743ee9af2abfe6c0e9e78c7055697048

7910ab16

25 Jul, 2022 1 commit

add evaluation result type annotation · b04ba38b

Yanghan Wang authored Jul 25, 2022

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/343

Reviewed By: miqueljubert

Differential Revision: D38077850

fbshipit-source-id: a79541d899ce2b49a30c7f2a81a616f76321026f

b04ba38b

22 Jul, 2022 1 commit

use dataclass to annotate the output of main & operator · 5c16a4ea

Yanghan Wang authored Jul 22, 2022

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/340

Reviewed By: miqueljubert

Differential Revision: D37968017

fbshipit-source-id: a3953fdbb2c48ceaffcf94df081c0b3253d247d5

5c16a4ea

30 Jun, 2022 2 commits

use kwargs for extra args in launch · 4397dcbe

Yanghan Wang authored Jun 30, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/320

MCV/D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go's `launch` now supports `kwargs`, which matches elastic launch. Let's always use `args=(cfg, output_dir, runner_name)` for all the binaries, and use `kwargs` for remaining binary arguments (which matches the `extra_args` in FBL's OperatorArgument).

Reviewed By: sstsai-adl

Differential Revision: D37535145

fbshipit-source-id: 9767e8d71421d2262aee1fd4b9019758aa4a6bbd

4397dcbe

use the same prepare_for_launch for lightning · d353b5af

Yanghan Wang authored Jun 30, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/319

follow up on D37500599 (https://github.com/facebookresearch/d2go/commit/668b7ac29b0afb55d5923e72fe4f6428e5c85cbd), move lightning_train_net part of D37367360 to this diff.

Reviewed By: sstsai-adl

Differential Revision: D37534370

fbshipit-source-id: 7f48942a14ce16a9a9540b189441b540ce4f4b25

d353b5af

29 Jun, 2022 1 commit

update for using lightning trainer binary · 668b7ac2

Sam Tsai authored Jun 28, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/317

1. Add eval-only only option in similar fashion with train_net
2. Use output_dir from config is not specified via command line

Reviewed By: wenliangzhao2018

Differential Revision: D37500599

fbshipit-source-id: 00c5804d08a449def3cc15fff49e27066d01f229

668b7ac2

24 Jun, 2022 2 commits

Only save results to file from rank 0 · f0297b81

Mik Vyatskov authored Jun 24, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/309

Right now multiple machines can try to write to the same output file,
since they get the same argument. Additionally, on the same machine, several
outputs can be saved which requires unncessary unpacking. This change makes
train_net only write output of the rank 0 trainer.

Reviewed By: wat3rBro

Differential Revision: D37310084

fbshipit-source-id: 9d5352a274e8fb1d2043393b12896d402333c17b

f0297b81

use runner class instead of instance outside of main · 8051775c

Yanghan Wang authored Jun 23, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/312

As discussed, we decided to not use runner instance outside of `main`, previous diffs already solved the prerequisites, this diff mainly does the renaming.
- Use runner name (str) in the fblearner, ML pipeline.
- Use runner name (str) in FBL operator, MAST and binary operator.
- Use runner class as the interface of main, it can be either the name of class (str) or actual class. The main usage should be using `str`, so that the importing of class happens inside `main`. But it's also a common use case to import runner class and call `main` for things like ad-hoc scripts or tests, supporting actual class makes it easier modify code for those cases (eg. some local test class doesn't have a name, so it's not feasible to use runner name).

Reviewed By: newstzpz

Differential Revision: D37060338

fbshipit-source-id: 879852d41902b87d6db6cb9d7b3e8dc55dc4b976

8051775c

18 Jun, 2022 2 commits

Support saving results in d2go tools · b57fde40

Tsahi Glik authored Jun 18, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/297

X-link: https://github.com/facebookresearch/mobile-vision/pull/84

Add command line arg to specify whether and where to save results.
This is useful where binaries are being launched from another process, or remotely on another machine.

Reviewed By: wat3rBro

Differential Revision: D37157955

fbshipit-source-id: 2a48cf967f6cf928049f2be41952834e1dd2a04d

b57fde40

fix OSS CLI tools · da04300b

Tsahi Glik authored Jun 18, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/302

Fixing issue introduced in D35035813 (https://github.com/facebookresearch/d2go/commit/744d72d73b7103b8dd9ca69372a179b44ad7d733) that break the OSS cli tools defined in https://github.com/facebookresearch/d2go/blob/8098d160c0b38b796a2c164719650a50238a0f89/setup.py#L87-L92.
The cli alias in setup need a function without any args to call. So creating a new main_cli function

Reviewed By: wat3rBro

Differential Revision: D37210948

fbshipit-source-id: efb3df15e9933c617414a727e5b53553db170622

da04300b

16 Jun, 2022 2 commits

restructure lightning related code · 318a3d79

Yanghan Wang authored Jun 16, 2022

Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/237

Reviewed By: tglik

Differential Revision: D35954531

fbshipit-source-id: b69c8065928fe385d29f20f2c2460d60d63fca00

318a3d79

Implement a central helper for converting arguments to CLI args · 3b64e76a

Mik Vyatskov authored Jun 16, 2022

Summary:
Pull Request resolved: https://github.com/facebookresearch/d2go/pull/301

This is a follow-up of earlier work to extract part responsible for the
centrally defined parameters from the helper in train_net closer to where the
parameters are defined.

Reviewed By: tglik

Differential Revision: D37176212

fbshipit-source-id: 226415f36f4872ac3d9ba41541b4389a18cc11e6

3b64e76a