- 26 May, 2023 1 commit
-
-
Ajinkya Deogade authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/549 The `iterate_module_named_parameters` is used by the `optimizer` and `quantization`. Let's move the `iterate_module_named_parameters` to a shared location `utils` to break the circular dependencies for the following diffs in the stack. Reviewed By: tglik Differential Revision: D45912066 fbshipit-source-id: bce5c5db3bbc1866f4da8662f7bd5908bfe30aad
-
- 25 May, 2023 4 commits
-
-
Jiaxu Zhu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/548 As title, by setting ``` SOLVER.DETERMINISTIC = True SEED = 42 # or other values ``` Training results are reproducible Reviewed By: wat3rBro, rkaarimi Differential Revision: D46174626 fbshipit-source-id: d6665b777376a176bd46a1286c3199ed0da26ae6
-
Ajinkya Deogade authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/546 Here we start modularizing the targets. I had to introduce some temporary hacks to break the circular dependency while keeping the diff atomic. There are some TODOs left at the end of the stack that are still WIP. Reviewed By: tglik Differential Revision: D45912076 fbshipit-source-id: 375f579fe749dd4a588908cdca7b76ba68f1048f
-
Ajinkya Deogade authored
Summary: There is an issue with the relative import in the `__init__` file of modeldef that causes tests on GitHub CI to fail. Specifically, the `FBNetV2ModelArch` is not correctly populated. The internal CI does not detect such failures because we use the buck build system. This diff fixes it. Pull Request resolved: https://github.com/facebookresearch/d2go/pull/547 Reviewed By: patricksnape Differential Revision: D46177424 fbshipit-source-id: 06b23b9b221c990cd15a2debff6def8cfb99743b
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/544 The previous diff on memory profiler D45673764 doesn't pick up a config key name change and causes an attribute not found error. This diff fixes it and adds two unittests (one with gpu one without) for using memory profiler in runner Reviewed By: wat3rBro Differential Revision: D46114730 fbshipit-source-id: d066d435021983d90f4a75e0c88798a3aedcaf92
-
- 24 May, 2023 1 commit
-
-
Ajinkya Deogade authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/545 Expanding the relative imports to absolute ones helps the autodeps down the stack. Reviewed By: tglik Differential Revision: D45912074 fbshipit-source-id: d42c9756dde731504ee6fd0f93cf549d71157489
-
- 22 May, 2023 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/542 ## Overview Add an option to enable GPU memory snapshot profiler in d2go. The profiler is natively supported by Pytorch and is able to record stack traces associated with all CUDA memory allocation/free events, allowing users to understand which parts of code contribute to the memory bottleneck. It also provides a powerful interactive web tool to visualize memory utilization ordered by time: {F978609840} Each colored block represents an allocated cuda memory block. User can click on the block to see the corresponding python stack trace that allocates the block. ## d2go integration This diff integrates the profiler as a hook controlled by config key `USE_MEMORY_PROFILER`. The profiler will log snapshots and web tools to the output directory. There are three places that logging could happen: start of training, during training and OOM. Please read the docstring of `D2GoGpuMemorySnapshot` for more information. Reviewed By: tglik, jaconey Differential Revision: D45673764 fbshipit-source-id: 8900484a2266d94421fe3ee7a85a4dea3a9f6b72
-
- 19 May, 2023 1 commit
-
-
Yanghan Wang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/543 The previous implementation: > the problem is the ContextDecorator somehow swallows the exception in the wrapped function and just returns None. This diff adds a test such that previous implementation would fail: ``` ====================================================================== FAIL: test_log_interval_error_prop (d2go.tests.fb.test_utils_logging.TestUtilsLogging) Make sure the log_interval can handle error propagation. ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/sandcastle/boxes/fbsource/buck-out/v2/gen/fbcode/ef4169ac7f95fb74/mobile-vision/d2go/tests/__init_tests__/init_tests#link-tree/d2go/tests/fb/test_utils_logging.py", line 152, in test_log_interval_error_prop foo(-1) AssertionError: ValueError not raised ---------------------------------------------------------------------- Ran 1 test in 0.098s ``` The new version seems easier to understand and doesn't have the error swallowing. Reviewed By: jaconey Differential Revision: D46009938 fbshipit-source-id: 6b632deb513ab47c4d760f796bf49fc45eae3005
-
- 18 May, 2023 1 commit
-
-
Jiaxu Zhu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/541 The issue post https://fb.workplace.com/groups/277527419809135/permalink/1303604910534709/ The fix was suggested by the MV folks. Reviewed By: dilinwang820, wat3rBro Differential Revision: D45881863 fbshipit-source-id: b33345c4230067b78f27e7deb038c095d55f1360
-
- 16 May, 2023 1 commit
-
-
Jiaxu Zhu authored
Summary: X-link: https://github.com/facebookresearch/detectron2/pull/4955 Pull Request resolved: https://github.com/facebookresearch/d2go/pull/540 Allow users to launch deterministic training jobs. That is, using the same training config, users can get identical training results. Reviewed By: dilinwang820 Differential Revision: D45370627 fbshipit-source-id: 88db388c992500b0d789b8341952502cd1f8f995
-
- 12 May, 2023 1 commit
-
-
Jack Zhang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/538 We want to log interval to measure execution time for a function. Reviewed By: wat3rBro Differential Revision: D45751279 fbshipit-source-id: fe25d3fedd32f61b64e978881b6547d3bc1acb22
-
- 10 May, 2023 1 commit
-
-
Mik Vyatskov authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/537 For some reason numba cannot work with the print being overwritten by a local variable. However when the override is a module attribute, it seems to work. Reviewed By: navsud Differential Revision: D45730776 fbshipit-source-id: fee1288b1adb43f69fe7c4e43f4a8a750f0b98b4
-
- 08 May, 2023 1 commit
-
-
Jiaxu Zhu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/531 As title, enable mixed precision FX quantization for FBS model. This diff includes 1. Add `custom_prepare_fx` to the FBS d2go model to enable the FX quantization. 2. Add two new d2go config params `QUANTIZATION.ACT_BITS/QUANTIZATION.WEIGHTS` 3. Add `backend_config/qconfig_mapping` to d2go convert function calls. 4. Add an example FBS fx QAT config. Reviewed By: ayushidalmia Differential Revision: D45252545 fbshipit-source-id: 813b192fcdd66c17629490b8908ce8cd8534506a
-
- 07 May, 2023 1 commit
-
-
John Lee authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/536 This diff insteuments checkpoints using signpost for FSDPCheckpointer using D44278485 as a reference Reviewed By: miqueljubert Differential Revision: D45524792 fbshipit-source-id: 9b7e004e6853141ee26d65ae11f79b1f5f5db0e6
-
- 02 May, 2023 1 commit
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/535 Use `FSDP.STATE_DICT_TYPE = SHARDED_STATE_DICT` for FSDP checkpointing by default.` FSDP.USE_LOCAL_STATE_DICT` will be deprecated in the future. # Note After the change, config usage of `FSDP.USE_LOCAL_STATE_DICT` will not be picked up by code: it will be superseded by the default type of FSDP.STATE_DICT_TYPE instead Reviewed By: tglik Differential Revision: D45413143 fbshipit-source-id: e7bc2d5dc04ac09004cb89353333be020a9c80b5
-
- 01 May, 2023 3 commits
-
-
Richard Barnes authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/533 The pattern ``` X.Y if hasattr(X, "Y") else Z ``` can be replaced with ``` getattr(X, "Y", Z) ``` The [getattr](https://www.w3schools.com/python/ref_func_getattr.asp) function gives more succinct code than the [hasattr](https://www.w3schools.com/python/ref_func_hasattr.asp) function. Please use it when appropriate. **This diff is very low risk. Green tests indicate that you can safely Accept & Ship.** Differential Revision: D44886687 fbshipit-source-id: f3f0265251bf8008ae927b767da5749bf6828c2c
-
Zhicheng Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/532 Enable the visualization of panoptic segmentation. Reviewed By: tglik Differential Revision: D45334039 fbshipit-source-id: eebd9316d56d8132a5d3c166058ae18a0e88e928
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/534 Currently, d2go supports 2 checkpointers, 2 distributed modes and 3 checkpointing modes. The many options make it hard to maintain and manage all use cases. For example, after the recent migration to FSDP sharded_state_dict, it's hard to understand and trace down the usage of the deprecated version. Per crassirostris and wat3rBro's advice, this diff add API loggings to better keep track of checkpointer usage in d2go. ## Appendix 2 checkpointers: FSDPCheckpointer, AIInfraCheckpointer 2 distributed modes: ddp, fsdp 3 checkpointing modes (fsdp only): local_state_dict, sharded_state_dict, full_state_dict Reviewed By: tglik Differential Revision: D45385021 fbshipit-source-id: 5d2cb115ed0fdada254b819793e376e410ecd97d
-
- 21 Apr, 2023 1 commit
-
-
Tao Xu authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/527 - Add model.reset_generation_counter() to enable the diffusion visualization evaluators to run on multiple test datasets. - Before this fix, the visualization evaluators will only run on the 1st test dataset since self.generation_counter will set to <0 after running on the 1st test datasaet. Thus the visualization evaluators will skip for all the other test sets since self.generation_counter < 0. - Use the ddim for upsampler by default for better results Reviewed By: zechenghe Differential Revision: D45058672 fbshipit-source-id: 2f7919bf6ecd2e5f6f242ce3e7891cb3dc8d6af4
-
- 20 Apr, 2023 2 commits
-
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/530 Add options to include/exclude model buffers and frozen parameters in EMA state via two new config keys `MODEL_EMA.INCLUDE_FROZEN` and `MODEL_EMA.INCLUDE_BUFFER` Reviewed By: tglik Differential Revision: D45129625 fbshipit-source-id: 895ebe7e4f8e15566c3c3bddd852dd98c40a27b1
-
Tsahi Glik authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/529 Set config param for enabling async write metrics added in D44305165 Use it in LDM Pokemon config as first use case Reviewed By: sf-wind Differential Revision: D44335491 fbshipit-source-id: b000502e6ed0e19a10d6fe3a7470bcd3045e7717
-
- 18 Apr, 2023 1 commit
-
-
Chien-Chin Huang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/528 Not passing optimizer object to shard_full_optim_state_dict() is being deprecated. This diff passes optimizer to shard_full_optim_state_dict(). Reviewed By: YanjunChen329 Differential Revision: D45065185 fbshipit-source-id: 0abec3eeff6e7c626eefc432c73e38779a6f02d9
-
- 11 Apr, 2023 2 commits
-
-
Fei Sun authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/526 Add a config variable: DDP_GRADIENT_AS_BUCKET_VIEW. Pass it to DDP. This variable reduces the memory consumption of the model. Reviewed By: tglik Differential Revision: D44273339 fbshipit-source-id: 272e2ffbea89532a55df0ebdb3bd49f0df7d78a5
-
Fei Sun authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/525 In d2go, pass the argument ZERO_GRAD_BEFORE_FORWARD to the detectron runtime. Reviewed By: tglik Differential Revision: D44267319 fbshipit-source-id: 3bd5874bea96ac381fb49972a2dfe9bb52005a7d
-
- 05 Apr, 2023 2 commits
-
-
Mik Vyatskov authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/523 To avoid setting it up multiple times, add run_once() decorator. Additionally make sure logging is configured for datalodaing workers, which have a different entry point, by moving setting up logging to the import time. Right now when a dataloader worker is created using spawn method from multiprocessing module, a new Python interpreter is created, with all the modules imported anew and with the entry point set to the method specified. This means that the entry point of the training framework is skipped, together with the logging setup. With this change, the logging is configured on the import time, which means that when a dataloading process is created, even though the training main is not invoked, the logging is still configured because even though train_net is not invoked as an entry point, it's still imported in the child process. Reviewed By: miqueljubert Differential Revision: D44641142 fbshipit-source-id: 06ea85363d965b31d7f9ade3c2615ed9db67470b
-
Anthony Chen authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/522 Change d2go's default FSDP sharding strategy to grad_optim, which corresponds to ShardingStrategy.SHARD_GRAD_OP in FSDP API, or ZERO2 in literature. grad_optim is shown to have the best tradeoff between memory utilization and training speed for mid-sized models. `FSDP.ALGORITHM = ""` was from the previous design to indicate that no FSDP is used. It will not work now Reviewed By: tglik Differential Revision: D44657184 fbshipit-source-id: 3888eea5f2b5042269e69453f3cdd8db7cf1581c
-
- 03 Apr, 2023 1 commit
-
-
Grisha Temchenko authored
Summary: Correction in docs. Related issue: https://github.com/facebookresearch/d2go/issues/514 Pull Request resolved: https://github.com/facebookresearch/d2go/pull/515 Reviewed By: crassirostris Differential Revision: D44546569 fbshipit-source-id: fec3797bad15b55833d9278c19978ff9c312d963
-
- 31 Mar, 2023 2 commits
-
-
Mik Vyatskov authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/521 Further along the setup, D2Go loggers will have logging level set to debug. Setting logging level as debug for every process introduces unnecessary logs. Reviewed By: miqueljubert Differential Revision: D44561105 fbshipit-source-id: 536f75bb886aec644207933e9baeb91a862a7ca7
-
Mik Vyatskov authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/510 This change allows to more granularly configure initial logging setup as part of a separate module. Reviewed By: tglik Differential Revision: D44278485 fbshipit-source-id: 2f421ee4e7f9017ef8ebccb9ff51f4177b8628b9
-
- 30 Mar, 2023 4 commits
-
-
David Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/520 - Move gather/scatter functions to their own util function - Make changes to onboard AIInfraCheckpointer to the gather/scatter functions for optimizer and ema state - Add a test for fsdp checkpointer and ai infra checkpointer Reviewed By: YanjunChen329 Differential Revision: D44400633 fbshipit-source-id: bcfe3e0a4fbf53f91a83e88f74c4538699a50293
-
David Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/519 Prior to this, FSDP checkpointer did not save EMA state which matched the model state when the model used sharded state dict. This diff adds this functionality. Reviewed By: YanjunChen329 Differential Revision: D44270790 fbshipit-source-id: f522765ad56e8279f355c43a19f26c3b6bcf01e3
-
Mircea Cimpoi authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/518 Enable profiling for eval step only, not on every eval (which can be called during training) Reviewed By: frabu6 Differential Revision: D44535915 fbshipit-source-id: 4497a3f74f5d751277df9ed41bc9bf21056341c4
-
Anton Rigner authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/516 # Context D2go allows for training with more than one datasets, and as long as the categories are consistent, the IDs do not necessarily have to correspond to each other between annotations of two different data sets. It is still loaded correctly to the data loader, and the training works as expected. # Problem However, I observed weird mis-labelleing issues in the Visualizer for Tensorboard. Originally I thought this was a data/conversion issue, but upon inspecting the logs I see that the data is loaded correctly. See example below. {F924075931} "Plant" labelled as "Refrigerator", "Floor" labelled as "Lamp" {F924078113} ... but the loaded annotations doesn't actually contain any samples of "Refrigerator". The reason is that the Visualizer always loads the metadata (and thus the labels) from the first train data set, but the order of the categories between the data sets may not be consistent, but still be a valid training run. # Fix If there is a data set name associated with the data to visualize, use that to fetch the metadata, and the correct labels, otherwise default to the first data set (current situation). Reviewed By: wat3rBro Differential Revision: D44495363 Privacy Context Container: L1127277 fbshipit-source-id: 37b940d393aa794cd2f39aabdc66c6d23abd8000
-
- 26 Mar, 2023 1 commit
-
-
Peizhao Zhang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/513 support specifying backend for testing helper. Reviewed By: tglik Differential Revision: D44401470 fbshipit-source-id: 9c7962cf40d3c677f9a3c7bfa9cdf5dcecae2ba9
-
- 24 Mar, 2023 2 commits
-
-
David Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/511 Add tests for sharded_state_dict integration in AIF Checkpointer Fix compatibility problems including: 1. small API errors of flatten_sharded_optim_state_dict 2. deprecate model.use_local_state_dict and model.load_local_state_dict 3. fix auto conversion for local_state_dict 4. fix T148056077: add metadata to differentiate between local_state_dict and sharded_state_dict when loading a directory with FSDPCheckpointer Reviewed By: YanjunChen329 Differential Revision: D44160045 fbshipit-source-id: f607b7076d0e49b9407f9adfbc8ecfe439c3b0c9
-
David Yan authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/512 Currently, when saving and loading checkpoints for FSDP-wrapped modules, we are saving and loading using `StateDictType.LOCAL_STATE_DICT`, where the state_dict becomes essentially a single flat tensor under the `_flat_param` key (or some other layer-specific key for flat weights). This means that 1. It's impossible to load weights directly from checkpoints, for example in notebooks 2. Converting from a local to a global checkpoint requires running a special workflow (https://fburl.com/code/6yqa4ldb) that occupies the same number of GPUs as was used during training This diff adds an option, `FSDP.STATE_DICT_TYPE`, which allows selection of the type of state dict to save (local, sharded, full). In sharded mode, with AIF checkpointing, we are able to have the benefit of allowing local loading of state dicts in minutes with any number of GPUs, in notebooks and elsewhere. Note: for backwards compatibility, `CFG.FSDP.use_local_state_dict` and `CFG.FSDP.load_local_state_dict` still need to work when the new config parameter (`CFG.FSDP.state_dict_type`) is not set. Also, it's used to signify that local/sharded state dicts need to be converted to a full state dict when loading. This functionality can be deprecated when everyone migrates to AIF checkpointing with sharded dicts. Reviewed By: YanjunChen329 Differential Revision: D43840887 fbshipit-source-id: d112f7b7ad97ba82fd5bf1da986b95ad7fc61c42
-
- 23 Mar, 2023 1 commit
-
-
Mik Vyatskov authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/509 print function is used all over the place and it's not realistic to enforce not using print for everyone. So this diff attempts to improve the debuggability of the code that was written using prints by redirecting prints to the logging module. Additionally call logger setup from `setup_after_launch` to make sure logging settings are applied in every of the spawned processes. Reviewed By: frabu6, wat3rBro Differential Revision: D44280241 fbshipit-source-id: 713400ac2b2edacef3c7a99067cbb1e684c3c5ad
-
- 22 Mar, 2023 2 commits
-
-
Mircea Cimpoi authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/508 Avoid unnecessary restriction to base class Trainer. Subclasses of `SimpleTrainer` would work as well. Reviewed By: wat3rBro Differential Revision: D44221069 fbshipit-source-id: a666977b2073b4525b4c6940c121f6b05466e5d7
-
Yanghan Wang authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/507 Reviewed By: crassirostris Differential Revision: D44269996 fbshipit-source-id: 91b313aeb820ec39e60c29c4c1bd9e669e1f7a6b
-
- 21 Mar, 2023 1 commit
-
-
Denis Savenkov authored
Summary: Pull Request resolved: https://github.com/facebookresearch/d2go/pull/505 Fused optimizers can only be run on CUDA, so this change makes necessary changes to enable remote execution for GPU tests, following: https://www.internalfb.com/intern/wiki/Pytorch_Ecosystem_Foundation_(EcoF)/PyTorch_Training/PyTorch_Lightning/Getting_Started/Testing/Adding_GPU_Unit_tests_using_RE/ Reviewed By: ertrue Differential Revision: D44113380 fbshipit-source-id: 34a06813a894f4de6e5731f78ef7f2cf11f18a06
-