1. 02 May, 2024 1 commit
  2. 24 Apr, 2024 1 commit
  3. 03 Apr, 2024 1 commit
  4. 02 Apr, 2024 1 commit
  5. 27 Mar, 2024 1 commit
  6. 19 Mar, 2024 1 commit
    • Geet Sethi's avatar
      distributed FSDP model initialization · abdad994
      Geet Sethi authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/656
      
      Enable distributed FSDP model initialization. This iteratively moves and shards the model on GPU to allow for the training of models greater than single GPU HBM capacity and which cannot be instantiated multiple times on a single host.
      
      The flow is as follows:
      1. Rank 0 will init the whole model on CPU using existing code paths, while all other ranks init an 'empty' model using fake tensors.
      2. Once this is complete and initialization moves to FSDP, distributed init will traverse the model 'bottom-up', transferring all params/buffers from rank 0 to all other ranks, while simultaneously wrapping modules in FSDP whenever possible (based on the specified config). Thus modules are sharded (and memory usage distributed) at the first possible instance using the existing FSDP api/implementation.
      
      Reviewed By: XiaoliangDai
      
      Differential Revision: D54287718
      
      fbshipit-source-id: 16d63d78065d1fca0c6baf7a385f666a4e1b2a5f
      abdad994
  7. 14 Mar, 2024 1 commit
  8. 10 Mar, 2024 1 commit
    • Zhicheng Yan's avatar
      ensure metadata thing_classes consistency with multiple datasets and category filtering · 1216c225
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/653
      
      # Changes
      In Mask2Former RC4 training, we need to use a particular weighted category training sampler where `DATALOADER.SAMPLER_TRAIN = "WeightedCategoryTrainingSampler"`.
      
      Also there are multiple datasets are used, and the set of each one's categories are not exactly identical. Some datasets have more categories (e.g. Exo-body) than other datasets that do not have exobody annotations.
      
      Also we use category filtering by setting `D2GO_DATA.DATASETS.TRAIN_CATEGORIES` to a subset of full categories.
      
      In this setup, currently D2GO will complain metadata.thing_classes is NOT consistency across datasets (https://fburl.com/code/k8xbvyfd).
      
      The reason is when category filtering is used, D2GO writes a temporary dataset json file (https://fburl.com/code/slb5z6mc).
      And this tmp json file will be loaded when we get the dataset dicts from DatasetCatalog (https://fburl.com/code/5k4ynyhc). Meanwhile, metadata in MetadataCatalog for this category-filtered dataset is also updated based on categories stored in this tmp file.
      
      Therefore, we must ensure categories stored in the tmp file is consistent between multiple category-filtered datasets.
      
      In this diff, we update the logic of writing such tmp dataset json file.
      
      # Github CI test
      Note **CI / python-unittest-cpu** is shown as failed with error below. But I do not think it is related to changes in this diff since error is related to observer in the QAT model training, but changes in the diff are related to dataset preparation.
      
      ```
      Traceback (most recent call last):
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train
          self.run_step()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 310, in run_step
          loss_dict = self.model(data)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
          return forward_call(*args, **kwargs)
        File "/home/runner/work/d2go/d2go/tests/runner/test_runner_default_runner.py", line 44, in forward
          ret = self.conv(images.tensor)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1590, in _call_impl
          hook_result = hook(self, args, result)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/quantize.py", line 131, in _observer_forward_hook
          return self.activation_post_process(output)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
          return forward_call(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/fake_quantize.py", line 199, in forward
          _scale, _zero_point = self.calculate_qparams()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/fake_quantize.py", line 194, in calculate_qparams
          return self.activation_post_process.calculate_qparams()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/observer.py", line 529, in calculate_qparams
          return self._calculate_qparams(self.min_val, self.max_val)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/observer.py", line 328, in _calculate_qparams
          if not check_min_max_valid(min_val, max_val):
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/utils.py", line 346, in check_min_max_valid
          assert min_val <= max_val, f"min {min_val} should be less than max {max_val}"
      AssertionError: min 3.8139522075653076e-05 should be less than max -3.8139522075653076e-05
      ```
      
      Reviewed By: ayushidalmia
      
      Differential Revision:
      D54665936
      
      Privacy Context Container: L1243674
      
      fbshipit-source-id: 322ab4a84a710b03fa39b39fa81117752d369ba5
      1216c225
  9. 03 Mar, 2024 1 commit
    • Amethyst Reese's avatar
      apply Black 2024 style in fbcode (7/16) · 2256bdb7
      Amethyst Reese authored
      Summary:
      Formats the covered files with pyfmt.
      
      paintitblack
      
      Reviewed By: aleivag
      
      Differential Revision: D54447732
      
      fbshipit-source-id: e21fbbe27882c8af183d021f4ac27029cbe93e8e
      2256bdb7
  10. 23 Feb, 2024 1 commit
    • Naveen Suda's avatar
      pt2e quantization support in D2Go · 09bd2869
      Naveen Suda authored
      Summary: Add pt2e quantization support in D2Go.
      
      Reviewed By: chakriu
      
      Differential Revision: D54132092
      
      fbshipit-source-id: 34a9ba79a5eb49ed27a3f33454078b0df37cf2f0
      09bd2869
  11. 17 Feb, 2024 1 commit
  12. 08 Feb, 2024 1 commit
  13. 04 Feb, 2024 1 commit
  14. 17 Jan, 2024 1 commit
    • Zhicheng Yan's avatar
      expose example_input argument in setup_qat_model() · 3c6f71b4
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/647
      
      Major changes
      - **example_input** argument in **prepare_fake_quant_model()** is useful in certain cases. For example, in Argos model **custom_prepare_fx()** method under FX graph + QAT setup (D52760682), it is used to prepare example inputs to individual sub-modules by running one forward pass and bookkeeping the inputs to individual sub-modules. Therefore, we export argument **example_input** in **setup_qat_model()** function.
      - For QAT model, currently we assert # of state dict keys (excluding observers) should be equal to # of state dict keys in the original model. However, when the assertion fails, it does not log useful information for debugging. We make changes to report what are the unique keys in each state dict.
      
      Reviewed By: navsud
      
      Differential Revision: D52760688
      
      fbshipit-source-id: 27535a0324ebe6513f198acb839918a0346720d0
      3c6f71b4
  15. 16 Jan, 2024 1 commit
  16. 12 Jan, 2024 1 commit
    • Kapil Krishnakumar's avatar
      consolidate deterministic settings · 573bd454
      Kapil Krishnakumar authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/644
      
      This diff consolidates deterministic settings in D2Go. In the `default_runner.py` file, the `torch.set_float32_matmul_precision("highest")` function is added to set the precision for matrix multiplication to the highest possible value. In the `setup.py` file, the `torch.backends.cudnn.deterministic` setting is set to `True` and the `torch.backends.cudnn.allow_tf32` setting is set to `False` to avoid random pytorch and CUDA algorithms during the training. The `torch.backends.cuda.matmul.allow_tf32` setting is also set to `False` to avoid random matrix multiplication algorithms. Additionally, the `seed` function is used to set the seed for reproducibility.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D51796739
      
      fbshipit-source-id: 50e44ea50b0311b56a885db9f633491ac3002bd4
      573bd454
  17. 08 Jan, 2024 1 commit
  18. 04 Jan, 2024 1 commit
  19. 15 Dec, 2023 2 commits
    • Zhicheng Yan's avatar
      allow to ignore state dict keys in QAT model · c2256758
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/642
      
      When we build a QAT model using FX graph mode API **prepare_qat_fx** and **convert_fx**, they will run symbolic tracing following **module.forward()**.
      
      In certain cases, such as a module takes constant tensor input, the symbolic tracing will add new tensor attributes with name prefix **_tensor_constant** (https://fburl.com/code/msc4ch4o), which becomes new keys in the QAT model state dict.
      
      In current implementation of **_setup_non_qat_to_qat_state_dict_map**, it asserts # of keys in the state dict of original- and QAT model should be the same.
      
      Thus, we extend **qat_state_dict_keys_to_ignore** method by adding an argument, which allows to ignore specified state dict keys in the QAT model.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D52152706
      
      fbshipit-source-id: 92219feae43bf8841b0a3a71adfbfcb84d8e8f95
      c2256758
    • Zhicheng Yan's avatar
      do not fuse model again for a QAT model · 8f130231
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/643
      
      For a QAT model, it contains observers. After QAT training, those observers already contain updated statistics, such as min_val, max_val.
      
      When we want to export FP32 QAT model for a sanity check, if we call **fuse_utils.fuse_model()** again (which is often already called when we build the QAT model before QAT training), it will remove statistics in the observers.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D52152688
      
      fbshipit-source-id: 08aa16f2aa72b3809e0ba2d346f1b806c0e6ede7
      8f130231
  20. 07 Dec, 2023 2 commits
  21. 30 Nov, 2023 1 commit
  22. 17 Nov, 2023 1 commit
    • Wei Sun's avatar
      Use the consolidated snapshot API in Unitrace to support Zoomer · 87649f4f
      Wei Sun authored
      Summary: Similar to D48210543. Update the training_hooks to use the Unitrace memory snapshot APIs. This allows us to maintain a singel path for memory snapshot APIs, and also collect important details such as snapshot location for Zoomer.
      
      Pulled By:
      HugeEngine
      
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/636
      
      Reviewed By: frabu6, aaronenyeshi, jackiexu1992, mengluy0125
      
      Differential Revision: D48368150
      
      fbshipit-source-id: b279adfa29d390e615d2c32a7ab9e05d95b4f164
      87649f4f
  23. 10 Nov, 2023 1 commit
  24. 09 Nov, 2023 1 commit
  25. 05 Nov, 2023 1 commit
    • Zhicheng Yan's avatar
      allow to skip loading model weights in build_model() · f2a0c52c
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630
      
      Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights.
      This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file.
      
      Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights.
      Note, this diff is entirely back-compatible and is NOT expected to break existing implementations.
      
      Reviewed By: navsud, wat3rBro
      
      Differential Revision: D50623772
      
      fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2
      f2a0c52c
  26. 01 Nov, 2023 1 commit
  27. 23 Oct, 2023 1 commit
  28. 20 Oct, 2023 1 commit
  29. 12 Oct, 2023 1 commit
    • Igor Fedorov's avatar
      Enable training for fraction of total steps; enable early stopping from trial 0 · 3c724416
      Igor Fedorov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/627
      
      Enable training for fraction of total steps: when doing HPO, users may want to train for a fraction of the number of training steps of a regular (baseline) training run. In this case, it is not enough to just change SOLVER.MAX_ITER because that also changes the learning rate schedule. We introduce a multiplier to be used on top of SOLVER.MAX_ITER when deciding how many steps to train for. This multiplier does not scale the number of steps over which the learning rate schedule is defined.
      
      Reviewed By: raghuramank100
      
      Differential Revision: D48699087
      
      fbshipit-source-id: 903f7c957ee471f36365c1449e9cd6a919fd260a
      3c724416
  30. 11 Oct, 2023 1 commit
  31. 05 Oct, 2023 1 commit
  32. 03 Oct, 2023 1 commit
  33. 27 Sep, 2023 2 commits
    • Anthony Chen's avatar
      Make EMA checkpointing with FSDP more robust · 477629d0
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/615
      
      Previous FSDP EMA checkpointing logic directly handles `EMAState`: it manually calls `FSDP.summon_full_params()` to gather the full model params, and reconstruct/load an `EMAState` for checkpointing. This logic has two drawbacks:
      
      1. `FSDP.summon_full_params()` gathers all model weights at the same time, which could cause OOM issues if the model can't fit into a single GPU. This is quite common for FSDP workloads.
      2.  Directly saving and loading `EMAState` is error-prone. EMA state dict has different semantics and behaviors than `model.state_dict()`. However, users often expect it to function seamlessly like the model state dict
      
      This diff modifies the save/load logic of EMA to directly use `model.state_dict()` to solve the above 2 painpoints
      
      Reviewed By: wat3rBro
      
      Differential Revision: D48813697
      
      fbshipit-source-id: be53c2677d2e493ba923508bbd82d9d295397941
      477629d0
    • Min Xu's avatar
      add damit uri support in train_net local run · c668ed4e
      Min Xu authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/622
      
      as title
      
      Reviewed By: jiaxuzhu92, ywwwer
      
      Differential Revision: D49672980
      
      fbshipit-source-id: f34ffe944c25c948fe1abd492ea0b96e47dc5b06
      c668ed4e
  34. 25 Sep, 2023 1 commit
    • Ed Pizzi's avatar
      Propagate include_frozen/buffers to EMAState in FSDP FULL_STATE_DICT checkpoints · 206a05c6
      Ed Pizzi authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/620
      
      EMA can be configured to exclude frozen (`requires_grad=False`) parameters and buffers, reducing memory use and checkpoint size.
      
      However `FULL_STATE_DICT` FSDP + EMA checkpoints construct an inner `EMAState` after unsharding FSDP parameters. This inner `EMAState` uses default `include_frozen` and `include_buffers` settings, resulting in checkpoints containing frozen parameters and buffers regardless of settings.
      
      Propagate `include_frozen` and `include_buffers` settings to the inner `EMAState` when gathering `FULL_STATE_DICT` FSDP EMA state.
      
      This change only affects frozen parameters with a parallel fix to PyTorch FSDP to propagate `requires_grad` across parameter sharding/unsharding: https://github.com/pytorch/pytorch/pull/109892.
      
      Reviewed By: daveboat
      
      Differential Revision: D49517178
      
      fbshipit-source-id: 0fe159dcec9ec1f2c456ae2ee7798681e7536249
      206a05c6
  35. 21 Sep, 2023 1 commit
  36. 11 Sep, 2023 1 commit
  37. 06 Sep, 2023 1 commit
    • Karla Brkic's avatar
      Add check for empty bboxes · 66f626dd
      Karla Brkic authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/616
      
      The check for valid bboxes doesn't verify that the bbox list has exactly 4 elements, and crashes the training instead of marking empty bboxes as invalid (see f472700454).
      
      Reviewed By: tglik
      
      Differential Revision: D48653084
      
      fbshipit-source-id: 2d47fb267c5e51ab27798662ae739014f3d310e4
      66f626dd