1. 22 Jun, 2024 1 commit
    • Ahmed Gheith's avatar
      adhere to lazy import rules · 040a7167
      Ahmed Gheith authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/668
      
      Lazy import changes `Python` import semantics, specifically when it comes to initialization of packages/modules: https://www.internalfb.com/intern/wiki/Python/Cinder/Onboarding/Tutorial/Lazy_Imports/Troubleshooting/
      
      For example, this pattern is not guaranteed to work:
      
      ```
      import torch.optim
      ...
      torch.optim._multi_tensor.Adam   # may fail to resolve _multi_tensor
      ```
      
      And this is guaranteed to work:
      
      ```
      import torch.optim._multi_tensor
      ...
      torch.optim._multi_tensor.Adam   # will always work
      ```
      
      A recent change to `PyTorch` changed module initialization logic in a way that exposed this issue.
      
      But the code has been working for years? This is the nature of undefined behavior, any change in the environment (in this the `PyTorch` code base can make it fail.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D58876582
      
      fbshipit-source-id: c8f3f53605822517d646e57ddbf4359af54dba0d
      040a7167
  2. 19 Jun, 2024 1 commit
  3. 11 Jun, 2024 1 commit
  4. 08 May, 2024 1 commit
  5. 02 May, 2024 1 commit
  6. 24 Apr, 2024 1 commit
  7. 03 Apr, 2024 1 commit
  8. 02 Apr, 2024 1 commit
  9. 27 Mar, 2024 1 commit
  10. 19 Mar, 2024 1 commit
    • Geet Sethi's avatar
      distributed FSDP model initialization · abdad994
      Geet Sethi authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/656
      
      Enable distributed FSDP model initialization. This iteratively moves and shards the model on GPU to allow for the training of models greater than single GPU HBM capacity and which cannot be instantiated multiple times on a single host.
      
      The flow is as follows:
      1. Rank 0 will init the whole model on CPU using existing code paths, while all other ranks init an 'empty' model using fake tensors.
      2. Once this is complete and initialization moves to FSDP, distributed init will traverse the model 'bottom-up', transferring all params/buffers from rank 0 to all other ranks, while simultaneously wrapping modules in FSDP whenever possible (based on the specified config). Thus modules are sharded (and memory usage distributed) at the first possible instance using the existing FSDP api/implementation.
      
      Reviewed By: XiaoliangDai
      
      Differential Revision: D54287718
      
      fbshipit-source-id: 16d63d78065d1fca0c6baf7a385f666a4e1b2a5f
      abdad994
  11. 14 Mar, 2024 1 commit
  12. 10 Mar, 2024 1 commit
    • Zhicheng Yan's avatar
      ensure metadata thing_classes consistency with multiple datasets and category filtering · 1216c225
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/653
      
      # Changes
      In Mask2Former RC4 training, we need to use a particular weighted category training sampler where `DATALOADER.SAMPLER_TRAIN = "WeightedCategoryTrainingSampler"`.
      
      Also there are multiple datasets are used, and the set of each one's categories are not exactly identical. Some datasets have more categories (e.g. Exo-body) than other datasets that do not have exobody annotations.
      
      Also we use category filtering by setting `D2GO_DATA.DATASETS.TRAIN_CATEGORIES` to a subset of full categories.
      
      In this setup, currently D2GO will complain metadata.thing_classes is NOT consistency across datasets (https://fburl.com/code/k8xbvyfd).
      
      The reason is when category filtering is used, D2GO writes a temporary dataset json file (https://fburl.com/code/slb5z6mc).
      And this tmp json file will be loaded when we get the dataset dicts from DatasetCatalog (https://fburl.com/code/5k4ynyhc). Meanwhile, metadata in MetadataCatalog for this category-filtered dataset is also updated based on categories stored in this tmp file.
      
      Therefore, we must ensure categories stored in the tmp file is consistent between multiple category-filtered datasets.
      
      In this diff, we update the logic of writing such tmp dataset json file.
      
      # Github CI test
      Note **CI / python-unittest-cpu** is shown as failed with error below. But I do not think it is related to changes in this diff since error is related to observer in the QAT model training, but changes in the diff are related to dataset preparation.
      
      ```
      Traceback (most recent call last):
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train
          self.run_step()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 310, in run_step
          loss_dict = self.model(data)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
          return forward_call(*args, **kwargs)
        File "/home/runner/work/d2go/d2go/tests/runner/test_runner_default_runner.py", line 44, in forward
          ret = self.conv(images.tensor)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1590, in _call_impl
          hook_result = hook(self, args, result)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/quantize.py", line 131, in _observer_forward_hook
          return self.activation_post_process(output)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
          return forward_call(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/fake_quantize.py", line 199, in forward
          _scale, _zero_point = self.calculate_qparams()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/fake_quantize.py", line 194, in calculate_qparams
          return self.activation_post_process.calculate_qparams()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/observer.py", line 529, in calculate_qparams
          return self._calculate_qparams(self.min_val, self.max_val)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/observer.py", line 328, in _calculate_qparams
          if not check_min_max_valid(min_val, max_val):
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/utils.py", line 346, in check_min_max_valid
          assert min_val <= max_val, f"min {min_val} should be less than max {max_val}"
      AssertionError: min 3.8139522075653076e-05 should be less than max -3.8139522075653076e-05
      ```
      
      Reviewed By: ayushidalmia
      
      Differential Revision:
      D54665936
      
      Privacy Context Container: L1243674
      
      fbshipit-source-id: 322ab4a84a710b03fa39b39fa81117752d369ba5
      1216c225
  13. 03 Mar, 2024 1 commit
    • Amethyst Reese's avatar
      apply Black 2024 style in fbcode (7/16) · 2256bdb7
      Amethyst Reese authored
      Summary:
      Formats the covered files with pyfmt.
      
      paintitblack
      
      Reviewed By: aleivag
      
      Differential Revision: D54447732
      
      fbshipit-source-id: e21fbbe27882c8af183d021f4ac27029cbe93e8e
      2256bdb7
  14. 23 Feb, 2024 1 commit
    • Naveen Suda's avatar
      pt2e quantization support in D2Go · 09bd2869
      Naveen Suda authored
      Summary: Add pt2e quantization support in D2Go.
      
      Reviewed By: chakriu
      
      Differential Revision: D54132092
      
      fbshipit-source-id: 34a9ba79a5eb49ed27a3f33454078b0df37cf2f0
      09bd2869
  15. 17 Feb, 2024 1 commit
  16. 08 Feb, 2024 1 commit
  17. 04 Feb, 2024 1 commit
  18. 17 Jan, 2024 1 commit
    • Zhicheng Yan's avatar
      expose example_input argument in setup_qat_model() · 3c6f71b4
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/647
      
      Major changes
      - **example_input** argument in **prepare_fake_quant_model()** is useful in certain cases. For example, in Argos model **custom_prepare_fx()** method under FX graph + QAT setup (D52760682), it is used to prepare example inputs to individual sub-modules by running one forward pass and bookkeeping the inputs to individual sub-modules. Therefore, we export argument **example_input** in **setup_qat_model()** function.
      - For QAT model, currently we assert # of state dict keys (excluding observers) should be equal to # of state dict keys in the original model. However, when the assertion fails, it does not log useful information for debugging. We make changes to report what are the unique keys in each state dict.
      
      Reviewed By: navsud
      
      Differential Revision: D52760688
      
      fbshipit-source-id: 27535a0324ebe6513f198acb839918a0346720d0
      3c6f71b4
  19. 16 Jan, 2024 1 commit
  20. 12 Jan, 2024 1 commit
    • Kapil Krishnakumar's avatar
      consolidate deterministic settings · 573bd454
      Kapil Krishnakumar authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/644
      
      This diff consolidates deterministic settings in D2Go. In the `default_runner.py` file, the `torch.set_float32_matmul_precision("highest")` function is added to set the precision for matrix multiplication to the highest possible value. In the `setup.py` file, the `torch.backends.cudnn.deterministic` setting is set to `True` and the `torch.backends.cudnn.allow_tf32` setting is set to `False` to avoid random pytorch and CUDA algorithms during the training. The `torch.backends.cuda.matmul.allow_tf32` setting is also set to `False` to avoid random matrix multiplication algorithms. Additionally, the `seed` function is used to set the seed for reproducibility.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D51796739
      
      fbshipit-source-id: 50e44ea50b0311b56a885db9f633491ac3002bd4
      573bd454
  21. 08 Jan, 2024 1 commit
  22. 04 Jan, 2024 1 commit
  23. 15 Dec, 2023 2 commits
    • Zhicheng Yan's avatar
      allow to ignore state dict keys in QAT model · c2256758
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/642
      
      When we build a QAT model using FX graph mode API **prepare_qat_fx** and **convert_fx**, they will run symbolic tracing following **module.forward()**.
      
      In certain cases, such as a module takes constant tensor input, the symbolic tracing will add new tensor attributes with name prefix **_tensor_constant** (https://fburl.com/code/msc4ch4o), which becomes new keys in the QAT model state dict.
      
      In current implementation of **_setup_non_qat_to_qat_state_dict_map**, it asserts # of keys in the state dict of original- and QAT model should be the same.
      
      Thus, we extend **qat_state_dict_keys_to_ignore** method by adding an argument, which allows to ignore specified state dict keys in the QAT model.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D52152706
      
      fbshipit-source-id: 92219feae43bf8841b0a3a71adfbfcb84d8e8f95
      c2256758
    • Zhicheng Yan's avatar
      do not fuse model again for a QAT model · 8f130231
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/643
      
      For a QAT model, it contains observers. After QAT training, those observers already contain updated statistics, such as min_val, max_val.
      
      When we want to export FP32 QAT model for a sanity check, if we call **fuse_utils.fuse_model()** again (which is often already called when we build the QAT model before QAT training), it will remove statistics in the observers.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D52152688
      
      fbshipit-source-id: 08aa16f2aa72b3809e0ba2d346f1b806c0e6ede7
      8f130231
  24. 07 Dec, 2023 2 commits
  25. 30 Nov, 2023 1 commit
  26. 17 Nov, 2023 1 commit
    • Wei Sun's avatar
      Use the consolidated snapshot API in Unitrace to support Zoomer · 87649f4f
      Wei Sun authored
      Summary: Similar to D48210543. Update the training_hooks to use the Unitrace memory snapshot APIs. This allows us to maintain a singel path for memory snapshot APIs, and also collect important details such as snapshot location for Zoomer.
      
      Pulled By:
      HugeEngine
      
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/636
      
      Reviewed By: frabu6, aaronenyeshi, jackiexu1992, mengluy0125
      
      Differential Revision: D48368150
      
      fbshipit-source-id: b279adfa29d390e615d2c32a7ab9e05d95b4f164
      87649f4f
  27. 10 Nov, 2023 1 commit
  28. 09 Nov, 2023 1 commit
  29. 05 Nov, 2023 1 commit
    • Zhicheng Yan's avatar
      allow to skip loading model weights in build_model() · f2a0c52c
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630
      
      Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights.
      This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file.
      
      Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights.
      Note, this diff is entirely back-compatible and is NOT expected to break existing implementations.
      
      Reviewed By: navsud, wat3rBro
      
      Differential Revision: D50623772
      
      fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2
      f2a0c52c
  30. 01 Nov, 2023 1 commit
  31. 23 Oct, 2023 1 commit
  32. 20 Oct, 2023 1 commit
  33. 12 Oct, 2023 1 commit
    • Igor Fedorov's avatar
      Enable training for fraction of total steps; enable early stopping from trial 0 · 3c724416
      Igor Fedorov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/627
      
      Enable training for fraction of total steps: when doing HPO, users may want to train for a fraction of the number of training steps of a regular (baseline) training run. In this case, it is not enough to just change SOLVER.MAX_ITER because that also changes the learning rate schedule. We introduce a multiplier to be used on top of SOLVER.MAX_ITER when deciding how many steps to train for. This multiplier does not scale the number of steps over which the learning rate schedule is defined.
      
      Reviewed By: raghuramank100
      
      Differential Revision: D48699087
      
      fbshipit-source-id: 903f7c957ee471f36365c1449e9cd6a919fd260a
      3c724416
  34. 11 Oct, 2023 1 commit
  35. 05 Oct, 2023 1 commit
  36. 03 Oct, 2023 1 commit
  37. 27 Sep, 2023 2 commits
    • Anthony Chen's avatar
      Make EMA checkpointing with FSDP more robust · 477629d0
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/615
      
      Previous FSDP EMA checkpointing logic directly handles `EMAState`: it manually calls `FSDP.summon_full_params()` to gather the full model params, and reconstruct/load an `EMAState` for checkpointing. This logic has two drawbacks:
      
      1. `FSDP.summon_full_params()` gathers all model weights at the same time, which could cause OOM issues if the model can't fit into a single GPU. This is quite common for FSDP workloads.
      2.  Directly saving and loading `EMAState` is error-prone. EMA state dict has different semantics and behaviors than `model.state_dict()`. However, users often expect it to function seamlessly like the model state dict
      
      This diff modifies the save/load logic of EMA to directly use `model.state_dict()` to solve the above 2 painpoints
      
      Reviewed By: wat3rBro
      
      Differential Revision: D48813697
      
      fbshipit-source-id: be53c2677d2e493ba923508bbd82d9d295397941
      477629d0
    • Min Xu's avatar
      add damit uri support in train_net local run · c668ed4e
      Min Xu authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/622
      
      as title
      
      Reviewed By: jiaxuzhu92, ywwwer
      
      Differential Revision: D49672980
      
      fbshipit-source-id: f34ffe944c25c948fe1abd492ea0b96e47dc5b06
      c668ed4e