1. 04 Oct, 2024 2 commits
  2. 26 Sep, 2024 1 commit
    • Victor Bourgin's avatar
      Deterministic D2GO Trainer Params · 5b856252
      Victor Bourgin authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/677
      
      Previously, cfg.SOLVER.DETERMINISTIC was not taken into account for lightning `Trainer` in d2go:
      - Nested checks `hasattr(cfg, "SOLVER.DETERMINISTIC")` do not work as expected
      - If SOLVER.DETERMINISTIC exists, we should check that it is set to `True`
      
      Reviewed By: ayushidalmia, rbasch
      
      Differential Revision: D63426319
      
      fbshipit-source-id: 8caf0af53e7b97a49392df09153e26ee3628231f
      5b856252
  3. 13 Aug, 2024 1 commit
    • Josh Fromm's avatar
      Hipify various dependencies to enable AMD Face Enhancer · 7739077a
      Josh Fromm authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/675
      
      This diff extends several targets to be hip compatible and fixes a few silly hipification issues with those targets.
      
      After these changes, all dependencies needed for the face enhancer can compile with AMD.
      
      A few silly issues that I had to hack around, maybe we could improve hipification to avoid similar issues in the future:
      * Some of the dependencies used sources in `src/cuda/**.cu`. Hipification tried to rename "cuda" to "hip" and broke the paths. I'm not sure where that rename happens so I just changed the directory from "cuda" to "gpu" to avoid the issue.
      * One header import called `THCAtomics.cuh` was incorrectly being renamed to `THHAtomics.cuh`, which doesnt exist. Fortunately an equivalent import that doesnt have name issues was available.
      
      We also might want to consider graduating the cpp_library_hip bazel helper out of fbgemm since it seems pretty generally useful.
      
      For some of the targets, we needed to build a python cpp extension, which as far as I can tell we didnt have good hipification for yet. I added a new buck rule very similar to our standard cpp_library_hip rule that creates an extension instead. It's a little copy-pasted so if there are cleaner ways to work around this requirement let me know.
      
      Reviewed By: houseroad
      
      Differential Revision: D61080247
      
      fbshipit-source-id: dc6f101eb3eadfd43ef5610c651b1639e4c78ae6
      7739077a
  4. 30 Jul, 2024 2 commits
  5. 11 Jul, 2024 1 commit
  6. 01 Jul, 2024 1 commit
  7. 22 Jun, 2024 1 commit
    • Ahmed Gheith's avatar
      adhere to lazy import rules · 040a7167
      Ahmed Gheith authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/668
      
      Lazy import changes `Python` import semantics, specifically when it comes to initialization of packages/modules: https://www.internalfb.com/intern/wiki/Python/Cinder/Onboarding/Tutorial/Lazy_Imports/Troubleshooting/
      
      For example, this pattern is not guaranteed to work:
      
      ```
      import torch.optim
      ...
      torch.optim._multi_tensor.Adam   # may fail to resolve _multi_tensor
      ```
      
      And this is guaranteed to work:
      
      ```
      import torch.optim._multi_tensor
      ...
      torch.optim._multi_tensor.Adam   # will always work
      ```
      
      A recent change to `PyTorch` changed module initialization logic in a way that exposed this issue.
      
      But the code has been working for years? This is the nature of undefined behavior, any change in the environment (in this the `PyTorch` code base can make it fail.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D58876582
      
      fbshipit-source-id: c8f3f53605822517d646e57ddbf4359af54dba0d
      040a7167
  8. 19 Jun, 2024 1 commit
  9. 11 Jun, 2024 1 commit
  10. 08 May, 2024 1 commit
  11. 02 May, 2024 1 commit
  12. 24 Apr, 2024 1 commit
  13. 03 Apr, 2024 1 commit
  14. 02 Apr, 2024 1 commit
  15. 27 Mar, 2024 1 commit
  16. 19 Mar, 2024 1 commit
    • Geet Sethi's avatar
      distributed FSDP model initialization · abdad994
      Geet Sethi authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/656
      
      Enable distributed FSDP model initialization. This iteratively moves and shards the model on GPU to allow for the training of models greater than single GPU HBM capacity and which cannot be instantiated multiple times on a single host.
      
      The flow is as follows:
      1. Rank 0 will init the whole model on CPU using existing code paths, while all other ranks init an 'empty' model using fake tensors.
      2. Once this is complete and initialization moves to FSDP, distributed init will traverse the model 'bottom-up', transferring all params/buffers from rank 0 to all other ranks, while simultaneously wrapping modules in FSDP whenever possible (based on the specified config). Thus modules are sharded (and memory usage distributed) at the first possible instance using the existing FSDP api/implementation.
      
      Reviewed By: XiaoliangDai
      
      Differential Revision: D54287718
      
      fbshipit-source-id: 16d63d78065d1fca0c6baf7a385f666a4e1b2a5f
      abdad994
  17. 14 Mar, 2024 1 commit
  18. 10 Mar, 2024 1 commit
    • Zhicheng Yan's avatar
      ensure metadata thing_classes consistency with multiple datasets and category filtering · 1216c225
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/653
      
      # Changes
      In Mask2Former RC4 training, we need to use a particular weighted category training sampler where `DATALOADER.SAMPLER_TRAIN = "WeightedCategoryTrainingSampler"`.
      
      Also there are multiple datasets are used, and the set of each one's categories are not exactly identical. Some datasets have more categories (e.g. Exo-body) than other datasets that do not have exobody annotations.
      
      Also we use category filtering by setting `D2GO_DATA.DATASETS.TRAIN_CATEGORIES` to a subset of full categories.
      
      In this setup, currently D2GO will complain metadata.thing_classes is NOT consistency across datasets (https://fburl.com/code/k8xbvyfd).
      
      The reason is when category filtering is used, D2GO writes a temporary dataset json file (https://fburl.com/code/slb5z6mc).
      And this tmp json file will be loaded when we get the dataset dicts from DatasetCatalog (https://fburl.com/code/5k4ynyhc). Meanwhile, metadata in MetadataCatalog for this category-filtered dataset is also updated based on categories stored in this tmp file.
      
      Therefore, we must ensure categories stored in the tmp file is consistent between multiple category-filtered datasets.
      
      In this diff, we update the logic of writing such tmp dataset json file.
      
      # Github CI test
      Note **CI / python-unittest-cpu** is shown as failed with error below. But I do not think it is related to changes in this diff since error is related to observer in the QAT model training, but changes in the diff are related to dataset preparation.
      
      ```
      Traceback (most recent call last):
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 155, in train
          self.run_step()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/detectron2/engine/train_loop.py", line 310, in run_step
          loss_dict = self.model(data)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
          return forward_call(*args, **kwargs)
        File "/home/runner/work/d2go/d2go/tests/runner/test_runner_default_runner.py", line 44, in forward
          ret = self.conv(images.tensor)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1590, in _call_impl
          hook_result = hook(self, args, result)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/quantize.py", line 131, in _observer_forward_hook
          return self.activation_post_process(output)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _wrapped_call_impl
          return self._call_impl(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1536, in _call_impl
          return forward_call(*args, **kwargs)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/fake_quantize.py", line 199, in forward
          _scale, _zero_point = self.calculate_qparams()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/fake_quantize.py", line 194, in calculate_qparams
          return self.activation_post_process.calculate_qparams()
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/observer.py", line 529, in calculate_qparams
          return self._calculate_qparams(self.min_val, self.max_val)
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/observer.py", line 328, in _calculate_qparams
          if not check_min_max_valid(min_val, max_val):
        File "/usr/share/miniconda/envs/__setup_conda/lib/python3.8/site-packages/torch/ao/quantization/utils.py", line 346, in check_min_max_valid
          assert min_val <= max_val, f"min {min_val} should be less than max {max_val}"
      AssertionError: min 3.8139522075653076e-05 should be less than max -3.8139522075653076e-05
      ```
      
      Reviewed By: ayushidalmia
      
      Differential Revision:
      D54665936
      
      Privacy Context Container: L1243674
      
      fbshipit-source-id: 322ab4a84a710b03fa39b39fa81117752d369ba5
      1216c225
  19. 03 Mar, 2024 1 commit
    • Amethyst Reese's avatar
      apply Black 2024 style in fbcode (7/16) · 2256bdb7
      Amethyst Reese authored
      Summary:
      Formats the covered files with pyfmt.
      
      paintitblack
      
      Reviewed By: aleivag
      
      Differential Revision: D54447732
      
      fbshipit-source-id: e21fbbe27882c8af183d021f4ac27029cbe93e8e
      2256bdb7
  20. 23 Feb, 2024 1 commit
    • Naveen Suda's avatar
      pt2e quantization support in D2Go · 09bd2869
      Naveen Suda authored
      Summary: Add pt2e quantization support in D2Go.
      
      Reviewed By: chakriu
      
      Differential Revision: D54132092
      
      fbshipit-source-id: 34a9ba79a5eb49ed27a3f33454078b0df37cf2f0
      09bd2869
  21. 17 Feb, 2024 1 commit
  22. 08 Feb, 2024 1 commit
  23. 04 Feb, 2024 1 commit
  24. 17 Jan, 2024 1 commit
    • Zhicheng Yan's avatar
      expose example_input argument in setup_qat_model() · 3c6f71b4
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/647
      
      Major changes
      - **example_input** argument in **prepare_fake_quant_model()** is useful in certain cases. For example, in Argos model **custom_prepare_fx()** method under FX graph + QAT setup (D52760682), it is used to prepare example inputs to individual sub-modules by running one forward pass and bookkeeping the inputs to individual sub-modules. Therefore, we export argument **example_input** in **setup_qat_model()** function.
      - For QAT model, currently we assert # of state dict keys (excluding observers) should be equal to # of state dict keys in the original model. However, when the assertion fails, it does not log useful information for debugging. We make changes to report what are the unique keys in each state dict.
      
      Reviewed By: navsud
      
      Differential Revision: D52760688
      
      fbshipit-source-id: 27535a0324ebe6513f198acb839918a0346720d0
      3c6f71b4
  25. 16 Jan, 2024 1 commit
  26. 12 Jan, 2024 1 commit
    • Kapil Krishnakumar's avatar
      consolidate deterministic settings · 573bd454
      Kapil Krishnakumar authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/644
      
      This diff consolidates deterministic settings in D2Go. In the `default_runner.py` file, the `torch.set_float32_matmul_precision("highest")` function is added to set the precision for matrix multiplication to the highest possible value. In the `setup.py` file, the `torch.backends.cudnn.deterministic` setting is set to `True` and the `torch.backends.cudnn.allow_tf32` setting is set to `False` to avoid random pytorch and CUDA algorithms during the training. The `torch.backends.cuda.matmul.allow_tf32` setting is also set to `False` to avoid random matrix multiplication algorithms. Additionally, the `seed` function is used to set the seed for reproducibility.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D51796739
      
      fbshipit-source-id: 50e44ea50b0311b56a885db9f633491ac3002bd4
      573bd454
  27. 08 Jan, 2024 1 commit
  28. 04 Jan, 2024 1 commit
  29. 15 Dec, 2023 2 commits
    • Zhicheng Yan's avatar
      allow to ignore state dict keys in QAT model · c2256758
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/642
      
      When we build a QAT model using FX graph mode API **prepare_qat_fx** and **convert_fx**, they will run symbolic tracing following **module.forward()**.
      
      In certain cases, such as a module takes constant tensor input, the symbolic tracing will add new tensor attributes with name prefix **_tensor_constant** (https://fburl.com/code/msc4ch4o), which becomes new keys in the QAT model state dict.
      
      In current implementation of **_setup_non_qat_to_qat_state_dict_map**, it asserts # of keys in the state dict of original- and QAT model should be the same.
      
      Thus, we extend **qat_state_dict_keys_to_ignore** method by adding an argument, which allows to ignore specified state dict keys in the QAT model.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D52152706
      
      fbshipit-source-id: 92219feae43bf8841b0a3a71adfbfcb84d8e8f95
      c2256758
    • Zhicheng Yan's avatar
      do not fuse model again for a QAT model · 8f130231
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/643
      
      For a QAT model, it contains observers. After QAT training, those observers already contain updated statistics, such as min_val, max_val.
      
      When we want to export FP32 QAT model for a sanity check, if we call **fuse_utils.fuse_model()** again (which is often already called when we build the QAT model before QAT training), it will remove statistics in the observers.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D52152688
      
      fbshipit-source-id: 08aa16f2aa72b3809e0ba2d346f1b806c0e6ede7
      8f130231
  30. 07 Dec, 2023 2 commits
  31. 30 Nov, 2023 1 commit
  32. 17 Nov, 2023 1 commit
    • Wei Sun's avatar
      Use the consolidated snapshot API in Unitrace to support Zoomer · 87649f4f
      Wei Sun authored
      Summary: Similar to D48210543. Update the training_hooks to use the Unitrace memory snapshot APIs. This allows us to maintain a singel path for memory snapshot APIs, and also collect important details such as snapshot location for Zoomer.
      
      Pulled By:
      HugeEngine
      
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/636
      
      Reviewed By: frabu6, aaronenyeshi, jackiexu1992, mengluy0125
      
      Differential Revision: D48368150
      
      fbshipit-source-id: b279adfa29d390e615d2c32a7ab9e05d95b4f164
      87649f4f
  33. 10 Nov, 2023 1 commit
  34. 09 Nov, 2023 1 commit
  35. 05 Nov, 2023 1 commit
    • Zhicheng Yan's avatar
      allow to skip loading model weights in build_model() · f2a0c52c
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630
      
      Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights.
      This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file.
      
      Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights.
      Note, this diff is entirely back-compatible and is NOT expected to break existing implementations.
      
      Reviewed By: navsud, wat3rBro
      
      Differential Revision: D50623772
      
      fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2
      f2a0c52c
  36. 01 Nov, 2023 1 commit