1. 30 Mar, 2023 1 commit
  2. 11 Mar, 2023 1 commit
  3. 09 Mar, 2023 1 commit
  4. 25 Feb, 2023 1 commit
  5. 23 Feb, 2023 1 commit
  6. 16 Feb, 2023 2 commits
  7. 14 Feb, 2023 1 commit
    • Fei Sun's avatar
      Add NUMA binding · 07ddd262
      Fei Sun authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/472
      
      Add NUMA binding to d2go. It equally distributes the GPUs to the CPU sockets so that the CPU traffic, GPU to CPU traffic are all balanced. It helps the diffusion model training, but it is a general technique that can be applied to all models. We still want to manually enable it in each case though, until we are confident that it gives better performance and set it as a default.
      
      NUMA binding is based on jspark1105's work D42827082. Full credit goes to him.
      
      This diff does not enable the feature.
      
      Reviewed By: newstzpz
      
      Differential Revision: D43036817
      
      fbshipit-source-id: fe67fd656ed3980f04bc81909cae7ba2527346fd
      07ddd262
  8. 13 Jan, 2023 1 commit
    • Anthony Chen's avatar
      Rewrite FSDP wrapping as modeling hook · dc6fac12
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/440
      
      Move FSDP wrapping to runner.build_model by rewriting it as a modeling hook
      
      **Motivation**
      When a model is too large to run inference on a single GPU, it requires using FSDP with local checkpointing mode to save peak GPU memory. However, in eval_pytorch workflow (train_net with eval-only), models are evaluated without being wrapped by FSDP. This may cause OOM errors for the reasons above. Thus, it may be a better practice to wrap model with FSDP during `runner.build_model(cfg)`, so evaluation can also be run in the same FSDP setting as in training.
      
      This diff moves FSDP wrapping to `runner.build_model(cfg)` by rewriting it as a modeling hook.
      
      **API changes**
      * Users need to append `"FSDPModelingHook"` to `MODEL.MODELING_HOOKS` to enable FSDP.
      * `FSDP.ALGORITHM` can only be `full` or `grad_optim`
      
      **Note**
      It's not possible to unwrap an FSDP model back to the normal model, so FSDPModelingHook.unapply() can't be implemented
      
      Reviewed By: wat3rBro
      
      Differential Revision: D41416917
      
      fbshipit-source-id: f3fc72d574cc6ccbe0d238e48c575926ba5b4d06
      dc6fac12
  9. 05 Jan, 2023 1 commit
  10. 09 Dec, 2022 1 commit
  11. 28 Nov, 2022 1 commit
  12. 17 Nov, 2022 1 commit
    • Anthony Chen's avatar
      Integrate PyTorch Fully Sharded Data Parallel (FSDP) · 02625ff8
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/396
      
      Integrate PyTorch FSDP, which supports two sharding modes: 1. gradient + optimizer sharding; 2. full model sharding (params + gradient + optimizer). This feature is enabled in the train_net.py code path.
      
      Sources
      * Integration follows this tutorial: https://pytorch.org/tutorials/intermediate/FSDP_tutorial.html
      
      API changes
      * Add new config keys to support the new feature. Refer to mobile-vision/d2go/d2go/trainer/fsdp.py for the full list of config options
      * Add `FSDPCheckpointer` as an inheritance of `QATCheckpointer` to support special loading/saving logic for FSDP models
      
      Reviewed By: wat3rBro
      
      Differential Revision: D39228316
      
      fbshipit-source-id: 342ecb3bcbce748453c3fba2d6e1b7b7e478473c
      02625ff8
  13. 11 Nov, 2022 1 commit
  14. 03 Nov, 2022 1 commit
    • Yanghan Wang's avatar
      use SharedList as offload backend of DatasetFromList by default · 01c351bc
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/405
      
      - Use the non-hacky way (added in D40818736, https://github.com/facebookresearch/detectron2/pull/4626) to customize offloaded backend for DatasetFromList.
      - In `D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb)Go`, switch to use `SharedList` (added in D40789062, https://github.com/facebookresearch/mobile-vision/pull/120) by default to save RAM and optionally use `DiskCachedList` to further save RAM.
      
      Local benchmarking results (using a ~2.4 GiB dataset) using dev mode:
      | RAM usage (RES, SHR) | No-dataset | Naive | NumpySerializedList | SharedList | DiskCachedList |
      | -- | -- | -- | -- | -- | -- |
      | Master GPU worker.         | 8.0g, 2.8g | 21.4g, 2.8g | 11.6g, 2.8g | 11.5g, 5.2g | -- |
      | Non-master GPU worker  | 7.5g, 2.8g | 21.0g, 2.8g | 11.5g, 2.8g | 8.0g, 2.8g | -- |
      | Per data loader worker     | 2.0g, 1.0g | 14.0g, 1.0g | 4.4g, 1.0g | 2.1g, 1.0g | -- |
      
      - The memory usage (RES, SHR) is found from `top` command. `RES` is total memory used per process; `SHR` shows how much RAM can be shared inside `RES`.
      - experiments are done using 2 GPU and 2 data loader workers per GPU, so there're 6 processes in total, the **numbers are per-process**.
      - `No-dataset`: running the same job with tiny dataset (only 4.47 MiB after serialization), since RAM usage should be negligible, it shows the floor RAM usage.
      - other experiments are running using a dataset of the size of **2413.57 MiB** after serialization.
        - `Naive`: vanilla version if we don't offload the dataset to other storage.
        - `NumpySerializedList`: this optimization was added a long time ago in D19896490. I recalled that the RAM was indeed shared for data loader worker, but seems that there was a regression. Now basically all the processes have a copy of data.
        - `SharedList`: is enabled in this diff. It shows that only the master GPU needs extra RAM. It's interesting that it uses 3.5GB RAM more than other rank, while the data itself is 2.4GB. I'm not so sure if it's overhead of the storage itself or the overhead caused by sharing it with other processes, since non-master GPU using `NumpySerializedList` also uses 11.5g of RAM, we probably don't need to worry too much about it.
        - `DiskCachedList`: didn't benchmark, should have no extra RAM usage.
      
      Using the above number for a typical 8GPU, 4worker training, assuming the OS and other programs take 20-30GB RAM, the current training will use `11.6g * 8 + 4.4g * 8*4 = 233.6g` RAM, on the edge of causing OOM for a 256gb machine. This aligns with our experience that it supports ~2GB dataset. After the change, the training will use only `(11.5g * 7 + 8.0g) + 2.1g * 8*4 = 155.7g` RAM, which gives a much larger head room, we can thus train with much larger dataset (eg. 20GB) or use more DL workers (eg. 8 workers).
      
      Reviewed By: sstsai-adl
      
      Differential Revision: D40819959
      
      fbshipit-source-id: fbdc9d2d1d440e14ae8496be65979a09f3ed3638
      01c351bc
  15. 31 Oct, 2022 1 commit
  16. 26 Oct, 2022 1 commit
    • Matthew Yu's avatar
      swap the order of qat and layer freezing to preserve checkpoint values · 13b2fe71
      Matthew Yu authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/399
      
      Freezing the model before running quantization causes an issue with loading a saved checkpoint bc fusing does not support FrozenBatchNorm2d (which means that the checkpoint could have a fused weight conv.bn.weight whereas the model would have an unfused weight bn.weight). The longer term solution is to add FrozenBatchNorm2d to the fusing support but there are some subtle issues there that will take some time to fix:
      * need to move FrozenBatchNorm2d out of D2 (https://github.com/facebookresearch/d2go/commit/87374efb134e539090e0b5c476809dc35bf6aedb) and into mobile_cv lib
      * current fuser has options to add new bn ops (e.g., FrozenBatchNorm2d) which we use with ops like SyncBN but this currently is only tested with inference so we need to write some additional checks on training
      
      The swap will make freezing compatible with QAT and should still work with standard models. One subtle potential issue is that the current BN swap assumes that BN is a leaf node. If a user runs QAT without fusing BN, the BN will no longer be the leaf node as it will obtain an activation_post_process module in order to record the output. The result is that BN will not be frozen in this specific instance. This should not occur as BN is usually fused. A small adjustment to the BN swap would just be to swap the BN regardless of whether it is a leaf node (but we have to check whether activation_post_process module is retained). Another long term consideration is moving both freezing and quant to modeling hooks so the user can decide the order.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D40496052
      
      fbshipit-source-id: 0d7e467b833821f7952cd2fce459ae1f76e1fa3b
      13b2fe71
  17. 23 Oct, 2022 1 commit
  18. 03 Oct, 2022 1 commit
  19. 29 Sep, 2022 1 commit
  20. 31 Aug, 2022 1 commit
  21. 20 Aug, 2022 1 commit
  22. 27 Jul, 2022 2 commits
  23. 29 Jun, 2022 1 commit
  24. 24 Jun, 2022 1 commit
  25. 20 Jun, 2022 1 commit
  26. 14 Jun, 2022 1 commit
    • Yanghan Wang's avatar
      make get_default_cfg a classmethod · 65dad512
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/293
      
      In order to pass runner during the workflow using "runner name" instead of runner instance, we need to make sure the `get_default_cfg` is not instance method. It can be either staticmethod or classmethod, but I choose classmethod for better inheritance.
      
      code mode using following script:
      ```
      #!/usr/bin/env python3
      
      import json
      import os
      import subprocess
      
      result = subprocess.check_output("fbgs --json 'def get_default_cfg('", shell=True)
      fbgs = json.loads(result)
      fbsource_root = os.path.expanduser("~")
      
      def _indent(s):
          return len(s) - len(s.lstrip())
      
      def resolve_instance_method(content):
          lines = content.split("\n")
          for idx, line in enumerate(lines):
              if "def get_default_cfg(self" in line:
                  indent = _indent(line)
                  # find the class
                  for j in range(idx, 0, -1):
                      if lines[j].startswith(" " * (indent - 4) + "class "):
                          class_line = lines[j]
                          break
                  else:
                      raise RuntimeError("Can't find class")
                  print("class_line: ", class_line)
                  if "Runner" in class_line:
                      # check self if not used
                      for j in range(idx + 1, len(lines)):
                          if _indent(lines[j]) < indent:
                              break
                          assert "self" not in lines[j], (j, lines[j])
                      # update the content
                      assert "def get_default_cfg(self)" in line
                      lines[idx] = lines[idx].replace(
                          "def get_default_cfg(self)", "def get_default_cfg(cls)"
                      )
                      lines.insert(idx, " " * indent + "classmethod")
                      return "\n".join(lines)
          return content
      
      def resolve_static_method(content):
          lines = content.split("\n")
          for idx, line in enumerate(lines):
              if "def get_default_cfg()" in line:
                  indent = _indent(line)
                  # find the class
                  for j in range(idx, 0, -1):
                      if "class " in lines[j]:
                          class_line = lines[j]
                          break
                  else:
                      print("[WARNING] Can't find class!!!")
                      continue
                  if "Runner" in class_line:
                      # check staticmethod is used
                      for j in range(idx, 0, -1):
                          if lines[j] == " " * indent + "staticmethod":
                              staticmethod_line_idx = j
                              break
                      else:
                          raise RuntimeError("Can't find staticmethod")
                      # update the content
                      lines[idx] = lines[idx].replace(
                          "def get_default_cfg()", "def get_default_cfg(cls)"
                      )
                      lines[staticmethod_line_idx] = " " * indent + "classmethod"
                      return "\n".join(lines)
          return content
      
      for result in fbgs["results"]:
          filename = os.path.join(fbsource_root, result["file_name"])
          print(f"processing: {filename}")
          with open(filename) as f:
              content = f.read()
          orig_content = content
          while True:
              old_content = content
              content = resolve_instance_method(content)
              content = resolve_static_method(content)
              if content == old_content:
                  break
          if content != orig_content:
              print("Updating ...")
              with open(filename, "w") as f:
                  f.write(content)
      ```
      
      Reviewed By: tglik
      
      Differential Revision: D37059264
      
      fbshipit-source-id: b09d5518f4232de95d8313621468905cf10a731c
      65dad512
  27. 26 May, 2022 1 commit
  28. 25 May, 2022 2 commits
  29. 21 May, 2022 1 commit
  30. 20 May, 2022 1 commit
  31. 17 May, 2022 1 commit
  32. 15 May, 2022 1 commit
    • John Reese's avatar
      apply import merging for fbcode (7 of 11) · b3a9204c
      John Reese authored
      Summary:
      Applies new import merging and sorting from µsort v1.0.
      
      When merging imports, µsort will make a best-effort to move associated
      comments to match merged elements, but there are known limitations due to
      the diynamic nature of Python and developer tooling. These changes should
      not produce any dangerous runtime changes, but may require touch-ups to
      satisfy linters and other tooling.
      
      Note that µsort uses case-insensitive, lexicographical sorting, which
      results in a different ordering compared to isort. This provides a more
      consistent sorting order, matching the case-insensitive order used when
      sorting import statements by module name, and ensures that "frog", "FROG",
      and "Frog" always sort next to each other.
      
      For details on µsort's sorting and merging semantics, see the user guide:
      https://usort.readthedocs.io/en/stable/guide.html#sorting
      
      Reviewed By: lisroach
      
      Differential Revision: D36402205
      
      fbshipit-source-id: a4efc688d02da80c6e96685aa8eb00411615a366
      b3a9204c
  33. 26 Apr, 2022 1 commit
  34. 12 Apr, 2022 1 commit
  35. 05 Apr, 2022 1 commit
    • Yanghan Wang's avatar
      support do_postprocess when tracing rcnn model in D2 style · 647a3fdf
      Yanghan Wang authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/200
      
      Currently when exporting the RCNN model, we call it with `self.model.inference(inputs, do_postprocess=False)[0]`, therefore the output of exported model is not post-processed, eg. the mask is in the squared shape. This diff adds the option to include postprocess in the exported model.
      
      Worth noting that since the input is a single tensor, the post-process doesn't resize the output to original resolution, and we can't apply the post-process twice to further resize it in the Predictor's PostProcessFunc, add an assertion to raise error in this case. But this is fine for most production use cases where the input is not resized.
      
      Set `RCNN_EXPORT.INCLUDE_POSTPROCESS` to `True` to enable this.
      
      Reviewed By: tglik
      
      Differential Revision: D34904058
      
      fbshipit-source-id: 65f120eadc9747e9918d26ce0bd7dd265931cfb5
      647a3fdf
  36. 31 Mar, 2022 1 commit
  37. 01 Mar, 2022 1 commit
    • Tong Xiao's avatar
      Allow Users to Disable the Evaluation after the Last Training Iteration · f16cc060
      Tong Xiao authored
      Summary:
      `Detectron2GoRunner` defaults to trigger an evaluation right after the last iteration in the `runner.do_train` method. This sometimes might be unnecessary, because there is a `runner.do_test` at the end of training anyways.
      
      It could also lead to some side effects. For example, it would cause the training and test data loader present at the same time, which led to an OOM issue in our use case.
      
      In this diff, we add an option `eval_after_train` in the `EvalHook` to allow users to disable the evaluation after the last training iteration.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D34295685
      
      fbshipit-source-id: 3612eb649bb50145346c56c072ae9ca91cb199f5
      f16cc060