1. 15 Dec, 2023 2 commits
    • Zhicheng Yan's avatar
      allow to ignore state dict keys in QAT model · c2256758
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/642
      
      When we build a QAT model using FX graph mode API **prepare_qat_fx** and **convert_fx**, they will run symbolic tracing following **module.forward()**.
      
      In certain cases, such as a module takes constant tensor input, the symbolic tracing will add new tensor attributes with name prefix **_tensor_constant** (https://fburl.com/code/msc4ch4o), which becomes new keys in the QAT model state dict.
      
      In current implementation of **_setup_non_qat_to_qat_state_dict_map**, it asserts # of keys in the state dict of original- and QAT model should be the same.
      
      Thus, we extend **qat_state_dict_keys_to_ignore** method by adding an argument, which allows to ignore specified state dict keys in the QAT model.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D52152706
      
      fbshipit-source-id: 92219feae43bf8841b0a3a71adfbfcb84d8e8f95
      c2256758
    • Zhicheng Yan's avatar
      do not fuse model again for a QAT model · 8f130231
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/643
      
      For a QAT model, it contains observers. After QAT training, those observers already contain updated statistics, such as min_val, max_val.
      
      When we want to export FP32 QAT model for a sanity check, if we call **fuse_utils.fuse_model()** again (which is often already called when we build the QAT model before QAT training), it will remove statistics in the observers.
      
      Reviewed By: wat3rBro
      
      Differential Revision: D52152688
      
      fbshipit-source-id: 08aa16f2aa72b3809e0ba2d346f1b806c0e6ede7
      8f130231
  2. 07 Dec, 2023 2 commits
  3. 30 Nov, 2023 1 commit
  4. 17 Nov, 2023 1 commit
    • Wei Sun's avatar
      Use the consolidated snapshot API in Unitrace to support Zoomer · 87649f4f
      Wei Sun authored
      Summary: Similar to D48210543. Update the training_hooks to use the Unitrace memory snapshot APIs. This allows us to maintain a singel path for memory snapshot APIs, and also collect important details such as snapshot location for Zoomer.
      
      Pulled By:
      HugeEngine
      
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/636
      
      Reviewed By: frabu6, aaronenyeshi, jackiexu1992, mengluy0125
      
      Differential Revision: D48368150
      
      fbshipit-source-id: b279adfa29d390e615d2c32a7ab9e05d95b4f164
      87649f4f
  5. 10 Nov, 2023 1 commit
  6. 09 Nov, 2023 1 commit
  7. 05 Nov, 2023 1 commit
    • Zhicheng Yan's avatar
      allow to skip loading model weights in build_model() · f2a0c52c
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/630
      
      Currently, in runner **build_model()** method, when **eval_only=True**, we always try to load model weights.
      This is quite restricted in some cases. For example, we may just wanna build a model in eval mode to profile its efficiency, and we have not trained the model or generated the model weights in a checkpoint file.
      
      Thus, this diff adds an argument **skip_model_weights** to allow users to skip the loading of model weights.
      Note, this diff is entirely back-compatible and is NOT expected to break existing implementations.
      
      Reviewed By: navsud, wat3rBro
      
      Differential Revision: D50623772
      
      fbshipit-source-id: 282dc6f19e17a4dd9eb0048e068c5299bb3d47c2
      f2a0c52c
  8. 01 Nov, 2023 1 commit
  9. 23 Oct, 2023 1 commit
  10. 20 Oct, 2023 1 commit
  11. 12 Oct, 2023 1 commit
    • Igor Fedorov's avatar
      Enable training for fraction of total steps; enable early stopping from trial 0 · 3c724416
      Igor Fedorov authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/627
      
      Enable training for fraction of total steps: when doing HPO, users may want to train for a fraction of the number of training steps of a regular (baseline) training run. In this case, it is not enough to just change SOLVER.MAX_ITER because that also changes the learning rate schedule. We introduce a multiplier to be used on top of SOLVER.MAX_ITER when deciding how many steps to train for. This multiplier does not scale the number of steps over which the learning rate schedule is defined.
      
      Reviewed By: raghuramank100
      
      Differential Revision: D48699087
      
      fbshipit-source-id: 903f7c957ee471f36365c1449e9cd6a919fd260a
      3c724416
  12. 11 Oct, 2023 1 commit
  13. 05 Oct, 2023 1 commit
  14. 03 Oct, 2023 1 commit
  15. 27 Sep, 2023 2 commits
    • Anthony Chen's avatar
      Make EMA checkpointing with FSDP more robust · 477629d0
      Anthony Chen authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/615
      
      Previous FSDP EMA checkpointing logic directly handles `EMAState`: it manually calls `FSDP.summon_full_params()` to gather the full model params, and reconstruct/load an `EMAState` for checkpointing. This logic has two drawbacks:
      
      1. `FSDP.summon_full_params()` gathers all model weights at the same time, which could cause OOM issues if the model can't fit into a single GPU. This is quite common for FSDP workloads.
      2.  Directly saving and loading `EMAState` is error-prone. EMA state dict has different semantics and behaviors than `model.state_dict()`. However, users often expect it to function seamlessly like the model state dict
      
      This diff modifies the save/load logic of EMA to directly use `model.state_dict()` to solve the above 2 painpoints
      
      Reviewed By: wat3rBro
      
      Differential Revision: D48813697
      
      fbshipit-source-id: be53c2677d2e493ba923508bbd82d9d295397941
      477629d0
    • Min Xu's avatar
      add damit uri support in train_net local run · c668ed4e
      Min Xu authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/622
      
      as title
      
      Reviewed By: jiaxuzhu92, ywwwer
      
      Differential Revision: D49672980
      
      fbshipit-source-id: f34ffe944c25c948fe1abd492ea0b96e47dc5b06
      c668ed4e
  16. 25 Sep, 2023 1 commit
    • Ed Pizzi's avatar
      Propagate include_frozen/buffers to EMAState in FSDP FULL_STATE_DICT checkpoints · 206a05c6
      Ed Pizzi authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/620
      
      EMA can be configured to exclude frozen (`requires_grad=False`) parameters and buffers, reducing memory use and checkpoint size.
      
      However `FULL_STATE_DICT` FSDP + EMA checkpoints construct an inner `EMAState` after unsharding FSDP parameters. This inner `EMAState` uses default `include_frozen` and `include_buffers` settings, resulting in checkpoints containing frozen parameters and buffers regardless of settings.
      
      Propagate `include_frozen` and `include_buffers` settings to the inner `EMAState` when gathering `FULL_STATE_DICT` FSDP EMA state.
      
      This change only affects frozen parameters with a parallel fix to PyTorch FSDP to propagate `requires_grad` across parameter sharding/unsharding: https://github.com/pytorch/pytorch/pull/109892.
      
      Reviewed By: daveboat
      
      Differential Revision: D49517178
      
      fbshipit-source-id: 0fe159dcec9ec1f2c456ae2ee7798681e7536249
      206a05c6
  17. 21 Sep, 2023 1 commit
  18. 11 Sep, 2023 1 commit
  19. 06 Sep, 2023 1 commit
    • Karla Brkic's avatar
      Add check for empty bboxes · 66f626dd
      Karla Brkic authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/616
      
      The check for valid bboxes doesn't verify that the bbox list has exactly 4 elements, and crashes the training instead of marking empty bboxes as invalid (see f472700454).
      
      Reviewed By: tglik
      
      Differential Revision: D48653084
      
      fbshipit-source-id: 2d47fb267c5e51ab27798662ae739014f3d310e4
      66f626dd
  20. 24 Aug, 2023 1 commit
  21. 22 Aug, 2023 2 commits
  22. 19 Aug, 2023 1 commit
  23. 12 Aug, 2023 1 commit
    • Yichao Lu's avatar
      Summary: · f7e1b47e
      Yichao Lu authored
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/609
      
      In previous code, the valid_bbox function was only designed for XYWH horizontal bboxes, this caused XYWHA rotated bboxes being marked invalid when the bboxes are large or close to the right edge of the image. So writing a valid_bbox_rotated for XYWHA format bbox separately
      
      Reviewed By: debowin
      
      Differential Revision: D48138234
      
      fbshipit-source-id: d09d209afde9843624169af04f2e1692180bca0d
      f7e1b47e
  24. 08 Aug, 2023 1 commit
  25. 07 Aug, 2023 2 commits
  26. 04 Aug, 2023 1 commit
    • Zhicheng Yan's avatar
      only select pth files with prefix "model" as model checkpoint file · 94c7f647
      Zhicheng Yan authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/605
      
      D2GO workflow async validation monitor the model checkpoint files *.pth in **e2e_train** folder (such as **model_0004999.pth**, **model_final.pth**) and launch async val operator as needed.
      All model files actually have prefix **"model"**.  In some cases, there are non-model-checkpoint files also with pth file extension.
      To exclude them, add a filtering to check if the file prefix is "model".
      
      Reviewed By: ayushidalmia
      
      Differential Revision: D48021972
      
      fbshipit-source-id: 54d9c14117192809ea76d812ebd4240b44166637
      94c7f647
  27. 25 Jul, 2023 2 commits
  28. 21 Jul, 2023 2 commits
  29. 19 Jul, 2023 2 commits
  30. 18 Jul, 2023 1 commit
  31. 14 Jul, 2023 1 commit
  32. 12 Jul, 2023 1 commit
    • Francisc Bungiu's avatar
      Extend reply files to all binaries · e4fa6d63
      Francisc Bungiu authored
      Summary:
      Pull Request resolved: https://github.com/facebookresearch/d2go/pull/591
      
      We previously added reply files for train_net, but not the other relevant binaries with MAST support: evaluator and lightning.
      Adding support here by extracting the common bits into a separate module and wrapping the functions to reuse the functionality.
      
      Differential Revision: D47293689
      
      fbshipit-source-id: 70630a471c0cf037d180c9edfb57a4db4fdf7bde
      e4fa6d63