tests/benchmarks/model_benchmarks/test_pytorch_base.py · b5b1c3dac7831f3568fb6b574f6ade2c7dc5b575 · tsoc / superbenchmark

Bug - Fix bug of duration feature for model benchmarks in distributed mode. (#347) · b5b1c3da

user4543 authored Apr 25, 2022

**Description**
Fix bug of duration feature for model benchmarks in distributed mode.

**Major Revision**
- Add all_reduce to sync the result of is_finished(the function to judge whether the model benchmark should be stopped) in each step 
  - to avoid inconsistency between different ranks to determine duration end (some rank may enter one more step and can never finish)
- Add torch.cuda.synchronize() before and after step time measuring in train_step() for all model benchmarks
  - some operations in train_step() maybe async resulting incorrect step time records (for example, lstm)

b5b1c3da

test_pytorch_base.py 9.75 KB

Replace test_pytorch_base.py