• user4543's avatar
    Bug - Fix bug of duration feature for model benchmarks in distributed mode. (#347) · b5b1c3da
    user4543 authored
    **Description**
    Fix bug of duration feature for model benchmarks in distributed mode.
    
    **Major Revision**
    - Add all_reduce to sync the result of is_finished(the function to judge whether the model benchmark should be stopped) in each step 
      - to avoid inconsistency between different ranks to determine duration end (some rank may enter one more step and can never finish)
    - Add torch.cuda.synchronize() before and after step time measuring in train_step() for all model benchmarks
      - some operations in train_step() maybe async resulting incorrect step time records (for example, lstm) 
    b5b1c3da
test_pytorch_base.py 9.75 KB