• Myle Ott's avatar
    Improve memory efficiency of FP16 optimization (#404) · 03a57dec
    Myle Ott authored
    Summary:
    Previously when training with --fp16, we stored a copy of the model parameters in FP32 for optimization, which consumed a lot of memory. An alternative is to just do the conversions to FP32 on the fly, which allows the caching allocator to reuse/save some memory.
    
    This reduces peak memory usage by ~20% with a negligible reduction in training speed (~2% slower) when training a big transformer on 8 GPUs on wmt en-de with --update-freq=16.
    
    This does not affect convergence, i.e., models will train exactly as they did before.
    Pull Request resolved: https://github.com/pytorch/fairseq/pull/404
    
    Differential Revision: D13394376
    
    Pulled By: myleott
    
    fbshipit-source-id: 2b9f808548df4782110513c9cfc9f7c6159bcbbf
    03a57dec
utils.py 15.4 KB