• Sergey Edunov's avatar
    Fix of MHA for TPUs (#636) · ee8bcb17
    Sergey Edunov authored
    Summary:
    Multi-Head attention is currently not TPU-friendly, specifically .data_ptr() is not supported and should not be used. Also there are potential issues with correctness of existing code (e.g. data_ptr() can point to the same storage for different tensors).  Rather than rely on data_ptr() we should explicitly set self_attention or encoder_decoder_attention flags.
    Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/636
    
    Reviewed By: myleott
    
    Differential Revision: D15709898
    
    Pulled By: edunov
    
    fbshipit-source-id: f931713193c51be848a5de20da730ac3a3ce0187
    ee8bcb17
transformer.py 35.3 KB