• Sara Hanson's avatar
    Implement sparse transformer fixed attention pattern (#804) · a03fe6fa
    Sara Hanson authored
    Summary:
    Pull Request resolved: https://github.com/facebookresearch/pytext/pull/804
    
    Pull Request resolved: https://github.com/fairinternal/fairseq-py/pull/746
    
    Pull Request resolved: https://github.com/pytorch/fairseq/pull/894
    
    Adding an implementation of the sparse transformer to multi-head attention using the fixed attention pattern specified https://arxiv.org/pdf/1904.10509.pdf. The sparse_mask masks out words using -inf; after softmax, -inf becomes 0. Thus, a mask does not need to be re-calculated and re-applied when multiplying attn_weights and values.
    
    Four inputs are added to the config: sparse, is_bidirectional, stride, expressivity. If we are using the sparse transformer, is_bidirectional, stride, and expressivity must be specified (there are defaults). If is_bidirectional is False, the mask values using the fixed attention pattern described in the paper. If is_bidirectional is True, subset one includes all values in the current stride window and a summary from every stride window--all other values are masked. Stride (L in the paper) controls the window size and expressivity (c in the paper) controls the size of the summary.
    
    Reviewed By: borguz
    
    Differential Revision: D16042988
    
    fbshipit-source-id: c59166dc7cfe89187a256e4076000c2458842fd5
    a03fe6fa
sparse_transformer_sentence_encoder_layer.py 1.69 KB