"...git@developer.sourcefind.cn:chenpangpang/transformers.git" did not exist on "0603564e9323bd424217581e5297da6cd202817b"
adapt attention masks for the decoder case
The introduction of a decoder introduces 2 changes: - We need to be able to specify a separate mask in the cross attention to mask the positions corresponding to padding tokens in the encoder state. - The self-attention in the decoder needs to be causal on top of not attending to padding tokens.
Showing
Please register or sign in to comment