[PyTorch/Jax] Fix attention mask definition, and sliding window for decoder (#818)
* fix inconsistency for attn mask; now True means participating in attn Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix sliding window window_size for decoder+padding combination Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert paddle changes regarding mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert softmax to 1-mask;0-keep Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * enforce 1-mask out; 0-keep rule for jax masks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert pytorch mask changes; some kept in tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to jax fused attn on main Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * inverse mask logic for get_cu_seqlens/_and_indices in PyTorch implementation and mask generation in unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * temporarily disable update_weight_scale_inv Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * enforce window_size for decoder Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for mask definition 1-mask out;0-keep Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add aux_ctx_tensors to save_for_backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak make_decoder_mask and make_mask in jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip dBias for shapes other than 1HSS; otherwise dq/dk/dv NaNs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * expand attn_biases from list to variables in save_for_backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix use of variable before assignment in jax dact_lu Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove window size definition for decoder Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add change notes in README for padding mask in PyTorch Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak padding mask notes in README Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * expand list to tensors for save_for_backwards Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
Showing
Please register or sign in to comment