[PyTorch] Adjusted the logic of MHA and DPA to enable speculative decoding (#668)
* Modified MHA and DPA logic to use causal softmax and FA for inference Signed-off-by:Oleg Goncharov <ogoncharov@nvidia.com> * Adjusted unfused attention and softmax logic for inference Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Cleaned up the code per pylint Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Added test cases to evaluate numerics of incremental decoding Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Apply suggestions from code review [sequence start-end] Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Apply suggestions from code review [inference_params offset update]] Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Fixed bug in KV-cache indices and updated test suite Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Added inference_params description and applied suggestions from the code review Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Adjusted absolute tolerances in numerics tests Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Cleaned up the files per pylint Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Showing
Please register or sign in to comment