- 26 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Add example of formatting files locally Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 14 Jun, 2024 2 commits
-
-
Kirthi Shankar Sivamani authored
* Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Initial config test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * remove linters, fix clang-format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix clang-format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix clang-format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Adjust config Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * use config file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * adjust pylintrc Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * pre-format fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Python only Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add FA module Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update CI configs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CRLF -> LF Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * revert accidental formatting changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * try with sudo Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * cpp formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix pylint error properly Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * some review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * lint fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * add fp8 attn include in the correct file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * autofix PRs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 06 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Cleanup Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 30 May, 2024 2 commits
-
-
Xin Yao authored
* add multi-tensor kernels Signed-off-by:
Xin Yao <xiny@nvidia.com> * add FusedAdam Signed-off-by:
Xin Yao <xiny@nvidia.com> * add test to qa Signed-off-by:
Xin Yao <xiny@nvidia.com> * add FusedSGD Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Tim Moon authored
* Initial refactor of FP8 workspaces in Linear module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove extra kernel launch Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor perf optimizations Tensor base class functions in Float8Tensor have significant overhead. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug FP8 recipe test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor FP8 workspaces in LayerNormLinear and LayerNormMLP Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Document FP8 workspace function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert changes to FP8 recipe tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for lazy FP8 transpose caching Previous caching behavior (always fill cache) incorrectly filled cache during CUDA graph warmup steps. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix Pylint warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug ONNX export ONNX FP8 cast ops assumed that FP8 scales were created during model export (i.e. not initialized during training). Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug fused attention tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure Float8Tensor.transpose_2d is backward compatible Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert changes to ONNX export operations Work around ONNX test failures by filling FP8 scale tensors instead of copying into them. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug scale factor update in Float8Tensor transpose_2d Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 25 May, 2024 1 commit
-
-
Paweł Gadziński authored
* Fixed Llama tutorial. Changed batch size and added fused=True. Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Tutorial updated but not complete yet. Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Tutorial notebook reseted - removed fuse=true Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Removed fused=true Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Batch size back to 8 Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Typo and commented out line Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> * fixed whitespace Signed-off-by:
root <root@ipp2-0037.nvidia.com> * fixed whitespace Signed-off-by:
root <root@ipp2-0037.nvidia.com> * Added comment to attention line. Fixed potential bug with loading weights - now loading works correctly, confirmed by the generation code. Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Comments Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Models cast added again Signed-off-by:
root <root@ipp2-1661.nvidia.com> * Weight download info Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Moved parameter gate_proj_size to config Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * gate_proj_size removed and put immediate_size instead Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Llama 3 added to tutorial Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Typos fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Typos fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fixed model loading Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Loading fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Different dim for attention Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Reversed other commit Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Changed name to kv_channels Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Fixed typo Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Back to kv_channels in transformer layer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Back to kv_channels in transformer layer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Small bug fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Small bug fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Test fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * changed file modes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix and resolved conflict Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix and resolved conflict Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Lint fix, hopefully last Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
root <root@ipp2-0037.nvidia.com> Signed-off-by:
root <root@ipp2-1661.nvidia.com> Co-authored-by:
root <root@ipp2-2373.nvidia.com> Co-authored-by:
root <root@ipp2-1588.nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
root <root@ipp2-0037.nvidia.com> Co-authored-by:
root <root@ipp2-1661.nvidia.com> Co-authored-by:
root <root@ipp2-2371.nvidia.com> Co-authored-by:
root <root@ipp2-1589.nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 May, 2024 1 commit
-
-
Paweł Gadziński authored
* Calibration fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Lint fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
- 13 May, 2024 1 commit
-
-
Kunlun Li authored
Signed-off-by:
kunlunl <kunlunl@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 01 May, 2024 1 commit
-
-
Jinze Xue authored
* Handle the scaling factor when amax is too tiny that leads to an infinite scale Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * revert formatting changes Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * fix comments Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * apply review suggestion Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * add test_recipe.py to qa/L0_pytorch_unittest/test.sh; fix unittest for is_first_microbatch=False Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * revert changes to update_weight_scale_inv Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Jinze Xue <jinzex@nvidia.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Jinze Xue <jinzex@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 18 Apr, 2024 1 commit
-
-
Alp Dener authored
* fixed static argnums for jax.jit in single gpu encoder test, changed warning filtering for pytest Signed-off-by:
Alp Dener <adener@nvidia.com> * propagating the fix to the JAX mnist example Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed missing space ibetween flags i QAA scripts Signed-off-by:
Alp Dener <adener@nvidia.com> * added TE warnings into the ignore list Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 16 Apr, 2024 1 commit
-
-
cyanguwa authored
* WIP: fp8 v1 fprop integration Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add more debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fprop working for h1; w/ debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * cleanup; bprop running but has mismatches Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add gitlab frontend as submodule Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up and add back v0.9.2 FE support; fprop/bprop passing with 5e-2 tols Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix after merge; add bias_b/h to caching descriptor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * distinguish fwd/bwd tensor types for bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for F16 cases; include added dqkv_type and d_scale_dp Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * adjust out shape for bwd in test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add casting from/to FP8 to DPA module Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: bshd_bshd_bshd layout Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: support all sbhd/bshd layouts Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add qkvpacked and kvpacked support in both FusedAttnFunc and C levels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove qkvpacked/kvpacked calls in DPA module (used for testing) Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp setup; add allow_non_contiguous; update FE; revert to sbh3d in tests; clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_FP8_DPA_BWD to control whether to use FP8 bwd or F16 bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MQA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MQA/GQA in FP8 v1 API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 705d8e3, with API change Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test causal mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * restrict mha_fill for THD format Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn with CP and comment out is_alibi code Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up FE0.9 vs FE1.0 FP8 implementations, and related unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change NVTE_FP8_DPA_BWD default to 1, and fix its use in qkvpacked/kvpacked APIs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint and self.tp_size/group in FusedAttention() Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 6902c94 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 MHA support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to FE v1.3.0 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for FP8 MHA with different configs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * emit stats regardless of is_training Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix linear when input is not Float8Tensor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix d_out type when f16 bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix user buffer for layernorm_linear/linear and revert two FP8 casts in MHA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for fp8_dpa/mha in recipe Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix backend selection to avoid FA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace transpose with transpose_2d Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use RMSE for FP8 unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace two more transpose with transpose_2d Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 initialization to FusedAttention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rm docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Revert "add FP8 initialization to FusedAttention" This reverts commit 15fffd825d6f23f31ea709b16ba01dfd61efabf8. Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change order of ctxs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back docs and mark as beta Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for tests and docs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 12 Apr, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* FP8 cuda graphs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com> * Fix numerics Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * exclude torch compile from numerics tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More numerics fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm fusion from unfused path Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com>
-
- 20 Feb, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 01 Feb, 2024 1 commit
-
-
zlsh80826 authored
* Fix unfused GQA perf Signed-off-by:
Reese Wang <rewang@nvidia.com> * Remove WAR for Check failed: reduction_kind.has_value() Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 30 Jan, 2024 1 commit
-
-
Tim Moon authored
* Replace paddle.fluid imports with paddle.base Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove paddle.fluid usage from tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 25 Jan, 2024 1 commit
-
-
Xin Yao authored
* fused apply rope Signed-off-by:
Xin Yao <xiny@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * make rotary_percent optional Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix ci Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix test Signed-off-by:
Xin Yao <xiny@nvidia.com> * add rope test to qa Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix linting Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: add transpose_output_memory Signed-off-by:
Xin Yao <xiny@nvidia.com> * small fix Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: fuse sin/cos Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: fused rope for thd format Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * Fix license headers Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * add support for bshd format Signed-off-by:
Xin Yao <xiny@nvidia.com> * support different seq length Signed-off-by:
Xin Yao <xiny@nvidia.com> * update Signed-off-by:
Xin Yao <xiny@nvidia.com> * update copyright Signed-off-by:
Xin Yao <xiny@nvidia.com> * remove transpose_output_memory Signed-off-by:
Xin Yao <xiny@nvidia.com> * Make outputs contiguous in SBHD case Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 23 Jan, 2024 1 commit
-
-
Alp Dener authored
* added missing parameter materialization on real device for LayerNorm and RMSNorm Signed-off-by:
Alp Dener <adener@nvidia.com> * added new unittest for deferred initialization and modified parameter materialization to support standalone execution outside of FSDP Signed-off-by:
Alp Dener <adener@nvidia.com> * restored tensor parallel attributes that were being wiped out by the parameter reset Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect order of fp8 metadata initialization Signed-off-by:
Alp Dener <adener@nvidia.com> * added deferred init unittest to the QA script Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 16 Jan, 2024 1 commit
-
-
zlsh80826 authored
* Support num_gqa_groups arguments Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add GQA support on the JAX bridge code Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix the kv stride of the arbitrary backend Signed-off-by:
Reese Wang <rewang@nvidia.com> * Complete rewrite fused attention tests and add GQA coverage Signed-off-by:
Reese Wang <rewang@nvidia.com> * Support unfused GQA Signed-off-by:
Reese Wang <rewang@nvidia.com> * Calculate seqlen before the primitive for the better perf Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add GQA layer tests Signed-off-by:
Reese Wang <rewang@nvidia.com> * Apply code style checks for te_jax Signed-off-by:
Reese Wang <rewang@nvidia.com> * Apply code style checks for tests Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add num_gqa_groups doc Signed-off-by:
Reese Wang <rewang@nvidia.com> * Refine the qkv_type Signed-off-by:
Reese Wang <rewang@nvidia.com> * Correct the variable naming Signed-off-by:
Reese Wang <rewang@nvidia.com> * Handle Max512 CAUSAL Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add WAR for the latest jax image Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 10 Jan, 2024 1 commit
-
-
Xiaowei Ren authored
* try to use cuDNN fused attention for context parallelism Signed-off-by:
xren <xren@nvidia.com> * assert CP is only supported with NVTE_F16_arbitrary_seqlen Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * port fused attn api to context parallelism Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add one more assert Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert CP does not support padded tokens Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add qkv_format into CP implementation Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove qkv_format from CP function Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix qkv_for,at Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bwd error with FA v2 Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make cp implementation support non-causal masking Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * remove redundant asserts for CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor assert information change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert core attn bias has not been supported with CP yet Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make CP work with window_sizes of [-1, -1] and [-1, 0] Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add draft code for fa test with cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * move fused attn test to a specific folder Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add assert_close to flash attn cp test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add more tests for CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add optional arguments for FA v2.4+ Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * minor change Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add skip condition for CP test Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * class and function naming fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * docstring fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * do not use fused attn if backend does not work with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * create a separate folder for CP test as it needs multi-GPUs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add attn_mask_type check in attn_forwrad_func_with_cp Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * code format fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
xren <xren@nvidia.com> Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 03 Jan, 2024 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 07 Dec, 2023 1 commit
-
-
cyanguwa authored
* Integrate cuDNN frontend v1 to fused attention and miscellaneous fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax/paddle for unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax/pytorch lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify stride generation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix and/or logic in get_backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix flag_max512 and test_numerics Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove v.contiguous() since get_qkv_layout covers it Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip fp8 tests for sm89 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further fix jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert mask type to comma-separated list Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last two commits Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * integrate v1/pre-release-5 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * cleanup prerelease5 integration and fix FA2.1 commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * force dropout to 0 if not training Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * testing bias/alibi and padding+causal; add alibi to unfused DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * set flag_arb to false when non determinism is not allowed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * followup on prev commit; remove redundant python env var setting Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor tweaks for tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * prepare for tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix determinism logic for fused attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add bias to bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix gpt_checkpointing/dpa_accuracy problem Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix some seg fault issues Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add failure notes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove use of non-deter var for backend selection Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for lint and CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix workspace size in bwd and uncomment bias test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get_alibi and remove check_support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update tests status Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove workspace_opt from FADescriptor_v1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable arbitrary backend + post scale bias in Jax; waiting on PR 525 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up bhsd order Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * swap bias/rng_state order in aux_ctx_tensor and add bias to aux_ctx_tensor in _qkvpacked/_kvpacked API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove support for padding_causal + cross for max512 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change alibi bias to float32 for bias_1_4/5 tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further clean up tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix thd fwd output shape for FlashAttention and add backend info for DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix definition of workspace limit when dbias is present Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further tweak DP_WORKSPACE_LIMIT definition Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disallow alibi+no_mask for sdpa flash and update alibi tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update jax/paddle after PR525 and fix DP_WORKSPACE_LIMIT for dbias Jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable dbias for non-hopper archs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix layernorm lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remode unused arg for lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove build dir in setup.py Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change selection logic to prefer fused attn on sm90 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix distributed jax test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix h and s order in header Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to cudnn fe v1 branch Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove manual setting of workopt path due to dbias after v1 update Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix paddle CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add post_scale_bias and alibi to sdpa flash support matrix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix support matrix in header files Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move headers back to .cu and change seed/offset to int64 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update Megatron commit in L1 test and remove all prints in fused attn test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix L1 Megatron test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 arg in L1 Megatron script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * print only when debug flag is on Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove checkpointing loading to avoid loading other tests results Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 20 Nov, 2023 1 commit
-
-
zlsh80826 authored
* Remove assertion for NO_MASK Signed-off-by:
Reese Wang <rewang@nvidia.com> * Fix JAX distributed unit tests name Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com>
-
- 03 Nov, 2023 1 commit
-
-
Alp Dener authored
[JAX] Regression tests for custom ops sharding with both xmap and custom_partitioning. Coverage: - layernorm: fwd/grad, zero_centered_gamma, DP, TP_COL, DP_TP_COL - rmsnorm: fwd/grad, DP, TP_COL, DP_TP_COL - softmax: fwd/grad, SCALED, SCALED_MASKED, SCALED_UPPER_TRIANG_MASKED, DP, TP_COL, TP_ROW, DP_TP_COL, DP_TP_ROW - self_fused_attn: fwd/grad, NO_BIAS, PRE_SCALE_BIAS, POST_SCALE_BIAS, NO_MASK, CAUSAL_MASK, PADDING_MASK, DP, TP_COL, DP_TP_COL - cross_fused_attn: fwd/grad, NO_BIAS, NO_MASK, PADDING_MASK, DP, TP_COL, DP_TP_COL Signed-off-by:Alp Dener <adener@nvidia.com>
-
- 31 Oct, 2023 1 commit
-
-
Tim Moon authored
* Experimental FP8 tensor Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add fp8 tensor to ci test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * review comments and tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Minor changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Default to FP8 usage Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Naming changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * minor fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix transpose caching Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Debug transpose caching Handle case where transpose cache is updated externally. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename FP8GlobalStateManager.with_fp8_parameters Signed-off-by:
Tim Moon <tmoon@nvidia.com> * remove Float8Tensor from import API Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Avoid caching FP8 transposes if not required Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix import error in FP8 tensor tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix tranpose caching and checkpointing bug Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Improve caching and fix distopt case Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/pytorch/float8_tensor.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Remove recursive logic Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix cache reset bug Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Store FP8 attributes in dict Easier for multiple tensors to share, e.g. detached tensors. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure scale_inv is 1D tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure scale_inv is 1D tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fixes and detach recipe Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Set default fp8 data type Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com>
-
- 26 Oct, 2023 1 commit
-
-
zlsh80826 authored
Inherit the existing XLA_FLAGS Signed-off-by:Reese Wang <rewang@nvidia.com>
-
- 13 Oct, 2023 1 commit
-
-
Tim Moon authored
Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 11 Oct, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 25 Sep, 2023 1 commit
-
-
cyanguwa authored
* add flexible layout support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add support for flexible qkv layout Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add more changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes for compiling Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove redudant file Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix options device error Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix typos Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more changes; WIP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more changes; WIP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes and tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes and wrong results Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * sb3hd/bs3hd working on top of 3xsbhd/bshd/thd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix dQ, dK, dV Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add nvtx Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove qkvso_strides on torch side; cover it in generateQKVStrides Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * all 15 layouts pass Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add workspace optimization Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes and test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * removed most debug info/clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add note to deprecate some qkv layouts Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix code for unit tests in test_fused_attn.py Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further remove debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove a couple more comments Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix numerics tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes for lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix onnx for core attn; not fixed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove nvtx and add env var for workspace opt Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove testing for env var Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace zeros/zeros_like with empty/empty_like Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix nvtx marker name for _q_k_v API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove sm80 when compiling for h100 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add mapping from qkv layout to layout group and qkv format Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up enums mapping and remove trailing spaces Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify workspace opt control logic; only need env var Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 test, and minor modifications for other tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * avoid overwriting model configs in unit test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * random fixes/improvements: get_qkv_format/etc, default values, docstrings, comments Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix minor issues: invalid syntax Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change workspace opt logic back to FORCE_WORKSPACE_OPT Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FP8 tests and generateStrides function Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get_backend logic for max512/arbitrary Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix unit tests; need cleanup Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up unit tests for layouts, and fix minor lint issue Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor tweaks for CI testing: onnx string issue and test fused attn first Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove one unsupported layout from max512 and add a check to qkvpacked API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix te layer test; reduce test time Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert compiler option changes; add back sm80 for even h100 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove some unit tests or make them optional to reduce CI time Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove more unit tests temporarily Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove _q_k_v in naming and add NVTE_ERROR for FP8 Aux_CTX_Tensors size checks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add more deprecation notes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove temp tests from last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace with te::getenv Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove prints from last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove redundant contiguous() Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove thd->bs3hd user warning to avoid GPU sync Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * adjust fused attn bs in tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * temporary fix for onnx issue; more fixes in PR 437 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove unused variables Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by: Charlene Yang Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 Sep, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial setup Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix testfile Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix commit Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Test script Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add logs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add perf summary Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Reviews and improvements Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Generalize GPU count Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * add plots Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better plot Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * get default file name with time Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 16 Aug, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial refactor Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Reorder methods by purpose Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Save full global state Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More fixes to test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 29 Jul, 2023 1 commit
-
-
cyanguwa authored
* add support for multi-query/grouped-query attention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to flash-attn 1.0.6 and build 2.0.0.post1 manually in CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add keyword name for DPA input Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix skipif for pytest Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Update transformer_engine/pytorch/attention.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update tests/pytorch/test_fused_attn.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix TP and SP case Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * add skipifs for pytest Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove higher limit for flash-attn version Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 19 Jul, 2023 1 commit
-
-
Tian Zheng authored
* Add Linear layer (FP16) Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> - Add BF16 training example - Add fp8_autocast (only supports non-fp8 for now) Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Remove FP8 stuff Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Simplify Linear layer forward Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add LayerNorm layer (BF16) Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add LayerNormLinear layer Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Store weights in BF16 Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add LayerNormMLP layer Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add BF16 MNIST example Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Remove in-place cast for compatibility with Paddle AMP mechanism Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * README correction Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add Paddle op as a backend option Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix code format Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix dtype change between iterations Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Minor fixes Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Move forward function out of base layer Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Use Paddle nvtx bindings Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> --------- Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com>
-
- 15 Jul, 2023 1 commit
-
-
Tim Moon authored
* Disable TorchDynamo optimizations in PyTorch modules Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add test for Torch Dynamo Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add torch.dynamo test to qa Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Skip torch.compile test for <v2.0 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 Jun, 2023 1 commit
-
-
zlsh80826 authored
* Enable fused attention dropout Signed-off-by:
Reese Wang <rewang@nvidia.com> * Cast the uint32 key/counter to int64 Signed-off-by:
Reese Wang <rewang@nvidia.com> * Update dropout support in fused attention docs Signed-off-by:
Reese Wang <rewang@nvidia.com> * Revise devPtrCuSeqlen* to align the naming Signed-off-by:
Reese Wang <rewang@nvidia.com> * Support different Jax PRNG impls Signed-off-by:
Reese Wang <rewang@nvidia.com> * Revert CastAsync since it is not used Signed-off-by:
Reese Wang <rewang@nvidia.com> * Implement is_training for 16-bit fused attn Signed-off-by:
Reese Wang <rewang@nvidia.com> * Add fused attn with dropout sanity unit tests Signed-off-by:
Reese Wang <rewang@nvidia.com> * Enhance the comments readability and rng_state checker Signed-off-by:
Reese Wang <rewang@nvidia.com> * Change the attention dropout shape to align other frameworks Signed-off-by:
Reese Wang <rewang@nvidia.com> * Make encoder tests deterministic Signed-off-by:
Reese Wang <rewang@nvidia.com> * Change the default seed for the jax encoder tests Signed-off-by:
Reese Wang <rewang@nvidia.com> * Maintain offset in TE Signed-off-by:
Reese Wang <rewang@nvidia.com> * Enhance the resource safety Signed-off-by:
Reese Wang <rewang@nvidia.com> * Revert rng_state type to allow only i64 Signed-off-by:
Reese Wang <rewang@nvidia.com> * Handle the corner case for elts_per_threads calculation Signed-off-by:
Reese Wang <rewang@nvidia.com> * Populate rng state by kernels Signed-off-by:
Reese Wang <rewang@nvidia.com> * Rename rng_state as seed in cpp_extensions Signed-off-by:
Reese Wang <rewang@nvidia.com> * Update the attention dropout comment Signed-off-by:
Reese Wang <rewang@nvidia.com> --------- Signed-off-by:
Reese Wang <rewang@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 Jun, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Use torch.compile for version 2.0 and higher Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Remove unused import Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * use torch.__version__ Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Use NVFuser for dropout fusions Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix onnx tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 06 Jun, 2023 1 commit
-
-
Tian Zheng authored
* First step of PaddlePaddle integration - Add build option for paddle - Add basic test framework - Add 3 basic operators: cast_from_fp8, cast_to_fp8, gemm Signed-off-by:
Tian Zheng <tizheng@nvidia.com> Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix review comments Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Support paddle build Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Add paddle build support for new building framework Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix review comments Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Clean up build process for Paddle stub file Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Minor fixes Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix pylint "wrong-import-order" warning Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Fix review comments Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> * Skip BF16 GEMM tests for unsupported arch Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> --------- Signed-off-by:
Tian Zheng <tizheng@nvidia.com> Signed-off-by:
Tian Zheng (Engrg-Hardware 1) <tizheng@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 09 May, 2023 1 commit
-
-
Jeng Bai-Cheng authored
* add mp example Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * refactor Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * update doc-string Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * better FP8 checker Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * update readme Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * replace te.* with te.flax* to remove deprecated warning Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * remove nouse os.environ Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * remove nouse Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix typo Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/test_multiprocessing_encoder.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * remove cuda-python Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * adjust readme Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * Update examples/jax/encoder/README.md Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix cpp lint fix issue of "Could not find a newline character at the end of the file." Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix AssertionError: 1 GPU per process Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * replace tfds with datasets The Flax application crash if it use TensorFlow Dataset (tfds) in NVIDIA JAX container. The tfds is very useful for downloading well-knwon dataset (e.g., MNIST, GLUE) and commonly used by TF/JAX community. However, it seems like that it is NOT compatible with NVIDIA TensorFlow in NVIDIA JAX container and somehow affects JAX. It triggers random errors at JAX initialization depending on different versions, and make CI unstable. Thus, this commit replaces tfds with "huggingface datasets" to download needed datasets. See "nvbugs 4039266" for more details. Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix input sharding Unlike SPMD mode, in multiprocessing mode, the input tensor must be sharded manually. Using DP=4, TP=2 as an example, the device mesh looks like: mesh.device_ids = [[0, 1], [2, 3], [4, 5], [6, 7]] Assume that the process ID is mapped to GPU ID. The process 0 and process 1 are grouped for model parallelism, process 2 and process 3 are grouped together too, and so on. The process 0 and process 1 need to share the same micro-batch in the training step, process 0 and process 2, 4, and 6 have different micro-batch. Thus, `shard_array_wrapper` partitions inputs to 4 parts (and setup needed arguments for jax.make_array_from_single_device_arrays). The process 0 and process 1 take the first quarter, process 2 and process 3 take the second quarter, and so on. Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * refactor UT for multiprocess example Use Python `multiprocessing` to test the multiprocessing example, if the system has multiple GPU. 1 GPU per process. Because `jax.distributed.initialize` must be called before any other JAX or Flax API, GPU info cannot be queried by calling jax.local_devices() in TestEncoder. Thus, `unittest_query_gpu()` forks another process to query number of GPUs and FP8 capability. Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * remove nouse arg `--num-gpu` JAX doesn't have an API to setup number of GPU used in SPMD mode. The only way is to use `CUDA_VISIBLE_DEVICES` for now. Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix typo Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix ut Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * simplify the mask setting Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * increase batch-size for multigpu example The batch-size 64 is too small to be partitioned for 8xH100. If batch-size is 64, the GEMM shape is 256x8192x8 per GPU. The 8 is too small for FP8 GEMM kernel, and cuBLASLt will throw "Failed to query heuristics". Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> * fix downloading mnist error To download MNIST via `huggingface datasets`, it requires Pillow. Otherwise, it throws `An error occurred while generating the dataset` Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> --------- Signed-off-by:
Ryan Jeng <rjeng@nvidia.com> Signed-off-by:
Jeng Bai-Cheng <jeng1220@users.noreply.github.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 21 Apr, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial refactor; linker error Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix linking issue and make mpi conditional Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix TF/JAX build Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Use max SMs at the last RS chunk in pipelined overlap Co-authored-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make userbuffers support opt-in Decouple userbuffers from MPI. Refactor MPI handling in build system. Standardize names to "userbuffers". Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Lint Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Sangkug Lym <slym@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-