- 17 May, 2024 1 commit
-
-
Charlene Yang authored
* fix inconsistency for attn mask; now True means participating in attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix sliding window window_size for decoder+padding combination Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert paddle changes regarding mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert softmax to 1-mask;0-keep Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * enforce 1-mask out; 0-keep rule for jax masks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert pytorch mask changes; some kept in tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to jax fused attn on main Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * inverse mask logic for get_cu_seqlens/_and_indices in PyTorch implementation and mask generation in unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * temporarily disable update_weight_scale_inv Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * enforce window_size for decoder Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for mask definition 1-mask out;0-keep Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add aux_ctx_tensors to save_for_backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak make_decoder_mask and make_mask in jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip dBias for shapes other than 1HSS; otherwise dq/dk/dv NaNs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * expand attn_biases from list to variables in save_for_backward Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix use of variable before assignment in jax dact_lu Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove window size definition for decoder Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add change notes in README for padding mask in PyTorch Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tweak padding mask notes in README Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * expand list to tensors for save_for_backwards Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 16 May, 2024 1 commit
-
-
Phuong Nguyen authored
* added squared relu in te-torch Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 13 May, 2024 1 commit
-
-
Kunlun Li authored
Signed-off-by:
kunlunl <kunlunl@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 09 May, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Bump FA version to 2.5.8 Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 02 May, 2024 1 commit
-
-
cyanguwa authored
* initialize tp_group for FP8 DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix cuDNN version in unit tests for cuDNN v9 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add hook to ignore missing fused_attn._extra_states if training from old checkpoints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove test and redundant implementation from last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove warning message and replace with docstring Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp_size/tp_group in FusedAttention; amax reduction is handled with fp8_group Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move core_attention.fused_attention._extra_state to core_attention._extra_state Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify post_state_dict_hooks between FU and DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add temporary test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove previous attempts to move core_attention.fused_attention to core_attention; keep the test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove the test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable pylint self arg for hook which is required by hook Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 01 May, 2024 1 commit
-
-
Jinze Xue authored
* Handle the scaling factor when amax is too tiny that leads to an infinite scale Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * revert formatting changes Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * fix comments Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * Apply review suggestion Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> * apply review suggestion Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * add test_recipe.py to qa/L0_pytorch_unittest/test.sh; fix unittest for is_first_microbatch=False Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * revert changes to update_weight_scale_inv Signed-off-by:
Jinze Xue <jinzex@nvidia.com> * Debug test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Jinze Xue <jinzex@nvidia.com> Signed-off-by:
Jinze Xue <155670984+jinzex@users.noreply.github.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Jinze Xue <jinzex@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 30 Apr, 2024 2 commits
-
-
vasunvidia authored
Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tim Moon authored
* Fix linter warnings from unused args Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update .gitignore Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 29 Apr, 2024 2 commits
-
-
cyanguwa authored
remove tp_size/tp_group as amax reduction is handled by fp8_group() Signed-off-by:Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Zhenhuan Liu authored
* Add support for MoE with FP8. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Fix unittest. Signed-off-by:
Dennis Liu <denliu@nvidia.com> * Fix error in linear backward. Signed-off-by:
Dennis Liu <denliu@nvidia.com> --------- Signed-off-by:
Dennis Liu <denliu@nvidia.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 26 Apr, 2024 1 commit
-
-
Xiaowei Ren authored
* make FusedAttn with CP support bias Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * assert Alibi cannot work with CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * syntax fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix variable name Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix tensor shapes Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * a typo fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix bias indexing for CP Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * bug fix Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * add attn bias tests Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change dbias update location Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix CP test model configs Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change CP test sequence length Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make AttnFuncWithCP support qkv format of sbhd Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * make sure qkv are contiguous for CP in cuDNN fused attn Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * change assert message Signed-off-by:
Xiaowei Ren <xren@nvidia.com> * fix code format Signed-off-by:
Xiaowei Ren <xren@nvidia.com> --------- Signed-off-by:
Xiaowei Ren <xren@nvidia.com> Co-authored-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 24 Apr, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Try using global buffer for cu_seqlens Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Avoid using functools.lru_cache Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com>
-
- 22 Apr, 2024 1 commit
-
-
Tim Moon authored
* Remove unnecessary Pylint overrides Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fixes to lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 19 Apr, 2024 1 commit
-
-
Tim Moon authored
* Support noop concat without providing full tensor Stop storing fused buffers in linear modules. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug noop cat func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Construct TE modules in tests with correct dtypes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add tolerances to numerical tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use plain PyTorch concat when exporting to ONNX Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 18 Apr, 2024 1 commit
-
-
Alp Dener authored
fix type checking in checkpointing to assume that there must be TE modules in custom callables Signed-off-by:Alp Dener <adener@nvidia.com>
-
- 17 Apr, 2024 2 commits
-
-
Pavel Shamis (Pasha) authored
[UB] Adding configurable timeout for userbuffer and improving error reporting for potential hangs (#757) * Improving error reporting and hang detection logic * Adding verbose error reporting in case of UB hang * Adding CE hang detector * Replacing hard-coded timeout with configurable one Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> * Cleaning up warnings in the code Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> * Removing unused codes Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> * Fixing styling issues reported on github Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> * Addressing lint new line and casting warnings Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> * Addressing lint warning about the usage of `unsigned long long` Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> * Removing unused case causing build issues on multi-arch setup Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> * Post GRDCOPY removal cleanup * Remove cmake check * Remove unused includes Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> --------- Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* fixes; docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Check for FP8 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix LoRa-like use cases Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Reviews Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 16 Apr, 2024 3 commits
-
-
Alp Dener authored
* changed TE checkpoint passthrough logic to also recursively look for TE submodules Signed-off-by:
Alp Dener <adener@nvidia.com> * simplified search for TE modules in the checkpointed network Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
Kirthi Shankar Sivamani authored
Use torch function as a class method Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
cyanguwa authored
* WIP: fp8 v1 fprop integration Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add more debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fprop working for h1; w/ debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * cleanup; bprop running but has mismatches Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add gitlab frontend as submodule Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up and add back v0.9.2 FE support; fprop/bprop passing with 5e-2 tols Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix after merge; add bias_b/h to caching descriptor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * distinguish fwd/bwd tensor types for bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for F16 cases; include added dqkv_type and d_scale_dp Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * adjust out shape for bwd in test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add casting from/to FP8 to DPA module Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: bshd_bshd_bshd layout Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: support all sbhd/bshd layouts Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add qkvpacked and kvpacked support in both FusedAttnFunc and C levels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove qkvpacked/kvpacked calls in DPA module (used for testing) Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove tp setup; add allow_non_contiguous; update FE; revert to sbh3d in tests; clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_FP8_DPA_BWD to control whether to use FP8 bwd or F16 bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MQA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix MQA/GQA in FP8 v1 API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 705d8e3, with API change Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test causal mask Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * restrict mha_fill for THD format Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn with CP and comment out is_alibi code Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up FE0.9 vs FE1.0 FP8 implementations, and related unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change NVTE_FP8_DPA_BWD default to 1, and fix its use in qkvpacked/kvpacked APIs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint and self.tp_size/group in FusedAttention() Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to 6902c94 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 MHA support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to FE v1.3.0 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for FP8 MHA with different configs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * emit stats regardless of is_training Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix linear when input is not Float8Tensor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix d_out type when f16 bprop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix user buffer for layernorm_linear/linear and revert two FP8 casts in MHA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add docstring for fp8_dpa/mha in recipe Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix backend selection to avoid FA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace transpose with transpose_2d Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use RMSE for FP8 unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace two more transpose with transpose_2d Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 initialization to FusedAttention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rm docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Revert "add FP8 initialization to FusedAttention" This reverts commit 15fffd825d6f23f31ea709b16ba01dfd61efabf8. Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change order of ctxs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back docs and mark as beta Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for tests and docs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 15 Apr, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Don't use autograd hook for bwd reduction Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 12 Apr, 2024 2 commits
-
-
Sangkug Lym authored
* Add LN margin to inference Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * Fix symbolic func registration Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix grads Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* FP8 cuda graphs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com> * Fix numerics Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * exclude torch compile from numerics tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More numerics fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm fusion from unfused path Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com>
-
- 06 Apr, 2024 2 commits
-
-
Sangkug Lym authored
fix the default userbuffer communicator init settings Signed-off-by:Sangkug Lym <slym@nvidia.com>
-
Jaemin Choi authored
* Enable DGRAD RS overlap Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> * fix lint; apply suggestions Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 04 Apr, 2024 3 commits
-
-
Sangkug Lym authored
* userbuffer fp8 reduction support for individual overlap Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup dict ub_cfg dict value load Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * Remove unnecessary fence from producer From @erhoo82 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Args can be None Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix other arg types Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Pavel Shamis (Pasha) authored
* Fixing potential integer overflow on sequence counter Current implementation may potential cause hangs or data corruption Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> * Fixing typo in comments Addressing reviewers comments Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> --------- Signed-off-by:
Pasha (Pavel) Shamis <pasharesearch@gmail.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 03 Apr, 2024 4 commits
-
-
Sangkug Lym authored
* Atomic gemm for TP-AR and TP-RS overlap with P2P exchanges Signed-off-by:
Sangkug Lym <slym@nvidia.com> * FP8 reduction for atomic TP-RS with p2p exchange Signed-off-by:
Sangkug Lym <slym@nvidia.com> * Fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Sangkug Lym authored
* Do not store input activations when not computing weight gradients Signed-off-by:
Sangkug Lym <slym@nvidia.com> * fix userbuffer tp comm overlap case Signed-off-by:
Sangkug Lym <slym@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
vasunvidia authored
Fix license, and sign off everything Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com>
-
Kirthi Shankar Sivamani authored
This reverts commit 965803c9.
-
- 29 Mar, 2024 2 commits
-
-
Kirthi Shankar Sivamani authored
* Fix backward compatibility with checkpoint API Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * review comments and fix lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tim Moon authored
Perform FP8 cast on gathered layernorm output in LayerNormLinear Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 22 Mar, 2024 1 commit
-
-
Jaemin Choi authored
* Enable TP-AG overlap with return_layernorm_output Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> * Use ub_overlap_ag Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> --------- Signed-off-by:
Jaemin Choi <jaeminc@nvidia.com> Co-authored-by:
Jaemin Choi <jaeminc@nvidia.com>
-
- 21 Mar, 2024 2 commits
-
-
Sangkug Lym authored
* TP-RS overlap with send/recv Atomic GEMM based TP-RS overlap with send/recv Signed-off-by:
Sangkug Lym <slym@nvidia.com> Specify userbuffer overlap method of each overlap instance Signed-off-by:
Sangkug Lym <slym@nvidia.com> P2P TP-RS overlap with fp8 GEMM outputs Signed-off-by:
Sangkug Lym <slym@nvidia.com> Fix TP-RS overlap with send/recv Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * cleanup Signed-off-by:
Sangkug Lym <slym@nvidia.com> * linting Signed-off-by:
Sangkug Lym <slym@nvidia.com> * fix typo Signed-off-by:
Sangkug Lym <slym@nvidia.com> --------- Signed-off-by:
Sangkug Lym <slym@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Kite0011 authored
[Pytorch] Update context parallel softmax lse correction func. Signed-off-by:
kitefang <kitefang@tencent.com> Co-authored-by:
kitefang <kitefang@tencent.com>
-
- 20 Mar, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 15 Mar, 2024 1 commit
-
-
Rachit Garg authored
* fix the perf regression because of constant property polling of the device Signed-off-by:
Rachit Garg <rachitg@nvidia.com> * Fix lint error Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Rachit Garg <rachitg@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Rachit Garg <rachitg@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 13 Mar, 2024 1 commit
-
-
Rachit Garg authored
Add envvar for SM margin in GEMM Signed-off-by:
Rachit Garg <rachitg@nvidia.com> Co-authored-by:
Rachit Garg <rachitg@nvidia.com>
-