- 28 Apr, 2025 1 commit
-
-
Kshitij Lakhani authored
* Move MultiHeadAttention into its own file. Modify tests and files in t_e/pytorch to import from the new MHA module Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Resolving lost MHA changes from PR 1614 as a result of rebase Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move context parallelism code into it's own file. Modify test and local imports of cp code accordingly Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move softmax.py frm pytorch/ to pytorch/d_p_a Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Move Unfused and Fused attention to backends.py and some utils functions to pytorch/utils.py Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Resolving lost mark_activation_offload changes from PR 1678 as a result of rebase Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactor attention dir Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Refactor dir structure. Make relevant symbols public in __init__ for attention and d_p_a dirs Move FA package imports to backends.py Code cleanup Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Modify tests to import attention modules correctly Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Lint fixes Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Code clean up and fix typo Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Allowing InferenceParams and RoPE imports from attention module and pytorch module Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Allow InferenceParams and RoPE imports via transformer_engine.pytorch and transformer_engine.pytorch.attention modules Remove unnecessary checks for check_set_window_size in MHA and TL Reorder backends such that smaller classes at the start and larger ones at the end Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Reinstating changes from PR 1478 for rope.py lost during rebase conflict resolution Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix lint issues Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * nit: Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make imports leaner Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 25 Apr, 2025 1 commit
-
-
Nicolas Castet authored
Fixes #1692 Signed-off-by:Nicolas Castet <26874160+nvcastet@users.noreply.github.com>
-
- 24 Apr, 2025 1 commit
-
-
jberchtold-nvidia authored
Introduce nvte_memset to provide a fill kernel that is faster than cudaMemsetAsync for small sizes (#1716) * nvte_memset fills single float value Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Support larger sizes than a single value and add tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 22 Apr, 2025 3 commits
-
-
Kirthi Shankar Sivamani authored
* Move radix sort to core Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix; change fused_attn to include C header Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix args Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Sudhakar Singh authored
* add support for `sb1d` freqs tensor in Fused RoPE Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add `start_positions` variable to `apply_rotary_pos_emb` function to make staggered rope application faster Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add pytorch path for `start_positions` and corresponding tests Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add tests for start_positions with thd Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes from feedback Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove start_positions from backward pass Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * from feedback Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make notes shorter Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
jberchtold-nvidia authored
* [JAX-Q] Single GPU current scaling for JAX Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix scale check dtype for MXFP8 scales affecting tests using assert_bitwise_scaled_tensors Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Address comments Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Remove cast to fp32 for norm primitives now that zero-centered gamma dtype issue is fixed Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix lint issue Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Remove unnecessary cast to fp32 Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 21 Apr, 2025 2 commits
-
-
jberchtold-nvidia authored
Check CuDNN version and apply unfused norm if below a version with the fix Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
Sudhakar Singh authored
* rtx5090 arch fix support Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * apprend `nvte` to the function name so that its visible in framework specific dirs Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix typo Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add filter for nvte_is_supported_nontn_fp8_gemm Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * properly expose the api Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * feedback from PR Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move the function to apt header/c files Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add more info Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 19 Apr, 2025 1 commit
-
-
Tim Moon authored
Revert "Allow NVTEShape to own data. (#1674)" This reverts commit e61ce77c . Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 18 Apr, 2025 4 commits
-
-
Kunlun Li authored
* Add fp8_primary_weights support for blockwise scaling Signed-off-by:
kunlunl <kunlunl@nvidia.com> custom fsdp Signed-off-by:
kunlunl <kunlunl@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Add view to blockwise fp8 tensor Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Fix columnwise_shape in backward of view() Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Add comments to the unit of start_offset Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Add test for view and reshape for blockwise fp8 tensor Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Add implementation for self._columnwise_scale_inv is not existed Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Track down checks for _columnwise_data is None and adding checks for _columnwise_invalid Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add assertion to check whether ._quantizer is None Signed-off-by:
kunlunl <kunlunl@nvidia.com> * rename partial_cast.cu -> fp8_block_scaling_partial_cast.cu Signed-off-by:
kunlunl <kunlunl@nvidia.com> * rename partial_cast kernel to fp8_block_scaling_partial_cast kernel Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add shfl_sync in partial cast kernel Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Remove columnwise_invalid flag Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add comments about out-of-bounds write Signed-off-by:
kunlunl <kunlunl@nvidia.com> --------- Signed-off-by:
kunlunl <kunlunl@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
* Move jaxx cuda kernels to core Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Hongbin Liu authored
* split wgrad for GroupedLinear Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * support wgrad split for linear and ln_linear Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * add comments and fix WeightGradStore Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * support bias and fix unit tests Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * minor fix Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * support fuse_grad_accumulation=false Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add wgrad split for layernorm_mlp Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * minor fix Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix unittest Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add unittest for distributed interface apply Dener's suggestion Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * replace split_bw with delay_wgrad_compute Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/module/layernorm_mlp.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/pytorch/module/linear.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update transformer_engine/pytorch/module/layernorm_linear.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * remove comments Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> --------- Signed-off-by:
Hongbin Liu <hongbinl@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Hongbin Liu <hongbinl@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Phuong Nguyen authored
rm pax/praxis Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 17 Apr, 2025 5 commits
-
-
wdykas authored
* re merge request Signed-off-by:
Peter Dykas <wdykas@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add docstring Signed-off-by:
Peter Dykas <wdykas@nvidia.com> --------- Signed-off-by:
Peter Dykas <wdykas@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Xin Yao authored
* move swizzle scaling factor to cpp Signed-off-by:
Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
kwyss-nvidia authored
* Allow NVTEShape to own data. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Convert repeated copy paths to nvte_make_shape calls. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Apply suggestions from code review Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Build fixes. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
jberchtold-nvidia authored
* Add a flag to support computing zero-centered gamma in weight dtype or compute dtype for CuDNN Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Address comments Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
Paweł Gadziński authored
* drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 16 Apr, 2025 2 commits
-
-
Paweł Gadziński authored
* add Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * weight workspace fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * docs fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * file i forgot Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lint fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Update transformer_engine/debug/pytorch/utils.py Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * setup fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * setup fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Update transformer_engine/pytorch/tensor/_internal/float8_tensor_base.py Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * all tensor types Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed check Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * move error Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * _reset Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Update transformer_engine/pytorch/module/linear.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * name documentation Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * added blockwise quantizer Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make debug option optional Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Update transformer_engine/pytorch/tensor/quantized_tensor.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * names fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Kshitij Lakhani authored
* Add test cases for full coverage in jax/test_layer.py - causal and window size None - causal and window size default (-1,1) - no_mask and window size default (-1,1) - no_mask and window size default (2,2) - padding and window size None - padding_causal and window_size (2,2) Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Correct the condition where padding_causal_mask was being mapped to scaled upper triangle Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Fix Issue #1524 Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Add a runner and test cases for jax.flax.module.Softmax class for fwd pass only Segregate runner classes for Softmax module and softmax primitives Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Simplify logic when picking softmax primitives and softmax jax framework calls Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Simplify the logic for performing jax based softmax Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Code clean up Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add support table for mask, SWA and Softmax type. Code linting Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Explicit SWA conditons in comments. Fix Typo Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Resolve typo to remove None in SWA comments section Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 15 Apr, 2025 2 commits
-
-
Li Tao authored
* support adam bf16 state Signed-off-by:
XiaobingSuper <xiaobingzhangupc@gmail.com> * use fp32 kernel but keep bf16 optimizer states to save memory Signed-off-by:
lit <lit@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
XiaobingSuper <xiaobingzhangupc@gmail.com> Signed-off-by:
lit <lit@nvidia.com> Co-authored-by:
XiaobingSuper <xiaobingzhangupc@gmail.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Paweł Gadziński authored
* fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * added test Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * test change Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * changed the test Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Apr, 2025 7 commits
-
-
Jianbin Chang authored
* Add fp8 weight transpose cache check in backward, and regenerated it if it does not exist Signed-off-by:
jianbinc <shjwudp@gmail.com> * Properly handle fsdp shard model weight input. Signed-off-by:
jianbinc <shjwudp@gmail.com> * move Float8Tensor to QuantizedTensor in cast_master_weights_to_fp8 UT Signed-off-by:
jianbinc <shjwudp@gmail.com> * handle Float8TensorBase issue Signed-off-by:
jianbinc <shjwudp@gmail.com> * fix bug in activation recompute Signed-off-by:
jianbinc <shjwudp@gmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
jianbinc <shjwudp@gmail.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tim Moon authored
* Avoid unnecessary tensor usages when caching for linear op backward Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failure Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
Xin Yao authored
* Enable MXFP8 and Per-Tensor Current Scaling for Grouped Linear Signed-off-by:
Xin Yao <xiny@nvidia.com> * enable float8blockwise Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update Signed-off-by:
Xin Yao <xiny@nvidia.com> * remove grouped linear parallel mode test Signed-off-by:
Xin Yao <xiny@nvidia.com> * update test Signed-off-by:
Xin Yao <xiny@nvidia.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * internal=False for now Signed-off-by:
Xin Yao <xiny@nvidia.com> * remove unused import Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Johannes Reifferscheid authored
* Add experimental Shardy support. Production use is not yet recommended. --------- Signed-off-by:Johannes Reifferscheid <jreiffers@nvidia.com>
-
Hua Huang authored
* New GroupedGemmPrimitive using variadic args * Remove squeeze() to reduce D2D memcpy * Revert to the list append fashion to simplify code --------- Signed-off-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Autumn1998 authored
* add support for new recipe on permute_fusion, rm fp unpermute Signed-off-by:
tongliu <tongliu@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * remove fp8 from index map Signed-off-by:
Xin Yao <xiny@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * skip unsupported tests Signed-off-by:
Xin Yao <xiny@nvidia.com> --------- Signed-off-by:
tongliu <tongliu@nvidia.com> Signed-off-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Selvaraj Anandaraj authored
* Added attention activation offloading support for TE v2.0 Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> Co-authored-by:
Selvaraj Anandaraj <selvaraja@login-ptyche02.ptyche.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
-
- 11 Apr, 2025 2 commits
-
-
Tim Moon authored
* Add option to cache activation input in FP8 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid casting to FP8 transpose Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Skip input caching if device is not supported Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add documentation that FP8 input caching is experimental Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
kwyss-nvidia authored
Repeated calls to nvte_shape should not invalidate previous data pointers. It would be possible to avoid unnecessary comparisons by duplicating some of the logic from shape() so that the cache is only relevant when columnwise shapes are involved. Whether this code duplication is preferable to the comparisons that arise from by value semantics of reusing shape is a judgment call. Signed-off-by:Keith Wyss <kwyss@nvidia.com>
-
- 10 Apr, 2025 1 commit
-
-
kwyss-nvidia authored
* Add GEMM logic for blockwise quantized tensors. GEMM test cases included in pytorch integration. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update NVTE_BLOCK_SCALING for GEMM. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gate feature on CUDA 12.9 Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gemm typo. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove unecessary type converter change. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reflect epilogue availability and test supported epilogues. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * GEMM simplifications from recipe branch. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Format py code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update GEMM DGelu tests to match support depending on output dtype. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Force pow2Scales in GEMM Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add GEMM test to pytorch test suite. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add copyright to GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update import for GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add license. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test gemm supported predicate. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use sgemm like interfaces and naming. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Rewrite GEMM comment. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR Feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Recipe setup for Linear modules. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use 12.9 feature test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Run against tensor dumps from internal library. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update FIXME to TODO with linked issue. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update full recompute feature to save recipe. The recompute context uses the same recipe and fp8 settings as the original fwd pass. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR Feedback. Avoid reusing quantizer objects. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update logic in module. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Format py. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update for PP bug. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test numerics. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update force_power_of_2 scales in the recipe. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update usage method to satisfy upstream changes. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * fix subchannel recipe in distributed test with bf16 gather Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Edit and cleanup BF16 gather code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test import. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * support columnwise only mode to 1D quantize kernel Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format and move enum Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Skip alloc. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * try async bf16 gather Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format python code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Document and type code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update pytorch lint errors. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Dont set high precision dtype. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add test for sanity and CG; fix CG for sequential? Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Keep make_quantizers API stable Update num_quantizers instead to pass cuda_graph tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix import name. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Rename recipe method. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Skip grouped linear sanity test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Set usage before BF16 gather. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * refactor for nvte_quantize_v2 Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Cleanup nvte_quantize_v2 Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Test fp32 scales. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Disable CUDA graph. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Simplify layernorm linear Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Cleanup layernorm linear. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * LayerNorm linear bwd gather logic. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Communication updates. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update transformer_engine/pytorch/ops/op.py Apply MR comment change. Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
kwyss-nvidia <kwyss@nvidia.com> * Lint fix. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Enable cuda graph tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reduce chance of spurious failure and reword. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Review suggestions from @timmoon10 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update CPP tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update common.h Signed-off-by:
Xin Yao <yaox12@outlook.com> * Update test_float8blockwisetensor.py Signed-off-by:
Xin Yao <yaox12@outlook.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
kwyss-nvidia <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Co-authored-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Xin Yao <yaox12@outlook.com>
-
- 09 Apr, 2025 3 commits
-
-
Tim Moon authored
* Debug checkpointing with te.Sequential Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
Explicitly specify quantized tensor usages needed for linear op backward Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
Phuong Nguyen authored
* scaling enum abstract * rm NVTE_ from ScalingMode names * rework scaling mode enum in grouped gemm * fix norm sharding --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 08 Apr, 2025 2 commits
-
-
Tim Moon authored
* Minor stylistic tweaks and typo fixes Review suggestions from @ptrendx Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix incorrect col strides for MXFP8 matrices Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
vasunvidia authored
* Use dummy wgrads for lower memory consumption Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Bug fix to avoid sharing gradients. Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Disable automatic use of batch_p2p_comm for CP2 Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change weight to origin_weight for LN_LINEAR Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 Apr, 2025 3 commits
-
-
kwyss-nvidia authored
* Add GEMM logic for blockwise quantized tensors. GEMM test cases included in pytorch integration. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update NVTE_BLOCK_SCALING for GEMM. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gate feature on CUDA 12.9 Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gemm typo. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove unecessary type converter change. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reflect epilogue availability and test supported epilogues. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * GEMM simplifications from recipe branch. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Format py code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update GEMM DGelu tests to match support depending on output dtype. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Force pow2Scales in GEMM Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add GEMM test to pytorch test suite. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add copyright to GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update import for GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add license. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test gemm supported predicate. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use sgemm like interfaces and naming. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Rewrite GEMM comment. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR Feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Refactor GEMM param canonicalization Configure A and B matrices separately. Have separate code path for each scaling mode. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Prune number of tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Phuong Nguyen authored
* rm no scaling enum Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * update jax enum Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Jianbin Chang authored
Support fp8 primary weight in fsdp training Signed-off-by:
jianbinc <shjwudp@gmail.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-