- 11 Dec, 2025 1 commit
-
-
Kshitij Lakhani authored
* Unset NVTE_FUSED_RING_ATTENTION_USE_SCAN by default Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add TODO Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change the warning check in P2P helper to warn against using scan loop Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 10 Dec, 2025 1 commit
-
-
jberchtold-nvidia authored
* Make softmax_type in FFI optional Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add warn message Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 09 Dec, 2025 1 commit
-
-
Teddy Do authored
* branch off of initial permutation jax-triton PR Signed-off-by:
tdophung <tdophung@nvidia.com> * Set 0 as the size of dummy tensors to reduce memory usage. Signed-off-by:
tdophung <tdophung@nvidia.com> * Correct setting of permuted_probs_stride_token, unpermuted_probs_stride_token and unpermuted_probs_stride_expert in unpermutation Signed-off-by:
tdophung <tdophung@nvidia.com> * Implement primitives, wrapper, test for wrapper, edit trit on binding to accomodate scalars Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change implemementation of VJP functions to match correct pattern. Deduce some static scalar args from shapes of inputs. Accept B, S instead of num_tokens. Change test to use value_and_grad to test vjp funcs properly Signed-off-by:
tdophung <tdophung@nvidia.com> * formatting Signed-off-by:
tdophung <tdophung@nvidia.com> * fix pylint Signed-off-by:
tdophung <tdophung@nvidia.com> * fix test to compare to the correct reference impl. relax 1 tol for grad compare, fix lint the rightway Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test_permutation to use value_and_grad for reference impl, tighten tols, and add unpermute with probs for token combine bwd rule Signed-off-by:
tdophung <tdophung@nvidia.com> * added forgotten file in prev commit Signed-off-by:
tdophung <tdophung@nvidia.com> * format Signed-off-by:
tdophung <tdophung@nvidia.com> * merge with_probs to without_probs Signed-off-by:
tdophung <tdophung@nvidia.com> * add aserts and fix lint Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com> Co-authored-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 06 Dec, 2025 1 commit
-
-
Kshitij Lakhani authored
* Add generic stripe_height support for load balancing Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix imports in test for deprecated jax.experimental.pjit Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add test case for stripe_height greater than 1. Add stripe_height arg to reordering methods Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add Striped 1 and 4 test cases. Refactor the Load Balancing test case. Fix the incorrect shape in striping inverser reordering Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Modify test code for CP + AG + THD + stripe height greater than 1 Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add stripe_height arg to fused attn and fused attn fwd API. Add appropriate mask checks for AG+THD+CP and pick BRCM to be executed per rank. Add Fused Attn Primitive for CP + THD +AG + Striping. Add a method to reorder and all gather segment ids and offsets for kv Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway testing commit Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add comments in primitive registration process Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway test commit Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Undoing incorrect rebase/merge leftovers Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway test commits Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add support for calculating q and kv seqlens and offsets per rank for CP+THD+AG+SW+Striped>1 primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Augment jax primitive register code comments Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Fix the array sizes and padding values returned for seqlens and offsets to fit what the fused attn primitive non cp computation Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add support in new primitive for softmax_offset related changes. Put in missing primitive registering line in again. Increase the seqoffsets arrays lengths by 1 Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Add new set of helper functions for seqlens and seqoffsets fo AG+THD+CP+Stripe>1 which accounts for batching and seq offsets size b+1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add backward primitive for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Modify tests for backward primitive for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Move stripe_height along with other static args in fused_attn_bwd rule. Fix typo in CP+AG+TH+Striped>1 primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Code clean up: remove older version for calculating seqlens and offsets for CP+AG+THD+striped>1 primitive Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add test for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix missing var Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add SWA tests for AG+Striped>1+CP+THD+SWA Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Restoring test code Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove assert preventing SWA code path in CP+AG+Striped primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Parametrize num_segments_per_seq in tests Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean up test code Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Clean up test code in TE common Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Clean up debug statements Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Rename stripe_height to stripe_size Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Code clean up and add additional comments Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> nit: Apply suggestions from code review Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Fix type on fused attn tests Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Fix seqoffsets length to be passed onto FusedAttn primitive as it is b and not b+1 needed by cuDNN Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove commented code Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Fix linting issues Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Fix incorrect greptile change Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip THD test cases for CP + AG + Dual chunk. Skip BSHD cases for CP + AG + Striped>1. Correct the layout and shapr parameters passed to the tests Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Pass stripe_size explicitly for ring attn tests for THD cases Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove TODO Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Explicitly fail if THD + AG is being used with a non padding causal mask Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Correct the ID for the test dist fused attn tests to account for cp*2 which is done under the hood Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Set num_segments_per_seq defaults to None instead of 0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Augment comments. Add ValueError for stripe_size=0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Test only 1 num_segments_per_seq combination for CP+AG+THD+Striped>1+SWA instead of 2. Modify the num segments and window size to easily to debug values Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Default stripe_size to None instead of 0. Modify stripe_size check for <=0 instead of ==0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove incorrectly added file Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Explicitly pass zero sized arrays for seg ids and pos in the CP + AG + Striped primitive rather than using the seqlens or the offsets as placeholders Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix linting errors Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add a deep dive doc for CP+THD+AG+Stripe>1+SWA regarding design considerations and decisions Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Put docs and pngs into it's separate dir Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Replace png screenshots with markdown coe blocks for the attention patterns. Remove unecessary pngs Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Add doc file to index.rst. Fix grammatical errors Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 02 Dec, 2025 1 commit
-
-
Phuong Nguyen authored
* init triton binding with test case/example * added Triton as TE-JAX test dependency * grid with blocksize from autotune Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 26 Nov, 2025 1 commit
-
-
Paweł Gadziński authored
* init Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lines lenght Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * subtitle --- fix in many files: Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * a lot of small fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * torch_version() change Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add missing module and fix warnings Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * removed training whitespace: Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Update docs/api/pytorch.rst Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Fix import Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix more imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix NumPy docstring parameter spacing and indentation - Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide - Fix inconsistent indentation in cpu_offload.py docstring - Modified 51 Python files across transformer_engine/pytorch Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 25 Nov, 2025 1 commit
-
-
Phuong Nguyen authored
* allow dp + fsdp and fixed sr_rng_state partitioning Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup for lint test Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 21 Nov, 2025 2 commits
-
-
Kshitij Lakhani authored
* Remove unnecessary SWA calculation from _segment_ids_pos_to_seqlens_offsets Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Kshitij Lakhani authored
* Make BSHD default for Unfused DPA, DPA and MHA in TE JAX Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Remove explicit transpose_batch set for BSHD for DPA in JAX quickstart Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Add warnings in DPA and MHA to warn users of change defaults to BSHD instead of SBHD Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Minimize the scope of when to trigger warnings for changed defaults for transpose_batch_sequence Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 18 Nov, 2025 1 commit
-
-
Paweł Gadziński authored
* fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed packed versions Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * jax Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix: Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * sofmtax_fusion -> softmax_fusion_type Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 14 Nov, 2025 3 commits
-
-
jberchtold-nvidia authored
* Use TE quant if TE fused act is disabled Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Keep existing precision Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
jberchtold-nvidia authored
* Refactor to avoid storing a global quantization config so direct recipe passing works as intended Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix use_split_accumulator for current scaling recipe Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix tests that pass direct recipe and were missing quantize meta set Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Revert "fix use_split_accumulator for current scaling recipe" This reverts commit a74ab7df812ec0a069b1bdd208debb93ec25a900. Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix ci failures Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix amax_history post_init Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * Update transformer_engine/jax/quantize/quantizer.py Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ci failures Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix ci issue Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * address comments Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * make recipe assertion classes in test_recipe_characteristics not inherit from unittest.TestCase Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
jberchtold-nvidia <158520091+jberchtold-nvidia@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Paweł Gadziński authored
* fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * add notes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * small fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 13 Nov, 2025 3 commits
-
-
jberchtold-nvidia authored
* Support for checkpointing quantizations Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add jaxpr test for quant checkpoint name Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Revert "Support for checkpointing quantizations" This reverts commit f7b784940369d0da2a77c57fa6ea744e883c5832. Signed-off-by:
JAX Toolbox <jax@nvidia.com> * Checkpoint quantizations Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * revert other files Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * move checkpointing to VJPs Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix ci failure Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Signed-off-by:
JAX Toolbox <jax@nvidia.com> Co-authored-by:
JAX Toolbox <jax@nvidia.com>
-
Phuong Nguyen authored
* shardy + quantize_layout rework Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * add assertion for NVFP4 in fused act and fused norm primitive Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * add assertions Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Phuong Nguyen authored
* swizzle via nvte Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 10 Nov, 2025 1 commit
-
-
Teddy Do authored
* Changing default activations in MLP, TransformerLayer, dropout rate after FC1 to 0, and return_layernorm_output to False Signed-off-by:
tdophung <tdophung@nvidia.com> * Fixing the failing tests by hard coding arguments to the previous values instead of relying on newer default values Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
tdophung <tdophung@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 07 Nov, 2025 3 commits
-
-
Kshitij Lakhani authored
* Default to fused attention in JAX DPA Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Consolidate documentation for DPA in JAX Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> * Correctly update the documentation for defaults in JAX DPA Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
* Fix cuDNN backend selection for more case. Add CG as a option as well Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix logic Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix cuDNN checks Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add more checks Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix cuddn version Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix error message Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add check for window size Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Michael Goldfarb authored
-
- 05 Nov, 2025 1 commit
-
-
Paweł Gadziński authored
* fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com>
-
- 31 Oct, 2025 1 commit
-
-
jberchtold-nvidia authored
* Fix mesh resource requirement when no mesh Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * do not require meshresource if all axes are manual axes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * remove abstract_mesh is None check Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 30 Oct, 2025 1 commit
-
-
Kshitij Lakhani authored
* Fix: Skip determinism tests for bprop for all sm >=100 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add username to TODO Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Assert in fused attn bwd pass for sm100+ Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 28 Oct, 2025 1 commit
-
-
Phuong Nguyen authored
* jax norm + te quant Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 25 Oct, 2025 1 commit
-
-
Charlene Yang authored
* add max_score for fused/unfused F16 non-CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * calculate max per head instead of max over all heads Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fused attn max_score shape Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert FE to github Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE to 1.15.0-rc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * reduce ew kernels; fix causal masks; add more tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix to tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove logic for flash-attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add CP support for p2p/a2a/all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor improvements of implementation/tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: add thd support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add thd to UnfusedDPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more fixes for lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update to FE 1.15 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove unneeded changes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable unfused for thd + pad_between_seqs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable thd for unfused until bug is fixed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * rename max_score to max_logit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix all_gather Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable fused attn + thd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 23 Oct, 2025 1 commit
-
-
jberchtold-nvidia authored
* Make SR rng state always 2D (num_devices, 4) Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix pure-jax impl Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix test shape Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 22 Oct, 2025 2 commits
-
-
jberchtold-nvidia authored
Defer cublas check on fp8 gemms until lowering Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
jberchtold-nvidia authored
* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add test for SR state preservation across VJP boundaries Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix sharding of SR rng state Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update tolerances slightly now that SR is enabled Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use hashlib for deterministic hashes across runs for SR Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * rename uses_rht on scaled tensors to has_applied_rht Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add assert Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Move decision of whether to use RHT into helper.py and add dedicated RHT tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix use_rht attr usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix pure-jax rht usage criteria Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Adjust tolerances after rebase Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 18 Oct, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Support wheel build for cuda 13 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes for cu13 runtime, format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add documentation Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better error handling Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix jax sdist Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Modify function names Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 14 Oct, 2025 2 commits
-
-
Kirthi Shankar Sivamani authored
* Initial API change Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change all imports and api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix typo Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix recipe tets Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix more tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix docs, tests, and make Jax change as well Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change internal uses of fp8_autocast Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address nits Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rename file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CG function, and small test fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change instances of make_graphed_callables internally Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix distributed tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test and add more docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup test imports and minimize internal file imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make is_bf16_available public Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better docs and better api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply suggestions from code review Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * fix nvfp4 test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
Kshitij Lakhani authored
* Add BRCM support when creating a test mask for fused attn Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add support for BRCM to correctly generate the mask needed for calculating the seqlens and offsets for THD Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Skip drop=0 and no_bias case for BRCM as cuDNN does not suport this Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Skip BRCM test cases where max_seqlen_q > max_seqlen_kv Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Refactor the segment id run length code for BRCM seqoffset and seqlens calculations Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix the drop inequality skip condition in fused attn Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * nit: Adjust the BRCM id name in the test to make it consistent Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix the brcm mask condition. Fix the condition for cross atnn type pattern to only apply for brcm Change the num segments per sequence to 3 instead of 2 Reduce one test pattern data size and make it such that it triggers brcm Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix lint errors Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix incorrectly changed dtype to numpy bool_ rather than native python bool Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Restore the numsegments to earlier value Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add example for THD BRCM Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 13 Oct, 2025 2 commits
-
-
jberchtold-nvidia authored
assertion check Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
jberchtold-nvidia authored
* Improve error message for cublas fp8 gemm with incorrect shape Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Removed unnecessary non-contracting size check Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * rename inner dim -> leading dim Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 09 Oct, 2025 2 commits
-
-
jberchtold-nvidia authored
Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Update minimum python version to 3.10 and update CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 08 Oct, 2025 1 commit
-
-
Hua Huang authored
* Try async copy of grouped GEMM group_sizes data Signed-off-by:
Hua Huang <huah@nvidia.com> --------- Signed-off-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 07 Oct, 2025 1 commit
-
-
Phuong Nguyen authored
* reuse amax for current scaling Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 06 Oct, 2025 1 commit
-
-
Phuong Nguyen authored
* not fuse bias for output all reduction case + unit tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * norm to reduce dgamma along tpsp as well Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * clean up tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix test_distributed_layernorm byte counts Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * increase tols for jax_gemm Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 03 Oct, 2025 1 commit
-
-
vthumbe1503 authored
Signed-off-by:Varun Thumbe <vthumbe@nvidia.com> *Jax integration for clamped swiglu. This is the continuation of PR which added Clamped Swiglu(used in GPT OSS) support in TE along with Pytorch integration. This PR hooks up the clamped swiglu and dswiglu's nvte APIs to TE Jax.
-
- 02 Oct, 2025 1 commit
-
-
jberchtold-nvidia authored
Fix shard map issue Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-