- 19 Dec, 2025 2 commits
-
-
jberchtold-nvidia authored
* Handle meshs set with jax.set_mesh Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Sudhakar Singh authored
* add early return back (removed in 2427) Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure Float8Tensor.contiguous supports autograd Expand quantized tensor tests to check identity ops. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 18 Dec, 2025 1 commit
-
-
LucienXian authored
* Fix meta device check failure when passing torch.device objects Signed-off-by:
LucienXian <fl.xian@foxmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
LucienXian <fl.xian@foxmail.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 17 Dec, 2025 2 commits
-
-
jberchtold-nvidia authored
* Tutorial for integration te/jax quantization into an existing framework Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add todos Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * support nvfp4 sr rng key, move wrapper module into TE itself, fix bfloat16 cast Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update docstrings Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix QKV proj and out proj in Flax example transformer layer Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use fused attention in quickstart_jax example Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remat policy Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add tutorial to docs Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update title Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * remove unused dtype from TE DPA module Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix notebook title Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add explanation of flax module wrapper Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Jinhang Choi authored
reset weight ws cache for NVFP4TensorStorage Signed-off-by:Jinhang Choi <jinhangc@nvidia.com>
-
- 15 Dec, 2025 2 commits
-
-
kwyss-nvidia authored
* Check calling convention for amax switch. Wgrad gemms with colwise x colwise require rowwise data via general_gemm. Since dy has both for dgrad and wgrad, the brittleness has likely not affected results. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Clear rowwise data when applicable. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test with columnwise cases. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Check enum value rather than implicit cast. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com>
-
Yashaswi Karnati authored
* fix ce loss with ignore idx Signed-off-by:
ykarnati <ykarnati@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
ykarnati <ykarnati@nvidia.com> * remove fix comments Signed-off-by:
ykarnati <ykarnati@nvidia.com> * fallback divisor to 1 Signed-off-by:
ykarnati <ykarnati@nvidia.com> * have arg for n_rows and n_non_ignore Signed-off-by:
ykarnati <ykarnati@nvidia.com> * fuse n_non_ignore to softmax kernel Signed-off-by:
ykarnati <ykarnati@nvidia.com> * fix incorrect arg Signed-off-by:
ykarnati <ykarnati@nvidia.com> --------- Signed-off-by:
ykarnati <ykarnati@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 11 Dec, 2025 4 commits
-
-
Kshitij Lakhani authored
* Unset NVTE_FUSED_RING_ATTENTION_USE_SCAN by default Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add TODO Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change the warning check in P2P helper to warn against using scan loop Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Robin Zhang authored
Convert sample tuple to list in reuse Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Robin Zhang authored
set_all_rng_states in set_states Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Evgeny Tsykunov authored
* Add separate RNG states for columnwise quantization with Stochastic Rounding Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Fix single tensor path Signed-off-by:
Evgeny <etsykunov@nvidia.com> --------- Signed-off-by:
Evgeny <etsykunov@nvidia.com>
-
- 10 Dec, 2025 2 commits
-
-
Charlene Yang authored
* update FE; initial pass at thd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * produce Stats+Max instead of Max+Sum_Exp Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "produce Stats+Max instead of Max+Sum_Exp" This reverts commit c7d2b77b2da9ff3f68344097284187ac427eeb6a. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
jberchtold-nvidia authored
* Make softmax_type in FFI optional Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add warn message Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 09 Dec, 2025 4 commits
-
-
Teddy Do authored
* branch off of initial permutation jax-triton PR Signed-off-by:
tdophung <tdophung@nvidia.com> * Set 0 as the size of dummy tensors to reduce memory usage. Signed-off-by:
tdophung <tdophung@nvidia.com> * Correct setting of permuted_probs_stride_token, unpermuted_probs_stride_token and unpermuted_probs_stride_expert in unpermutation Signed-off-by:
tdophung <tdophung@nvidia.com> * Implement primitives, wrapper, test for wrapper, edit trit on binding to accomodate scalars Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change implemementation of VJP functions to match correct pattern. Deduce some static scalar args from shapes of inputs. Accept B, S instead of num_tokens. Change test to use value_and_grad to test vjp funcs properly Signed-off-by:
tdophung <tdophung@nvidia.com> * formatting Signed-off-by:
tdophung <tdophung@nvidia.com> * fix pylint Signed-off-by:
tdophung <tdophung@nvidia.com> * fix test to compare to the correct reference impl. relax 1 tol for grad compare, fix lint the rightway Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test_permutation to use value_and_grad for reference impl, tighten tols, and add unpermute with probs for token combine bwd rule Signed-off-by:
tdophung <tdophung@nvidia.com> * added forgotten file in prev commit Signed-off-by:
tdophung <tdophung@nvidia.com> * format Signed-off-by:
tdophung <tdophung@nvidia.com> * merge with_probs to without_probs Signed-off-by:
tdophung <tdophung@nvidia.com> * add aserts and fix lint Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com> Co-authored-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
Teddy Do authored
change order Signed-off-by:tdophung <tdophung@nvidia.com>
-
Kirthi Shankar Sivamani authored
Fixes to runtime loading logic and add missing deps Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 08 Dec, 2025 1 commit
-
-
vthumbe1503 authored
* bug fixed, test added Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix contigous Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * revert unecessary change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * revert another change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/tensor/mxfp8_tensor.py Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * missed adding renamed file Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix minor issue Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix ci issue Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the test for bfloat16 Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-
- 06 Dec, 2025 1 commit
-
-
Kshitij Lakhani authored
* Add generic stripe_height support for load balancing Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix imports in test for deprecated jax.experimental.pjit Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add test case for stripe_height greater than 1. Add stripe_height arg to reordering methods Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add Striped 1 and 4 test cases. Refactor the Load Balancing test case. Fix the incorrect shape in striping inverser reordering Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Modify test code for CP + AG + THD + stripe height greater than 1 Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add stripe_height arg to fused attn and fused attn fwd API. Add appropriate mask checks for AG+THD+CP and pick BRCM to be executed per rank. Add Fused Attn Primitive for CP + THD +AG + Striping. Add a method to reorder and all gather segment ids and offsets for kv Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway testing commit Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add comments in primitive registration process Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway test commit Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Undoing incorrect rebase/merge leftovers Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * TMP: Throwaway test commits Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Add support for calculating q and kv seqlens and offsets per rank for CP+THD+AG+SW+Striped>1 primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Augment jax primitive register code comments Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Fix the array sizes and padding values returned for seqlens and offsets to fit what the fused attn primitive non cp computation Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add support in new primitive for softmax_offset related changes. Put in missing primitive registering line in again. Increase the seqoffsets arrays lengths by 1 Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Add new set of helper functions for seqlens and seqoffsets fo AG+THD+CP+Stripe>1 which accounts for batching and seq offsets size b+1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add backward primitive for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Modify tests for backward primitive for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Move stripe_height along with other static args in fused_attn_bwd rule. Fix typo in CP+AG+TH+Striped>1 primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Code clean up: remove older version for calculating seqlens and offsets for CP+AG+THD+striped>1 primitive Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add test for CP+THD+AG+Striped>1 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix missing var Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add SWA tests for AG+Striped>1+CP+THD+SWA Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Restoring test code Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove assert preventing SWA code path in CP+AG+Striped primitive Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Parametrize num_segments_per_seq in tests Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Clean up test code Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Clean up test code in TE common Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Clean up debug statements Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Rename stripe_height to stripe_size Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Code clean up and add additional comments Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> nit: Apply suggestions from code review Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Fix type on fused attn tests Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Fix seqoffsets length to be passed onto FusedAttn primitive as it is b and not b+1 needed by cuDNN Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove commented code Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Kshitij Lakhani <33047503+KshitijLakhani@users.noreply.github.com> Fix linting issues Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Fix incorrect greptile change Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Skip THD test cases for CP + AG + Dual chunk. Skip BSHD cases for CP + AG + Striped>1. Correct the layout and shapr parameters passed to the tests Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Pass stripe_size explicitly for ring attn tests for THD cases Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Remove TODO Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * Explicitly fail if THD + AG is being used with a non padding causal mask Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * nit: Correct the ID for the test dist fused attn tests to account for cp*2 which is done under the hood Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Set num_segments_per_seq defaults to None instead of 0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Augment comments. Add ValueError for stripe_size=0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Test only 1 num_segments_per_seq combination for CP+AG+THD+Striped>1+SWA instead of 2. Modify the num segments and window size to easily to debug values Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Default stripe_size to None instead of 0. Modify stripe_size check for <=0 instead of ==0 Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove incorrectly added file Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Explicitly pass zero sized arrays for seg ids and pos in the CP + AG + Striped primitive rather than using the seqlens or the offsets as placeholders Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix linting errors Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add a deep dive doc for CP+THD+AG+Stripe>1+SWA regarding design considerations and decisions Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Put docs and pngs into it's separate dir Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Replace png screenshots with markdown coe blocks for the attention patterns. Remove unecessary pngs Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Add doc file to index.rst. Fix grammatical errors Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
Kshitij Janardan Lakhani <klakhani@login-preos01.a51.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 05 Dec, 2025 1 commit
-
-
Tim Moon authored
Remve assumption in quantize/activation kernels that data buffer is initialized Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 04 Dec, 2025 1 commit
-
-
Tim Moon authored
* Initialize empty tensors with shape=[0] instead of shape=[]. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix runtime crash in LayerNorm Still seeing correctness issues. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure norm workspace sizes are not zero Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove assumption in swizzle kernel that data is available. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove assumption in multi-swizzle kernel that data is available. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unnecessary explicit call to default constructor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid accessing tensor data pointer if tensor has no entries Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Apply suggestions from code review Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/common/swizzle/swizzle.cu Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestions from @ptrendx and @greptile-apps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Prefer using row-wise/col-wise shape based on which has data Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix merge conflict, expand docs, fix inconsistency in dim function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Change Tensor::has_data to check whether tensor is initialized, not whether pointer is valid. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestion from @greptile-apps Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Debug incorrect tensor initialization in tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Clarify comments that has_data does not guarantee safe pointer accesses Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test failure when computing amaxes Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-
- 02 Dec, 2025 3 commits
-
-
Kunlun Li authored
* Add primary weighs fp8 support for mxfp8 Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Fix unit test and add better error log to unit test Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Move post all-gather processing out of for loop Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Add descriptions and ASCII diagrams for partial cast and partial amax functions Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Minor fix based on greptile bot Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix compilation errors due to arch-specific PTX instructions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove unused noop flag from C API Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Expose test_partial_cast Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Skip mxfp8 partial cast test if mxfp8 is not available Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Fix pytest error Signed-off-by:
kunlunl <kunlunl@nvidia.com> * pylint ignore unused manual_post_all_gather_processing Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Fix error when using is_mxfp8_available Signed-off-by:
kunlunl <kunlunl@nvidia.com> --------- Signed-off-by:
kunlunl <kunlunl@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
Phuong Nguyen authored
* add grouped_tensor classes and helpers Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * rm non-contiguous option and dptrs Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address comments + rework CheckIn/OutputGroupedTensor Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix for compilation Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * make first_dims/last_dims optional + data.shape 2d Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added assertion Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * rs conflicts Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * add data.shape info Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added logical shape field Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * compilation fix Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fixed issues raised by greptile Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * return default dtype when grouped_tensor is empty Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use has_data() for dim queries Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update comments Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix index bound Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * Update transformer_engine/common/transformer_engine.cpp Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * Update transformer_engine/common/transformer_engine.cpp Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * restore Tensor.has_data() + add experimental marks Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * restore Tensor::has_columnwise_data Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * cleanup Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-
Phuong Nguyen authored
* init triton binding with test case/example * added Triton as TE-JAX test dependency * grid with blocksize from autotune Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 26 Nov, 2025 2 commits
-
-
Paweł Gadziński authored
* init Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * lines lenght Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * subtitle --- fix in many files: Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * cross entropy _input -> input rename Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * a lot of small fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * torch_version() change Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add missing module and fix warnings Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * removed training whitespace: Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Update docs/api/pytorch.rst Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> * Fix import Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix more imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix NumPy docstring parameter spacing and indentation - Standardize parameter documentation to use 'param : type' format (space before and after colon) per NumPy style guide - Fix inconsistent indentation in cpu_offload.py docstring - Modified 51 Python files across transformer_engine/pytorch Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Tim Moon authored
Do not initialize recipe state in base op class Op attrs may not be set. Move recipe state initialization to linear op constructor. Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 25 Nov, 2025 7 commits
-
-
Paweł Gadziński authored
* main Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * docs Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * add Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * test fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Pingtian Li authored
* fix backward_dw cuda graph order Signed-off-by:
Pingtian Li <pingtianl@nvidia.com> * add validation for num_layers_per_chunk Signed-off-by:
Pingtian Li <pingtianl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pingtian Li <pingtianl@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
vthumbe1503 authored
* fix ci issue Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * revert back testing changes Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * remove quantizer copy + fused adam working Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix mxfp8 bug, god knows who created it Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update transformer_engine/pytorch/optimizers/fused_adam.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * Update comment Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
Zhongbo Zhu authored
* minor fix of torch view dtype Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * multi-tensor RHT amax, compiles Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * setup multi_tensor_quantize_nvfp4_impl Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * wire things up and run without crash Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * numerical test Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * unit test passing Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * finish unit test of split quantize api Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * bump up padding to 64 for nvfp4 grouped quantize Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fix stochastic rounding Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * lint Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * change error message Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * clean up Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * enable multi-amax without RHT Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fix col-only quantize mode Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * improve benchmark script Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * add NCU example script Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * add larger test case Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * add contiguous_data_and_scale check to bulk allocator Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * unified naming and differentiate between group_ and multi_ Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * move regular amax into multi_tensor.h Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Disentangle logic for split-quantize and general multi-tensor quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use size_t for split sections Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Suggestions from @greptile-apps Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
vthumbe1503 authored
remove linear redundant check Signed-off-by:Varun Thumbe <vthumbe@nvidia.com>
-
Phuong Nguyen authored
* allow dp + fsdp and fixed sr_rng_state partitioning Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * cleanup for lint test Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Teddy Do authored
* Change order of arguments to make jax works Signed-off-by:
tdophung <tdophung@nvidia.com> * make num_experts a tl.constepxr again Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com>
-
- 21 Nov, 2025 7 commits
-
-
Sudhakar Singh authored
* Add support for THD+CP+SWA through A2A comms Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * unblock the `padding`+`THD`+`CP(A2A)` with SWA case in A2A forward Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add proper support for thd Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * enable thd+cp tests as essential Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add cp+thd+a2a test to essential Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix comments from greptile Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add proper skip for flash attention Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix the test to create separate tensors for flash and fused attention backend scenarios Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * remove redundant compare Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * simplify code Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * add note for cu_seqlens_kv and cu_seqlens_kv_padded Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update tests/pytorch/attention/test_attention_with_cp.py Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * Update transformer_engine/pytorch/attention/dot_product_attention/context_parallel.py Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fixo Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix docs Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * fix the argument name Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-
Kshitij Lakhani authored
* Remove unnecessary SWA calculation from _segment_ids_pos_to_seqlens_offsets Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Kshitij Lakhani authored
* Make BSHD default for Unfused DPA, DPA and MHA in TE JAX Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Remove explicit transpose_batch set for BSHD for DPA in JAX quickstart Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Add warnings in DPA and MHA to warn users of change defaults to BSHD instead of SBHD Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * Minimize the scope of when to trigger warnings for changed defaults for transpose_batch_sequence Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Janardan Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Jan Bernlöhr authored
Signed-off-by:
janbernloehr <jan@bernloehrs.de> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
oliver könig authored
* ci: Build and attach bdist wheels to release page Signed-off-by:
oliver könig <okoenig@nvidia.com> * free up space Signed-off-by:
oliver könig <okoenig@nvidia.com> * cleanup Signed-off-by:
oliver könig <okoenig@nvidia.com> * test Signed-off-by:
oliver könig <okoenig@nvidia.com> * test Signed-off-by:
oliver könig <okoenig@nvidia.com> * test Signed-off-by:
oliver könig <okoenig@nvidia.com> * fix Signed-off-by:
oliver könig <okoenig@nvidia.com> * test Signed-off-by:
oliver könig <okoenig@nvidia.com> * fix Signed-off-by:
oliver könig <okoenig@nvidia.com> * fix Signed-off-by:
oliver könig <okoenig@nvidia.com> * fix Signed-off-by:
oliver könig <okoenig@nvidia.com> * fix Signed-off-by:
oliver könig <okoenig@nvidia.com> * c28619d8999a147d5e09c1199f84ff6af6ad5794 Signed-off-by:
oliver könig <okoenig@nvidia.com> * c28619d8999a147d5e09c1199f84ff6af6ad5794 Signed-off-by:
oliver könig <okoenig@nvidia.com> * Reduce months to check from 7 to 5 Signed-off-by:
oliver könig <okoenig@nvidia.com> * Update .github/scripts/check_for_ngc_images.sh Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update .github/actions/build-pytorch-wheel/build.sh Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
oliver könig <okoenig@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Jack authored
Signed-off-by:Jack <lityangweiguang@163.com>
-
Paweł Gadziński authored
* fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-