- 07 Jan, 2026 1 commit
-
-
Teddy Do authored
* force initialization to int32 Signed-off-by:
tdophung <tdophung@nvidia.com> * address greptile comment Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com>
-
- 06 Jan, 2026 3 commits
-
-
jberchtold-nvidia authored
[JAX] Fix test_layer to support fused attention and adjust test encoder tolerance to account for minor diff (#2563) Fix failing unit tests Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
jberchtold-nvidia authored
* Fix long compile time in padding.cu Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Paweł Gadziński authored
* docs: Add comprehensive Getting Started guide with benchmarks - Add new Getting Started documentation with PyTorch and JAX tutorials - Include benchmark scripts demonstrating TE performance benefits - Add CSS styling for code output and tabs - Replace old quickstart notebooks with improved documentation - Add transformer layer diagram (SVG) - Update docs configuration and workflow Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * 2026 in copyright Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 05 Jan, 2026 2 commits
-
-
Peter St. John authored
* Add tests for 2528 and 2529 Signed-off-by:
Peter St. John <pstjohn@nvidia.com> * Update tests/pytorch/test_deferred_init.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update tests/pytorch/test_deferred_init.py Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Peter St. John <pstjohn@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kirthi Shankar Sivamani authored
Fix barrier ID Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 02 Jan, 2026 3 commits
-
-
xiaoxi-wangfj authored
Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> Co-authored-by:
Teddy Do <tdophung@nvidia.com>
-
Kirthi Shankar Sivamani authored
* Document envvars Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add remaining envvars Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More missing ones Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update docs/envvars.rst Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update docs/envvars.rst Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com>
-
Kirthi Shankar Sivamani authored
Update copyright to include 2026 Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 31 Dec, 2025 3 commits
-
-
Robin Zhang authored
* replace autograd.grad with autograd.backward Signed-off-by:
Robin Zhang <robinz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * get/set graphable rng state Signed-off-by:
Robin Zhang <robinz@nvidia.com> * fix lint Signed-off-by:
Robin Zhang <robinz@nvidia.com> --------- Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
刘俊 authored
Signed-off-by:
fuyue.lj <fuyue.lj@antgroup.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Kshitij Lakhani authored
* Fix incorrect calculation of segment pos from segment ids for thd cases and load balanced cases in from_segment_ids_and_pos. Enforce passing of segment_pos for THD cases and lod balanced cases Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Correct the assert condition Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Modify fused attn tests to pass new args to from_segment_ids_and_pos() Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Calculate seg ids before pos Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * 1. Change the signature for from_segment_ids_and_pos() 2. Add support for THD in from_segment_ids_and_pos() 3. Assert if load balanced segment_ids is passed to generate a segment_pos Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Pass keyword-only args by name Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * nit: Fix typo to use seg_ids instead of segment_ids Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * nit: Fix comments Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * Modify the function call to differentiate between load balancing and actually reordered segment_ids and segment_pos Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the is_segment_ids_reordered to be set only when CP and load balancing Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Fix comments for from_segment_ids_and_pos() Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Code clean up for more information, see https://pre-commit.ci Fix lint errors Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Signed-off-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kshitij Janardan Lakhani <klakhani@login-eos01.eos.clusters.nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 27 Dec, 2025 1 commit
-
-
xiaoxi-wangfj authored
* [PyTorch] Fuse permute+pad and unpermute+unpad ops for FP8 optimization 1.Fused `moe_permute_with_probs` + `Fp8Padding` and fused `moe_unpermute` + `Fp8Unpadding`, that can remove the explicit padding/unpadding of moe expert, improved performance and reduced peak gpu memory usage. 2.Add tests of fused permute/pad and unpermute/unpad. Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * [PyTorch/Common] Fuse permute+pad and unpermute+unpad support with_merging_probs Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * [PyTorch]format code Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * [Common]perf expert_idx loaded once Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * fix: pad_offsets can be None Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> * add padding + merging probs bwd support. Not tested Signed-off-by:
tdophung <tdophung@nvidia.com> * Fix garbage initialized act grad Signed-off-by:
tdophung <tdophung@nvidia.com> * all test passing for jax permutation + pad Signed-off-by:
tdophung <tdophung@nvidia.com> * change tokens_per_experts APIs to num_out_tokens with conservative allocation of worst case padding for output buffer Signed-off-by:
tdophung <tdophung@nvidia.com> * change test permutation to reduce test time Signed-off-by:
tdophung <tdophung@nvidia.com> * triggering PR refresh Signed-off-by:
tdophung <tdophung@nvidia.com> * format code Signed-off-by:
tdophung <tdophung@nvidia.com> * Remove some tests cases from pytorch side. Add a separate toekn_dispatch test for sanity in case combine accidentally undo an error on dispatch in the roundtrip test. Add distinction between L0 and L2 in test cases in jax Signed-off-by:
tdophung <tdophung@nvidia.com> * format code Signed-off-by:
tdophung <tdophung@nvidia.com> * remove chance for inefficiency in moving between CPU and GPU, remove redundant primitive using a new static bool for padding, add assert for align size Signed-off-by:
tdophung <tdophung@nvidia.com> * fix lint in jax Signed-off-by:
tdophung <tdophung@nvidia.com> * account for both jax newer and older than version 0.8.2. Adjusted gpu triton binding accordingly Signed-off-by:
tdophung <tdophung@nvidia.com> * format code Signed-off-by:
tdophung <tdophung@nvidia.com> * fix typo Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
xiaoxi-wangfj <690912414@qq.com> Signed-off-by:
tdophung <tdophung@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
tdophung <tdophung@nvidia.com>
-
- 22 Dec, 2025 1 commit
-
-
Teddy Do authored
* add triton ptxas path for gb300 to find where it is to avoid compilation errors Signed-off-by:
tdophung <tdophung@nvidia.com> * add these flags in advance to preven future breaks when ops are extended to multi gpus Signed-off-by:
tdophung <tdophung@nvidia.com> * add this also to L1 Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com>
-
- 20 Dec, 2025 2 commits
-
-
Zhongbo Zhu authored
* rowwise colwise RHT group quant v1 Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * remove local array RW Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * change wait_barrier Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fast math options Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * use mult to replace div Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * format Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * bulk move random states Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * greptile Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * lint Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * revert to use divides Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * avoid fp32 bf16 round-trip in RHT cast fusion Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * trigger fastmath by toggle NVTE_RHT_CAST_FUSION_USE_FAST_MATH Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * integrate row col rht fusion, functional Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * numerics aligned Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * style Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * remove device sync Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * 128 padding Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * revert colwise rng state creation because of row-col fused kernel Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fix CI, linter Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * refactor RS for generating two random values Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Avoid invalid configs with templated kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix acc pipeline init with 0 arrival count Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * restore rowwise-only mode Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * switch to dynamic atomic scheduler Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Avoid instantiating group RHT+cast kernel without row-wise or col-wise output Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Include fast math option in quantization config Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings and review nits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use TE license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix bug where kernel is always launched on stream Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Restore BF16 intermediate downcast in fused RHT-cast kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix numerical test of grouped kernel Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Make sure row-wise and col-wise quantization use different RNG seeds Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Restore autoformatter Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
jberchtold-nvidia authored
[JAX] Remove unused TE DPA module dtype which fixes cuDNN backend detection to properly use input dtypes (#2485) * Remove unused TE DPA module dtype which fixes cuDNN backend detection to properly use input dtypes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Warning fallback Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust test tolerances slightly for encoder tests due to change in backend Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 19 Dec, 2025 2 commits
-
-
jberchtold-nvidia authored
* Handle meshs set with jax.set_mesh Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Sudhakar Singh authored
* add early return back (removed in 2427) Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure Float8Tensor.contiguous supports autograd Expand quantized tensor tests to check identity ops. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 18 Dec, 2025 2 commits
-
-
oliver könig authored
Signed-off-by:oliver könig <okoenig@nvidia.com>
-
LucienXian authored
* Fix meta device check failure when passing torch.device objects Signed-off-by:
LucienXian <fl.xian@foxmail.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
LucienXian <fl.xian@foxmail.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 17 Dec, 2025 3 commits
-
-
jberchtold-nvidia authored
* Tutorial for integration te/jax quantization into an existing framework Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add todos Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * support nvfp4 sr rng key, move wrapper module into TE itself, fix bfloat16 cast Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update docstrings Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix QKV proj and out proj in Flax example transformer layer Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use fused attention in quickstart_jax example Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remat policy Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add tutorial to docs Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update title Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * remove unused dtype from TE DPA module Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix notebook title Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add explanation of flax module wrapper Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Przemyslaw Tredak authored
* Add ccache support to TE and use it in GitHub actions Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Move to allowed action with sccache Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Properly handle sccache Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix typo Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Removing ccache from the custom docker workflows where we can't run the action in the container Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * JAX already uses same cmake options to build the extension so there is no need to set CXX too Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Removed the unnecessary env variables Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com>
-
Jinhang Choi authored
reset weight ws cache for NVFP4TensorStorage Signed-off-by:Jinhang Choi <jinhangc@nvidia.com>
-
- 16 Dec, 2025 1 commit
-
-
vcherepanov-nv authored
* Use GEMM-AR fallback on newer cuBLASMp Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> * Remove test skip logic completely Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com> --------- Signed-off-by:
Vladimir Cherepanov <vcherepanov@nvidia.com>
-
- 15 Dec, 2025 3 commits
-
-
Paweł Gadziński authored
* Skip delayed wgrad tests in distributed numerics when debug mode is enabled Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
kwyss-nvidia authored
* Check calling convention for amax switch. Wgrad gemms with colwise x colwise require rowwise data via general_gemm. Since dy has both for dgrad and wgrad, the brittleness has likely not affected results. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Clear rowwise data when applicable. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test with columnwise cases. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Check enum value rather than implicit cast. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com>
-
Yashaswi Karnati authored
* fix ce loss with ignore idx Signed-off-by:
ykarnati <ykarnati@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
ykarnati <ykarnati@nvidia.com> * remove fix comments Signed-off-by:
ykarnati <ykarnati@nvidia.com> * fallback divisor to 1 Signed-off-by:
ykarnati <ykarnati@nvidia.com> * have arg for n_rows and n_non_ignore Signed-off-by:
ykarnati <ykarnati@nvidia.com> * fuse n_non_ignore to softmax kernel Signed-off-by:
ykarnati <ykarnati@nvidia.com> * fix incorrect arg Signed-off-by:
ykarnati <ykarnati@nvidia.com> --------- Signed-off-by:
ykarnati <ykarnati@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 12 Dec, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Add triton dep Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Teddy Do <tdophung@nvidia.com>
-
- 11 Dec, 2025 4 commits
-
-
Kshitij Lakhani authored
* Unset NVTE_FUSED_RING_ATTENTION_USE_SCAN by default Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * Add TODO Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change the warning check in P2P helper to warn against using scan loop Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Kshitij Lakhani <klakhani@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Robin Zhang authored
Convert sample tuple to list in reuse Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Robin Zhang authored
set_all_rng_states in set_states Signed-off-by:
Robin Zhang <robinz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Evgeny Tsykunov authored
* Add separate RNG states for columnwise quantization with Stochastic Rounding Signed-off-by:
Evgeny <etsykunov@nvidia.com> * Fix single tensor path Signed-off-by:
Evgeny <etsykunov@nvidia.com> --------- Signed-off-by:
Evgeny <etsykunov@nvidia.com>
-
- 10 Dec, 2025 3 commits
-
-
Charlene Yang authored
* update FE; initial pass at thd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * produce Stats+Max instead of Max+Sum_Exp Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "produce Stats+Max instead of Max+Sum_Exp" This reverts commit c7d2b77b2da9ff3f68344097284187ac427eeb6a. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com>
-
Paweł Gadziński authored
* code drop Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
jberchtold-nvidia authored
* Make softmax_type in FFI optional Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add warn message Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 09 Dec, 2025 4 commits
-
-
Teddy Do authored
* branch off of initial permutation jax-triton PR Signed-off-by:
tdophung <tdophung@nvidia.com> * Set 0 as the size of dummy tensors to reduce memory usage. Signed-off-by:
tdophung <tdophung@nvidia.com> * Correct setting of permuted_probs_stride_token, unpermuted_probs_stride_token and unpermuted_probs_stride_expert in unpermutation Signed-off-by:
tdophung <tdophung@nvidia.com> * Implement primitives, wrapper, test for wrapper, edit trit on binding to accomodate scalars Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change implemementation of VJP functions to match correct pattern. Deduce some static scalar args from shapes of inputs. Accept B, S instead of num_tokens. Change test to use value_and_grad to test vjp funcs properly Signed-off-by:
tdophung <tdophung@nvidia.com> * formatting Signed-off-by:
tdophung <tdophung@nvidia.com> * fix pylint Signed-off-by:
tdophung <tdophung@nvidia.com> * fix test to compare to the correct reference impl. relax 1 tol for grad compare, fix lint the rightway Signed-off-by:
tdophung <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix test_permutation to use value_and_grad for reference impl, tighten tols, and add unpermute with probs for token combine bwd rule Signed-off-by:
tdophung <tdophung@nvidia.com> * added forgotten file in prev commit Signed-off-by:
tdophung <tdophung@nvidia.com> * format Signed-off-by:
tdophung <tdophung@nvidia.com> * merge with_probs to without_probs Signed-off-by:
tdophung <tdophung@nvidia.com> * add aserts and fix lint Signed-off-by:
tdophung <tdophung@nvidia.com> --------- Signed-off-by:
tdophung <tdophung@nvidia.com> Co-authored-by:
Ming Huang <mingh@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
Teddy Do authored
change order Signed-off-by:tdophung <tdophung@nvidia.com>
-
Kirthi Shankar Sivamani authored
Fixes to runtime loading logic and add missing deps Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 08 Dec, 2025 1 commit
-
-
vthumbe1503 authored
* bug fixed, test added Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix contigous Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * revert unecessary change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * revert another change Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * Update transformer_engine/pytorch/tensor/mxfp8_tensor.py Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> * address review comments Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * missed adding renamed file Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix minor issue Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * fix ci issue Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the test for bfloat16 Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Varun Thumbe <vthumbe@nvidia.com> Signed-off-by:
vthumbe1503 <vthumbe@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-