- 11 Feb, 2026 1 commit
-
-
Faradawn Yang authored
* fix broken link of quickstart guide Signed-off-by:
Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Update README.rst Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Signed-off-by:
Faradawn Yang <73060648+faradawn@users.noreply.github.com> * moved getting started guide to first and moved jax out of pytorch section Signed-off-by:
Faradawn Yang <73060648+faradawn@users.noreply.github.com> * Update README.rst Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by:
Faradawn Yang <73060648+faradawn@users.noreply.github.com> --------- Signed-off-by:
Faradawn Yang <73060648+faradawn@users.noreply.github.com> Co-authored-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by:
greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
-
- 27 Jan, 2026 1 commit
-
-
jberchtold-nvidia authored
* Use "nyu-mll/glue" instead of "glue" for encoder datasets to fix 404 error Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * rename mnist dataset path Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add dataset manifest Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 15 Jan, 2026 1 commit
-
-
jberchtold-nvidia authored
disable fused attention in encoder tests for determinism Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
- 06 Jan, 2026 1 commit
-
-
jberchtold-nvidia authored
[JAX] Fix test_layer to support fused attention and adjust test encoder tolerance to account for minor diff (#2563) Fix failing unit tests Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
- 02 Jan, 2026 1 commit
-
-
Kirthi Shankar Sivamani authored
Update copyright to include 2026 Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 Dec, 2025 1 commit
-
-
jberchtold-nvidia authored
[JAX] Remove unused TE DPA module dtype which fixes cuDNN backend detection to properly use input dtypes (#2485) * Remove unused TE DPA module dtype which fixes cuDNN backend detection to properly use input dtypes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Warning fallback Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust test tolerances slightly for encoder tests due to change in backend Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 12 Nov, 2025 1 commit
-
-
Phuong Nguyen authored
relax tol Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 06 Nov, 2025 1 commit
-
-
jberchtold-nvidia authored
* Try to use pre-downloaded dataset artifacts first Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Set HF_HUB_OFFLINE to disable any network calls to HF when the pre-downloaded dataset is available Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 22 Oct, 2025 1 commit
-
-
jberchtold-nvidia authored
* [JAX] Support recipe flags for disabling SR, RHT, and 2D quantization Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix issue with SR state being erased due to pytree handling of NVFP4Quantizer Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add test for SR state preservation across VJP boundaries Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix sharding of SR rng state Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * update tolerances slightly now that SR is enabled Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use hashlib for deterministic hashes across runs for SR Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * rename uses_rht on scaled tensors to has_applied_rht Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * add assert Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Move decision of whether to use RHT into helper.py and add dedicated RHT tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix use_rht attr usage Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * fix pure-jax rht usage criteria Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Adjust tolerances after rebase Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 21 Oct, 2025 1 commit
-
-
jberchtold-nvidia authored
HF login in JAX examples Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
- 14 Oct, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial API change Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change all imports and api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix typo Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix recipe tets Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix more tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix docs, tests, and make Jax change as well Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change internal uses of fp8_autocast Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address nits Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rename file Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * CG function, and small test fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Change instances of make_graphed_callables internally Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix distributed tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test and add more docs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Cleanup test imports and minimize internal file imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Make is_bf16_available public Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better docs and better api Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * format Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply suggestions from code review Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * fix nvfp4 test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 09 Oct, 2025 1 commit
-
-
jberchtold-nvidia authored
Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 29 Sep, 2025 1 commit
-
-
Phuong Nguyen authored
* add xml export for test_multiprocessing_encoder and test_cgemm Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 27 Sep, 2025 1 commit
-
-
Phuong Nguyen authored
* init cgemm + unit tests * UB bootstrap with NCCL, no MPI dependency * add NVLINK-P2P check + error message * skip tests if no NVLINK available * use std::vector to store ncclComm_t * update misuse of TP warning Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 03 Sep, 2025 1 commit
-
-
Daniel Stokes authored
Signed-off-by:djns99 <40156487+djns99@users.noreply.github.com>
-
- 29 Aug, 2025 1 commit
-
-
Daniel Stokes authored
-
- 26 Aug, 2025 1 commit
-
-
Phuong Nguyen authored
* clean up sharding Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added tpsp_resource Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * update tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * rework test for MeshResource Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * add mesh_resource into fp8_autocast in test_helper.py Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 20 Aug, 2025 1 commit
-
-
jberchtold-nvidia authored
[JAX] Error checking for mesh resource and update GemmPrimitive to use global_mesh_resource().fsdp_resource (#2088) * Enforce global MeshResource is set Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use global_mesh_resource().fsdp_resource in gemm primitive Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update gemm.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 17 Jul, 2025 1 commit
-
-
Phuong Nguyen authored
tighten encoder test tols Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 16 Jul, 2025 1 commit
-
-
jberchtold-nvidia authored
* Support flax sharding constraints Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add warning for deprecated TE logical axes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update examples Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 14 Jul, 2025 1 commit
-
-
Alp Dener authored
* added XLA FFI custom op for TE/common nvte_cublas_gemm Signed-off-by:
Alp Dener <adener@nvidia.com> started GemmPrimitive, abstract done Signed-off-by:
Alp Dener <adener@nvidia.com> gemm custom op working with BF16, needs testing for FP8/MXFP8 Signed-off-by:
Alp Dener <adener@nvidia.com> converted TE GEMM API to use ScaledTensor and added os ENV flag to use TE GEMM under general gemm() call Signed-off-by:
Alp Dener <adener@nvidia.com> BF16 tests passing, FP8 tests should be passing but contracting_dims has a scoping issue Signed-off-by:
Alp Dener <adener@nvidia.com> fp8 tests passing for E4M3, getting CUBLAS_STATUS_NOT_SUPPORTED for E5M2 Signed-off-by:
Alp Dener <adener@nvidia.com> updated GEMM API to use separate LHS and RHS quantizers instead of a QuantizerSet Signed-off-by:
Alp Dener <adener@nvidia.com> new GemmPrimitive passing all Dense tests Signed-off-by:
Alp Dener <adener@nvidia.com> import cleanup and reverted code chunk movement Signed-off-by:
Alp Dener <adener@nvidia.com> removed unused .transpose() implementations from ScaledTensors Signed-off-by:
Alp Dener <adener@nvidia.com> all custom call tests passing on Hopper, GEMM-related tests cover both GemmPrimitive and native JAX impl Signed-off-by:
Alp Dener <adener@nvidia.com> removed direct calls to GemmPrimitive.enabled() from outside of cpp_extensions Signed-off-by:
Alp Dener <adener@nvidia.com> removed unused changes to ScaledTensor classes and debug prints Signed-off-by:
Alp Dener <adener@nvidia.com> * minor unit test cleanup Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * FP8 tests passing on Blackwell but MXFP8 outputs NaN Signed-off-by:
Alp Dener <adener@nvidia.com> * reverted dense and fuseddense changes, FP8 test passing on Hopper and Blackwell, MXFP8 has issues with E5M2 Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * MXFP8 issue traced to scale factor padding with NaNs instead of zeros Signed-off-by:
Alp Dener <adener@nvidia.com> * padding scale with 2^-127 instead of nans Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix bug on rhs_scale_inv usage Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * cleanup E8M0 type converter use it in gemm.cpp Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * segfault fixed, passing all unittests on Blackwell Signed-off-by:
Alp Dener <adener@nvidia.com> * fix for fuseddense tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix workspace alignment Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed GemmPrimitive custom partitioning to match jax.nn.scaled_matmul Signed-off-by:
Alp Dener <adener@nvidia.com> all unit tests passing on H100x8 node Signed-off-by:
Alp Dener <adener@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci linting fixes Signed-off-by:
Alp Dener <adener@nvidia.com> fixed batch dimension numbers Signed-off-by:
Alp Dener <adener@nvidia.com> fixed FP8 scale sharding rule when there are no FP8 scales Signed-off-by:
Alp Dener <adener@nvidia.com> added error message for unsupported Shardy partitioner Signed-off-by:
Alp Dener <adener@nvidia.com> fixed test tolerances for FP8 cases Signed-off-by:
Alp Dener <adener@nvidia.com> fixed shardy test skip cases Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * moved reshape of encoder output in encoder examples to make custom partitioning rules work correctly Signed-off-by:
Alp Dener <adener@nvidia.com> * added helper functions for padding and unpadding block scales, changed GemmPrimitive to accept unpadded scales and pad them after sharding Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated shardy rules for all custom ops to decouple block scale rules from their tensors Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * changed unit test use_jax_gemm option to be a context to preserve external custom op settings, tightened multi-GPU encoder test tolerances, changed gemm() API to use contracting_dims and batched_dims separately instead of dimension_numbers Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed typo in test utils Signed-off-by:
Alp Dener <adener@nvidia.com> * added sequence-first input warnings Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed datasets version for JAX examples Signed-off-by:
Alp Dener <adener@nvidia.com> * reverting modification to force_1x_quantization decision Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected gemm function syntax in unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 11 Jul, 2025 1 commit
-
-
Alp Dener authored
* capped JAX encoder example datasets version at below 4.0 Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 26 Jun, 2025 1 commit
-
-
jberchtold-nvidia authored
Use keyword args for jit in_shardings and out_shardings Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
- 17 Jun, 2025 1 commit
-
-
Phuong Nguyen authored
* include previously accidentally excluded tests * Execute run_test_multiprocessing_encoder with nested bash + exit code for inner bash shell * Adapt run_test_multiprocessing to handle segfault Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 05 Jun, 2025 1 commit
-
-
Phuong Nguyen authored
* fix otype for fp8 gemm Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 21 May, 2025 1 commit
-
-
Sudhakar Singh authored
* fix model parallel encoder to be properly sharded Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 16 May, 2025 1 commit
-
-
jberchtold-nvidia authored
* [JAX] Update flax module param initialization to support logical partitioning axes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix ffn1 intermediate result being replicated Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add documentation and assert when logical_axes=None Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix bias in LayerNormMLP flax module Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix layer tests to not use nn_partitioning and instead use nn.with_logical_axes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 29 Apr, 2025 1 commit
-
-
jberchtold-nvidia authored
* Update test_helper.py and add QuantizeConfig class for CurrentScaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * WIP distributed current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Distributed Current Scaling (debugging). Distributed implementation with replicated scale_inv works for layernorm_mlp but feels like a hack Has different per-device scale_inv values, but jax.debug.print only shows one of them. Since we're telling JAX/XLA that this scale is replicated, I think it assumes all the values are equal. However, it doesn't actually check this, so it seems we are able to get away with per-device scales for current scaling but I am not sure how stable this will be and may randomly fail if us or the user changes partitioning at all or if XLA decides to actually act on the assumption that all these scale_invs are the same. Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Implement distributed current scaling by computing a global amax and scale before quantization Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add encoder and mnist tests for current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add primitive prefix to shardy unique_vars to prevent factor conflicts when performing unfused primitives for current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Remove scale_shape primitive arg that is no longer used Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Format Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix expected result on multiprocessing encoder test Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Lint fix Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update multiprocessing current scaling tolerances Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Uncomment test case that was disabled for testing Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Remove commented out debug line Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 25 Apr, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Update FSDP example instructions Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 15 Apr, 2025 1 commit
-
-
Phuong Nguyen authored
* script improvement * add wait * add return code back * relax tols for FP8 test in test_multiprocessing_ by 0.001 --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 14 Apr, 2025 1 commit
-
-
Johannes Reifferscheid authored
* Add experimental Shardy support. Production use is not yet recommended. --------- Signed-off-by:Johannes Reifferscheid <jreiffers@nvidia.com>
-
- 09 Apr, 2025 1 commit
-
-
Phuong Nguyen authored
* scaling enum abstract * rm NVTE_ from ScalingMode names * rework scaling mode enum in grouped gemm * fix norm sharding --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 04 Apr, 2025 1 commit
-
-
Phuong Nguyen authored
* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout * add fatten_axis option * added gated act to test encoder * sharding constraint fixes * fix padding when flattening first dim needs to be padded * update test sizes so that padding is tested * rm output sharding as it can be done in the flax module * sharding scale_inv for mxfp8 --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 01 Apr, 2025 1 commit
-
-
Phuong Nguyen authored
* refactor + mxfp8 * added grouped gemm * rename linear to dense * added cublas init phase for groupedGemm * relax the tol of test encoder multiprocessing mxfp8 by 0.001 Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 25 Mar, 2025 1 commit
-
-
Phuong Nguyen authored
import te before te_jax Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 05 Mar, 2025 2 commits
-
-
Kirthi Shankar Sivamani authored
* Fix wheel install after src install Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix JAX imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * switch order of dirs for finding so Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Use existing dir src build Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Nicolas Castet authored
* Add support for UB MNNVL Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Address review comments Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Fix lint Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Dlopen nvml lib since it comes with the cuda driver Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Add initial copyright date Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> --------- Signed-off-by:
Nicolas Castet <ncastet@nvidia.com>
-
- 18 Feb, 2025 1 commit
-
-
Phuong Nguyen authored
flax module with compute dtype inferred from the inputs Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 14 Feb, 2025 1 commit
-
-
Phuong Nguyen authored
* fixes L1 test * fix test_multigpu_encoder * fixes for other multi-encoder tests * jax.extend.ffi to jax.ffi * initialization with float32 * add init_dtype as an optional arg to all modules * update use_scan query from xla flags * relax threshold for test_encoder fp8 * relax the tols --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 07 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-