- 27 Sep, 2025 1 commit
-
-
Phuong Nguyen authored
* init cgemm + unit tests * UB bootstrap with NCCL, no MPI dependency * add NVLINK-P2P check + error message * skip tests if no NVLINK available * use std::vector to store ncclComm_t * update misuse of TP warning Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 03 Sep, 2025 1 commit
-
-
Daniel Stokes authored
Signed-off-by:djns99 <40156487+djns99@users.noreply.github.com>
-
- 29 Aug, 2025 1 commit
-
-
Daniel Stokes authored
-
- 26 Aug, 2025 1 commit
-
-
Phuong Nguyen authored
* clean up sharding Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * added tpsp_resource Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * update tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * rework test for MeshResource Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * add mesh_resource into fp8_autocast in test_helper.py Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 20 Aug, 2025 1 commit
-
-
jberchtold-nvidia authored
[JAX] Error checking for mesh resource and update GemmPrimitive to use global_mesh_resource().fsdp_resource (#2088) * Enforce global MeshResource is set Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Use global_mesh_resource().fsdp_resource in gemm primitive Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update tests Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update gemm.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update test_layer.py Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 17 Jul, 2025 1 commit
-
-
Phuong Nguyen authored
tighten encoder test tols Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 16 Jul, 2025 1 commit
-
-
jberchtold-nvidia authored
* Support flax sharding constraints Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add warning for deprecated TE logical axes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update examples Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 14 Jul, 2025 1 commit
-
-
Alp Dener authored
* added XLA FFI custom op for TE/common nvte_cublas_gemm Signed-off-by:
Alp Dener <adener@nvidia.com> started GemmPrimitive, abstract done Signed-off-by:
Alp Dener <adener@nvidia.com> gemm custom op working with BF16, needs testing for FP8/MXFP8 Signed-off-by:
Alp Dener <adener@nvidia.com> converted TE GEMM API to use ScaledTensor and added os ENV flag to use TE GEMM under general gemm() call Signed-off-by:
Alp Dener <adener@nvidia.com> BF16 tests passing, FP8 tests should be passing but contracting_dims has a scoping issue Signed-off-by:
Alp Dener <adener@nvidia.com> fp8 tests passing for E4M3, getting CUBLAS_STATUS_NOT_SUPPORTED for E5M2 Signed-off-by:
Alp Dener <adener@nvidia.com> updated GEMM API to use separate LHS and RHS quantizers instead of a QuantizerSet Signed-off-by:
Alp Dener <adener@nvidia.com> new GemmPrimitive passing all Dense tests Signed-off-by:
Alp Dener <adener@nvidia.com> import cleanup and reverted code chunk movement Signed-off-by:
Alp Dener <adener@nvidia.com> removed unused .transpose() implementations from ScaledTensors Signed-off-by:
Alp Dener <adener@nvidia.com> all custom call tests passing on Hopper, GEMM-related tests cover both GemmPrimitive and native JAX impl Signed-off-by:
Alp Dener <adener@nvidia.com> removed direct calls to GemmPrimitive.enabled() from outside of cpp_extensions Signed-off-by:
Alp Dener <adener@nvidia.com> removed unused changes to ScaledTensor classes and debug prints Signed-off-by:
Alp Dener <adener@nvidia.com> * minor unit test cleanup Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * FP8 tests passing on Blackwell but MXFP8 outputs NaN Signed-off-by:
Alp Dener <adener@nvidia.com> * reverted dense and fuseddense changes, FP8 test passing on Hopper and Blackwell, MXFP8 has issues with E5M2 Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * MXFP8 issue traced to scale factor padding with NaNs instead of zeros Signed-off-by:
Alp Dener <adener@nvidia.com> * padding scale with 2^-127 instead of nans Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix bug on rhs_scale_inv usage Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * cleanup E8M0 type converter use it in gemm.cpp Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * segfault fixed, passing all unittests on Blackwell Signed-off-by:
Alp Dener <adener@nvidia.com> * fix for fuseddense tests Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * fix workspace alignment Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed GemmPrimitive custom partitioning to match jax.nn.scaled_matmul Signed-off-by:
Alp Dener <adener@nvidia.com> all unit tests passing on H100x8 node Signed-off-by:
Alp Dener <adener@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci linting fixes Signed-off-by:
Alp Dener <adener@nvidia.com> fixed batch dimension numbers Signed-off-by:
Alp Dener <adener@nvidia.com> fixed FP8 scale sharding rule when there are no FP8 scales Signed-off-by:
Alp Dener <adener@nvidia.com> added error message for unsupported Shardy partitioner Signed-off-by:
Alp Dener <adener@nvidia.com> fixed test tolerances for FP8 cases Signed-off-by:
Alp Dener <adener@nvidia.com> fixed shardy test skip cases Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * moved reshape of encoder output in encoder examples to make custom partitioning rules work correctly Signed-off-by:
Alp Dener <adener@nvidia.com> * added helper functions for padding and unpadding block scales, changed GemmPrimitive to accept unpadded scales and pad them after sharding Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated shardy rules for all custom ops to decouple block scale rules from their tensors Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * changed unit test use_jax_gemm option to be a context to preserve external custom op settings, tightened multi-GPU encoder test tolerances, changed gemm() API to use contracting_dims and batched_dims separately instead of dimension_numbers Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed typo in test utils Signed-off-by:
Alp Dener <adener@nvidia.com> * added sequence-first input warnings Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed datasets version for JAX examples Signed-off-by:
Alp Dener <adener@nvidia.com> * reverting modification to force_1x_quantization decision Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected gemm function syntax in unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 11 Jul, 2025 1 commit
-
-
Alp Dener authored
* capped JAX encoder example datasets version at below 4.0 Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com>
-
- 26 Jun, 2025 1 commit
-
-
jberchtold-nvidia authored
Use keyword args for jit in_shardings and out_shardings Signed-off-by:Jeremy Berchtold <jberchtold@nvidia.com>
-
- 17 Jun, 2025 1 commit
-
-
Phuong Nguyen authored
* include previously accidentally excluded tests * Execute run_test_multiprocessing_encoder with nested bash + exit code for inner bash shell * Adapt run_test_multiprocessing to handle segfault Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 05 Jun, 2025 1 commit
-
-
Phuong Nguyen authored
* fix otype for fp8 gemm Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 21 May, 2025 1 commit
-
-
Sudhakar Singh authored
* fix model parallel encoder to be properly sharded Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Sudhakar Singh <sudhakars@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 16 May, 2025 1 commit
-
-
jberchtold-nvidia authored
* [JAX] Update flax module param initialization to support logical partitioning axes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix ffn1 intermediate result being replicated Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Lint Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add documentation and assert when logical_axes=None Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix bias in LayerNormMLP flax module Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix layer tests to not use nn_partitioning and instead use nn.with_logical_axes Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 29 Apr, 2025 1 commit
-
-
jberchtold-nvidia authored
* Update test_helper.py and add QuantizeConfig class for CurrentScaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * WIP distributed current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Distributed Current Scaling (debugging). Distributed implementation with replicated scale_inv works for layernorm_mlp but feels like a hack Has different per-device scale_inv values, but jax.debug.print only shows one of them. Since we're telling JAX/XLA that this scale is replicated, I think it assumes all the values are equal. However, it doesn't actually check this, so it seems we are able to get away with per-device scales for current scaling but I am not sure how stable this will be and may randomly fail if us or the user changes partitioning at all or if XLA decides to actually act on the assumption that all these scale_invs are the same. Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Implement distributed current scaling by computing a global amax and scale before quantization Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add encoder and mnist tests for current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Add primitive prefix to shardy unique_vars to prevent factor conflicts when performing unfused primitives for current scaling Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Remove scale_shape primitive arg that is no longer used Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Format Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Fix expected result on multiprocessing encoder test Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Lint fix Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Update multiprocessing current scaling tolerances Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Uncomment test case that was disabled for testing Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> * Remove commented out debug line Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com> --------- Signed-off-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 25 Apr, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Update FSDP example instructions Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 15 Apr, 2025 1 commit
-
-
Phuong Nguyen authored
* script improvement * add wait * add return code back * relax tols for FP8 test in test_multiprocessing_ by 0.001 --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 14 Apr, 2025 1 commit
-
-
Johannes Reifferscheid authored
* Add experimental Shardy support. Production use is not yet recommended. --------- Signed-off-by:Johannes Reifferscheid <jreiffers@nvidia.com>
-
- 09 Apr, 2025 1 commit
-
-
Phuong Nguyen authored
* scaling enum abstract * rm NVTE_ from ScalingMode names * rework scaling mode enum in grouped gemm * fix norm sharding --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 04 Apr, 2025 1 commit
-
-
Phuong Nguyen authored
* rename QuantizeAxis to QuantizeLayout, get_layout to get_data_layout, q_axis to q_layout * add fatten_axis option * added gated act to test encoder * sharding constraint fixes * fix padding when flattening first dim needs to be padded * update test sizes so that padding is tested * rm output sharding as it can be done in the flax module * sharding scale_inv for mxfp8 --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 01 Apr, 2025 1 commit
-
-
Phuong Nguyen authored
* refactor + mxfp8 * added grouped gemm * rename linear to dense * added cublas init phase for groupedGemm * relax the tol of test encoder multiprocessing mxfp8 by 0.001 Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
Jeremy Berchtold <jberchtold@nvidia.com>
-
- 25 Mar, 2025 1 commit
-
-
Phuong Nguyen authored
import te before te_jax Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 05 Mar, 2025 2 commits
-
-
Kirthi Shankar Sivamani authored
* Fix wheel install after src install Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix JAX imports Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * switch order of dirs for finding so Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Use existing dir src build Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
Nicolas Castet authored
* Add support for UB MNNVL Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Address review comments Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Fix lint Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Dlopen nvml lib since it comes with the cuda driver Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Add initial copyright date Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> --------- Signed-off-by:
Nicolas Castet <ncastet@nvidia.com>
-
- 18 Feb, 2025 1 commit
-
-
Phuong Nguyen authored
flax module with compute dtype inferred from the inputs Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 14 Feb, 2025 1 commit
-
-
Phuong Nguyen authored
* fixes L1 test * fix test_multigpu_encoder * fixes for other multi-encoder tests * jax.extend.ffi to jax.ffi * initialization with float32 * add init_dtype as an optional arg to all modules * update use_scan query from xla flags * relax threshold for test_encoder fp8 * relax the tols --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 07 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 11 Jan, 2025 1 commit
-
-
Phuong Nguyen authored
* add test_multiprocessing_encoder with processing spawning in bash --------- Signed-off-by:Phuong Nguyen <phuonguyen@nvidia.com>
-
- 02 Jan, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 14 Dec, 2024 1 commit
-
-
Jingyue Wu authored
-
- 22 Oct, 2024 1 commit
-
-
Reese Wang authored
* Skip encoder tests on V100 * Fix mulitprocessing jax.distributed.init * Remove XLA xla_gpu_deterministic_ops which causes segfault --------- Signed-off-by:Reese Wang <rewang@nvidia.com>
-
- 15 Oct, 2024 1 commit
-
-
Santosh Bhavani authored
* Create README.md added all PyT examples Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> * Update README.md - Added JAX, PaddlePaddle, and third-party examples - Fixed DL framework links - Removed issue request for new PRs Signed-off-by:
Santosh Bhavani <sbhavani@nvidia.com> --------- Signed-off-by:
Santosh Bhavani <santosh.bhavani@live.com> Signed-off-by:
Santosh Bhavani <sbhavani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 23 Aug, 2024 1 commit
-
-
Frédéric Bastien authored
* Use jit instead of pjit --------- Signed-off-by:
Frederic Bastien <fbastien@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 14 Aug, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Switch to nltk>3.8.1 and new data Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix nltk install Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 09 Aug, 2024 1 commit
-
-
Alp Dener authored
[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` (#1087) * updated initialize_ub() to use new_subgroups_by_enumeration() to generate intra-node groups, added new unit tests for TE layers with comm overlap Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 23 Jul, 2024 1 commit
-
-
Alp Dener authored
[PyTorch] Fixing hang in `initialize_ub()` for multi-node runs after PR901 removal of MPI-dependence (#986) * Re-implementing PR901 (removing MPI-dependence in Userbuffers) with multi-node fixes * passing data-parallel rank/size info from torch.distributed to userbuffers Signed-off-by:
Alp Dener <adener@nvidia.com> * multi-node example working with UB_SKIPMC=1 but not with multicast Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed multi-node hang in initialize_ub(), updated comm+GEMM overlap example to support multi-node mixed tensor/data parallelism, added README Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed use case when Userbuffers is asked to allocate the TP overlap buffer with UB_SKIPMC=1 Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected example problem to set device by local ordinal instead of global process rank Signed-off-by:
Alp Dener <adener@nvidia.com> * double-free fix in userbuffers destructor Signed-off-by:
Alp Dener <adener@nvidia.com> * removed unnecessary and incorrect torch.cuda.set_device(...) Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected inter-node ranks logic Signed-off-by:
Alp Dener <adener@nvidia.com> * generalized node ID logic in initialize_ub to handle arbitrary world rank layouts within node Signed-off-by:
Alp Dener <adener@nvidia.com> * added single-node comm+GEMM overlap unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * LayerNormMLP example confirmed working with 2 nodes on Eos Signed-off-by:
Alp Dener <adener@nvidia.com> * unit test cleanup Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected DP group ranks logic in LNMLP comm+GEMM overlap example Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected enums in unit test Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect Ubuf object init signature Signed-off-by:
Alp Dener <adener@nvidia.com> * switched default backend for Userbuffer bootstrapping to Gloo with MPI and NCCL fallbacks, and initialize_ub option to manually select backend Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed all comm+GEMM overlap unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected all_gather use for Gloo backend Signed-off-by:
Alp Dener <adener@nvidia.com> * changed userbuffers allgather callback to always use all_gather() instead of all_gather_into_tensor() Signed-off-by:
Alp Dener <adener@nvidia.com> * restored and verified old MPI-based bootstrapping via NVTE_UB_WITH_MPI=1 option at compile time Signed-off-by:
Alp Dener <adener@nvidia.com> * disabled scoped GIL release for comm+GEMM overlap algorithms Signed-off-by:
Alp Dener <adener@nvidia.com> * avoid dist.init_device_mesh in comm+GEMM overlap example to support older PyTorch versions Signed-off-by:
Alp Dener <adener@nvidia.com> * applied RS overlap FP8 fix from PR1004 Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed segfault in Userbuffers destructor Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected comm+GEMM overlap unit test arguments Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed unit test run command for when Userbuffers is compiled with MPI Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactored torch.distributed collectives into pure C++ callbacks Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 15 Jun, 2024 1 commit
-
-
Phuong Nguyen authored
* rm tensor check if the workspace is empty Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * add trust_remote=true for load_dataset() in the mnist test Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 14 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 13 Jun, 2024 1 commit
-
-
Alp Dener authored
* added DL framework callbacks for bootstrapping userbuffers without MPI Signed-off-by:
Alp Dener <adener@nvidia.com> * removed userbuffers availability check in TE modules since userbuffers is now always compiled Signed-off-by:
Alp Dener <adener@nvidia.com> * added comm+GEMM overlap example with LayerNormMLP Signed-off-by:
Alp Dener <adener@nvidia.com> * lintin and review fixes Signed-off-by:
Alp Dener <adener@nvidia.com> * linting and review fixes Signed-off-by:
Alp Dener <adener@nvidia.com> * added header guards Signed-off-by:
Alp Dener <adener@nvidia.com> * removed defunct userbuffers checks in build_utils and setup.py Signed-off-by:
Alp Dener <adener@nvidia.com> * added exposed API in modules/base.py to __all__ Signed-off-by:
Alp Dener <adener@nvidia.com> * removed transformer_engine/CMakeLists.txt and shifted all TE/common compile into transformer_engine/common/CmakeLists.txt Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 12 Jun, 2024 1 commit
-
-
Ming-Xu Huang authored
* Reformat FP8 Meta 1. Reformat FP8 meta to be one-set-per-tensor. 2. Remove fp8_max and scale_inv. 3. Remove unused functions in fp8.py Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix unit-tests Signed-off-by:
Ming Huang <mingh@nvidia.com> * Remove ShardingType and MajorShardingType Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fix lint errors Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed unittests. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Rename few variables. Signed-off-by:
Ming Huang <mingh@nvidia.com> * Add jit to update_amax_list Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed naming error in LayernormMLP Signed-off-by:
Ming Huang <mingh@nvidia.com> * Fixed bugs in test_distributed_layernorm_mlp.py Signed-off-by:
Ming Huang <mingh@nvidia.com> --------- Signed-off-by:
Ming Huang <mingh@nvidia.com>
-