- 10 Apr, 2025 1 commit
-
-
kwyss-nvidia authored
* Add GEMM logic for blockwise quantized tensors. GEMM test cases included in pytorch integration. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update NVTE_BLOCK_SCALING for GEMM. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gate feature on CUDA 12.9 Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gemm typo. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove unecessary type converter change. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reflect epilogue availability and test supported epilogues. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * GEMM simplifications from recipe branch. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Format py code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update GEMM DGelu tests to match support depending on output dtype. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Force pow2Scales in GEMM Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add GEMM test to pytorch test suite. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add copyright to GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update import for GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add license. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test gemm supported predicate. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use sgemm like interfaces and naming. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Rewrite GEMM comment. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR Feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Recipe setup for Linear modules. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use 12.9 feature test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Run against tensor dumps from internal library. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update FIXME to TODO with linked issue. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update full recompute feature to save recipe. The recompute context uses the same recipe and fp8 settings as the original fwd pass. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR Feedback. Avoid reusing quantizer objects. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update logic in module. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Format py. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update for PP bug. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test numerics. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update force_power_of_2 scales in the recipe. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update usage method to satisfy upstream changes. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * fix subchannel recipe in distributed test with bf16 gather Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Edit and cleanup BF16 gather code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test import. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * support columnwise only mode to 1D quantize kernel Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format and move enum Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Skip alloc. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * try async bf16 gather Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format python code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Document and type code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update pytorch lint errors. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Dont set high precision dtype. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add test for sanity and CG; fix CG for sequential? Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Keep make_quantizers API stable Update num_quantizers instead to pass cuda_graph tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix import name. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Rename recipe method. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Skip grouped linear sanity test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Set usage before BF16 gather. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * refactor for nvte_quantize_v2 Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Cleanup nvte_quantize_v2 Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Test fp32 scales. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Disable CUDA graph. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Simplify layernorm linear Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Cleanup layernorm linear. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * LayerNorm linear bwd gather logic. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Communication updates. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update transformer_engine/pytorch/ops/op.py Apply MR comment change. Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
kwyss-nvidia <kwyss@nvidia.com> * Lint fix. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Enable cuda graph tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reduce chance of spurious failure and reword. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Review suggestions from @timmoon10 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update CPP tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update common.h Signed-off-by:
Xin Yao <yaox12@outlook.com> * Update test_float8blockwisetensor.py Signed-off-by:
Xin Yao <yaox12@outlook.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
kwyss-nvidia <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Co-authored-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Xin Yao <yaox12@outlook.com>
-
- 07 Apr, 2025 1 commit
-
-
Jianbin Chang authored
Support fp8 primary weight in fsdp training Signed-off-by:
jianbinc <shjwudp@gmail.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 31 Mar, 2025 1 commit
-
-
Tim Moon authored
* Handle case where FP8 current scaling quantizer gets default process group Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid canonicalizing TP group since it may not be initialized Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 25 Mar, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 22 Mar, 2025 1 commit
-
-
Kunlun Li authored
* Enable fp8_primary_weights for current scaling Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use different cast_master_weights_to_fp8 functions depending on the type of quantizer Signed-off-by:
kunlunl <kunlunl@nvidia.com> * All amaxes of model_weights should participate in reduce-max Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Clear _high_precision_init_val automatically in cast_master_weights_to_fp8 function Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Merge all all-reduce on amaxes into one NCCL kernel Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add unit tests for multi_tensor_compute_scale_and_scale_inv and preserve_high_precision_init_val Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Fix conflicts Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add unit test for cast_master_weights_to_fp8 Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * use mock group to initialize fp8_autocast to avoid reduction of amax_history by fp8_autocast_exit Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Remove with_computing_amax and with_computing_scale Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Move replace_raw_data from QuantizedTensor to utils.py Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Remove allow_empty_output argument from nvte_compute_amax and set it always be true Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Rename import guard of recipe_common.cuh to be align with other import guards Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Add unit test for replace_raw_data Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add test_replace_raw_data into qa/L0_pytorch_unittest/test.sh Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Minor changes in comments Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Add randomness to the unit test of replace_raw_data Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * (Maybe need revert) Add tex.quantize_to_fragment Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * (Maybe needsto rrevert) Use nvte_quantize_noop in quantize_to_fragment Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix lint error Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Move high_precision_init_val test and replace_raw_data test to test_sanity.py Signed-off-by:
kunlunl <kunlunl@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove test_fp8_model_init.py and test_replace_raw_data.py Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Remove cast_master_weights_to_fp8 and replace_raw_data from __all__ of tensor.__init__.py Signed-off-by:
kunlunl <kunlunl@nvidia.com> * Move FP8 casting logic back from C++ tex funcs to Python Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Remove unimplemented function from header Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
kunlunl <kunlunl@nvidia.com> Signed-off-by:
Kunlun Li <94586211+kunlunl@users.noreply.github.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 15 Mar, 2025 1 commit
-
-
Li Tao authored
* support tp-comm-overlap in Current Scaling recipe Signed-off-by:
Li Tao <lit@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean Signed-off-by:
Li Tao <lit@nvidia.com> * fix test recipe argument to generalize to MXFP8 Signed-off-by:
Li Tao <lit@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reduce duplicated transpose in certain cases Signed-off-by:
Li Tao <lit@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use per_tensor_scaling() to judge DS or CS Signed-off-by:
Li Tao <lit@nvidia.com> * minor fixes Signed-off-by:
Li Tao <lit@nvidia.com> * change comment description Signed-off-by:
Li Tao <lit@nvidia.com> * add multi-layer unit test for tp overlap Signed-off-by:
Li Tao <lit@nvidia.com> * support test case that run for several times Signed-off-by:
Li Tao <lit@nvidia.com> * avoid save ub tensor in prepare_for_saving Signed-off-by:
Li Tao <lit@nvidia.com> * fix Signed-off-by:
Li Tao <lit@nvidia.com> * switch to a simple fix Signed-off-by:
Li Tao <lit@nvidia.com> * formatting Signed-off-by:
Li Tao <lit@nvidia.com> * simply test cases; avoid additional clone() Signed-off-by:
Li Tao <lit@nvidia.com> * fall back to get_buffer in layernormmlp Signed-off-by:
Li Tao <lit@nvidia.com> * use 2 layers for fp8 tpoverlap multi-layer test for better tolerance, limit max gpus for test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Li Tao <lit@nvidia.com> Signed-off-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
zhongboz <zhongboz@nvidia.com>
-
- 14 Mar, 2025 1 commit
-
-
vasunvidia authored
* Add options to comm overlap tests Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix Typo Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Update tests/pytorch/distributed/run_layer_with_overlap.py Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 13 Mar, 2025 1 commit
-
-
Tim Moon authored
* Explicitly use python3 and pip3 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Run pre-commit as Python module Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Replace some missed references to "python" or "pip" Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 10 Mar, 2025 1 commit
-
-
Tim Moon authored
Remove Megatron-LM convergence test Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 08 Mar, 2025 1 commit
-
-
Zhongbo Zhu authored
* check in per-tensor current scaling full recipe Signed-off-by:
zhongboz <zhongboz@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
zhongboz <zhongboz@nvidia.com> setup basics of current scaling quantizer in python level Signed-off-by:
zhongboz <zhongboz@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
zhongboz <zhongboz@nvidia.com> add test case for current scaling dequantize Signed-off-by:
zhongboz <zhongboz@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
zhongboz <zhongboz@nvidia.com> finish linear layer fwd bwd test, determined error with bf16 Signed-off-by:
zhongboz <zhongboz@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
zhongboz <zhongboz@nvidia.com> achieved zero tolerance for Linear by specify gemm use_split_accumulator config Signed-off-by:
zhongboz <zhongboz@nvidia.com> enable layernormlinear with current scaling, pass bitwise test Signed-off-by:
zhongboz <zhongboz@nvidia.com> refactor test case code Signed-off-by:
zhongboz <zhongboz@nvidia.com> make current scaling quantizers distrbuted, pass distributed linear&layernormlinear tests Signed-off-by:
zhongboz <zhongboz@nvidia.com> bug fix: use cached fp8 recipe in backward Signed-off-by:
zhongboz <zhongboz@nvidia.com> fix layernorm_mlp with current scaling, fix activation_helper with current scaling Signed-off-by:
zhongboz <zhongboz@nvidia.com> support detailed numerical settings from recipe to quantization kernel Signed-off-by:
zhongboz <zhongboz@nvidia.com> resolving MR comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> recipe naming Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, remove IS_CURRENT_SCALING template from kernels Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, make current scaling c++ test cases Signed-off-by:
zhongboz <zhongboz@nvidia.com> * add current scaling to test_numerics.py, skip act recomp and grouped linear Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmark for quantizer Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmarks for linear layer Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix, typo Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve more mr comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> * avoid potential race condition by not using from_blob to construct amax tensor in C++ Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve more comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Debug linter warnings and license check Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug import error in FP8 tensor test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug compilation error with CUDA 12.1 for Turing Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, fix activation cast fusion Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve comments, add NVTEQuantizationParams for compute scale Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove is_current_scaling check totally from common folder Signed-off-by:
zhongboz <zhongboz@nvidia.com> * remove benchmarks, will contribute in another repo Signed-off-by:
zhongboz <zhongboz@nvidia.com> * adjust cs default recipe config Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust comments in test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Remove current scaling mode from core lib Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor current-scaling-specific logic in core C++ lib Move amax and scale update functions out of casting functions, and put into dedicated current-scaling source file. Add general API for accessing quantization config object. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add missing header in C++ tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable test config with FP8 transpose on Blackwell Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix compilation error in C++ test Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 05 Mar, 2025 1 commit
-
-
Nicolas Castet authored
* Add support for UB MNNVL Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Address review comments Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Fix lint Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Dlopen nvml lib since it comes with the cuda driver Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> * Add initial copyright date Signed-off-by:
Nicolas Castet <ncastet@nvidia.com> --------- Signed-off-by:
Nicolas Castet <ncastet@nvidia.com>
-
- 28 Feb, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Enforce torch 2.0 and run attn tests with torch.compile Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * replace torch.compile with jit_fuser Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 20 Feb, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Fix te sequential for older pytorch versions Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * FIxes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 07 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 13 Jan, 2025 1 commit
-
-
Alp Dener authored
* support AG overlap in sequence-parallel Linear forward and RS overlap in sequence-parallel Linear backward Signed-off-by:
Alp Dener <adener@nvidia.com> * implemented TP overlap support for column-parallel te.Linear Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed backward pass for te.Linear column-parallel with TP overlap, updated unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improved error messages for internal failure to infer TP overlap options in te.Linear Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect TP overlap option asserts Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 02 Jan, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 16 Dec, 2024 1 commit
-
-
Youngeun Kwon authored
* draft implementation of fsdp2 fp8 all gather Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix the convergence issue Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * Add warning Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix lint error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add comments Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add ref Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * add related tests Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 02 Dec, 2024 1 commit
-
-
Youngeun Kwon authored
* draft implementation Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * compile error fix Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix compile error Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * remove print Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Edit comments Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * edit the bulk-overlap test case Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add runtime version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> * fix the version guard Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> --------- Signed-off-by:
Youngeun Kwon <youngeunk@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 06 Nov, 2024 1 commit
-
-
Tim Moon authored
* Add Userbuffers support for column TP linear layer Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add Userbuffers support for row TP linear layer Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Interpret linear+RS as row TP linear Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add Userbuffers support for FP8 row TP linear layer Assumes FP8 RS, which is not a good assumption. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug bug with incorrect bias pointers in UB GEMM Bias pointers are not properly offset for different data chunks. Also removed logic for FP8 RS. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add Userbuffers support for linear dgrad Test passes with row TP, fails with col TP. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add Userbuffers support for linear wgrad Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add support for grad bias Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fused cast-transpose-dbias Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Support case where wgrad is optional Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expand documentation Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use recently added convenience functions in Float8Tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Respect autograd dtype Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix missing imports Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Respect PyT autocast dtype in bprop Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug merge conflicts Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 29 Oct, 2024 1 commit
-
-
Alp Dener authored
* moved userbuffers code to TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * moved comm+GEMM overlap code to TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * removed PyTorch depdency from comm+GEMM overlap in TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * added TE/PyTorch wrappers for refactored comm+GEMM overlap code in TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * updated TE/PyTorch Python API to match the refactored comm+GEMM overlap code Signed-off-by:
Alp Dener <adener@nvidia.com> * updated unit tests to work with refactored comm+GEMM overlap code Signed-off-by:
Alp Dener <adener@nvidia.com> * added a pylint exception to comm+GEMM overlap test runner Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixing linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added documentation for te.initialize_ub Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed compile errors when building with NVTE_UB_WITH_MPI=1 Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed default bootstrap backend Signed-off-by:
Alp Dener <adener@nvidia.com> * switched default bootstrap backend priority to MPI > Gloo > NCCL Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated bootstrap backend documentation Signed-off-by:
Alp Dener <adener@nvidia.com> * close UB bootstrap socket to avoid interfering with CUDA Multicast shareable file handle send/recv Signed-off-by:
Alp Dener <adener@nvidia.com> * added torch::Tensor wrappers for communication buffer and atomic counters so PyTorch can factor externally allocated memory into its garbage collection threshold Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * automated handling of world, local and node ranks/sizes within C++ CommOverlapHelper to simplify Python function signatures Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed incorrect read of environment variables Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected priority for _SOCKET_IFNAME environment variables in UB bootstrapping Signed-off-by:
Alp Dener <adener@nvidia.com> * moved multicast support check to cuda_runtime.h and replaced cudaDeviceGetProp call with cached sm_count() Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed commented out old code and replaced external collective function type defines with aliases Signed-off-by:
Alp Dener <adener@nvidia.com> * compile-time CUDA version guard for CUDA Driver Multicast attribute Signed-off-by:
Alp Dener <adener@nvidia.com> * added compile-time CUDA version guards to Multicast code in Userbuffers Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * condensed UB docs, corrected const violations Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed autodoc rst for UB calls, added CUDA version guard on Multicast UB kernels Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect UB type reporting for P2P overlaps, comment reformatting Signed-off-by:
Alp Dener <adener@nvidia.com> * add docstring to tex.ubuf_built_with_mpi() Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 18 Oct, 2024 1 commit
-
-
Tim Moon authored
* Reorganize PyTorch L1 tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move ONNX tests to L1 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move FA version test to L3 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Limit parallel build jobs in FA version test Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 07 Oct, 2024 1 commit
-
-
Paweł Gadziński authored
* Tests for distributed Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added the test to the qa script Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * Changed qa Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * fix to test_numerics file Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * pr fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixes Signed-off-by:
Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update tests/pytorch/distributed/run_numerics.py Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> --------- Signed-off-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Signed-off-by:
Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> Signed-off-by:
Paweł Gadziński <62263673+pggPL@users.noreply.github.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Pawel Gadzinski <pgadzinski@pgadzinski-mlt.client.nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 09 Aug, 2024 1 commit
-
-
Alp Dener authored
[C/PyTorch] Fixed incorrect use of `torch.distributed.new_group()` when creating intra-node group in `initialize_ub()` (#1087) * updated initialize_ub() to use new_subgroups_by_enumeration() to generate intra-node groups, added new unit tests for TE layers with comm overlap Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 23 Jul, 2024 1 commit
-
-
Alp Dener authored
[PyTorch] Fixing hang in `initialize_ub()` for multi-node runs after PR901 removal of MPI-dependence (#986) * Re-implementing PR901 (removing MPI-dependence in Userbuffers) with multi-node fixes * passing data-parallel rank/size info from torch.distributed to userbuffers Signed-off-by:
Alp Dener <adener@nvidia.com> * multi-node example working with UB_SKIPMC=1 but not with multicast Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed multi-node hang in initialize_ub(), updated comm+GEMM overlap example to support multi-node mixed tensor/data parallelism, added README Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed use case when Userbuffers is asked to allocate the TP overlap buffer with UB_SKIPMC=1 Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected example problem to set device by local ordinal instead of global process rank Signed-off-by:
Alp Dener <adener@nvidia.com> * double-free fix in userbuffers destructor Signed-off-by:
Alp Dener <adener@nvidia.com> * removed unnecessary and incorrect torch.cuda.set_device(...) Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected inter-node ranks logic Signed-off-by:
Alp Dener <adener@nvidia.com> * generalized node ID logic in initialize_ub to handle arbitrary world rank layouts within node Signed-off-by:
Alp Dener <adener@nvidia.com> * added single-node comm+GEMM overlap unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * LayerNormMLP example confirmed working with 2 nodes on Eos Signed-off-by:
Alp Dener <adener@nvidia.com> * unit test cleanup Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected DP group ranks logic in LNMLP comm+GEMM overlap example Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected enums in unit test Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect Ubuf object init signature Signed-off-by:
Alp Dener <adener@nvidia.com> * switched default backend for Userbuffer bootstrapping to Gloo with MPI and NCCL fallbacks, and initialize_ub option to manually select backend Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed all comm+GEMM overlap unit tests Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected all_gather use for Gloo backend Signed-off-by:
Alp Dener <adener@nvidia.com> * changed userbuffers allgather callback to always use all_gather() instead of all_gather_into_tensor() Signed-off-by:
Alp Dener <adener@nvidia.com> * restored and verified old MPI-based bootstrapping via NVTE_UB_WITH_MPI=1 option at compile time Signed-off-by:
Alp Dener <adener@nvidia.com> * disabled scoped GIL release for comm+GEMM overlap algorithms Signed-off-by:
Alp Dener <adener@nvidia.com> * avoid dist.init_device_mesh in comm+GEMM overlap example to support older PyTorch versions Signed-off-by:
Alp Dener <adener@nvidia.com> * applied RS overlap FP8 fix from PR1004 Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed segfault in Userbuffers destructor Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected comm+GEMM overlap unit test arguments Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed unit test run command for when Userbuffers is compiled with MPI Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Refactored torch.distributed collectives into pure C++ callbacks Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 26 Jun, 2024 1 commit
-
-
Tim Moon authored
cache was added in Python 3.9. Signed-off-by:Tim Moon <tmoon@nvidia.com>
-
- 14 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 03 Jan, 2024 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 07 Dec, 2023 1 commit
-
-
cyanguwa authored
* Integrate cuDNN frontend v1 to fused attention and miscellaneous fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax/paddle for unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax/pytorch lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * simplify stride generation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix and/or logic in get_backend Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix flag_max512 and test_numerics Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove v.contiguous() since get_qkv_layout covers it Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * skip fp8 tests for sm89 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further fix jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert mask type to comma-separated list Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last two commits Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * integrate v1/pre-release-5 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * cleanup prerelease5 integration and fix FA2.1 commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * force dropout to 0 if not training Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix Jax CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * testing bias/alibi and padding+causal; add alibi to unfused DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * set flag_arb to false when non determinism is not allowed Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * followup on prev commit; remove redundant python env var setting Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor tweaks for tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * prepare for tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix determinism logic for fused attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add bias to bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix gpt_checkpointing/dpa_accuracy problem Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix some seg fault issues Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add failure notes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove use of non-deter var for backend selection Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for lint and CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix workspace size in bwd and uncomment bias test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix get_alibi and remove check_support Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update tests status Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove workspace_opt from FADescriptor_v1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable arbitrary backend + post scale bias in Jax; waiting on PR 525 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up bhsd order Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * swap bias/rng_state order in aux_ctx_tensor and add bias to aux_ctx_tensor in _qkvpacked/_kvpacked API Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove support for padding_causal + cross for max512 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change alibi bias to float32 for bias_1_4/5 tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further clean up tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix thd fwd output shape for FlashAttention and add backend info for DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix definition of workspace limit when dbias is present Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * further tweak DP_WORKSPACE_LIMIT definition Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disallow alibi+no_mask for sdpa flash and update alibi tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update jax/paddle after PR525 and fix DP_WORKSPACE_LIMIT for dbias Jax tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * disable dbias for non-hopper archs Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix layernorm lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remode unused arg for lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove build dir in setup.py Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change selection logic to prefer fused attn on sm90 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix distributed jax test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix h and s order in header Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update to cudnn fe v1 branch Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove manual setting of workopt path due to dbias after v1 update Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix paddle CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add post_scale_bias and alibi to sdpa flash support matrix Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix support matrix in header files Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * move headers back to .cu and change seed/offset to int64 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update Megatron commit in L1 test and remove all prints in fused attn test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix L1 Megatron test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 arg in L1 Megatron script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * print only when debug flag is on Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove checkpointing loading to avoid loading other tests results Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
cyanguwa <8636796+cyanguwa@users.noreply.github.com>
-
- 07 Sep, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Initial setup Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix testfile Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix commit Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Test script Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add logs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add perf summary Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Reviews and improvements Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Generalize GPU count Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * add plots Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Better plot Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * get default file name with time Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-