- 30 Sep, 2025 1 commit
-
-
Charlene Yang authored
* debug existing usage Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8_dpa Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reimplement fp8_dpa Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE develop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * redesign CS; need cleanup Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up s/dP quantizers Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * return dP to DS Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * improve quantizer_helper; tweak dP DS/CS logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * debug CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FE commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up non-CP; debug dq/dk mismatches Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor success with CP; need to remove debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove debug info Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable fp8 output for fp8_mha + CS Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add output_tensor_type to FADescriptor Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes for CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove print Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more fixes for non-CP and CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable non-determinism for blackwell Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix indent; remove print Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * switch from create_tensor_from_data to make_like Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable a2a+p2p for CS CP and require additional cp_group_global Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * condense tests; only create dist groups once Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * consolidate CP P2P per-tile calls for fwd/bwd and fused/flash Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix flash-attn from last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes for previous commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix attn_mask_type in f16 causal Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert bb6a0a59 temporarily Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reenable comparison for some tensors in CP tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix dbias for fused attn CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clean up prints/comments and add back NVTE_CS_dP_SCALE Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * first attempt at mixed DS/CS reduction Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for last commit for mixed DS/CS reduction Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove prints from 69639024 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix DS recipe for dP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_DPA_FORCE_DS to force DS for all DPA tensors, not just dP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix NVTE_DPA_FORCE_DS and add NVTE_PRINT Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * modify DS recipe for MLPerf Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reduce only over TP group; need to think about CP group later Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * streamline fake_recipe/quantizer generation; allow NVTE_DPA_Fixed_Scales or DS-update S/dP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add more print: NVTE_LAYER_NUMBER Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * split S/dP in env vars: NVTE_DPA_Fix_S_Scale and NVTE_DPA_Fix_dP_Scale Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix autocast_key for DS Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_REPEAT_in_F16 to repeat FP8 fwd/bwd passes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 CS to UnfusedDPA; unsuccessful; does not affect other backends Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * temporary: print min/max and save tensors for debugging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * emulate q/dq+bf16 with NVTE_Emulate_in_F16; add NVTE_DPA_FORCE_MXFP8 for MXFP8 q/dq Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add RHT to BMM1 with NVTE_RHT_BMM1 for the size Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * re-enable fused attn in dpa_fp8_vs_f16 test; changed during unfused attn implementation Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add NVTE_FP8_CS_POWER_OF_2, NVTE_DPA_FORCE_BLOCKFP8, NVTE_Emulate_QDQ_QKV, NVTE_Emulate_QDQ_O, NVTE_Emulate_QDQ_dO Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add F16 O support for FP8 kernels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to TE FE commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * return to FE develop Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * tidy up; untested Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes and improvements for last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more minor fixes and improvements Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * more small fixes/improvements; mostly for CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CS/DS recipe switch in DPA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * avoid quantizing/saving of O when CS bwd uses F16 O Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * move fp8_autocast(fp8_recipe) print to utils.py Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add debug logging to unit tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back prints of quantizers/layer_number for debugging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * enable amax reduction for both CS and DS tensors Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix NVTE_FP8_DPA_BWD=0 for CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last commit for F16 fwd/bwd a2a+p2p Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * small fixes for float8_current_scaling(), nominal types, and unruly d_out types Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8_output in MHA and some CP tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fixes to CP tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes for CP A2A Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * clamp input data in tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove rmse and tighten atol/rtol for tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * restructure fp8_recipes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix linter Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * Revert "remove rmse and tighten atol/rtol for tests" This reverts commit 15dba6a59a5323d414f02cf22f099cb00d880532. Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more fixes for linter Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 recipe changes for F16 code path Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to FE on main to help with merges Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * switch back to FE develop after merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE develop commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to GitHub FE 1.14.1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update FE to its latest main Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for A2A Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix last commit for A2A DS Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove memset for BSHD/SBHD FP8 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove concat for qkv quantization in CS Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * improve/simplify the logic for last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add nominal_type for UnfusedDPA FP8 EmuFunc Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: update env vars for DPA recipes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix typo in last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix DS recipe creation for NVFP4 global recipe Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace python max with torch.maximum Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix linter Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CP A2A for FA Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reduce prints in print_quantizers Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add FP8 env vars to NVTE_DEBUG prints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add reduce_amax to DS repr Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * separate fp8_dpa/fp8_mha in CP tests; fix A2A for them; add f16_O tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * address some reciews Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make data optional in create_hp_tensor_with_amax Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fix for comments in bwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * print cudnn version in attn tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable CS for Hopper Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * alternative tests to reduce CI time Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make NVTE_DPA_FP8CS_O_in_F16 default to 1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove _fp8 variables to avoid confusion Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * return to requiring two cp_groups for a2a+p2p Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace NVTE_PRINT with NVTE_DEBUG/_LEVEL for quantizer prints Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * provide a basic set of tests for CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix the last merge with nvfp4 PR Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * disable for Hopper Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fp8 backend selection for Hopper Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * reduce CP CI to essential tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix to CP test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix recipe logic in tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * revert to concat for qkv quantization Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove cudnn version in qa scripts Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 29 Sep, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Add NVFP4 recipe Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Frank Sun <frsun@nvidia.com> Co-authored-by:
Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by:
Zhongbo Zhu <zhongboz@nvidia.com> Co-authored-by:
Evgeny Tsykunov <etsykunov@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Teddy Do <tdophung@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add MathDx dependency to GitHub builds Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Suggestions from GitHub Copilot Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move 2x shape logic from core to PyTorch Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix compilation errors with CUDA 12.1 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * SM 70 is not supported in CUDA 13 Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Typo Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Revert "Move 2x shape logic from core to PyTorch" This reverts commit f8b2a2d0111d9af690b43bb98ae448d9a430a185. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Added dequantize kernel for FP4 Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add NVFP4 support with fusible ops Use logical tensor dims for PyTorch NVFP4 tensors. Temporarily add unfused dequantize impl. Fix bug where NVFP4 recipe was not configurable. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix logic for 2x shapes and move to PyTorch Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CG test model config Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Debug NVFP4 tensor size function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Proper handling of the RNG state Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Test SR properly Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix workspace size for GEMM heuristic. Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix compile error in C++ NVFP4 test Some some numeric errors when blocks are all zero. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix distrbuted test problem shape Signed-off-by:
zhongboz <zhongboz@nvidia.com> * proper assert dim for low precision AG TP Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up duplicated code in nvfp4_utils.cuh Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * pylint: disable=unused-argument Signed-off-by:
zhongboz <zhongboz@nvidia.com> * `nvte_cublas_gemm_v2` to take alpha pointer (#12) * make nvte_cublas_gemm_v2 to take alpha/beta pointers Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * users are expected to pass a valid C_tensor Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * typos Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * API to have const float* alpha Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> * Minor tweaks Support arbitrary beta scales. Increase workspace to be aligned to 128 bytes. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug IMA with alpha pointer Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Support fused amax kernels with NVFP4 quantization Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable fused amax with cuDNN LayerNorm kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add NVFP4 cases to distributed tests for TE ops Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Change assert to NVTE_CHECK in the hadamard cast fusion Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix compile error Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use global thread IDs for Philox subsequences Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add shape checks for NVFP4 cast kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Do not fuse amax if cuDNN normalization is forced by envvar Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Frank Sun <frsun@nvidia.com> Co-authored-by:
Oleg Goncharov <ogoncharov@nvidia.com> Co-authored-by:
Zhongbo Zhu <zhongboz@nvidia.com> Co-authored-by:
Evgeny Tsykunov <etsykunov@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Teddy Do <tdophung@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
Phuong Nguyen <phuonguyen@nvidia.com>
-
- 07 Aug, 2025 1 commit
-
-
ldl authored
[PyTorch] Fix numeric overflow caused by int-type parameters and return value in the roundup function (#2034) Signed-off-by:
lvdunlin <lvdunlin@xiaomi.com> Co-authored-by:
lvdunlin <lvdunlin@xiaomi.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 29 Jul, 2025 2 commits
-
-
Jan Bielak authored
* Compute amax in activation kernels when the output pointer is provided, even for non-fp8 outputs Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 9f13fe2fefc58cae93bc467d87d01ecf792a0381) * Initialize metatensor values Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * Fuse computation of amax into the activation kernel for fp8 current scaling Signed-off-by:
Jan Bielak <jbielak@nvidia.com> (cherry picked from commit 2b54327ac9c931a5340983a79e99de5caa0399dd) Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Zero out amax in `create_hp_tensor_with_amax` instead of relying on `Float8CurrentScalingQuantizer.__init__` to zero-initialize it Signed-off-by:
Jan Bielak <jbielak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Jan Bielak <jbielak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Tim Moon authored
* remove reciprocal op Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Refactor Quantizer::create_tensor function Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix bug when constructing FP8 tensor Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add quantize function to C++ quantizers Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Prototype function to coerce Python quantized tensors to match quantizer Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use quantizer class in tex.quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add FP8 current scaling support for activation backward Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable quantized GEMM output with FP8 current scaling Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add coerce_tensor functions for MXFP8 and DSv3 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Avoid quantizing empty tensors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use consistent shapes for FP8 transposes Signed-off-by:
Tim Moon <tmoon@nvidia.com> * In attention impl, construct FP8 tensors with pre-initialized scale-invs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Initialize MXFP8 scales to zero Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Store copy of quantizer when creating quantized tensors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make sure quantized tensors have private quantizer Avoid problems with in-place ops after quantizer usages are changed externally. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Rename "coerce_tensor" to "convert_and_update_tensor" Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Make sure CUDA context is available when launching NVRTC kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Expose CUDA context creation function externally Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 10 Jul, 2025 1 commit
-
-
Autumn1998 authored
* add router fusion Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix ci Signed-off-by:
tongliu <tongliu@nvidia.com> * fix ci with cuda 12.3 Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Review suggestions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix CI sm89/80 Signed-off-by:
tongliu <tongliu@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
tongliu <tongliu@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
tongliu <tongliu@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xin Yao <xiny@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 09 Jul, 2025 1 commit
-
-
Zhongbo Zhu authored
* functional passed Signed-off-by:
zhongboz <zhongboz@nvidia.com> * before zero padding in mxfp8 swizzle, use torch zeros to malloc for now Signed-off-by:
zhongboz <zhongboz@nvidia.com> * format Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com>
-
- 26 Jun, 2025 1 commit
-
-
Zhongbo Zhu authored
* finish python ref impl for bulk alloc Signed-off-by:
zhongboz <zhongboz@nvidia.com> * c++ bulk alloc worked, still draft version Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve rebase conflict Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add license Signed-off-by:
zhongboz <zhongboz@nvidia.com> * use shared_ptr to auto manage reference count Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * attempt to fix misc training error Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * attempt to handle case where experts get zero token Signed-off-by:
zhongboz <zhongboz@nvidia.com> * updated with fused C++ function calls Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * experiment with reducing py object construction time Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix seg fault bug in inference mode Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fuse torch split into bulk alloc Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * rebase to latest main Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix unit test failure Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint error Signed-off-by:
zhongboz <zhongboz@nvidia.com> * refactor create_tensor to use get_scale_shape Signed-off-by:
zhongboz <zhongboz@nvidia.com> * refactor quantize to call quantize_cpp Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Implement separate functions for multi-tensor quantize and split + multi-tensor quantize Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update grouped linear module with fused split+quantize func Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move multi-tensor quantize func to cast.cpp Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Do not expose quantizer helper function externally Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Revert cuDNN frontend commit Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix corner cases with zero tokens Signed-off-by:
zhongboz <zhongboz@nvidia.com> * add comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 13 Jun, 2025 1 commit
-
-
Oleg Goncharov authored
* Added support of FP4 data type Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Refactoring to BitsNum in progress Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed compilation errors. All C++ tests passed Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed a typo Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added FP4 guard to TMA tensor descriptor data type Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed errors in JAX C++ extensions Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed dummy NVFP4 C++ test file Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Make pytorch changes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Refactored the code per the review notes. Fixed JAX build error. Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Removed unnecessary static casts Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Typo fix Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Pass correct num bits to create_2D_tensor_map; fixes CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * inline funcs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 06 Jun, 2025 2 commits
-
-
Phuong Nguyen authored
* refactor the multi_stream utils + implement nvte_multi_tensor_quantize in TE/Common * implement GroupedQuantizer and grouped_quantize in jaxx * fix logical_axes_names for transpose tensor in ScaledTensor Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Hua Huang <huah@nvidia.com> Co-authored-by:
Ming Huang <mingh@nvidia.com>
-
Zhongbo Zhu authored
[PyTorch] FP8 Subchannel Recipe With FP8 Gather And Configurable Scaling Factor Tensor Swizzling (#1707) * functional kernel for columnwise + no-transpose option, still hacky Signed-off-by:
zhongboz <zhongboz@nvidia.com> * pass all quantizer unit tests Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor, add gemm ready api Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make format options private members, simplify api Signed-off-by:
zhongboz <zhongboz@nvidia.com> * swizzle scales right before gemm Signed-off-by:
zhongboz <zhongboz@nvidia.com> * bug fix of single layer test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * attempt to fix lint issue Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fp8 gather pass, need minor refine Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix return_layernorm_output_gathered case Signed-off-by:
zhongboz <zhongboz@nvidia.com> * remove special cases, add sanity check before gemm Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint ungrouped imports Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Implement dequantize for compact 1D blocks. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * add more unit test with dequantize compact supported Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint again Signed-off-by:
zhongboz <zhongboz@nvidia.com> * make ag for subchannel respect async Signed-off-by:
zhongboz <zhongboz@nvidia.com> * zero tolerance in distributed test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix zero tolerance test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve rebase issues Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint & format Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * bug fix Signed-off-by:
zhongboz <zhongboz@nvidia.com> * relax rtol for fp32 distributed test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix some ci issue Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix ci test failure in debug mode Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Force row-wise and column-wise data to have same data format Prototype "all-gather usage" in quantizer. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove dead logic for high-precision AGs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug FP8 block-wise tensor tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug distributed test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Handle case where LayerNormLinear returns gathered norm output Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix debug mode Signed-off-by:
zhongboz <zhongboz@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Keith Wyss <kwyss@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 29 May, 2025 1 commit
-
-
Przemyslaw Tredak authored
* Changed the Tensor allocation strategy Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Disable debug flag Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix the double free error Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fixed pyTorch recipe extension Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Hide TensorAllocator and fix the usage in LayerNorm Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Cleaning Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix permutation Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 11 May, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* First pass refactor Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * first pass Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * core compiles Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Include cuda dirs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Compiles Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Move grad outside autocast Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix kv cache Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Change src file name in cmake Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * move the kernels too Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Move comment Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Move comments around Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * more movement Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * move Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 05 May, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
* Move multi tensors kernels from PyTorch extensions to core Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Add int16 type to core (for storing fp32 param remainders) Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix core build Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * same fix to scale Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix perf, memory, vars Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Re-add device guard for multi-device Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix junk output dtype for non-per tensor Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes for test and upgrade mcore version Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix core tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 10 Apr, 2025 1 commit
-
-
kwyss-nvidia authored
* Add GEMM logic for blockwise quantized tensors. GEMM test cases included in pytorch integration. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update NVTE_BLOCK_SCALING for GEMM. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gate feature on CUDA 12.9 Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gemm typo. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove unecessary type converter change. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reflect epilogue availability and test supported epilogues. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * GEMM simplifications from recipe branch. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Format py code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update GEMM DGelu tests to match support depending on output dtype. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Force pow2Scales in GEMM Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add GEMM test to pytorch test suite. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add copyright to GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update import for GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add license. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test gemm supported predicate. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use sgemm like interfaces and naming. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Rewrite GEMM comment. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR Feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Recipe setup for Linear modules. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use 12.9 feature test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Run against tensor dumps from internal library. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update FIXME to TODO with linked issue. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update full recompute feature to save recipe. The recompute context uses the same recipe and fp8 settings as the original fwd pass. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR Feedback. Avoid reusing quantizer objects. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update logic in module. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Format py. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update for PP bug. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test numerics. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update force_power_of_2 scales in the recipe. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update usage method to satisfy upstream changes. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * fix subchannel recipe in distributed test with bf16 gather Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Edit and cleanup BF16 gather code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test import. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * support columnwise only mode to 1D quantize kernel Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format and move enum Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Skip alloc. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * try async bf16 gather Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format python code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Document and type code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update pytorch lint errors. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Dont set high precision dtype. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add test for sanity and CG; fix CG for sequential? Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Keep make_quantizers API stable Update num_quantizers instead to pass cuda_graph tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix import name. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Rename recipe method. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Skip grouped linear sanity test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Set usage before BF16 gather. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * refactor for nvte_quantize_v2 Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Format code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Cleanup nvte_quantize_v2 Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Test fp32 scales. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Disable CUDA graph. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Simplify layernorm linear Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Cleanup layernorm linear. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * LayerNorm linear bwd gather logic. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Communication updates. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update transformer_engine/pytorch/ops/op.py Apply MR comment change. Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
kwyss-nvidia <kwyss@nvidia.com> * Lint fix. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Enable cuda graph tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reduce chance of spurious failure and reword. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Review suggestions from @timmoon10 Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update CPP tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update common.h Signed-off-by:
Xin Yao <yaox12@outlook.com> * Update test_float8blockwisetensor.py Signed-off-by:
Xin Yao <yaox12@outlook.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
kwyss-nvidia <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Co-authored-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Xin Yao <yaox12@outlook.com>
-
- 04 Apr, 2025 1 commit
-
-
kwyss-nvidia authored
* Blockwise float8 quantizer and quantized tensor class. The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively. Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story. Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch. Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage. Two CUDA kernels for quantization are included, and are direct ports of equivalents in the kitchen repository, where a subchannel recipe has been used for end to end training. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Apply linting changes. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Alignment for 1D scaling for GEMM edge case. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Change API name. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix merge conflict with name change. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use common tensor map API. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Change API to use two scaling mode enums. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix typo. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update some call sites. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Tests for torch tensor API surface. Since the quantized tensor is a tensor subclass, these tests exercise torch hooks. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reuse scale calculation between quantizer refs. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Save memory by dropping reference to saved tensors. Issues previously observed are solved. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove constexpr parameters from kernel. Code size is reduced with fewer constexpr params. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Merge conflict from rebase. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add shape implementations for block scaling. nvte_shape was added upstream. Logic added for block scaled fp8. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Move benchmark to te_playground Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove amax_epsilon and pow_2_scales from tensor. Hardcodes the default values. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Lint changes. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fixup MR changes that broke. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Safer ifdef in kernel. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Documentation prose. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reuse compute_scale function from Current Scaling. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Bugfix on inf_value scale refactor. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove qopt calls from test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update pytest list. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add copyright to reference scale calc. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use ptx.cuh functions instead of cde. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update shape logic with allocation and reuse shape. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Usage defaults MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Copyright and header guard. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Updating torch dispatch code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix exception type. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use TypeInfo Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update CS scale update test to use updated ref impl Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update JAX scaling mode enum Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Skip tests on Lovelace Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 31 Mar, 2025 1 commit
-
-
Tim Moon authored
* Handle case where FP8 current scaling quantizer gets default process group Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix linter warning Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid canonicalizing TP group since it may not be initialized Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 18 Mar, 2025 1 commit
-
-
Charlene Yang authored
* add paged attention; test_kv_cache_accuray and test_paged_attn pass Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove unnecessary change from last commit Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test_fused_attn pass Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove unnecessary import in test_numerics Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add license for test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add to L0 test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update license for test_paged_attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update kv_cache_manager license Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix build issue from previous merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: minor fix/preparation for inference/cuda graph Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged, bshd/sbhd Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged, thd, no CG Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: non-paged, thd, CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged, CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: non-paged, using paged kernel Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: restructure kernels Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: paged, CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: padding + BRCM Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: restructure IP, clean up Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix non-CG, fused Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix last commit Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: unfused, non-CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: flash-attn, non-CG Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: flash_attn_with_kvcache Signed-off-by:
Charlene Yang <charleney@nvidia.com> * commit two files missed by bcef6b34 Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: thd_bshd_bshd Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix last commit Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix 1c31b68d Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: add bshd_2sbhd, sbhd_2bshd Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: some cleanup Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: all qkv_format combinations and merge CM files Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: some lint fixes Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: add docstring for IP Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix sequences_pre Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: minor fixes for multi-layer Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: initial multi-layer test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: minor clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: switch to flash_attn_varlen_func Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix unfused for separate q/kv format Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix fused for separate q/kv formats Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: flash attn + TELayer + 2 layers Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: unfused + TL + 2layers Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: all modules/backend Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: minor cleanup Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: FlashAttention on Hopper with 2.7.3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: FlashAttention + v3 from 39e7179 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: FlashAttention + v3 + FP8 + WIP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: add backend support table Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: clean up Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: separate use_flash_attention_2 and _3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * WIP: tweaks to paged attn script Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * WIP: enable/disable certain cases for fused attn Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: small fixes for lint and cg Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: minor fixes for attn/infer Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: fix CP Signed-off-by:
Charlene Yang <charleney@nvidia.com> * WIP: readd page info to FADescriptor_v1 Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweak to test_numerics.py Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix 9.5/9.7 sq/skv + mask logic Signed-off-by:
Charlene Yang <charleney@nvidia.com> * clean up Signed-off-by:
Charlene Yang <charleney@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor fix for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * more minor fixes for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * test page_size=1 for FA3 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix t3hd/th3d strides Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix ckpt recompute and fa3 k_scale Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * raise dynamo recompile limit for test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove thunder test from L0 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix FA selection logic Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FA3 q_descale shape Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove page_table from IP.step() returns Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FP8 FlashAttn DPA fp8_dpa tests Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * minor tweaks Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA3 note and L3 test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove redundant import in test Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * adopt new FA3 APIs from FA2.7.3+/hopper for CP and non-CP Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix lint Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * relax tols for TransformerLayers Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix merge Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix merge 2 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix FA import comments Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * relax tols for Ampere Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix fa3 version and reduce messaging Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * update FA3 to its latest commit on main Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add default values to IP and assertion to graph.py Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add more comments in attention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * use custom_cache_manager instead of cache_manager Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Charlene Yang <charleney@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 08 Mar, 2025 1 commit
-
-
Zhongbo Zhu authored
* check in per-tensor current scaling full recipe Signed-off-by:
zhongboz <zhongboz@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
zhongboz <zhongboz@nvidia.com> setup basics of current scaling quantizer in python level Signed-off-by:
zhongboz <zhongboz@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
zhongboz <zhongboz@nvidia.com> add test case for current scaling dequantize Signed-off-by:
zhongboz <zhongboz@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
zhongboz <zhongboz@nvidia.com> finish linear layer fwd bwd test, determined error with bf16 Signed-off-by:
zhongboz <zhongboz@nvidia.com> [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci Signed-off-by:
zhongboz <zhongboz@nvidia.com> achieved zero tolerance for Linear by specify gemm use_split_accumulator config Signed-off-by:
zhongboz <zhongboz@nvidia.com> enable layernormlinear with current scaling, pass bitwise test Signed-off-by:
zhongboz <zhongboz@nvidia.com> refactor test case code Signed-off-by:
zhongboz <zhongboz@nvidia.com> make current scaling quantizers distrbuted, pass distributed linear&layernormlinear tests Signed-off-by:
zhongboz <zhongboz@nvidia.com> bug fix: use cached fp8 recipe in backward Signed-off-by:
zhongboz <zhongboz@nvidia.com> fix layernorm_mlp with current scaling, fix activation_helper with current scaling Signed-off-by:
zhongboz <zhongboz@nvidia.com> support detailed numerical settings from recipe to quantization kernel Signed-off-by:
zhongboz <zhongboz@nvidia.com> resolving MR comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> recipe naming Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, remove IS_CURRENT_SCALING template from kernels Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, make current scaling c++ test cases Signed-off-by:
zhongboz <zhongboz@nvidia.com> * add current scaling to test_numerics.py, skip act recomp and grouped linear Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmark for quantizer Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmarks for linear layer Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * bug fix, typo Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve more mr comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> * avoid potential race condition by not using from_blob to construct amax tensor in C++ Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve more comments Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Debug linter warnings and license check Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug import error in FP8 tensor test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug compilation error with CUDA 12.1 for Turing Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * resolve mr comments, fix activation cast fusion Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve comments, add NVTEQuantizationParams for compute scale Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * remove is_current_scaling check totally from common folder Signed-off-by:
zhongboz <zhongboz@nvidia.com> * remove benchmarks, will contribute in another repo Signed-off-by:
zhongboz <zhongboz@nvidia.com> * adjust cs default recipe config Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * adjust comments in test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Remove current scaling mode from core lib Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Refactor current-scaling-specific logic in core C++ lib Move amax and scale update functions out of casting functions, and put into dedicated current-scaling source file. Add general API for accessing quantization config object. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Add missing header in C++ tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Disable test config with FP8 transpose on Blackwell Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix compilation error in C++ test Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
zhongboz <zhongboz@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com>
-
- 07 Feb, 2025 1 commit
-
-
Przemek Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 02 Jan, 2025 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 06 Dec, 2024 1 commit
-
-
Phuong Nguyen authored
* cuDNN normalization integration * TE Norm refactor * TE Norm APIs changes. --------- Signed-off-by:
Phuong Nguyen <phuonguyen@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 29 Oct, 2024 1 commit
-
-
Alp Dener authored
* moved userbuffers code to TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * moved comm+GEMM overlap code to TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * removed PyTorch depdency from comm+GEMM overlap in TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * added TE/PyTorch wrappers for refactored comm+GEMM overlap code in TE/common Signed-off-by:
Alp Dener <adener@nvidia.com> * updated TE/PyTorch Python API to match the refactored comm+GEMM overlap code Signed-off-by:
Alp Dener <adener@nvidia.com> * updated unit tests to work with refactored comm+GEMM overlap code Signed-off-by:
Alp Dener <adener@nvidia.com> * added a pylint exception to comm+GEMM overlap test runner Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixing linting errors Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * added documentation for te.initialize_ub Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed compile errors when building with NVTE_UB_WITH_MPI=1 Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed default bootstrap backend Signed-off-by:
Alp Dener <adener@nvidia.com> * switched default bootstrap backend priority to MPI > Gloo > NCCL Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * updated bootstrap backend documentation Signed-off-by:
Alp Dener <adener@nvidia.com> * close UB bootstrap socket to avoid interfering with CUDA Multicast shareable file handle send/recv Signed-off-by:
Alp Dener <adener@nvidia.com> * added torch::Tensor wrappers for communication buffer and atomic counters so PyTorch can factor externally allocated memory into its garbage collection threshold Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * automated handling of world, local and node ranks/sizes within C++ CommOverlapHelper to simplify Python function signatures Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed incorrect read of environment variables Signed-off-by:
Alp Dener <adener@nvidia.com> * corrected priority for _SOCKET_IFNAME environment variables in UB bootstrapping Signed-off-by:
Alp Dener <adener@nvidia.com> * moved multicast support check to cuda_runtime.h and replaced cudaDeviceGetProp call with cached sm_count() Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * removed commented out old code and replaced external collective function type defines with aliases Signed-off-by:
Alp Dener <adener@nvidia.com> * compile-time CUDA version guard for CUDA Driver Multicast attribute Signed-off-by:
Alp Dener <adener@nvidia.com> * added compile-time CUDA version guards to Multicast code in Userbuffers Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * condensed UB docs, corrected const violations Signed-off-by:
Alp Dener <adener@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fixed autodoc rst for UB calls, added CUDA version guard on Multicast UB kernels Signed-off-by:
Alp Dener <adener@nvidia.com> * fixed incorrect UB type reporting for P2P overlaps, comment reformatting Signed-off-by:
Alp Dener <adener@nvidia.com> * add docstring to tex.ubuf_built_with_mpi() Signed-off-by:
Alp Dener <adener@nvidia.com> --------- Signed-off-by:
Alp Dener <adener@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 05 Sep, 2024 1 commit
-
-
Ruibin Cheung authored
* [TE/PyTorch][MoE] Add FP8 padding and unpadding module 1. Add multi-tensor padding kernel for FP8 with padding size = 16. 2. Add FP8Padding and Fp8Unpadding module 3. Add Padded GroupedLinear unit tests --------- Signed-off-by:
beinggod <zhangruibin@01.ai> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 22 Aug, 2024 1 commit
-
-
NVJiangShao authored
* Add permutation functions * Add permutation ops * Remove the dependency on cutlass * Move permutation.py out of module dir * Rewrite the unit test and enable skipping if FP8 is unavailable * Rename exposed C++ API and reorder its parameters + take NVTETensor as inputs * Use Float8Tensor for FP8 input * Move dtype to ctx --------- Signed-off-by:
Jiang Shao <jiangs@nvidia.com> Co-authored-by:
Qi Zhang <qizhang@nvidia.com> Co-authored-by:
Phuong Nguyen <36155692+phu0ngng@users.noreply.github.com>
-
- 14 Jun, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Apply formatting Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 10 Jun, 2024 1 commit
-
-
Tim Moon authored
* Avoid select operation in cast-transpose extension Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid select operation in cast-transpose-dbias extensions Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Avoid select op in LayerNorm and RMSNorm Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter errors Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com>
-
- 12 Apr, 2024 1 commit
-
-
Kirthi Shankar Sivamani authored
* FP8 cuda graphs Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com> * Fix numerics Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * exclude torch compile from numerics tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More numerics fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * rm fusion from unfused path Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com>
-
- 08 Feb, 2024 1 commit
-
-
Tim Moon authored
* Implement fused kernel for FP8 scale update Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Add fused kernel for amax and scale update Add unit test. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Replace paddle.fluid imports with paddle.base Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Move fused kernel to core library Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use FP8 update kernel in Paddle Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug FP8 scale update in Paddle Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix lint errors Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug Paddle test failures Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Make update kernel in-place for PyTorch Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Revert cudnn-frontend commit Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 25 Jan, 2024 1 commit
-
-
Xin Yao authored
* fused apply rope Signed-off-by:
Xin Yao <xiny@nvidia.com> * Apply suggestions from code review Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> * resolve comments Signed-off-by:
Xin Yao <xiny@nvidia.com> * make rotary_percent optional Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix ci Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix test Signed-off-by:
Xin Yao <xiny@nvidia.com> * add rope test to qa Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix linting Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: add transpose_output_memory Signed-off-by:
Xin Yao <xiny@nvidia.com> * small fix Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: fuse sin/cos Signed-off-by:
Xin Yao <xiny@nvidia.com> * sync apex: fused rope for thd format Signed-off-by:
Xin Yao <xiny@nvidia.com> * fix lint Signed-off-by:
Xin Yao <xiny@nvidia.com> * Fix license headers Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * add support for bshd format Signed-off-by:
Xin Yao <xiny@nvidia.com> * support different seq length Signed-off-by:
Xin Yao <xiny@nvidia.com> * update Signed-off-by:
Xin Yao <xiny@nvidia.com> * update copyright Signed-off-by:
Xin Yao <xiny@nvidia.com> * remove transpose_output_memory Signed-off-by:
Xin Yao <xiny@nvidia.com> * Make outputs contiguous in SBHD case Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> --------- Signed-off-by:
Xin Yao <xiny@nvidia.com> Signed-off-by:
Xin Yao <yaox12@outlook.com> Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Przemyslaw Tredak <ptredak@nvidia.com>
-
- 03 Jan, 2024 1 commit
-
-
Przemyslaw Tredak authored
Signed-off-by:Przemek Tredak <ptredak@nvidia.com>
-
- 24 Oct, 2023 1 commit
-
-
Tim Moon authored
* Do not include logging macros in installed C headers Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug logging macros Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug C++ tests Use Google style for header includes. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update CUDA driver macros Incorporating changes from #389. Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Use core error checking macros in PyTorch extensions Hack to get around macro redefinition warning. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix missing arg when getting CUDA driver error string Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Reuse logging header in frameworks Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Jan Bielak <jbielak@nvidia.com>
-
- 20 Oct, 2023 1 commit
-
-
Przemyslaw Tredak authored
* Ability to check cuDNN version from Python Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Modify the fused attention test to not use the CUDNN_VERSION env variable which is specific to NGC containers Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com>
-
- 27 Jul, 2023 1 commit
-
-
Przemyslaw Tredak authored
* Exposing RMSNorm in pyTorch extensions Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * First pass at the Python API Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Small fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Added numerics tests and fixed issues Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Lint fixes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Added RMSNorm to LayerNormMLP Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Added ONNX export and tests for RMSNorm Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix python lint Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix BERT case Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Added normalization option to the TransformerLayer Added tests Fixed test failures Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * Fix documentation Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix kwarg bug Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix IMA and invalid type error Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Increase RMSNorm threshold for bf16 case Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix ONNX tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 22 Jun, 2023 1 commit
-
-
cyanguwa authored
* add long sequence support and unify three backends for fused attention Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * update cudnn-frontend to v0.9.1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace cpu_float2half_rn with __float2half_rn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix backend selection and NVTEDType Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * minor fixes Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix ci Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * make cudnn plan caches thread_local Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix CI Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * replace cuDNN throw with NVTE_CHECK Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix replacement of cuDNN throw with NVTE_CHECK Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * force dropout probablity to 0 in inference mode Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change negInfinity to be consistent with m512 fused attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove float2half conversion for scale_dropout Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add back runtime api for sm detection Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * add gemm3 to enums FP8Fwd/BwdTensors Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * change dropout from no to yes for fmha_v1 Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove output_rng_state in m512 kernels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * fix elts_per_thread calculation in kvpacked fwd Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove dropout=0.0 restriction for m512 fused attn Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> * remove output_rng_state completely from m512 kernels Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> --------- Signed-off-by:
Charlene Yang <8636796+cyanguwa@users.noreply.github.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 13 Jun, 2023 1 commit
-
-
Przemyslaw Tredak authored
* Added ReLU and GLU variants to common Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * pyTorch changes Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> * PyTorch C++ lint Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Bug fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * More fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix storage errors Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Compute bgrad Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix numerical tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix ONNX export tests Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Review comments Co-authored-by:
Przemyslaw Tredak <ptrendx@gmail.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Przemek Tredak <ptredak@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 21 Apr, 2023 1 commit
-
-
cyanguwa authored
* Add FP8 fused attention to TE for PyTorch Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add license for cudnn-frontend, modify installation requirements, and refactor some headers for aesthetics Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add c api docs for fused attention Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add exception for unsupported precision/sequence length combinations Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix installation requirement for non fused attn use cases Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix docs for fused-attn Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * prefix enums with NVTE_ and replace old MHA_Matrix with NVTE_QKV_Matrix Signed-off-by:
Charlene Yang <charleney@nvidia.com> * minor fixes based on PR comments Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix description for kvpacked fwd Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix description of Bias in C api Signed-off-by:
Charlene Yang <charleney@nvidia.com> * minor fixes for cudnn requirement and description for QKV tensors Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix QKV layout description and support matrix for C api Signed-off-by:
Charlene Yang <charleney@nvidia.com> * add asserts to cpp_extensions for qkv layout/bias type/attn mask type Signed-off-by:
Charlene Yang <charleney@nvidia.com> * fix typo precision Signed-off-by:
Charlene Yang <charleney@nvidia.com> --------- Signed-off-by:
Charlene Yang <charleney@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by:
Charlene Yang <charleney@nvidia.com> Co-authored-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 04 Apr, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
* Add FP8 support for Ada Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * better message Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * lint fixes Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Address review comments Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * better message for no fp8 Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * same thing for onnx test Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * fix Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Fix CI and review Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-
- 03 Feb, 2023 1 commit
-
-
vasunvidia authored
* Increase number of FP8 tensors per GEMM Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Enable FP8 output tensor for fp8_gemm Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * [BERT FP8] Initial TE review comments Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Temporary fix for cuda graph non convergence Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Address review comments-2 Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Review comments-3 Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Cleanup Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Change for New API Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Remove unnecessary clone for D_scale, D_amax Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Avoid Roll for AMAX history size = 1 Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Update onnx_te_gemm API Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> * Fix Lint errors Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com> --------- Signed-off-by:
Vasudevan Rengasamy <vrengasamy@nvidia.com>
-
- 20 Jan, 2023 1 commit
-
-
Kirthi Shankar Sivamani authored
Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com> Signed-off-by:
Kirthi Shankar Sivamani <ksivamani@nvidia.com>
-