- 20 Dec, 2025 1 commit
-
-
Zhongbo Zhu authored
* rowwise colwise RHT group quant v1 Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * remove local array RW Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * change wait_barrier Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fast math options Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * use mult to replace div Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * format Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * bulk move random states Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * greptile Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * lint Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * revert to use divides Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * avoid fp32 bf16 round-trip in RHT cast fusion Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * trigger fastmath by toggle NVTE_RHT_CAST_FUSION_USE_FAST_MATH Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * integrate row col rht fusion, functional Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * numerics aligned Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * style Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * remove device sync Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * 128 padding Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * revert colwise rng state creation because of row-col fused kernel Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fix CI, linter Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * refactor RS for generating two random values Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Avoid invalid configs with templated kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix acc pipeline init with 0 arrival count Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * restore rowwise-only mode Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * switch to dynamic atomic scheduler Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Avoid instantiating group RHT+cast kernel without row-wise or col-wise output Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Include fast math option in quantization config Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings and review nits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use TE license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix bug where kernel is always launched on stream Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Restore BF16 intermediate downcast in fused RHT-cast kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix numerical test of grouped kernel Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Make sure row-wise and col-wise quantization use different RNG seeds Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Restore autoformatter Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 30 Oct, 2025 1 commit
-
-
Oleg Goncharov authored
* Separated gated and dequantize kernels Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Separated quantize, dequantize and gated functions Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed lint issues Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed persistent lint issues Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Added missing compute capability 10.0 check for Quantize FP8 TMA kernels Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed the issue which was added again by autofix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed files description. Completely removed non-identity activations from the NVFP4 transpose test suite Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Removed unsupported template arguments in NVFP4 quantize Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fixed undefined symbol error Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixed condition Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> * Fixed CUDA version check Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Changed arch conditions order Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Fix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Clean up Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Small fix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Small fix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fixes per the PR review Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Fix Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Split quantize helper into two (FWD and BWD) functions Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Moved activation functions from cast.cu. Removed cast.cu from the fast-math compilation list Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Enabled fast math for activations by default Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> * Disabled fast math for activations by default Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> --------- Signed-off-by:
Oleg Goncharov <ogoncharov@nvidia.com> Signed-off-by:
Oleg Goncharov <64355998+Oleg-Goncharov@users.noreply.github.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-