[PyTorch][NVFP4][MOE] NVFP4 Grouped Quantize with Hadamard Transform (#2411)
* rowwise colwise RHT group quant v1 Signed-off-by:Zhongbo Zhu <zhongboz@nvidia.com> * remove local array RW Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * change wait_barrier Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fast math options Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * use mult to replace div Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * format Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * bulk move random states Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * greptile Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * lint Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * revert to use divides Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * avoid fp32 bf16 round-trip in RHT cast fusion Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * trigger fastmath by toggle NVTE_RHT_CAST_FUSION_USE_FAST_MATH Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * integrate row col rht fusion, functional Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * numerics aligned Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * style Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * remove device sync Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * 128 padding Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * revert colwise rng state creation because of row-col fused kernel Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * fix CI, linter Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * refactor RS for generating two random values Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Avoid invalid configs with templated kernel Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix acc pipeline init with 0 arrival count Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * restore rowwise-only mode Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * switch to dynamic atomic scheduler Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Avoid instantiating group RHT+cast kernel without row-wise or col-wise output Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Include fast math option in quantization config Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix linter warnings and review nits Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use TE license Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Fix bug where kernel is always launched on stream Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Restore BF16 intermediate downcast in fused RHT-cast kernels Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix numerical test of grouped kernel Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> * Make sure row-wise and col-wise quantization use different RNG seeds Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> * Restore autoformatter Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by:
Zhongbo Zhu <zhongboz@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Signed-off-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
Showing
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Please register or sign in to comment