- 19 Jun, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 18 Jun, 2025 1 commit
-
-
wenjh authored
Signed-off-by:wenjh <wenjh@sugon.com>
-
- 12 Jun, 2025 1 commit
-
-
wenjh authored
Same intention of commit 3e38a2ea . This commit is to improve acc. Signed-off-by:
wenjh <wenjh@sugon.com>
-
- 06 Jun, 2025 1 commit
-
-
Zhongbo Zhu authored
[PyTorch] FP8 Subchannel Recipe With FP8 Gather And Configurable Scaling Factor Tensor Swizzling (#1707) * functional kernel for columnwise + no-transpose option, still hacky Signed-off-by:
zhongboz <zhongboz@nvidia.com> * pass all quantizer unit tests Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * refactor, add gemm ready api Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * make format options private members, simplify api Signed-off-by:
zhongboz <zhongboz@nvidia.com> * swizzle scales right before gemm Signed-off-by:
zhongboz <zhongboz@nvidia.com> * bug fix of single layer test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * attempt to fix lint issue Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fp8 gather pass, need minor refine Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix return_layernorm_output_gathered case Signed-off-by:
zhongboz <zhongboz@nvidia.com> * remove special cases, add sanity check before gemm Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint ungrouped imports Signed-off-by:
zhongboz <zhongboz@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Implement dequantize for compact 1D blocks. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * add more unit test with dequantize compact supported Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint again Signed-off-by:
zhongboz <zhongboz@nvidia.com> * make ag for subchannel respect async Signed-off-by:
zhongboz <zhongboz@nvidia.com> * zero tolerance in distributed test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix zero tolerance test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * resolve rebase issues Signed-off-by:
zhongboz <zhongboz@nvidia.com> * lint & format Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix lint Signed-off-by:
zhongboz <zhongboz@nvidia.com> * clean up Signed-off-by:
zhongboz <zhongboz@nvidia.com> * bug fix Signed-off-by:
zhongboz <zhongboz@nvidia.com> * relax rtol for fp32 distributed test Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix some ci issue Signed-off-by:
zhongboz <zhongboz@nvidia.com> * fix ci test failure in debug mode Signed-off-by:
zhongboz <zhongboz@nvidia.com> * Force row-wise and column-wise data to have same data format Prototype "all-gather usage" in quantizer. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Remove dead logic for high-precision AGs Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Debug FP8 block-wise tensor tests Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Debug distributed test Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Handle case where LayerNormLinear returns gathered norm output Signed-off-by:
Tim Moon <tmoon@nvidia.com> * fix debug mode Signed-off-by:
zhongboz <zhongboz@nvidia.com> --------- Signed-off-by:
zhongboz <zhongboz@nvidia.com> Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Keith Wyss <kwyss@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 23 May, 2025 1 commit
-
-
yuguo authored
-
- 07 Apr, 2025 1 commit
-
-
kwyss-nvidia authored
* Add GEMM logic for blockwise quantized tensors. GEMM test cases included in pytorch integration. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update NVTE_BLOCK_SCALING for GEMM. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gate feature on CUDA 12.9 Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Gemm typo. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove unecessary type converter change. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reflect epilogue availability and test supported epilogues. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * GEMM simplifications from recipe branch. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Format py code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update GEMM DGelu tests to match support depending on output dtype. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Force pow2Scales in GEMM Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add GEMM test to pytorch test suite. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add copyright to GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update import for GEMM test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add license. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update test gemm supported predicate. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use sgemm like interfaces and naming. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Rewrite GEMM comment. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR Feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Refactor GEMM param canonicalization Configure A and B matrices separately. Have separate code path for each scaling mode. Signed-off-by:
Tim Moon <tmoon@nvidia.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Prune number of tests. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-
- 04 Apr, 2025 1 commit
-
-
kwyss-nvidia authored
* Blockwise float8 quantizer and quantized tensor class. The classes are configurable for 128x128 blocksize and 1x128 blocksize via setting block_scaling_dim == 2,1 respectively. Scale tensors are stored in a format emenable for matrix multiplication, however the integration of matmul is deferred as a separate story. Fusions of quantization and DBIAS or activation functions are not yet implemented, and the dequantization is currently implemented in torch. Tests for quantization are included in C++ and pytorch layers, with exact comparison to reference quantizer behavior as well as an attempt to hit interesting branches through the API such as tensor creation in pytorch and CPP and dequantization of row and columnwise usage. Two CUDA kernels for quantization are included, and are direct ports of equivalents in the kitchen repository, where a subchannel recipe has been used for end to end training. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Apply linting changes. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Alignment for 1D scaling for GEMM edge case. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Change API name. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix merge conflict with name change. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use common tensor map API. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Change API to use two scaling mode enums. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix typo. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update some call sites. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Tests for torch tensor API surface. Since the quantized tensor is a tensor subclass, these tests exercise torch hooks. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reuse scale calculation between quantizer refs. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Save memory by dropping reference to saved tensors. Issues previously observed are solved. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove constexpr parameters from kernel. Code size is reduced with fewer constexpr params. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Merge conflict from rebase. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add shape implementations for block scaling. nvte_shape was added upstream. Logic added for block scaled fp8. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Move benchmark to te_playground Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove amax_epsilon and pow_2_scales from tensor. Hardcodes the default values. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Lint changes. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fixup MR changes that broke. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Safer ifdef in kernel. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Documentation prose. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Reuse compute_scale function from Current Scaling. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Bugfix on inf_value scale refactor. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Remove qopt calls from test. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update pytest list. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Add copyright to reference scale calc. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use ptx.cuh functions instead of cde. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update shape logic with allocation and reuse shape. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Usage defaults MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Copyright and header guard. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Updating torch dispatch code. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Fix exception type. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Use TypeInfo Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * MR feedback. Signed-off-by:
Keith Wyss <kwyss@nvidia.com> * Update CS scale update test to use updated ref impl Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Update JAX scaling mode enum Signed-off-by:
Tim Moon <tmoon@nvidia.com> * Skip tests on Lovelace Signed-off-by:
Tim Moon <tmoon@nvidia.com> --------- Signed-off-by:
Keith Wyss <kwyss@nvidia.com> Signed-off-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <tmoon@nvidia.com> Co-authored-by:
Tim Moon <4406448+timmoon10@users.noreply.github.com>
-