examples/dequantize_gemm/example_dequant_gemm_fine_grained.py · 8554cb01ae59e62b0f99d6c1d43c553f08958d2d · OpenDAS / tilelang

"git@developer.sourcefind.cn:wangsen/mineru.git" did not exist on "d1af45662157c6bce4a04757a332c89dd9d98b65"

[Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811) · 8554cb01

Tong WU authored Sep 17, 2025



* [Enhancement] Enhance dequantization examples and utilities

- Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
- Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
- Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>

* fix typos in docstrings

* remove redundant code

* [Format] Unreproducible debug with T.print

* [BugFix] Correct dtype in ref dequantize; larger data distribution

* [Format]

* [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py

- Removed unnecessary cache disabling and manual seed setting in the example.
- Simplified nested loops into parallelized operations for better readability and performance.
- Updated the assertion function in utils.py to print detailed error messages.
- Adjusted tensor sizes in examples

* [Refactor] Update import path in example_dequant_gemm_fine_grained.py

- Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure.

* lint

* rename and add test

* lint

* [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py

- Added a new function `get_configs()` to generate hyperparameter configurations for tuning.
- Updated the `matmul` function to utilize autotuning with the new configurations.
- Improve kernel performance via vectorization and threadblock swizzle.
- Enhanced the main function to support the new autotuning inputs and updated parameters for better performance.

* lint

* fix typo

* fix typo and lint

* make ci format check happy

* fix ci

---------
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

8554cb01

example_dequant_gemm_fine_grained.py 15.1 KB

Replace example_dequant_gemm_fine_grained.py