[Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811)
* [Enhancement] Enhance dequantization examples and utilities
- Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
- Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
- Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
Co-authored-by:
Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
* fix typos in docstrings
* remove redundant code
* [Format] Unreproducible debug with T.print
* [BugFix] Correct dtype in ref dequantize; larger data distribution
* [Format]
* [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py
- Removed unnecessary cache disabling and manual seed setting in the example.
- Simplified nested loops into parallelized operations ...
Showing
This diff is collapsed.
Please register or sign in to comment