"git@developer.sourcefind.cn:wangsen/mineru.git" did not exist on "d1af45662157c6bce4a04757a332c89dd9d98b65"
  • Tong WU's avatar
    [Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811) · 8554cb01
    Tong WU authored
    
    
    * [Enhancement] Enhance dequantization examples and utilities
    
    - Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
    - Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
    - Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
    Co-authored-by: default avatarZhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
    
    * fix typos in docstrings
    
    * remove redundant code
    
    * [Format] Unreproducible debug with T.print
    
    * [BugFix] Correct dtype in ref dequantize; larger data distribution
    
    * [Format]
    
    * [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py
    
    - Removed unnecessary cache disabling and manual seed setting in the example.
    - Simplified nested loops into parallelized operations for better readability and performance.
    - Updated the assertion function in utils.py to print detailed error messages.
    - Adjusted tensor sizes in examples
    
    * [Refactor] Update import path in example_dequant_gemm_fine_grained.py
    
    - Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure.
    
    * lint
    
    * rename and add test
    
    * lint
    
    * [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py
    
    - Added a new function `get_configs()` to generate hyperparameter configurations for tuning.
    - Updated the `matmul` function to utilize autotuning with the new configurations.
    - Improve kernel performance via vectorization and threadblock swizzle.
    - Enhanced the main function to support the new autotuning inputs and updated parameters for better performance.
    
    * lint
    
    * fix typo
    
    * fix typo and lint
    
    * make ci format check happy
    
    * fix ci
    
    ---------
    Co-authored-by: default avatarZhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
    Co-authored-by: default avatartzj-fxz <tzjfxz@gmail.com>
    8554cb01
example_dequant_gemm_fine_grained.py 15.1 KB