"vscode:/vscode.git/clone" did not exist on "d8e55165ace1d93b5714d0b56b75d81c24d6993f"
  • Zhengju Tang's avatar
    [Feature] Low-bit twiddling dequantization and FP4 GEMM (#725) · 24603e4a
    Zhengju Tang authored
    
    
    * [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16
    
    * [Dequant] Add extern call and serial dequantization
    
    * [Dequant] Parallel Dequant wait for fence debug.
    
    * [Scale] Add scale matrix to mxfp4 gemm
    
    * [Remove] Remove fence-buggy example and some generated source cuda code
    
    * [MXFP4] Update initial version of MXFP4 GEMM
    
    * [Scale] Add scale to latest mxfp4 gemm
    
    * [Lint]
    
    * [BugFix] Load Scale, disabe TMA to recover performance
    
    * [Lint]
    
    * [Lint]
    
    * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance
    
    * [Lint]
    
    * Update example_dequant_gemm_bf16_fp4_hopper_serial.py
    
    * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory.
    
    * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding.
    
    * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency.
    
    * lint fix
    
    * ci fix
    
    * Remove non-existent example
    
    * [BugFix] Add smem swizzle to recover performance of TMA
    
    * [BugFix] Enough reg for producer when threads=512
    
    ---------
    Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
    Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
    24603e4a
utils.py 2.54 KB