- 13 Sep, 2025 1 commit
-
-
Yichen Yan authored
* update lint config * Remove spaces for blank line * update
-
- 19 Aug, 2025 2 commits
-
-
coderabbitai[bot] authored
*
📝 Add docstrings to `mxfp4` Docstrings generation was requested by @LeiWang1999. * https://github.com/tile-ai/tilelang/pull/725#issuecomment-3191656561 The following files were modified: * `examples/bitnet-1.58b/kernel_benchmark/tilelang_bitnet_158_int8xint2_prefill.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_fp4_hopper.py` * `examples/dequantize_gemm/example_dequant_gemm_bf16_mxfp4_hopper.py` * `examples/dequantize_gemm/utils.py` * `examples/gemm/example_gemm_autotune.py` * `tilelang/intrinsics/utils.py` * `tilelang/language/__init__.py` * `tilelang/language/utils.py` * `tilelang/quantize/mxfp.py` * `tilelang/quantize/quantization.py` * [Lint] More accurate docstring * [Lint] --------- Co-authored-by:coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Co-authored-by:
tzj-fxz <tzjfxz@gmail.com>
-
Zhengju Tang authored
* [Dequant] Add bit-twiddling dequantize cuda for fp4-->bf16 * [Dequant] Add extern call and serial dequantization * [Dequant] Parallel Dequant wait for fence debug. * [Scale] Add scale matrix to mxfp4 gemm * [Remove] Remove fence-buggy example and some generated source cuda code * [MXFP4] Update initial version of MXFP4 GEMM * [Scale] Add scale to latest mxfp4 gemm * [Lint] * [BugFix] Load Scale, disabe TMA to recover performance * [Lint] * [Lint] * [Scale] Use L2 to hold Scale and enable TMA will slightly boost performance * [Lint] * Update example_dequant_gemm_bf16_fp4_hopper_serial.py * Remove deprecated dequantization examples for BF16 and MXFP4 in the dequantize_gemm directory. * Refactor dequantization examples for improved readability and consistency. Adjusted formatting in matmul function and added spacing for clarity. Updated function signatures and comments for better understanding. * Refactor index_to_coordinates usage in bitnet example and update dequantization example configurations. Removed the custom index_to_coordinates function and replaced it with the built-in version. Adjusted block_K parameter in dequantization example for consistency. * lint fix * ci fix * Remove non-existent example * [BugFix] Add smem swizzle to recover performance of TMA * [BugFix] Enough reg for producer when threads=512 --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-