- 25 Jun, 2025 1 commit
-
-
Cunxiao Ni authored
* [Example] Update kernel compilation in examples to use @tilelang.jit - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead. - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability. - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed. * format * Update example_tilelang_sparse_gqa_decode_varlen_indice.py * Update example_dequant_gemm_fine_grained.py * Update example_gemm_autotune.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 09 May, 2025 1 commit
-
-
Zhengju Tang authored
* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`. * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic. * [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI. * Lint --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 03 Apr, 2025 1 commit
-
-
Yu Cheng authored
* [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support * Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation. * Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation. * Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h. * Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers. * Updated tilelang.language module to include the new `reduce_absmax` function. These changes enhance FP8 quantization capabilities and extend reduction operation support. * [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization * Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations. * Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy. * Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking. These changes enhance the flexibility and efficiency of the FP8 quantization process. * lint * [Dev] Update per_token_cast_fp8.py
-