- 22 Mar, 2025 1 commit
-
-
Lei Wang authored
* Add GPU kernel for 2D continuous cumulative sum in TileLang example - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum. - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations. - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations. - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function. * Refactor TileLang examples and enhance kernel compilation - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking. - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability. - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management. - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames. - Cleaned up import statements in `__init__.py` for better organization and clarity. * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example - Added additional spacing for improved readability in `example_tilelang_cumsum.py`. - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations. * Refactor CUDA post-processing callback registration in TileLang - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility. - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability. - Added debug prints to the CUDA code generation process for better traceability during development. - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation. - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling. * lint fix * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio. - Implemented a function to generate various kernel configurations for autotuning. - Refactored the main execution block to support both autotuned and default configurations. - Improved the block mask generation to accommodate specified sparsity levels. - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
-
- 13 Feb, 2025 1 commit
-
-
Wenhao Xie authored
* [CI] Clean up target repository before publishing documentation. * [Doc] Convert docs from rst format to Markdown format.
-
- 09 Feb, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add VectorizeLoop function and update imports for compatibility * [CI][Test] Improve test cases for vectorization and fix typos in parser comments * lint fix * Fix incorrect module reference for VectorizeLoop transformation * Refactor vectorize_loop transformation by removing unused extent mutation logic * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen * Fix formatting in CUDA FP8 header file for consistency * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity * Update submodule 'tvm' to latest commit for improved functionality * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule. * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files. * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency * Add CUDA requirements to FP8 test cases and update references for clarity * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py * Add CUDA requirements and FP8 test cases for matmul and gemv simulations * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py * Add BF16 support to matrix multiplication and introduce corresponding test cases * Add a blank line for improved readability in BF16 GEMM test * Update acknowledgements in README to include supervision by Zhi Yang at Peking University * enhance acknowledgement * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives * Update subproject commit for TVM dependency * Update subproject commit for TVM dependency * Add int4_t type and functions for packing char values in CUDA common header * Add plot_layout example and implement GetForwardVars method in layout classes * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files * Fix formatting by removing unnecessary line break in layout.h * Refactor make_int4 function for improved readability by adjusting parameter formatting
-
- 26 Jan, 2025 1 commit
-
-
Lei Wang authored
* implement jit test case * [Dev] implement auto tune test case for matrix multiplication * Implement test for legalize memory access and vectorized loop * lint fix * introduce run_once * Refactor callback function names for consistency and improve code readability * enhance documentations * lint fix * lint fix * lint fix * lint fix * fix formatting issues in rt_mod_hip.cc * add random seed initialization for deterministic testing
-