• Lei Wang's avatar
    [Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5
    Lei Wang authored
    * Add GPU kernel for 2D continuous cumulative sum in TileLang example
    
    - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
    - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
    - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
    - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
    
    * Refactor TileLang examples and enhance kernel compilation
    
    - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
    - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
    - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
    - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
    - Cleaned up import statements in `__init__.py` for better organization and clarity.
    
    * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
    
    - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
    - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
    
    * Refactor CUDA post-processing callback registration in TileLang
    
    - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
    - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
    - Added debug prints to the CUDA code generation process for better traceability during development.
    - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
    - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.
    
    * lint fix
    
    * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters
    
    - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
    - Implemented a function to generate various kernel configurations for autotuning.
    - Refactored the main execution block to support both autotuned and default configurations.
    - Improved the block mask generation to accommodate specified sparsity levels.
    - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
    f47b43c5
example_gemm.py 7.06 KB