• Lei Wang's avatar
    [Example] Implement Kernel Example cumsum (#258) · cd9ec62e
    Lei Wang authored
    * Add GPU kernel for 2D continuous cumulative sum in TileLang example
    
    - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
    - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
    - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
    - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
    
    * Refactor TileLang examples and enhance kernel compilation
    
    - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
    - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
    - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
    - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
    - Cleaned up import statements in `__init__.py` for better organization and clarity.
    
    * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
    
    - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
    - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
    cd9ec62e
example_mha_bwd.py 14.8 KB