"...include/git@developer.sourcefind.cn:gaoqiong/migraphx.git" did not exist on "231b7edd6e793c59c64da729f14ef7971fc6f03c"
-
Lei Wang authored
* Add GPU kernel for 2D continuous cumulative sum in TileLang example - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum. - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations. - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations. - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function. * Refactor TileLang examples and enhance kernel compilation - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking. - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability. - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management. - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames. - Cleaned up import statements in `__init__.py` for better organization and clarity. * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example - Added additional spacing for improved readability in `example_tilelang_cumsum.py`. - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations. * Refactor CUDA post-processing callback registration in TileLang - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility. - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability. - Added debug prints to the CUDA code generation process for better traceability during development. - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation. - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling. * lint fix * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio. - Implemented a function to generate various kernel configurations for autotuning. - Refactored the main execution block to support both autotuned and default configurations. - Improved the block mask generation to accommodate specified sparsity levels. - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
f47b43c5