- 26 Mar, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Improve flash attention example and layout comparison logic - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code. - Updated the handling of `lse_local_split` to utilize parallel processing for better performance. - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example. - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons. * lint fix * [Enhancement] Add support for shared memory scope in Fill operation - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation. - Implemented parallel operation and layout inference for improved performance in shared memory scenarios. - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling. * [Refactor] Remove deprecated decorator and enhance Cython kernel handling - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization. - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution. - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments. - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced. - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs. * [Feature] Add matrix multiplication test and kernel implementation - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives. - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types. - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation. - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs. - Minor formatting improvements in `deprecated.py` for better readability. * lint fix * [Refactor] Update tensor creation in matrix multiplication test - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency. - Updated imports in `__init__.py` to include `make_tensor`. - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers. * [Refactor] Update tensor definitions across multiple files - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity. - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations. - Improved documentation in README and example files to reflect changes in tensor usage. * lint fix * [Refactor] Update tensor types in attention and matrix multiplication examples - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity. - Adjusted tensor definitions in benchmark and example files to align with the new tensor types. - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files. * lint fix * [Refactor] Update tensor types in GEMM example and test files - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity. - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions. * [Refactor] Update tensor usage in customize.py - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions. - Improved code clarity by standardizing buffer usage across the file. * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions. - Improved code clarity by standardizing buffer usage across the test file. * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions. - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions. * [Refactor] Introduce Tensor alias for Buffer in proxy.py - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`. - This change enhances clarity and consistency in tensor usage across the codebase.
-
- 22 Mar, 2025 1 commit
-
-
Lei Wang authored
* Add GPU kernel for 2D continuous cumulative sum in TileLang example - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum. - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations. - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations. - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function. * Refactor TileLang examples and enhance kernel compilation - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking. - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability. - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management. - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames. - Cleaned up import statements in `__init__.py` for better organization and clarity. * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example - Added additional spacing for improved readability in `example_tilelang_cumsum.py`. - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations. * Refactor CUDA post-processing callback registration in TileLang - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility. - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability. - Added debug prints to the CUDA code generation process for better traceability during development. - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation. - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling. * lint fix * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio. - Implemented a function to generate various kernel configurations for autotuning. - Refactored the main execution block to support both autotuned and default configurations. - Improved the block mask generation to accommodate specified sparsity levels. - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
-
- 13 Feb, 2025 1 commit
-
-
Wenhao Xie authored
* [CI] Clean up target repository before publishing documentation. * [Doc] Convert docs from rst format to Markdown format.
-