1. 26 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
    • yyttt6's avatar
      add autotune to example_gemm.py (#285) · 73d2c62e
      yyttt6 authored
      73d2c62e
  2. 22 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5
      Lei Wang authored
      * Add GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
      - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
      - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
      - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
      
      * Refactor TileLang examples and enhance kernel compilation
      
      - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
      - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
      - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
      - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
      - Cleaned up import statements in `__init__.py` for better organization and clarity.
      
      * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
      - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
      
      * Refactor CUDA post-processing callback registration in TileLang
      
      - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
      - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
      - Added debug prints to the CUDA code generation process for better traceability during development.
      - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
      - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.
      
      * lint fix
      
      * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters
      
      - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
      - Implemented a function to generate various kernel configurations for autotuning.
      - Refactored the main execution block to support both autotuned and default configurations.
      - Improved the block mask generation to accommodate specified sparsity levels.
      - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
      f47b43c5
  3. 13 Feb, 2025 1 commit
  4. 09 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Tools] Introduce `plot_layout` to visualize the fragment layout (#68) · f9b6a92e
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      f9b6a92e
  5. 26 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Doc] Addd debug relevant testing and documentations (#58) · 5e259239
      Lei Wang authored
      * implement jit test case
      
      * [Dev] implement auto tune test case for matrix multiplication
      
      * Implement test for legalize memory access and vectorized loop
      
      * lint fix
      
      * introduce run_once
      
      * Refactor callback function names for consistency and improve code readability
      
      * enhance documentations
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix formatting issues in rt_mod_hip.cc
      
      * add random seed initialization for deterministic testing
      5e259239