1. 30 Mar, 2025 1 commit
  2. 29 Mar, 2025 1 commit
  3. 28 Mar, 2025 5 commits
    • NaOHCC's avatar
    • Lei Wang's avatar
      [Refactor] Improve documentation and add detailed docstrings across multiple modules (#298) · 3f294650
      Lei Wang authored
      * [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h
      
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      
      * [Enhancement] Improve documentation and add detailed docstrings across multiple modules
      
      - Updated the `__init__.py` file to enhance module documentation, providing clarity on auto-tuning functionalities.
      - Added comprehensive docstrings to the `JITContext`, `AutotuneResult`, and `AutoTuner` classes, detailing their attributes and methods.
      - Enhanced memory allocation utilities in `allocate.py` with detailed descriptions for each allocation function.
      - Improved documentation for various intrinsic operations in `builtin.py`, `copy.py`, `customize.py`, `frame.py`, `gemm.py`, `memscope.py`, and `reduce.py`, ensuring clear explanations of parameters and return values.
      - Refactored the `KernelCache` class to improve clarity and maintainability, including detailed comments and docstrings for methods.
      - Overall, these changes aim to enhance code readability and provide better guidance for future developers and users of the Tile-AI framework.
      3f294650
    • Lei Wang's avatar
      [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h (#297) · 9ad9d9cd
      Lei Wang authored
      - Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
      - Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.
      9ad9d9cd
    • Lei Wang's avatar
      [Feature] Implement ParallelLoopTransformer for enhanced loop analysis (#295) · 5c8de061
      Lei Wang authored
      * [Feature] Implement ParallelLoopTransformer for enhanced loop analysis
      
      - Introduced the ParallelLoopTransformer class to improve the handling of parallel loops in layout inference.
      - Enhanced the analysis of loop variables and their extents, allowing for more accurate index range calculations.
      - Added a BufferAccessCollector to gather buffer access information, ensuring correct index mapping and condition handling.
      - Updated the LayoutInference pass to utilize the new transformer, improving overall performance and accuracy in loop transformations.
      
      * test fix
      
      * Fix typo in buffer variable documentation and enhance loop variable handling in layout inference. Added checks for related loop variables and improved condition handling for index mapping.
      
      * Refactor loop variable handling in layout inference. Updated loop index variable from `i` to `j` for clarity and improved condition handling for index mapping by replacing `indices[i]` with `index` in predicate construction.
      5c8de061
    • botbw's avatar
      [doc/example] add gemv doc and examples (#293) · ff3cfa59
      botbw authored
      * [doc/example] init gemv doc and examples
      
      * [example] add vectorized read
      
      * [example] use local register instead of smem
      
      * [example] add bench
      
      * [doc] update doc
      
      * [doc] refine doc
      
      * [lint] format code
      
      * [doc] add tips
      
      * [doc/example] fix typo
      
      * [example] use tmv_all_reduce
      
      * [doc] update doc accordingly
      
      * [doc] add benchmark table
      
      * [lint] format code
      ff3cfa59
  4. 27 Mar, 2025 5 commits
    • penguin_wwy's avatar
      [Dev] Correcting cxx compiler (#294) · 304b4465
      penguin_wwy authored
      304b4465
    • Lei Wang's avatar
      Remove citation page (#292) · 5079e2a5
      Lei Wang authored
      5079e2a5
    • Wenhao Xie's avatar
      [Doc] Python API docs generation (#278) · 5501b31c
      Wenhao Xie authored
      * fix bug
      
      * update performance.py
      
      * update python api docs
      
      * test workflow
      
      * fix dependency
      
      * fix bug
      
      * fix
      
      * update correct git config
      
      * test workflow
      
      * clear cache
      
      * lint fix
      
      * fix exclude path
      5501b31c
    • Lei Wang's avatar
      [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291) · 83412458
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      
      * [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py
      
      - Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
      - Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
      - Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.
      
      * [Refactor] Update imports in __init__.py for tir compatibility
      
      - Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
      - Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.
      
      * lint fix
      
      * [Refactor] Update imports in tir.ir.py for improved compatibility
      
      - Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
      - Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.
      
      * [Refactor] Update function calls in tir.ir.py to return values
      
      - Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.
      
      * bugfix
      
      * [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper
      
      - Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.
      
      * bugfix
      
      * [Update] Sync subproject commit and modify CUDA atomic add functions
      
      - Updated the subproject commit for TVM to edd35139a0481e9359aa269e3e50450b95ba2f5a.
      - Commented out the CUDA capability check in the example convolution script to prevent execution errors.
      - Refactored atomic add functions for BFLOAT16 in common.h to include a conditional compilation directive for improved compatibility with CUDA architectures.
      83412458
    • Lei Wang's avatar
      [Language] Proxy tvm ir to make linter happy (#287) · be0bf36d
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      
      * [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py
      
      - Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
      - Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
      - Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.
      
      * [Refactor] Update imports in __init__.py for tir compatibility
      
      - Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
      - Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.
      
      * lint fix
      
      * [Refactor] Update imports in tir.ir.py for improved compatibility
      
      - Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
      - Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.
      
      * [Refactor] Update function calls in tir.ir.py to return values
      
      - Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.
      
      * bugfix
      
      * [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper
      
      - Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.
      
      * bugfix
      
      * Uncomment main function call
      be0bf36d
  5. 26 Mar, 2025 4 commits
    • Yu Cheng's avatar
      [Feature] Introduce NoSetMaxNReg for warp specialization (#289) · 76435ca8
      Yu Cheng authored
      - Added NoSetMaxNReg as a new TIR built-in to indicate no register hint for warp-specialized branches.
      - Updated the warp specialization rewriter to handle the new NoSetMaxNReg operation, allowing for improved register management.
      - Enhanced the Python interface to include NoSetMaxNReg for consistency with TIR operations.
      76435ca8
    • Yu Cheng's avatar
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
    • yyttt6's avatar
      add autotune to example_gemm.py (#285) · 73d2c62e
      yyttt6 authored
      73d2c62e
  6. 25 Mar, 2025 5 commits
    • Lei Wang's avatar
      [Refactor] Update cache key generation in KernelCache (#283) · 7bd59f21
      Lei Wang authored
      - Changed the cache key generation to use the serialized script of the function instead of the function object itself, improving the uniqueness of cache keys.
      7bd59f21
    • yyttt6's avatar
      [Refactor] Enhance Autotune (#266) · 541e1685
      yyttt6 authored
      * add autotune to example_gemm.py
      
      * format init.py
      541e1685
    • Lei Wang's avatar
      [Language] Introduce `T.ptr` and `T.Tensor` (#276) · 8ad53855
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      8ad53855
    • Wenhao Xie's avatar
      [CI] Add gemm performance test (#274) · 18f29277
      Wenhao Xie authored
      * [Typo] Fix formatting in installation instructions in README.md
      
      * [Enhancement] Improve CUDA path detection and update configuration handling
      
      * fix typo
      
      * remove IS_WINDOWS constant
      
      * lint fix
      
      * Improve error messages for CUDA detection failure
      
      * lint fix
      
      * lint fix
      
      * Fix .gitignore to correctly include venv directory
      
      * [Doc] Add instructions for installing nightly version of TileLang
      
      * update installation instructions
      
      * update install instruction
      
      * update performance ci
      
      * update
      
      * update
      
      * update
      
      * update ci workflow
      
      * delete test.yml
      
      * lint fix
      
      * update bot.yml
      
      * update bot.yml
      
      * remove changes in ci.yml
      18f29277
    • Xiaochuan Ye's avatar
      [Bugfix]Add CUDA availability check in CtypesKernelAdapter (#267) · 29b7d374
      Xiaochuan Ye authored
      * fix: Add CUDA availability check in CtypesKernelAdapter
      
      * fix: Add CUDA availability check in CythonKernelWrapper
      29b7d374
  7. 24 Mar, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Improve flash attention example and layout comparison logic (#270) · 5f5bf53c
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      5f5bf53c
    • Yu Cheng's avatar
      [Bugfix] Add TMA and Producer Buffer Analysis in Warp Specialized Rewriter (#269) · 2abd6ab7
      Yu Cheng authored
      - Introduced TMAFinder and ProducerUsedBufferFinder classes to analyze TMA loads and identify buffers used in producer conditions.
      - Enhanced WarpSpecializedRoleMarker to prepare and utilize the identified buffers during role marking.
      - Updated VisitStmt methods to incorporate new analysis logic for IfThenElse and For nodes, improving the handling of TMA loads in the warp specialization process.
      2abd6ab7
    • Lei Wang's avatar
      [Bugfix] Support `T.clear` for let binding (#268) · 47caf219
      Lei Wang authored
      * Fix indentation in JIT adapter wrapper to ensure consistent formatting of return statement in generated C code.
      
      * Enhance Fill Operation in TileLang
      
      - Updated the Fill constructor to support BufferLoad instances, adding checks for ramp indices and ensuring only stride 1 ramps are processed.
      - Introduced a region array to manage the bounds of the fill operation, improving error checking for static regions.
      - Modified the MakeSIMTLoop method to utilize the new region array for loop variable bounds, enhancing flexibility in kernel generation.
      - Updated the fill and clear functions in fill.py to accept both tir.Buffer and tir.BufferRegion types, improving usability and type handling.
      
      * Refactor Fill Operation and Improve Readability
      
      - Simplified the Fill constructor by enhancing the handling of BufferLoad instances and ensuring proper checks for ramp indices.
      - Improved error messages for region size checks to enhance clarity.
      - Cleaned up formatting in the Fill method for better readability.
      - Added a blank line in the matmul function test to improve code organization.
      - Introduced a blank line in the fill function to enhance readability in fill.py.
      
      * Add matrix multiplication functionality and test in TileLang
      
      - Introduced a new test file `test_tilelang_language_clear.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `__init__.py` in the utils module to include `map_torch_type`, enhancing type handling for tensor operations.
      
      * lint fix
      47caf219
  8. 23 Mar, 2025 3 commits
    • Lei Wang's avatar
      [Release] Bump version to 0.1.3 (#264) · 9981ac59
      Lei Wang authored
      * Bump version to 0.1.3
      
      * Refactor Docker script to streamline installation commands
      
      - Removed the installation of the Python environment and CMake from the Docker run command, simplifying the execution process.
      - Updated the command to focus on pip installation and running tox for testing across multiple Python versions.
      9981ac59
    • Lei Wang's avatar
      Refactor matrix multiplication benchmark and autotuner logging (#263) · 8c94de32
      Lei Wang authored
      - Updated `ref_program` in `benchmark_matmul.py` to remove the unused parameter `C`, simplifying the function signature.
      - Changed logging level in `autotuner/__init__.py` from `INFO` to `DEBUG` for more detailed logging during autotuning.
      - Modified the error handling in the autotuner to provide clearer messages and log errors at the debug level.
      - Enhanced error reporting in the JIT adapter by adding detailed context to error messages in `cython_wrapper.pyx` when kernel calls fail.
      8c94de32
    • Lei Wang's avatar
      [Language] Enhance alias to support blockwise memory load (#261) · 927e50d9
      Lei Wang authored
      * [Enhancement] Introduce caching control and frame management in TileLang
      
      - Added cache control functions (`enable_cache`, `disable_cache`, `is_cache_enabled`) in `env.py` to manage kernel caching behavior.
      - Updated `kernel_cache.py` to utilize the cache state, preventing unnecessary kernel compilation when caching is disabled.
      - Introduced a new `frame.py` module to manage LetFrame instances, including a stack for variable-value mapping and enhanced frame management.
      - Updated imports in various modules to accommodate new caching and frame functionalities, improving overall organization and clarity.
      
      * [Refactor] Clean up and enhance caching and frame management in TileLang
      
      - Added spacing for improved readability in `env.py` and `frame.py`.
      - Refactored `LetFrame` class to enhance clarity in buffer region assignment.
      - Ensured consistent formatting and organization across caching control and frame management functions.
      
      * [Feature] Add matrix multiplication functionality in TileLang
      
      - Introduced a new test file `test_tilelang_language_alias.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul` function defines a kernel for performing tile-level GEMM operations, with support for customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated `gemm.py` to allow `tir.Buffer` or `tir.Var` as valid argument types for the `gemm` function, enhancing flexibility in argument handling.
      
      * [Refactor] Improve formatting and readability in test_tilelang_language_alias.py
      
      - Adjusted spacing and alignment in the `matmul` and `run_matmul` functions for better readability.
      - Cleaned up unnecessary blank lines and ensured consistent formatting throughout the file.
      - Enhanced overall code clarity without altering functionality.
      927e50d9
  9. 22 Mar, 2025 5 commits
    • Chaofan Lin's avatar
      [Bugfix] Fix Benchmark/Example Code for Autotuning (#254) · 0430cfe7
      Chaofan Lin authored
      
      
      * fix tune args
      
      * lint
      
      * Refactor gemm example and autotuner logging
      
      - Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter.
      - Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity.
      - Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility.
      - Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds.
      
      * Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0430cfe7
    • Yichen Yan's avatar
      [CI] Use auditwheel to generate manylinux wheels (#251) · 60923344
      Yichen Yan authored
      
      
      * use auditwheel to get correct manylinux wheels
      
      * fix
      
      * make py3.8 happy
      
      * trivial updates
      
      * Add typing.Tuple import and update annotations
      
      * fmt
      
      * Remove unused import and update type hints
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      60923344
    • You Jiacheng's avatar
      [Refactor] Move compilation outside critical section (#260) · 001e7b2a
      You Jiacheng authored
      
      
      * move compilation outside critical section
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      001e7b2a
    • Lei Wang's avatar
      [Refactor] Refactor CUDA post-processing callback registration in TileLang (#259) · f47b43c5
      Lei Wang authored
      * Add GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
      - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
      - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
      - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
      
      * Refactor TileLang examples and enhance kernel compilation
      
      - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
      - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
      - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
      - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
      - Cleaned up import statements in `__init__.py` for better organization and clarity.
      
      * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
      - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
      
      * Refactor CUDA post-processing callback registration in TileLang
      
      - Introduced a new decorator `register_cuda_postproc_callback` for registering CUDA post-processing functions, enhancing usability and flexibility.
      - Updated existing callback implementations to utilize the new decorator, improving code clarity and maintainability.
      - Added debug prints to the CUDA code generation process for better traceability during development.
      - Refactored the `OptimizeForTarget` function to streamline conditional statement handling in the pipeline transformation.
      - Cleaned up the `inject_pipeline.cc` file by removing redundant code related to statement grouping and condition handling.
      
      * lint fix
      
      * Enhance BlockSparse GEMM Example with Autotuning and Configurable Parameters
      
      - Added argument parsing to allow dynamic configuration of matrix dimensions and sparsity ratio.
      - Implemented a function to generate various kernel configurations for autotuning.
      - Refactored the main execution block to support both autotuned and default configurations.
      - Improved the block mask generation to accommodate specified sparsity levels.
      - Updated the kernel compilation process to utilize the new configurations and ensure accurate results verification.
      f47b43c5
    • Lei Wang's avatar
      [Example] Implement Kernel Example cumsum (#258) · cd9ec62e
      Lei Wang authored
      * Add GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
      - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
      - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
      - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
      
      * Refactor TileLang examples and enhance kernel compilation
      
      - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
      - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
      - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
      - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
      - Cleaned up import statements in `__init__.py` for better organization and clarity.
      
      * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
      - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
      cd9ec62e
  10. 21 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Language] Introduce `T.alloc_var` to define a variable like `int var;` (#255) · c770a58f
      Lei Wang authored
      * [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT
      
      - Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters.
      - Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing.
      - Updated test cases to validate the new matrix multiplication implementations.
      - Enhanced error handling in library initialization and compilation processes across various modules.
      - Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting.
      
      * lint fix
      
      * optimize
      
      * Support var defiine
      
      * lint fix
      
      * Update TVM submodule and add alloc_variable function to allocate local variables in TileLang
      
      - Updated the TVM submodule to the latest commit.
      - Introduced `alloc_variable` function in `allocate.py` to support local variable allocation with specified data types and scopes.
      
      * lint fix
      
      * Refactor variable allocation functions for consistency
      
      - Renamed `alloc_variable` to `alloc_var` across multiple files for improved consistency.
      - Updated corresponding test functions to reflect the new naming convention.
      - Adjusted imports in `__init__.py` to align with the changes.
      c770a58f
    • yyttt6's avatar
      add autotune to example_gemm.py (#252) · 316d3b97
      yyttt6 authored
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      
      * add autotune to example_gemm.py
      316d3b97
  11. 20 Mar, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] Support float variable as arguments (#250) · 2d0c4169
      Lei Wang authored
      * [Enhancement] Add matrix multiplication functions for integer and float variables in Cython JIT
      
      - Introduced `matmul_int_variable` and `matmul_float_variable` functions to support matrix multiplication with dynamic shapes and additional parameters.
      - Implemented corresponding `run_matmul_int_variable` and `run_matmul_float_variable` functions for testing.
      - Updated test cases to validate the new matrix multiplication implementations.
      - Enhanced error handling in library initialization and compilation processes across various modules.
      - Improved dynamic memory handling in CUDA kernel initialization to provide better error reporting.
      
      * lint fix
      
      * optimize
      2d0c4169
    • Lei Wang's avatar
      Update bib citation (#249) · 4fcf6abe
      Lei Wang authored
      4fcf6abe
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  12. 19 Mar, 2025 3 commits
    • Chenghua's avatar
      [Examples] Implement elementwise add kernel (#219) · 43bd9d3e
      Chenghua authored
      * [Example] Modify tuning configurations for FlashAttention example
      
      * [Examples] formatting example_gqa_fwd_bshd.py
      
      * [Examples] Implement elementwise add kernel
      
      * [Doc] Update ElementWise Operators document
      
      * [Examples] Replace the example of elementwise add.
      43bd9d3e
    • alex_xiao's avatar
      [Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters (#213) · e789808b
      alex_xiao authored
      
      
      * [Dev] Add database mechanism to cache
      
      * [Dev] Fix database cache and test for it
      
      * [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment.
      
      * [Refactor] Improve code formatting and readability in multiple files
      
      * [Enhancement] Add execution backend options and improve kernel adapter initialization
      
      * [Refactor] Rename cached function to cached_kernel and update related references
      
      * [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case
      
      * [Enhancement] Update kernel compilation to specify execution backend as "cython"
      
      * [Refactor] Rename cached_kernel to cached and update references in the codebase
      
      * [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code
      
      * [Refactor] Clean up code formatting and improve readability in cache and adapter modules
      
      * [Refactor] Remove unused imports
      
      * [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety
      
      * [Refactor] Update cached function calls to use PrimFunc and improve parameter handling
      
      * [Refactor] Clean up import statements and improve code formatting in cache and kernel test files
      
      * Update tilelang/jit/kernel.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      e789808b
    • Yuxi Chi's avatar
      [Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value... · efceb6ed
      Yuxi Chi authored
      [Enhancement][CUDA] Avoid C7508 for CUDA backend via assigning default value to `minBlocksPerMultiprocesor ` (#248)
      
      efceb6ed