Commits · 9247a8792ff10c16b740bea624b9be04ab8fbe86 · OpenDAS / tilelang

13 Jun, 2025 1 commit

[Enhancement] Include Metadata (LayoutMap etc.) into hashing (#570) · 9247a879

Lei Wang authored Jun 13, 2025

- Modified the serialization of function scripts in both KernelCache and AutoTunerCache to include metadata by setting `show_meta=True` in `cloudpickle.dumps()`. This change enhances the hash key generation for kernel configurations, improving cache accuracy and consistency.

9247a879

11 Jun, 2025 1 commit

[Bugfix] Add `__tune_params` into key hash for autotuning (#565) · ae386a7b

Lei Wang authored Jun 11, 2025

* [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation

- Modified AutoTuner to store cache in a dedicated "autotuner" directory.
- Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing.
- Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation.

* lint fix

ae386a7b

09 Jun, 2025 1 commit

[Enhancement] Optimize debug info for auto tuning (#560) · e5e36dbf

Lei Wang authored Jun 09, 2025



* [Enhancement] Update AutoTuner and JIT compilation arguments

* Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities.
* Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling.
* Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes.

* Update tilelang/autotuner/param.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* remove redundant comments

* Refactor kernel source retrieval and logging levels

* Updated `AutotuneResult` to use `kernel.get_kernel_source()` instead of `kernel.adapter.get_kernel_source()`.
* Changed logging level in `KernelCache` from `ERROR` to `DEBUG` for improved verbosity during kernel caching operations.
* Removed unnecessary verbose logging in JIT compilation process to streamline output.

* Merge branch 'main' of https://github.com/tile-ai/tilelang

 into bugfix_autotune_0604

* lint fix

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

e5e36dbf

05 Jun, 2025 1 commit

[Enhancement] Add nvrtc execution backend (#461) · 17f7394f

Gabriel Wu authored Jun 05, 2025



* [wip] feat: add nvrtc backend

* [wip] fix: handle out_idx

* [wip] refactor: move lib logic to libgen

* feat: cache for nvrtc backend

* fmt: run format

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: get kernel source

* refactor: speedup pyimport

* Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning.

* Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.

* Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates.

* fix tl_templates

* ensure CUDA context

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

17f7394f

28 May, 2025 1 commit

[Autotune] Introduce cache mechanism for auto tuner (#527) · 7171aff6

Lei Wang authored May 28, 2025

* [Enhancement] Add commit ID to versioning and improve logging initialization

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

* [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching

* Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability.
* Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations.
* Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking.
* Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process.
* Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity.
* Minor adjustments in related modules to support the new caching and profiling features.

* [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules

* Consolidated import statements and removed unnecessary line breaks for better readability.
* Standardized function argument formatting across the AutoTuner and CompileArgs classes.
* Enhanced consistency in the use of whitespace and indentation throughout the codebase.
* Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability.
* Ensured that all changes adhere to the project's coding style guidelines.

* [Refactor] Remove redundant type hints in AutoTuner modules

* Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`.
* Improved code readability and maintainability by streamlining type imports across the AutoTuner module.

* [Refactor] Update AutoTuner configuration for improved profiling and target detection

* Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings.
* Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection.
* Removed redundant target specifications in certain instances to streamline the configuration process.
* Improved overall clarity and maintainability of the autotuning logic in various example scripts.

* [Refactor] Simplify code formatting and improve readability in example scripts

* Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity.
* Removed unnecessary line breaks and standardized argument placement across multiple files.
* Enhanced overall code readability and maintainability in autotuning examples and performance scripts.

* [Refactor] Update JIT decorator usage across multiple files

* Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code.
* Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase.
* Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions.

* [Refactor] Standardize JIT decorator formatting across benchmark and example scripts

* Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks.
* Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts.
* Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.

7171aff6

26 May, 2025 1 commit

[Enhancement] Add commit ID to versioning and improve logging initialization (#524) · 62a8d7f0

Lei Wang authored May 27, 2025

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

62a8d7f0

06 May, 2025 2 commits

[Feature] Add cache directory management functions in tilelang.cache (#453) · 0aaef97d

Lei Wang authored May 06, 2025

* [Feature] Add cache directory management functions in tilelang.cache

* Introduced `get_cache_dir` and `set_cache_dir` functions to manage the kernel cache directory.
* Updated `KernelCache` class to store cache directory as a `Path` object for improved path handling.
* Enhanced documentation with examples for new cache directory functions.

* [Refactor] Update cache imports in tilelang.__init__.py

* Added `set_cache_dir` and `get_cache_dir` functions to the import statement for improved cache directory management.
* This change enhances the accessibility of cache directory management functions within the module.

0aaef97d

[Enhancement] Introduce pass_configs parameter for kernel Caching (#452) · b1ba0cc8

Lei Wang authored May 06, 2025

* [Enhancement] Introduce pass_configs parameter for kernel compilation

* Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations.
* Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase.
* Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output.
* Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution.

* lint fix

b1ba0cc8

27 Apr, 2025 1 commit
- [Typo] Rename warp_source into wrap_source(#440) · 72fea0a2
  Gabriel Wu authored Apr 27, 2025
```
* Fix typo

* bugfix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
```
  72fea0a2
31 Mar, 2025 1 commit

[Cache] Implement in-memory cache (#308) · 5802c01b

Lei Wang authored Mar 31, 2025

* [Enhancement] Add support for CUDA architecture 8.9 in GEMM template

- Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
- This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.

* lintfix

* [Refactor] Clean up includes in gemm_sm89.h

- Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
- This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.

* [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings

- Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access.
- Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk.
- Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability.
- Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management.

* lint fix

5802c01b

28 Mar, 2025 1 commit

[Refactor] Improve documentation and add detailed docstrings across multiple modules (#298) · 3f294650

Lei Wang authored Mar 28, 2025

* [Enhancement] Update AtomicAdd functions for BFLOAT16 in common.h

- Added conditional compilation for BFLOAT16 atomic operations to ensure compatibility with CUDA architectures greater than 7.5.
- Improved code clarity by organizing the AtomicAdd functions and adding relevant comments for better understanding.

* [Enhancement] Improve documentation and add detailed docstrings across multiple modules

- Updated the `__init__.py` file to enhance module documentation, providing clarity on auto-tuning functionalities.
- Added comprehensive docstrings to the `JITContext`, `AutotuneResult`, and `AutoTuner` classes, detailing their attributes and methods.
- Enhanced memory allocation utilities in `allocate.py` with detailed descriptions for each allocation function.
- Improved documentation for various intrinsic operations in `builtin.py`, `copy.py`, `customize.py`, `frame.py`, `gemm.py`, `memscope.py`, and `reduce.py`, ensuring clear explanations of parameters and return values.
- Refactored the `KernelCache` class to improve clarity and maintainability, including detailed comments and docstrings for methods.
- Overall, these changes aim to enhance code readability and provide better guidance for future developers and users of the Tile-AI framework.

3f294650

25 Mar, 2025 1 commit

[Refactor] Update cache key generation in KernelCache (#283) · 7bd59f21

Lei Wang authored Mar 25, 2025

- Changed the cache key generation to use the serialized script of the function instead of the function object itself, improving the uniqueness of cache keys.

7bd59f21

23 Mar, 2025 1 commit

[Language] Enhance alias to support blockwise memory load (#261) · 927e50d9

Lei Wang authored Mar 23, 2025

* [Enhancement] Introduce caching control and frame management in TileLang

- Added cache control functions (`enable_cache`, `disable_cache`, `is_cache_enabled`) in `env.py` to manage kernel caching behavior.
- Updated `kernel_cache.py` to utilize the cache state, preventing unnecessary kernel compilation when caching is disabled.
- Introduced a new `frame.py` module to manage LetFrame instances, including a stack for variable-value mapping and enhanced frame management.
- Updated imports in various modules to accommodate new caching and frame functionalities, improving overall organization and clarity.

* [Refactor] Clean up and enhance caching and frame management in TileLang

- Added spacing for improved readability in `env.py` and `frame.py`.
- Refactored `LetFrame` class to enhance clarity in buffer region assignment.
- Ensured consistent formatting and organization across caching control and frame management functions.

* [Feature] Add matrix multiplication functionality in TileLang

- Introduced a new test file `test_tilelang_language_alias.py` that implements a matrix multiplication function using TileLang's primitives.
- The `matmul` function defines a kernel for performing tile-level GEMM operations, with support for customizable block sizes and data types.
- Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
- Updated `gemm.py` to allow `tir.Buffer` or `tir.Var` as valid argument types for the `gemm` function, enhancing flexibility in argument handling.

* [Refactor] Improve formatting and readability in test_tilelang_language_alias.py

- Adjusted spacing and alignment in the `matmul` and `run_matmul` functions for better readability.
- Cleaned up unnecessary blank lines and ensured consistent formatting throughout the file.
- Enhanced overall code clarity without altering functionality.

927e50d9

22 Mar, 2025 2 commits

[Refactor] Move compilation outside critical section (#260) · 001e7b2a

You Jiacheng authored Mar 23, 2025



* move compilation outside critical section

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

001e7b2a

[Example] Implement Kernel Example cumsum (#258) · cd9ec62e

Lei Wang authored Mar 22, 2025

* Add GPU kernel for 2D continuous cumulative sum in TileLang example

- Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
- Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
- Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
- Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.

* Refactor TileLang examples and enhance kernel compilation

- Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
- Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
- Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
- Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
- Cleaned up import statements in `__init__.py` for better organization and clarity.

* Enhance GPU kernel for 2D continuous cumulative sum in TileLang example

- Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
- Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.

cd9ec62e

20 Mar, 2025 1 commit

[Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180

Lei Wang authored Mar 20, 2025

* remove llvm build

* [Refactor] Update kernel compilation and profiling in examples

- Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
- Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
- Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.

* lint fix

* License Update

* [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files

- Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
- Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.

* [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files

- Improved comment alignment and readability in `cuda.h`.
- Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.

* lint fix

* fix

* License update

* [Enhancement] Update JITKernel to use artifact for kernel source

- Assigned the generated artifact to `self.artifact` for better management.
- Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.

* lint fix

* Add @tilelang.testing.requires_llvm decorator to vectorization tests

* Enhance setup.py and env.py for library management

- Added functionality to remove original files after copying in CMakeBuild.
- Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.

* Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py

* Refactor CMakeBuild file handling in setup.py

- Added a check to ensure the target library directory exists before copying .so files.
- Improved the logic for creating the target directory and copying files to enhance robustness.

* bugfix

* Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.

* lint fix

* Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.

* lint fix

* Add support for C target in device code generation

- Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.

* [Enhancement] Implement auto-clear cache feature based on environment variable

* Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
* Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
* Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.

* [Refactor] Update kernel invocation and import paths in tests and cache

* Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
* Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
* Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.

* [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py

* Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
* Enhanced overall code formatting to align with project standards.

* [Enhancement] Add bfloat16 test case and improve kernel caching logic

* Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
* Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
* Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
* Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
* Improved code formatting and readability across several files.

* lint fix

* Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage

f2e99180

19 Mar, 2025 1 commit

[Feature] Add database storage for JITKernel cache with Cython and Ctypes adapters (#213) · e789808b

alex_xiao authored Mar 19, 2025



* [Dev] Add database mechanism to cache

* [Dev] Fix database cache and test for it

* [Dev] Refactor env.py to use TILELANG_CACHE_DIR and remove extra comment.

* [Refactor] Improve code formatting and readability in multiple files

* [Enhancement] Add execution backend options and improve kernel adapter initialization

* [Refactor] Rename cached function to cached_kernel and update related references

* [Enhancement] Enable target and target_host parameters in kernel loading and improve gemm test case

* [Enhancement] Update kernel compilation to specify execution backend as "cython"

* [Refactor] Rename cached_kernel to cached and update references in the codebase

* [Enhancement] Un-comment and add test cases for matrix multiplication correctness; improve kernel caching logic and remove redundant code

* [Refactor] Clean up code formatting and improve readability in cache and adapter modules

* [Refactor] Remove unused imports

* [Refactor] Update cached function signature to use PrimFunc and Optional types for improved type safety

* [Refactor] Update cached function calls to use PrimFunc and improve parameter handling

* [Refactor] Clean up import statements and improve code formatting in cache and kernel test files

* Update tilelang/jit/kernel.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

e789808b