Commits · 29051439dbed90583bfad1d16dfca88a95e78709 · OpenDAS / tilelang

12 Dec, 2025 1 commit
- [Lint] Phaseout Yapf format and embrace ruff format (#1417) · 29051439
  Lei Wang authored Dec 12, 2025
  
  29051439
02 Dec, 2025 1 commit
- [Enhancement] Add DISABLE_CACHE environment variables (#1368) · 422fb129
  Chaofan Lin authored Dec 02, 2025
  
  422fb129
18 Nov, 2025 2 commits

[FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696

Lei Wang authored Nov 18, 2025

* [Refactor] Update FFI type handling and simplify argument management

* Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
* Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
* Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
* Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
* Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.

* [Update] Sync TVM submodule and enhance kernel source handling

* Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
* Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
* Commented out the main execution call in test files to prevent unintended execution during testing.
* Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
* Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.

* [Refactor] Clean up imports and improve code formatting

* Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
* Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
* Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.

* Update execution backend options and improve resolution logic

- Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
- Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
- Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
- Updated documentation to reflect changes in execution backend options and their defaults.

* lint fix

* fix

* Enhance argument handling in CUDA and HIP runtime modules

- Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
- Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
- Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.

* lint fix

* minor fix

* fix

* recover check

* Refactor argument binding and validation in `arg_binder.cc`

- Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
- Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
- Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
- Minor adjustments in test files to streamline kernel execution and improve readability.

* lint fix

* stride fix

* minor fix

* fix

* lint fix

* Add CUDA stream access policy window helpers and integrate with L2 persistent cache management

- Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
- Updated runtime files to include new FFI packed functions for managing stream attributes.
- Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
- Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.

* check with symbolic

* support null ptr

* Update CMakeLists and lower.py for code generation and subproject status

- Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
- Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
- Marked the TVM subproject as dirty to indicate local modifications.

* lint fix

* Update comments for clarity in quickstart.py

74da3696

[BugFix] Adding extra parameters into autotune hashkey (#1274) · e805f8e5
Chaofan Lin authored Nov 18, 2025
```
* [BugFix] Adding extra parameters into autotune hashkey

* lint

* None check

* check serializable
```
e805f8e5

04 Nov, 2025 1 commit

[Refactor] Improve Python3.9 compatibility for ParamSpec and Self (#1190) · 7d961892

Lei Wang authored Nov 04, 2025

* [Feature] Enhance fill operation to support various buffer types

- Added support for `BufferLoad` in the `fill` function to handle different buffer types.
- Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
- Introduced checks for static bounds in region definitions to ensure safety during operations.
- Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.

* lint fix

* [Refactor] Improve Python compatibility for ParamSpec and Self

- Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively.
- Updated type annotations across multiple files to ensure consistent usage of typing features.

* [Update] Require Python 3.9 and enhance type annotations

- Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`.
- Removed references to Python 3.8 in classifiers.
- Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility.
- Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files.

* [Refactor] Update import statements to enhance type annotations

- Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity.
- Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function.
- Adjusted import statements in the language proxy file to maintain consistency in type annotations.

* disable rocm rs nt test.

* lint fix

7d961892

03 Nov, 2025 1 commit

[Language] Initial version of tilelang frontend v2 (#1120) · 5f202fe5

Kurisu authored Nov 03, 2025



* tilelang frontend v2

* syntax sugar: defining a local var by annotation

* [Refactor] fix type linting warning like `T.float32`

* Add tl.local_var_init for new tl.float32

* allow passing default argument as function annotation

* allow default arguments as annotation

* fix lint error

* minor fix

* [Refactor] refactor tilelang.jit and tilelang.autotune

* minor fix

* minor fix

* minor fix

* fix metal get function name

* add par_compile impl and tests

* Type consistency on tvm datatype
1. isinstance(tl.float32, tvm.DataType) == True
2. Allow `tl.float32` as function annotations
3. Allow `tl.float32` as argument to be passed to `tl.alloc` or other functions

* fix lint error

* add more warning in frontend

* update tvm version

* Minor fix on tvm_ffi annotations

* add document and examples

* fix lint error

* Simplify index calculations in example_chunk_o_bwd.py

Refactor index calculations for dg_last_fragment assignment.

* minor fix

* lint fix

---------
Co-authored-by: Lei Wang <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

5f202fe5

23 Oct, 2025 1 commit
- [Lint] Enable pyupgrade linter in ruff (#963) · f14fb111
  Yichen Yan authored Oct 23, 2025
```
* update rules

* ruff check

* other fixes

* fmt

* do not touch examples

* fmt
```
  f14fb111
20 Oct, 2025 1 commit

[Autotune] Add autotune coverage for symbolic M and normalize cache key (#1075) · fd6cec58

Lei Wang authored Oct 20, 2025

- extend matmul autotune test suite with a symbolic M case and allow run_autotune to accept concrete values for symbolic dims
  - sanitize _kernel_parameters when generating cache keys so symbolic vars serialize deterministically

fd6cec58

13 Oct, 2025 1 commit

[Build] Migrate to scikit-build-core (#939) · d89ba5b8

Yichen Yan authored Oct 13, 2025



* cleanup

* init

* build first wheel that may not work

* build cython ext

* fix tvm build

* use sabi

* update rpath to support auditwheel

* pass editible build

* update ci

* fix warnings

* do not use ccache in self host runner

* test local uv cache

* test pip index

* update lib search to respect new lib location

* fix

* update ci

* enable cuda by default

* update src map

* fix

* fix

* fix

* Generate version with backend and git information at build time

* copy tvm_cython to wheels

* fix tvm lib search

* fmt

* remove unused

* auto detect ccache

* add back backend-related files

* remove jit cython adaptor to simplify code

* fmt

* fix ci

* ci fix 2

* ci fix 3

* workaround metal

* ci fix 4

* fmt

* fmt

* Revert "ci fix 4"

This reverts commit d1de8291c3e40927955f3ad3cf87a75c78813676.

* tmp

* fix metal

* trivial cleanup

* add detailed build-time version for cuda

* add back mlc

* Restore wheel info and other trivial updates

* update

* fix cuda

* upd

* fix metal ci

* test for ga build

* test for nvidia/cuda

* test ubuntu 20

* fix

* fix

* Do not use `uv build`

* fix

* fix

* log toolchain version

* merge wheel

* update

* debug

* fix

* update

* skip rocm

* update artifacts each

* fix

* fix

* add mac

* fix cache

* fix cache

* fix cache

* reset and add comment

* upd

* fix git version

* update deps

* trivial update

* use in-tree build dir and install to src to speedup editable build

* Revert "use in-tree build dir and install to src to speedup editable build"

This reverts commit 6ab87b05c5eed811210136b8dca4fc3677dd51f2.

* add build-dir

* update docs

* remove old scrips

* [1/n] cleanup scripts

* [Lint]: [pre-commit.ci] auto fixes [...]

* fix and update

* wait for tvm fix

* revert some tmp fix

* fix

* fix

* spell

* doc update

* test cibuildwheel

* fix and test macos on ci

* Update .github/workflows/dist.yml
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>

* fix

* test ga event

* cleanup

* bump tvm to support api3

* test final version

* add cron

* Update .github/workflows/dist.yml
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>

* fix

* test ccache for metal cibuildwheel

* test newer macos

* finish

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>

d89ba5b8

07 Oct, 2025 1 commit

[Backend] Add metal backend (#799) · 7fb06776

Yichen Yan authored Oct 07, 2025



* Reset

* Fix other CUDA issue

* fmt

* fmt

* fix cuda error

* fix

* fix

* fmt

* cleanup

* fix

* remove copyright

* trivial update

* readme update

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7fb06776

22 Sep, 2025 1 commit

[AMD][MLA] Fix mla autotune for rocm (#861) · 3b21a67d

Lei Wang authored Sep 23, 2025

* Refactor matmul example to include ReLU activation and update batch size in benchmark script

* lint fix

* Enhance autotuning capabilities in benchmark script and update argument defaults

- Introduced a new `get_configs` function to generate autotuning configurations for the benchmark.
- Updated the default batch size and kv context length in the argument parser for improved performance.
- Renamed the `--auto_tune` argument to `--autotune` for consistency.
- Modified the kernel invocation logic to support autotuning based on the new configurations.

* lint fix

3b21a67d

13 Sep, 2025 1 commit
- [Lint] Add ruff config to check for useless spaces (#807) · 5e529522
  Yichen Yan authored Sep 13, 2025
```
* update lint config

* Remove spaces for blank line

* update
```
  5e529522
02 Sep, 2025 1 commit

[Cache] Introduce detailed target information for the disk kernel cache (#780) · 7ffc5b44

Lei Wang authored Sep 02, 2025

* Fix type hint for target_host parameter in compile function to allow None value

* Refactor target handling in compile function to utilize determine_target for improved clarity and consistency

* Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.

* Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.

* Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.

* Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.

7ffc5b44

19 Aug, 2025 1 commit

[Refactor] Refactor env into a more flexible version (#740) · 72be4909

Lei Wang authored Aug 19, 2025

* Fix environment variable name for compilation print setting in `env.py`

* Remove deprecated test file for warp specialized pass configuration and refactor environment variable access in `env.py` to utilize a centralized `EnvVar` class for better management and clarity.

* lint fix

* Refactor cache check to use `env.is_cache_enabled()` for consistency in `tuner.py`

72be4909

15 Aug, 2025 1 commit
- [Chore] fix typos (#719) · d0742860
  Gabriel Wu authored Aug 15, 2025
```
* chore: fix typos

* chore: fix ruff

* chore: fix clang-format
```
  d0742860
31 Jul, 2025 1 commit

[Enhancement] Output cache-file-related messages with verbose=True (#683) · 042c60fb

Yang Chen authored Jul 30, 2025

This is a minor enhancement to output verbose messages indicating where
cache files are saved and loaded. These messages are useful for
examining the relevant intermediate files.

042c60fb

13 Jul, 2025 1 commit

[AutoTune] Support `with set_autotune_inputs` to set auto tuning input tensors (#632) · eec47592

Lei Wang authored Jul 13, 2025

* [Refactor] Simplify and modularize autotuner implementation

- Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability.
- Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure.
- Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic.
- Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes.

* [Refactor] Clean up and enhance capture and tuner modules

- Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`.
- Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning.
- Streamlined the `CaptureStack` class for better thread-local context management.

* lint fix

* [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example

- Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity.
- Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning.
- Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability.
- Deleted obsolete test file for autotuning decorator, cleaning up the codebase.

* [Refactor] Improve code formatting and readability in autotune test file

- Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation.
- Fixed a typo in the `enable_rasteration` parameter name to ensure consistency.
- Cleaned up unnecessary blank lines to enhance overall code clarity.

* Update example_blocksparse_gemm.py

* Update capture.py

eec47592

12 Jul, 2025 1 commit

[Enhancement] Add CPU utilization and count settings for Auto-Tuning (#630) · b6fe9582

Lei Wang authored Jul 12, 2025

* [Enhancement] Add CPU utilization and count settings for Auto-Tuning

- Introduced environment variables for CPU utilization, counts, and maximum CPU count for auto-tuning.
- Updated the AutoTuner class to utilize these new settings, improving flexibility and performance in multi-threaded environments.
- Enhanced logging to provide better insights into the auto-tuning process based on the configured CPU settings.

* typo fix

b6fe9582

08 Jul, 2025 1 commit

[Refactor] refactor autotune examples (#617) · d110d087

Lei Wang authored Jul 08, 2025

* [Refactor] Update tilelang kernel functions and remove unused imports

- Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance.
- Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation.
- Removed unused import of `cached` from `tilelang`, streamlining the code.
- Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use.

* [Refactor] Simplify configuration generation in benchmark and example scripts

- Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability.
- Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning.
- Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized.

* [Refactor] Update configuration handling in benchmark scripts

- Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management.
- Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning.
- Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks.
- Cleaned up redundant code and improved overall readability in the affected files.

* [Refactor] Clean up formatting and update subproject commit

- Updated the subproject commit reference in the TVM directory to indicate a dirty state.
- Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability.
- Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability.

* [Refactor] Update AutoTuner configuration handling

- Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling.
- Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources.
- Updated comments for clarity and maintainability.

* lint fix

* Update TVM subproject commit to indicate dirty state and modify MHA backward test cases

- Updated the subproject commit reference in the TVM directory to reflect a dirty state.
- Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256.
- Uncommented the main testing function call for potential execution.

d110d087

21 Jun, 2025 2 commits

[Refactor] Improve tensor shape compatibility checks in AutoTuner (#590) · 804735bf

Lei Wang authored Jun 21, 2025

- Simplified the shape comparison logic in the AutoTuner class to enhance readability and maintainability.
- Ensured that the shape compatibility checks are more concise while preserving functionality, contributing to overall code clarity.

804735bf

[Bugfix] Fix input tensor compatibility checks in AutoTuner (#588) · cce6aed8

Lei Wang authored Jun 21, 2025



* [Refactor] Remove cache existence check in kernel saving logic

- Eliminated redundant checks for existing cache paths in `AutotuneResult` and `AutoTunerCache` classes, simplifying the kernel saving process.
- Ensured that the cache directory is always created before saving kernel source code, improving reliability in kernel storage.

* [Enhancement] Improve input tensor compatibility checks in AutoTuner

- Enhanced the input tensor caching logic in the AutoTuner class to ensure compatibility between cached tensors and newly generated tensors during configuration trials.
- Added detailed logging to warn users about potential mismatches in tensor properties, including shape and dtype, when caching is enabled.
- Implemented a mechanism to regenerate input tensors if compatibility issues are detected, improving the robustness of the autotuning process.

* [Refactor] Update L2 persistent map initialization in CUDA wrapper

- Adjusted the L2 persistent map initialization function to use a consistent size parameter for cache limits and byte counts, improving clarity and reducing potential errors in memory management.
- Simplified the formatting of the initialization function to enhance readability and maintainability of the code.

* Update tilelang/autotuner/__init__.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

cce6aed8

19 Jun, 2025 1 commit

[Bugfix] FIx autotuning params (#585) · f4bb9f6c

Lei Wang authored Jun 19, 2025

* [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation

- Modified AutoTuner to store cache in a dedicated "autotuner" directory.
- Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing.
- Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation.

* lint fix

* [Enhancement] Improve error handling and documentation in AutoTuner

- Added traceback logging for exceptions during configuration testing to enhance debugging.
- Expanded the AutoTuner class docstring to include detailed descriptions of new parameters for input tensor generation and validation, improving clarity for users.

f4bb9f6c

16 Jun, 2025 1 commit

[Refactor] Phaseout tf32 Casting from GEMM Templates (#573) · 9ba8b480

Lei Wang authored Jun 16, 2025

* [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic

- Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
- Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
- Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
- Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.

* lint fix

* [Refactor] Update GEMM Layout Functions and Inference Logic

- Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
- Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
- Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
- Adjusted related layout functions to ensure consistent behavior across different architectures.

* [Refactor] Remove tf32 Casting Logic from GEMM Templates

- Eliminated the `cast_float_to_tf32` function from `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` templates to streamline the code.
- Removed conditional casting logic for float32 to tfloat32 conversion, enhancing clarity and maintainability.
- Updated relevant sections in GEMM operations to reflect the removal of casting, ensuring consistent behavior across templates.
- Adjusted tensor view handling to improve performance and accuracy in matrix operations.

* Update bulk_copy.cc

* Fix profiler initialization in GEMM test by removing TensorSupplyType argument for improved flexibility.

9ba8b480

11 Jun, 2025 2 commits

[Feature] Implement Swizzle 32B (#566) · ae9668a8

Lei Wang authored Jun 11, 2025

* [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic

- Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
- Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
- Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
- Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.

* lint fix

* [Refactor] Update GEMM Layout Functions and Inference Logic

- Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
- Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
- Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
- Adjusted related layout functions to ensure consistent behavior across different architectures.

* Update bulk_copy.cc

* Update __init__.py

ae9668a8

[Bugfix] Add `__tune_params` into key hash for autotuning (#565) · ae386a7b

Lei Wang authored Jun 11, 2025

* [Enhancement] Update AutoTuner and Profiler for improved kernel handling and output validation

- Modified AutoTuner to store cache in a dedicated "autotuner" directory.
- Enhanced kernel source code saving logic in AutotuneResult and AutoTunerCache to check for None before writing.
- Updated Profiler to handle None outputs gracefully during tensor comparisons, improving robustness in output validation.

* lint fix

ae386a7b

04 Jun, 2025 2 commits

[Autotune] Remove the out_idx argument from the autotune cache (#553) · 5fbfb80b

Lei Wang authored Jun 04, 2025



* [Enhancement] Update AutoTuner and JIT compilation arguments

* Added functionality to return compile arguments in the JIT implementation, enhancing the autotuner's caching capabilities.
* Modified `CompileArgs` and `AutotuneResult` classes to support optional `out_idx` parameter, improving flexibility in compile argument handling.
* Refactored the `_AutoTunerImplementation` to utilize the new compile arguments, ensuring better integration and performance during tuning processes.

* Update tilelang/autotuner/param.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* remove redundant comments

* Update tilelang/jit/__init__.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

5fbfb80b

[AMD][Enhancement] Add support for Vectorized FP8 DataPacking (#542) · 319bc6b1

Lei Wang authored Jun 04, 2025

* [Enhancement] Add support for new FP8 types in HIP code generation

* Updated `PrintConst` function in `codegen_hip.cc` to handle `float8_e4m3fnuz` type.
* Introduced new functions in `hip_fp8.h` for creating FP8 types, including `make_fp8_e4_4_t` and `make_fp8_e4_8_t`, enhancing type handling for FP8 data structures.
* Improved overall compatibility and performance for FP8 data types in HIP.

* workaround for competition

* enhance autotune

* autotune cache fix

* Implement validation for unused keys in AutoTuner configuration

* Added a check in the AutoTuner class to raise a ValueError if there are unused keys in the configuration, enhancing error handling and ensuring configuration integrity.

* lint fix

* revert changes of threads

* Update pipelining in `example_mla_decode.py` to improve performance

* Changed the number of stages in the pipelined loop from 0 to 2, enhancing the efficiency of the attention mechanism in the decoding process.

* Enhance Cython kernel validation by adding tensor attribute checks

* Updated the `CythonKernelWrapper` to include dedicated methods for validating tensor device, dtype, and static shape.
* Modified the `forward` method to utilize these new validation methods, improving error handling and ensuring input integrity.
* Updated the `lambda_forward` function in `CythonKernelAdapter` to reflect changes in validation parameters.

319bc6b1

28 May, 2025 1 commit

[Autotune] Introduce cache mechanism for auto tuner (#527) · 7171aff6

Lei Wang authored May 28, 2025

* [Enhancement] Add commit ID to versioning and improve logging initialization

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

* [Refactor] Enhance AutoTuner and JITKernel for improved performance and caching

* Refactored the AutoTuner class to include new methods for setting compilation and profiling arguments, enhancing configurability.
* Introduced caching mechanisms for tuning results, allowing for faster retrieval of previously computed configurations.
* Updated JITKernel to store tuning results, including latency and configuration details, improving the kernel's performance tracking.
* Added new methods for generating cache keys and saving/loading results to/from disk, streamlining the tuning process.
* Enhanced the overall structure and readability of the autotuning logic, ensuring better maintainability and clarity.
* Minor adjustments in related modules to support the new caching and profiling features.

* [Refactor] Clean up code formatting and improve readability in AutoTuner and related modules

* Consolidated import statements and removed unnecessary line breaks for better readability.
* Standardized function argument formatting across the AutoTuner and CompileArgs classes.
* Enhanced consistency in the use of whitespace and indentation throughout the codebase.
* Minor adjustments in the Profiler and JITKernel classes to improve clarity and maintainability.
* Ensured that all changes adhere to the project's coding style guidelines.

* [Refactor] Remove redundant type hints in AutoTuner modules

* Simplified import statements in `__init__.py` and `param.py` by removing unnecessary duplicate type hints for `Any`.
* Improved code readability and maintainability by streamlining type imports across the AutoTuner module.

* [Refactor] Update AutoTuner configuration for improved profiling and target detection

* Enhanced the AutoTuner configuration across multiple examples by adding `set_profile_args` to better manage profiling settings.
* Standardized the use of `target="auto"` in compile arguments to ensure automatic target detection.
* Removed redundant target specifications in certain instances to streamline the configuration process.
* Improved overall clarity and maintainability of the autotuning logic in various example scripts.

* [Refactor] Simplify code formatting and improve readability in example scripts

* Consolidated function argument formatting in `benchmark_mla_decode_amd_tilelang.py`, `example_elementwise_add.py`, and `performance.py` for better clarity.
* Removed unnecessary line breaks and standardized argument placement across multiple files.
* Enhanced overall code readability and maintainability in autotuning examples and performance scripts.

* [Refactor] Update JIT decorator usage across multiple files

* Removed redundant parameters from the JIT decorator in various benchmark and example scripts, simplifying the code.
* Standardized the import of the JIT decorator from `tilelang`, enhancing consistency across the codebase.
* Improved overall readability and maintainability by consolidating import statements and cleaning up function definitions.

* [Refactor] Standardize JIT decorator formatting across benchmark and example scripts

* Simplified the formatting of the JIT decorator in multiple files by removing unnecessary line breaks.
* Enhanced code readability and consistency in the usage of the JIT decorator across benchmark and example scripts.
* Improved overall maintainability by ensuring uniformity in function definitions and decorator usage.

7171aff6

26 May, 2025 1 commit

[Enhancement] Add commit ID to versioning and improve logging initialization (#524) · 62a8d7f0

Lei Wang authored May 27, 2025

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

62a8d7f0

16 May, 2025 1 commit

[Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48

Lei Wang authored May 16, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

dca2fb48

12 May, 2025 2 commits
- Revert "[Bugfix] Use AutoTune cache_input_tensors properly (#483)" (#488) · 39ae28e4
  Lei Wang authored May 12, 2025
```
This reverts commit 22e6de184fa4b307640b108b779f3d46d132f96c.
```
  39ae28e4
- [Bugfix] Use AutoTune cache_input_tensors properly (#483) · a10882e0
  yyttt6 authored May 12, 2025
  
  a10882e0
11 May, 2025 1 commit

[Feature] Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check (#481) · 089cc0a7

yuanjypku authored May 11, 2025



* Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check

* lint fix

* Update example_mla_decode.py

* Update __init__.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

089cc0a7

09 May, 2025 1 commit

[Refactor] Update set_compile_args to allow None for out_idx parameter (#469) · 1f2f1554

Lei Wang authored May 09, 2025

* Modified the `set_compile_args` method in `AutoTuner` to accept `None` as a valid input for the `out_idx` parameter, enhancing flexibility in argument handling.

1f2f1554

16 Apr, 2025 1 commit

[Enhancement] Introduce a smarter warp partition strategy (#396) · ca730c0a

Lei Wang authored Apr 16, 2025

* make it python 3.8- happy

* [Enhancement] Improve loop partitioning and vectorization logic in layout inference and loop vectorization

- Enhanced the VisitStmt_ method to support local buffer handling in parallel loops, allowing for register usage without explicit thread binding.
- Updated loop vectorization logic to simplify expressions and ensure accurate vector size calculations, improving performance and clarity in the vectorization process.

* lint fix

* [Refactor] Update warp size checks and enhance warp partitioning logic in GEMM

- Changed warp_n size check from 16 to 8 in gemm_layouts.cc to improve compatibility with specific configurations.
- Refactored warp partitioning logic in gemm.cc to prioritize N dimension for better performance based on aspect ratio.
- Introduced a new CompileArgs dataclass in autotuner to streamline compile argument management and improve code clarity.

* lint fix

* [Enhancement] Initialize jit_compile in AutoTuner class

- Added initialization for jit_compile attribute in the AutoTuner class to ensure it is set to None by default.
- Updated the assignment logic for jit_compile to prevent overwriting an existing compile function, enhancing the flexibility of the AutoTuner's compilation process.

ca730c0a

10 Apr, 2025 2 commits

[Bugfix] Adjust Autotuner threadpool `max_workers` limit to available CPUs (#368) · 9a7a569d
Haodong Tian authored Apr 10, 2025
```
* [Bugfix] Adjust Autotuner threadpool `max_workers` limit to available CPUs

* [Example] Small fix on example_blocksparse_gemm.py
```
9a7a569d

[MLA][AMD] Add amd mla benchmarking (#367) · d3536d9e

Lei Wang authored Apr 10, 2025



* [Add] Introduce benchmark scripts for MLA decoding with AMD support

- Added three new benchmark scripts: `benchmark_mla_decode_amd_tilelang.py`, `benchmark_mla_decode_amd_torch.py`, and `benchmark_mla_decode_amd_triton.py` to evaluate the performance of the MLA decoding mechanism across different frameworks.
- Each script includes implementations for attention calculation, performance profiling, and output validation against reference implementations.
- Enhanced command-line argument parsing for customizable input parameters, including batch size, number of heads, and dimensions.
- Integrated performance comparison functionality to facilitate benchmarking between different implementations.

* lint fix

* lint fix

---------
Co-authored-by: Zhiwen Mo <zhiwen.mo25@ic.ac.uk>

d3536d9e

09 Apr, 2025 1 commit

[Bugfix] Fix compilation issues for amd cdna element size check (#364) · d627fd58

Lei Wang authored Apr 09, 2025

* [Refactor] Update AutoTuner run method and timeout handling

- Modified the `run` method to reduce the default timeout from 100 to 30 seconds for improved responsiveness.
- Changed the `get_input_tensors_supply` call to disable output generation, enhancing performance during tensor supply retrieval.
- Refactored the latency measurement to streamline the benchmarking process, ensuring proper timeout handling with `ThreadPoolExecutor`.
- Added logging for timeout occurrences to aid in debugging and performance analysis.

* bug fix

* lint fix

d627fd58

07 Apr, 2025 1 commit

[AutoTune] Refactor AutoTuneArtifact to utilize kernel as context instead of profiler (#344) · f005db9f

Lei Wang authored Apr 07, 2025

* [Enhancement] Update GEMM examples and autotuner for improved performance

- Modified `example_gemm_intrinsics.py` to enhance matrix multiplication configurations, increasing warp sizes and adjusting data types for better performance.
- Updated the kernel compilation process to utilize the new `tilelang.compile` method and improved latency measurement with the profiler.
- Refactored `example_gemm.py` to include a new autotuning configuration and ensure consistency in latency checks against reference results.
- Adjusted tensor supply generation in `tilelang/utils/tensor.py` to use `torch.randn` for better randomness in tensor initialization.
- Enhanced the `JITContext` in `tilelang/autotuner/__init__.py` to replace the profiler with a kernel instance for performance measurement, improving the overall structure of the autotuner.

* bug fix

* fix

* [Enhancement] Update convolution tests and profiling assertions

- Added a random seed setting for reproducibility in convolution tests.
- Removed several redundant convolution test cases to streamline the testing process.
- Updated the assertion in the matrix multiplication profiling to include a maximum mismatched ratio for improved accuracy in results.
- Enabled the main testing function for better test execution.

* lint fix

f005db9f

05 Apr, 2025 1 commit

[Enhancement] Enhance FP8/FP4 type handling in CUDA codegen (#323) · 89725f7f

Lei Wang authored Apr 05, 2025

* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling

- Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities.
- Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities.
- Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management.
- This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes.

* [Refactor] Clean up whitespace in CUDA-related files

- Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability.
- This change enhances the overall organization of the codebase without altering functionality.

* [Benchmark] Add FP8 Matrix Multiplication Benchmark Script

- Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`.
- The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement.
- Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration.
- The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations.

* lint fix

* Update submodule and enhance FP8 type handling in CUDA codegen

- Updated the TVM submodule to the latest commit.
- Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes.
- Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code.
- Added error handling for missing configuration keys in the AutoTuner class.

* lint fix

* Remove print statement from example script

* lint fix

* fix

---------
Co-authored-by: LeiWang1999 <wyatuestc@gmail.com>

89725f7f