Commits · a58bf9b6c63e945928153d151f2ae927cbc20dc4 · OpenDAS / tilelang

26 Sep, 2025 5 commits

[Precision] Introduce `T.ieee_rsqrt` and related high precision op (#882) · a58bf9b6

Lei Wang authored Sep 26, 2025

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)

* Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.

* Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.

* Add precision comparison tool for CUDA operations

This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.

* Add IEEE-compliant mathematical operations and refactor fast math module

This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic.

* debug removed

* Refactor IEEE math tests for improved readability and consistency

This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code.

* Update README.md to enhance formatting of precision comparison results

This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.

a58bf9b6

[Example] Add efficient attention sink backward implementations and tests (#877) · ec24561a

Tong WU authored Sep 26, 2025

* [Example] Add a new example to support attention sink for MHA

- Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Added a reference attention function to validate the implementation against PyTorch.
- Included argument parsing for command-line execution of the example.

* [Example] Replace MHA sink forward example with updated implementation

- Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
- Updated argument parsing and reference functions to align with the new implementation.

* Enhance MHA sink example with sliding window support

- Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
- Implemented assertions to ensure `window_size` is compatible with `block_N`.
- Updated the main function to include a `tune` option for performance tuning.
- Introduced a new test file to validate both full attention and sliding window scenarios.
- Adjusted FLOPS calculation to account for the sliding window configuration.

* lint

* [Fix] Add checkinf process to fix the bug of swa

* Migrate to BSHD layout to align with triton baselines

* lint

* fix typo

* Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.

* Add GQA sink example for optimized attention mechanism & lint fix

* fix several typos and bugs

* lint

* fix speed issues of swa

* Add flash attention example with backward pass for BHSD layout and corresponding test cases

* Add backward pass implementation for flash attention with sinks and corresponding test case

* fix lint and typo

* Optimze the calculation of `dsinks`

* Add support for swa backward and update examples

* fix previous typos

* Add example for GQA sink backward pass and update tests for both MHA and GQA sinks

* fix lint

* fix previous typos

* typo

ec24561a

[FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit... · 95c373f5

Lei Wang authored Sep 26, 2025

[FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke (#875)

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

95c373f5

[CI][AMD] Remove amd Timeout test (#881) · 56f7494f
alex_xiao authored Sep 26, 2025

56f7494f
[Cython] Remove an incorrect check (#880) · 6f6ef7ad
LJC00118 authored Sep 26, 2025

6f6ef7ad

25 Sep, 2025 4 commits

[Language] Support atomic add with ret (#870) · aa0b1090

Lei Wang authored Sep 26, 2025

* Add atomic operations for CUDA templates in new atomic.h file

- Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
- Implemented support for half, bfloat16, and float types with appropriate memory ordering.
- Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
- Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
- Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.

* Refactor atomic operations in CUDA templates for improved readability

- Reformatted atomic operation implementations in atomic.h for better code clarity.
- Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
- Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.

* Add thread storage synchronization configuration option

- Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
- Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
- Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.

* Refactor thread storage sync configuration retrieval

- Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
- Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.

* test fix

* Update atomic operations and tests for improved functionality

- Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
- Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
- Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
- Updated the TVM subproject to the latest commit for better compatibility.

* Update attention sink examples to use 32 heads

- Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
- Ensured consistency across example scripts for improved usability and testing.

* Refactor atomic add handling in vectorization

- Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
- Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.

* Add loop break functionality and enhance thread synchronization

- Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
- Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
- Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.

* test fix

aa0b1090

[Bugfix] Use `ExprDeepEqual` instead of `StructuralEqual` when merge consecutive If stmt (#876) · 1dfac2e8

Lei Wang authored Sep 25, 2025

* Update submodule TVM to latest commit and fix condition comparison in merge_if_stmt.cc

* Update submodule TVM to latest commit 0524f760

* lint fix

1dfac2e8

[Language] Support loop_break primitive (#873) · 15a303d2
Yu Cheng authored Sep 25, 2025

15a303d2
[Language] Support sequence comparisons (#872) · c538d8ab
Lei Wang authored Sep 25, 2025
```
* Update submodule 'tvm' to latest commit 7a71ee34

* lint fix
```
c538d8ab

24 Sep, 2025 2 commits
- [Fix] tilelang can now vectorize `B[i,j] = c[i] + A[i,j]` (#798) · 2d4b848f
  Kurisu authored Sep 24, 2025
```
* Fix bug 0905: vectorize with broadcasted value

* fix lint error

* [Refactor] Use `tvm::tir::UseVar` and use Vectorizer

* Add loop size check in vectorize planner

* fix lint error
```
  2d4b848f
- [Parser] Adapt Parser to work with Python 3.8 in some cases (#869) · fa4fd0b7
  Lei Wang authored Sep 24, 2025
  
  fa4fd0b7
23 Sep, 2025 6 commits

[Example] Add examples to support efficient attention sink forward process (#853) · d9a171ce

Tong WU authored Sep 23, 2025



* [Example] Add a new example to support attention sink for MHA

- Introduced a new example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Added a reference attention function to validate the implementation against PyTorch.
- Included argument parsing for command-line execution of the example.

* [Example] Replace MHA sink forward example with updated implementation

- Removed the old example script for multi-head attention (MHA) with sliding window attention and sink tokens.
- Introduced a new example script that modifies the attention mechanism to enhance performance and maintainability.
- Updated argument parsing and reference functions to align with the new implementation.

* Enhance MHA sink example with sliding window support

- Added a `window_size` parameter to the `flashattn` function to enable sliding window attention.
- Implemented assertions to ensure `window_size` is compatible with `block_N`.
- Updated the main function to include a `tune` option for performance tuning.
- Introduced a new test file to validate both full attention and sliding window scenarios.
- Adjusted FLOPS calculation to account for the sliding window configuration.

* lint

* [Fix] Add checkinf process to fix the bug of swa

* Migrate to BSHD layout to align with triton baselines

* lint

* fix typo

* Refactor MHA sink example to use seq_q and seq_kv parameters to accommodate the new sequence length parameters.

* Add GQA sink example for optimized attention mechanism & lint fix

* fix several typos and bugs

* lint

* fix speed issues of swa

* Update examples/attention_sink/example_gqa_sink_fwd_bhsd_wgmma_pipelined.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

* Update examples/attention_sink/example_mha_sink_fwd_bhsd_wgmma_pipelined.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

d9a171ce

[Autotune][Conv] optimize convolution examples to use autotune (#866) · b4483090
Lei Wang authored Sep 23, 2025

b4483090

[Layout] Support layout forward with multi dimension (#867) · 9cbbbbc6

Lei Wang authored Sep 23, 2025

* Enhance LayoutNode::Forward method to handle variable transformations more robustly

- Updated the method to check for a minimum number of input dimensions.
- Introduced a mechanism to transform the last InputDim() elements of the input variables.
- Concatenated transformed variables with the remaining input variables for a comprehensive output.

* Refactor LayoutNode::Forward method for improved readability

- Removed unnecessary whitespace to enhance code clarity.
- Maintained existing functionality while streamlining the transformation process of input variables.

9cbbbbc6

Add fast sine and cosine definitions in common.h for CUDA templates (#865) · 86aaf3c1
Tong WU authored Sep 23, 2025

86aaf3c1
[AMD] refactor MatrixCoreIntrinEmitter (#860) · 48c9a352
Jiaxing Ding authored Sep 23, 2025

48c9a352
[Bugfix] Ensure correct handling for cases where `seq_q<seq_kv` in flash attention examples (#864) · b12a63cf
Tong WU authored Sep 23, 2025
```
* fix flash attention examples  for `seqlen_q<seqlen_kv` cases

* lint
```
b12a63cf

22 Sep, 2025 4 commits

[AMD][MLA] Fix mla autotune for rocm (#861) · 3b21a67d

Lei Wang authored Sep 23, 2025

* Refactor matmul example to include ReLU activation and update batch size in benchmark script

* lint fix

* Enhance autotuning capabilities in benchmark script and update argument defaults

- Introduced a new `get_configs` function to generate autotuning configurations for the benchmark.
- Updated the default batch size and kv context length in the argument parser for improved performance.
- Renamed the `--auto_tune` argument to `--autotune` for consistency.
- Modified the kernel invocation logic to support autotuning based on the new configurations.

* lint fix

3b21a67d

[TMA] Bugfix when a shared buffer is both issued with tma store and tma load (#857) · b9a51c43

Lei Wang authored Sep 22, 2025

- Updated `init_desc_arg_map` to use `Var` as the key instead of `String` in `lower_hopper_intrin.cc`.
- Enhanced `func_call_args` method in `TLCUDASourceWrapper` to accept additional parameters for better argument mapping.
- Added assertions to ensure consistency between function parameters and arguments during kernel launches.
- Modified `generate_tma_descriptor_args` to utilize a mapping of variable names for TMA descriptor initialization.

b9a51c43

[Doc] Optimize the quickstart guide for clarity and not just for CUDA (#858) · 058a670b
Lei Wang authored Sep 22, 2025
```
* Refactor matmul example to include ReLU activation and update batch size in benchmark script

* lint fix
```
058a670b
[Analyzer] Enhance ConstIntBoundAnalyzer and IntervalSet with modular set analysis (#856) · bd168654
Lei Wang authored Sep 22, 2025

bd168654

21 Sep, 2025 1 commit

[PATCH] Static libg++ linking fix (#854) · a3497ebc

Lei Wang authored Sep 22, 2025

* bump version to 0.1.6

* phaseout py38

* py39

* Update submodule 'tvm' to latest commit adc0e48

* [Build] Update CMake and Python environment settings

- Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking.
- Removed the cmake version requirement from pyproject.toml to allow for broader compatibility.
- Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments.

* [Build] Update Python version requirements in scripts and documentation

- Changed Python version requirement in README.md from 3.9+ to 3.8+.
- Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version.
- Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments.

* [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml

- Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility.
- Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.

* [Build] Update CMake and Dockerfile for improved compatibility

- Removed static linking flags from CMakeLists.txt to simplify build configuration.
- Updated Dockerfile to use Ubuntu 20.04 and streamlined the installation of dependencies, removing gcc-9 and g++-9.
- Adjusted symlink creation for Python environments to use the `-sf` option for safer linking.

* [Build] Bump version to 0.1.6.post1 for post-release updates

* [Build] Remove static linking flags from CMakeLists.txt

- Eliminated static linking flags for GCC and libstdc++ to simplify build configuration and avoid potential conflicts with Python extensions.

* [Build] Update Docker distribution scripts for manylinux compatibility

- Changed base image from `tilelang-builder:18.04` to `tilelang-builder:manylinux` in both local and PyPI distribution scripts.
- Updated Dockerfile references to use `pypi.manylinux.Dockerfile`.
- Added `--gpus all` flag to the Docker run command to enable GPU support during execution.

* lint fix

* add cmake

a3497ebc

19 Sep, 2025 3 commits

[Release] Bump Version to 0.1.6 (#818) · 1ad6e461

Lei Wang authored Sep 19, 2025

* bump version to 0.1.6

* phaseout py38

* py39

* Update submodule 'tvm' to latest commit adc0e48

* [Build] Update CMake and Python environment settings

- Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking.
- Removed the cmake version requirement from pyproject.toml to allow for broader compatibility.
- Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments.

* [Build] Update Python version requirements in scripts and documentation

- Changed Python version requirement in README.md from 3.9+ to 3.8+.
- Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version.
- Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments.

* [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml

- Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility.
- Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.

1ad6e461

[Refactor] Enhance buffer store transformation in TIR pass (#851) · 094e2298

Lei Wang authored Sep 19, 2025

- Updated the `AddWrapperForSingleBufStore` function to improve the handling of buffer stores by adding detailed checks for fragment buffer accesses and ensuring only index 0 is used.
- Introduced new helper functions for collecting buffer accesses and indices, enhancing code readability and maintainability.
- Refined the logic for determining tile operations and thread bindings to ensure accurate transformations without affecting existing parallel structures.

094e2298

[Py38] Revert typing and parser updates for Python 3.8 compatibility (#850) · bc9623fc

Lei Wang authored Sep 19, 2025

* Update submodule TVM to commit 872e32c1 and adjust type hints in nvcc.py and utils.py for compatibility with Python typing standards.

* Update requirements.txt to specify ml_dtypes without a version constraint, indicating that versions greater than 0.5.1 are needed for fp4 support.

bc9623fc

18 Sep, 2025 6 commits

[TIR] Refactor division simplification in RewriteSimplifier (#849) · 8cc2ab22
Lei Wang authored Sep 18, 2025

8cc2ab22
[Typing] Fallback from Python 3.10+ type syntax for compatibility (#848) · c36a7eee
Lei Wang authored Sep 18, 2025

c36a7eee
[AMD] fix bf16x2 dtype codegen (#847) · 6efeb743
Jiaxing Ding authored Sep 18, 2025

6efeb743

[Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355

Lei Wang authored Sep 18, 2025

* [Enhancement] Enable fast math optimization in tilelang JIT configurations

- Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
- Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
- Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
- Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.

* lint fix

* [Refactor] Introduce deprecated_warning utility for improved deprecation handling

- Added a new `deprecated_warning` function to streamline deprecation messages.
- Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
- Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.

e7e38355

[CI] Test Fix: Handle BufferLoad nodes when T.gemm input has a stride (#843) · ebea77d9
Lei Wang authored Sep 18, 2025
```
* bugfix

* fix

* test fix
```
ebea77d9

[Refactor] Refactor some build related configurations (#827) · 232782dd

Lei Wang authored Sep 18, 2025

* bugfix

* [Build] Update build dependencies and Dockerfile configuration

- Updated `pyproject.toml` and `requirements-build.txt` to specify Cython version as `Cython>=3.0.0`.
- Removed unnecessary dependencies from the build system.
- Enhanced `pypi.Dockerfile` to install gcc-9 and g++-9, and added ninja-build for improved build performance.
- Updated conda environment creation to include Python 3.9 to 3.12, while removing the Python 3.8 environment.

* cmake fix

* fix

* fix

232782dd

17 Sep, 2025 5 commits

[CMake] Added support for statically linked system libc library (#825) · 2f7dc52e
Lei Wang authored Sep 17, 2025

2f7dc52e

[Enhancement] Add a MXFP4 grouped GEMM example for FusedMoE (#811) · 8554cb01

Tong WU authored Sep 17, 2025



* [Enhancement] Enhance dequantization examples and utilities

- Added a new example for grouped matrix multiplication with experts in `example_dequant_groupgemm_bf16_mxfp4_hopper.py`.
- Improved dequantization logic in existing examples by replacing nested loops with vectorized operations for better performance.
- Updated `torch_convert_bit_twiddling` function in `utils.py` to utilize parallel processing, enhancing efficiency and clarity in the conversion process.
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>

* fix typos in docstrings

* remove redundant code

* [Format] Unreproducible debug with T.print

* [BugFix] Correct dtype in ref dequantize; larger data distribution

* [Format]

* [Refactor] Clean up and optimize example_dequant_groupgemm_bf16_mxfp4_hopper.py and utils.py

- Removed unnecessary cache disabling and manual seed setting in the example.
- Simplified nested loops into parallelized operations for better readability and performance.
- Updated the assertion function in utils.py to print detailed error messages.
- Adjusted tensor sizes in examples

* [Refactor] Update import path in example_dequant_gemm_fine_grained.py

- Changed the import statement for `_tir_packed_to_unsigned_convert` from `bitblas.quantization` to `tilelang.quantize` to reflect the new module structure.

* lint

* rename and add test

* lint

* [Feature] Enhance autotuning and configuration generation in example_dequant_groupedgemm_bf16_mxfp4_hopper.py

- Added a new function `get_configs()` to generate hyperparameter configurations for tuning.
- Updated the `matmul` function to utilize autotuning with the new configurations.
- Improve kernel performance via vectorization and threadblock swizzle.
- Enhanced the main function to support the new autotuning inputs and updated parameters for better performance.

* lint

* fix typo

* fix typo and lint

* make ci format check happy

* fix ci

---------
Co-authored-by: Zhengju Tang <97930865+tzj-fxz@users.noreply.github.com>
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

8554cb01

[Bugfix] Skip fp4 dtype binding when using older versions of ml_dtypes (#824) · e4a346fe
Lei Wang authored Sep 17, 2025
```
* bug fix when git is not installed

* ml_dtypes_fix
```
e4a346fe
[Bugfix] Bug fix when git command is not installed (#823) · a57f8270
Lei Wang authored Sep 17, 2025

a57f8270
[DSL] Support python tenary if then else expression (#822) · 15479958
Lei Wang authored Sep 17, 2025
```
* support python tenary if then else expression

* lint fix
```
15479958

16 Sep, 2025 3 commits

[Example] Remove redundant param (#821) · 907c3ff0
botbw authored Sep 16, 2025

907c3ff0
[CI] fix rocm ci (#819) · d3e75b70
Cunxiao Ni authored Sep 16, 2025
```
* [CI] fix rocm ci

* Trigger CI
```
d3e75b70

[Example] add w4a8 gemm kernel (#815) · 4bcb1593

Cunxiao Ni authored Sep 16, 2025

* [Bugfix] fix autotune bug

* [Example] add w4a8 gemm kernel

* fix lint: pinned the version of `ml_dtypes`
The version of ml_dtypes should be pinned in the dependency specification. If the version of ml_dtypes is too low, it may result in errors such as fp4 not being defined.

* Renames example for dequantization GEMM

* format

* add w4a8 example to ci

* fix lint

4bcb1593

15 Sep, 2025 1 commit

[Refactor] Update TVM subproject and streamline buffer store handling (#816) · 85d1a6b3

Yu Cheng authored Sep 16, 2025

- Updated the TVM subproject to the latest commit for improved functionality.
- Refactored `warp_specialized_rewriter.cc` to replace placeholder implementations for `BlockNode` and `BlockRealizeNode` with proper role filtering, enhancing code clarity and maintainability.
- Ensured consistent handling of the `cp_async_barrier_noinc` function in `builtin.py` by adding a newline at the end of the file.

85d1a6b3