Commits · cd681e6384c72fb8fd0375e21b58791e549ce8fc · OpenDAS / tilelang

"src/git@developer.sourcefind.cn:OpenDAS/nni.git" did not exist on "a9711e24a150d2cf4e5cb3ed3d2224789369feda"

19 Nov, 2025 1 commit

[Fix] Fix memory leak bug (#1281) · cd681e63

Kuris authored Nov 19, 2025

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

* fix memory leak bug

* fix lint error

* add comments

* fix lint error

* remove duplicated, because tilelang doesn't dependent deprecated

cd681e63

18 Nov, 2025 3 commits

[FFI] Use tvm ffi as the default execution backend (#1259) · 74da3696

Lei Wang authored Nov 18, 2025

* [Refactor] Update FFI type handling and simplify argument management

* Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity.
* Updated function registration in `runtime.cc` to utilize canonical names for better consistency.
* Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled.
* Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection.
* Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity.

* [Update] Sync TVM submodule and enhance kernel source handling

* Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes.
* Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging.
* Commented out the main execution call in test files to prevent unintended execution during testing.
* Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues.
* Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends.

* [Refactor] Clean up imports and improve code formatting

* Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code.
* Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency.
* Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality.

* Update execution backend options and improve resolution logic

- Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target.
- Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions.
- Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target.
- Updated documentation to reflect changes in execution backend options and their defaults.

* lint fix

* fix

* Enhance argument handling in CUDA and HIP runtime modules

- Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime.
- Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers.
- Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks.

* lint fix

* minor fix

* fix

* recover check

* Refactor argument binding and validation in `arg_binder.cc`

- Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers.
- Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards.
- Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling.
- Minor adjustments in test files to streamline kernel execution and improve readability.

* lint fix

* stride fix

* minor fix

* fix

* lint fix

* Add CUDA stream access policy window helpers and integrate with L2 persistent cache management

- Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage.
- Updated runtime files to include new FFI packed functions for managing stream attributes.
- Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown.
- Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source.

* check with symbolic

* support null ptr

* Update CMakeLists and lower.py for code generation and subproject status

- Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support.
- Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility.
- Marked the TVM subproject as dirty to indicate local modifications.

* lint fix

* Update comments for clarity in quickstart.py

74da3696

[Language] Add shape check in `T.view/reshape` (#1277) · 921b96a3
Chaofan Lin authored Nov 18, 2025
```
* [Language] Add shape check in T.view/reshape

* address comments
```
921b96a3

Fix various issues under `int64_t` static and dynamic shape. (#1218) · 49c85715

Elevator14B authored Nov 18, 2025



* Fix various issues under int64_t static and dynamic shape.

* Resolve reviewed issues.

* Add unit test.

* fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

49c85715

17 Nov, 2025 1 commit

[Refactor] add support for numpy dtype conversion (#1255) · 041d4a06

Kuris authored Nov 17, 2025

* add typing stub for tir.ir

* remove idents

* minor update

* [Refactor] add numpy conversion for dtype

* fix lint error

* remove unused np.float_ in dtype conversion

* fix type in np.int_

* fix typo

* minor fix

* remove debug files

041d4a06

16 Nov, 2025 1 commit

[BugFix] Remove memory_order in atomic constexpr and fix NSA bwd (#1260) · 2de566e7

Kevinzz authored Nov 16, 2025



* fix nsa bwd and atomic

* [Lint]

* [BugFix]
- New implementation for atomicMax and atomicMin using atomicCAS
- PTX version atomicAdd for single 16-byte data
- Modify the test cases

* [Lint]

---------
Co-authored-by: tzj-fxz <tzjfxz@gmail.com>

2de566e7

14 Nov, 2025 1 commit

[Language] Add missing while statement (#1254) · 5eb30a4f

Kuris authored Nov 14, 2025

* add typing stub for tir.ir

* remove idents

* minor update

* [Language] Add missing while statement

* add test

5eb30a4f

12 Nov, 2025 1 commit

[Enhancement] Support Layout/Fragment Reshape (#1241) · 4370309b

Lei Wang authored Nov 12, 2025



* Update layout handling and introduce reshape functionality

- Updated the `LayoutNode` class to include a new `Reshape` method, allowing for dynamic reshaping of layouts based on input shapes.
- Enhanced the `OutputShape` method to provide better handling of cases where the analyzer cannot form an `IntervalSet`, implementing fallback mechanisms to ensure safe extents.
- Refactored the `ReduceOpNode` to utilize `BufferRegion` for improved memory handling during reduction operations.
- Added tests for reshaping functionality and layout transformations to ensure correctness and performance in various scenarios.

* lint fix

* Revert tvm submodule pointer to 1815c3e0b6ec4ead36370bbd1562025d8529017c; keep src unchanged

* Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1

* Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1

* remove useless prove

* remove comment

---------
Co-authored-by: tilelang-bot <bot@tilelang>

4370309b

11 Nov, 2025 1 commit

[Enhancement] Add thread count validation for ReduceOp fragment layout inference (#1225) · 67cc8611

Lei Wang authored Nov 11, 2025

* [Enhancement] Add thread count validation for ReduceOp fragment layout inference

* Introduced a check to ensure that the thread count is divisible by the replicate extent during layout inference in ReduceOpNode. This validation prevents layout inference failures and provides detailed error messages to guide users in resolving issues related to thread block sizes and fragment layouts.
* Updated tests to remove unsupported configurations that could lead to layout inference errors, ensuring more robust testing scenarios.

* lint fix

67cc8611

06 Nov, 2025 2 commits

[Feat] Add A Pass to Handle Negative Index (#1192) · 0592834f
Kurisu authored Nov 06, 2025

0592834f

[Feat] Add support for `T.serial` with step and negative step (#1188) · 777881e1

Kurisu authored Nov 06, 2025



* [Feature] Support serial for with step

* add more tests

* fix

* Enhance trip count validation in SerialForWithStep to ensure non-zero step values and prevent undefined behavior. Added error handling for zero step values and improved logging for non-constant steps.

* Update builder.py

* fix lint error

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

777881e1

05 Nov, 2025 2 commits

[Feature] Add `tl.infinity` operator for infinity handling of bfloat16 (#1175) · 11456de2

Tong WU authored Nov 06, 2025



* Update dependency version for apache-tvm-ffi in pyproject.toml to fix CI

* [Math] Add `tl.infinity` operation and update Python interface for infinity handling

- Implemented `infinity_op` in C++ to return infinity values for supported data types.
- Registered new operation `tl.infinity` with appropriate attributes.
- Updated Python interface to call the new `tl.infinity` operation instead of the previous method.

* Add unit tests for `tl.infinity` operation in TileLang

- Introduced a new test file `test_tilelang_language_infinity.py` to validate the behavior of the `tl.infinity` operation across multiple data types (float16, bfloat16, float32, float64).
- Implemented a kernel to fill a tensor with infinity values and asserted the correctness of the output against PyTorch's `torch.inf`.

* lint

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

11456de2

[Feat] Add swap like grammar in tuple assignment (#1185) · 055f8500

Kurisu authored Nov 05, 2025

* [Feat] add 2 phase binding to allow swap two var

* Minor update tvm dtype constructor

* fix lint error

055f8500

03 Nov, 2025 1 commit

[Language] Initial version of tilelang frontend v2 (#1120) · 5f202fe5

Kurisu authored Nov 03, 2025



* tilelang frontend v2

* syntax sugar: defining a local var by annotation

* [Refactor] fix type linting warning like `T.float32`

* Add tl.local_var_init for new tl.float32

* allow passing default argument as function annotation

* allow default arguments as annotation

* fix lint error

* minor fix

* [Refactor] refactor tilelang.jit and tilelang.autotune

* minor fix

* minor fix

* minor fix

* fix metal get function name

* add par_compile impl and tests

* Type consistency on tvm datatype
1. isinstance(tl.float32, tvm.DataType) == True
2. Allow `tl.float32` as function annotations
3. Allow `tl.float32` as argument to be passed to `tl.alloc` or other functions

* fix lint error

* add more warning in frontend

* update tvm version

* Minor fix on tvm_ffi annotations

* add document and examples

* fix lint error

* Simplify index calculations in example_chunk_o_bwd.py

Refactor index calculations for dg_last_fragment assignment.

* minor fix

* lint fix

---------
Co-authored-by: Lei Wang <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

5f202fe5

02 Nov, 2025 2 commits

[Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb

Lei Wang authored Nov 03, 2025



* remove debug print

* pipeline fix

* use the correct buffer access scope

* rs support

* warp warpgroup_fence_operand

* fix

* fp8 dtype ptx enhance

* mma fix

* TCGEN05 Interface

* tcgen05 support

* rebase

* update

* Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.

* lint fix

* Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.

* wgmma fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

aef0a6bb

[Bugfix] Fix tvm import path for editable build (#1172) · c85bb3ac
Lei Wang authored Nov 02, 2025

c85bb3ac

01 Nov, 2025 1 commit
- [Testing] Move TMA 1D and test for its functionality (#1167) · 5c62d00a
  Zhengju Tang authored Nov 01, 2025
```
* [Testing] Move TMA 1D and test for its functionality

* [Lint]
```
  5c62d00a
29 Oct, 2025 2 commits

[Bugfix] Enhance LetStmt handling in Vectorize Loop Pass (#1159) · 79730b11

Lei Wang authored Oct 30, 2025

* [Refactor] Enhance TLVectorizer with loop vectorization convenience method and improve let variable handling

* lint fix

* let test fix

* lint fix

79730b11

[Enhancement] Enhance Cast operations Vectorization (#1156) · feef9ef6
LJC00118 authored Oct 29, 2025
```
* Enhance Cast vectorized

* Add Parallel vectorized cast test

* code lint

* merge newest commit
```
feef9ef6

28 Oct, 2025 1 commit
- [BugFix] alloc_var init failed to handle complex expression (#1144) · 399af087
  Kurisu authored Oct 28, 2025
```
* [Fix] init var with complex expression

* fix lint error
```
  399af087
23 Oct, 2025 1 commit

[Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a

Tong WU authored Oct 23, 2025

* [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen

* Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
* Enhanced the existing code to support both directions of conversion based on the lane count.
* Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.

* [Feature] Add float32 to float8 conversion support in CUDA codegen

* Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
* Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
* Enhanced type handling for better compatibility with TileLang, particularly for float8 types.

* lint

* fix a bug

* [Enhancement] Support lanes=4 cases and add unit test for vectorized cast

* lint

* [Feature] Refactor bf16 convertion operations and remove legacy compile flags

* lint

a148d62a

21 Oct, 2025 1 commit

[Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e

Lei Wang authored Oct 21, 2025

* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
    generated Allocates and the PrimFunc attrs
  - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
    allocations keep their tl.local_var_init annotations
  - teach annotation handling to accept scalar initializers, resolve buffers, and merge
    with existing stat

* lint fix

* enhance

* lint fix

* lint fix

bddb125e

20 Oct, 2025 3 commits
- [Language] Efficient `T.reduce_` with shared memory input/output (#1080) · bc37ea69
  Lei Wang authored Oct 20, 2025
```
* Support reduce ss

* lint fix

* test fix

* lint fix
```
  bc37ea69
- [Language] Recommend using `T.dynamic` instead of `T.symbolic` (#1076) · a7730272
  Lei Wang authored Oct 20, 2025
```
* recommend using T.dynamic instead of T.symbolic

* lint fix

* lint fix
```
  a7730272
- [Parallel] Support `T.Parallel` with dynamic extents (#990) · 27701c3d
  Lei Wang authored Oct 20, 2025
```
* Allow dynamic extents in loop partition; warn when layout inversion falls back to NoCheck

* add test and introduce predicate

* test fix

* fix

* enhance

* inverse with level

* test fix

* bug fix
```
  27701c3d
17 Oct, 2025 2 commits

[Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (#1050) · 72111642

Chaofan Lin authored Oct 17, 2025



* [Refactor] Refactor Pass  to support recursive load/store rewrite

* lint

* recursive collect conds for call_extern

* fix name

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* address comment

* rename pad_value to safe_value

* lint

* add oob store test

* [Lint]: [pre-commit.ci] auto fixes [...]

* fix

* fix

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

72111642

[Enhancement] Remove constraint requiring last dimension stride to be 1 (#1040) · 35cf8885

LJC00118 authored Oct 17, 2025



* remove last dimension stride must be 1 constraint

* add vectorize test

* minor fix

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

35cf8885

15 Oct, 2025 1 commit

[Language] Expose `T.get_warp_idx_sync` and `T.shuffle_elect` for efficient thread election (#989) · b78d8404

Lei Wang authored Oct 15, 2025



* Expose CUDA warp/lane intrinsics in TileLang frontend

* generalize warp indexing intrinsics and add coverage

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

b78d8404

14 Oct, 2025 2 commits

[Language] Support Consequential assignments like 'a = b = c = 1' (#992) · e59e7f9a

Lei Wang authored Oct 14, 2025



* chained assignments

* test update

* [Lint]: [pre-commit.ci] auto fixes [...]

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

e59e7f9a

[Transform] Migrate `LowerIntrin` from tvm into tilelang (#999) · 7a5077e4
Lei Wang authored Oct 14, 2025
```
* Donot lower ceildiv to >>

* lint fix

* test fix

* fallback ceildiv changes
```
7a5077e4

11 Oct, 2025 1 commit
- [TileOp] Implememt `CumSum1D` (#978) · 747381ae
  Lei Wang authored Oct 11, 2025
```
* support cumsum-1d

* cumsum 1d support
```
  747381ae
07 Oct, 2025 1 commit

[Enhancement] Add buffer load copy functions and improve copy logic in tilelang (#946) · c61971e8

Lei Wang authored Oct 07, 2025

- Introduced new functions for buffer load copy with stride and parallel execution.
- Enhanced the copy logic in `copy.py` to simplify nested if statements for BufferLoad nodes.
- Added corresponding test cases for the new buffer load functionalities.

c61971e8

25 Sep, 2025 2 commits

[Language] Support atomic add with ret (#870) · aa0b1090

Lei Wang authored Sep 26, 2025

* Add atomic operations for CUDA templates in new atomic.h file

- Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
- Implemented support for half, bfloat16, and float types with appropriate memory ordering.
- Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
- Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
- Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.

* Refactor atomic operations in CUDA templates for improved readability

- Reformatted atomic operation implementations in atomic.h for better code clarity.
- Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
- Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.

* Add thread storage synchronization configuration option

- Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
- Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
- Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.

* Refactor thread storage sync configuration retrieval

- Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
- Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.

* test fix

* Update atomic operations and tests for improved functionality

- Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
- Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
- Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
- Updated the TVM subproject to the latest commit for better compatibility.

* Update attention sink examples to use 32 heads

- Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
- Ensured consistency across example scripts for improved usability and testing.

* Refactor atomic add handling in vectorization

- Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
- Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.

* Add loop break functionality and enhance thread synchronization

- Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
- Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
- Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.

* test fix

aa0b1090

[Language] Support sequence comparisons (#872) · c538d8ab
Lei Wang authored Sep 25, 2025
```
* Update submodule 'tvm' to latest commit 7a71ee34

* lint fix
```
c538d8ab

17 Sep, 2025 1 commit
- [DSL] Support python tenary if then else expression (#822) · 15479958
  Lei Wang authored Sep 17, 2025
```
* support python tenary if then else expression

* lint fix
```
  15479958
24 Aug, 2025 1 commit

[Typo] Remove `disable_cache` in some tests (#755) · 556d411e

Lei Wang authored Aug 25, 2025

* Update test parameters and remove debug print statement

- Adjusted test cases in `test_tilelang_dynamic_symbolic_bench.py` to use smaller matrix sizes (1024x1024) for improved performance and quicker execution.
- Removed a debug print statement from `phase.py` to clean up the code and enhance clarity.

* Refactor loop stack management in warp_specialized_rewriter

- Introduced a new `LoopInfo` struct to encapsulate loop variable details, including `loop_var`, `extent`, and `min`, enhancing clarity and maintainability.
- Updated the `loop_stack_` to utilize `LoopInfo` instead of a pair, improving type safety and readability.
- Adjusted linear index calculations to account for the new structure, ensuring correct behavior in loop transformations.

* Remove unused `torch.backends` import and `tilelang.disable_cache()` calls from multiple test files to enhance code clarity and maintainability.

556d411e

23 Aug, 2025 1 commit

[Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) · 6b125028

Lei Wang authored Aug 23, 2025

* Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.

* Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.

* Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability.

* Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality.

* typo fix

* Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.

* lint fix

* test fix

* Enhance CI configuration by adding verbose output to pip install command for better visibility during installation.

* use ninja instead of make

* Add CMake configuration step for Ninja build system in setup.py

* Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja.

* Enhance CI configuration by adding verbose output to pytest commands for improved test visibility.

* Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.

* Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.

* Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.

* Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.

* bugfix

* Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.

* Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.

* Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.

6b125028

22 Aug, 2025 1 commit

[Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) · 5c11d245

Lei Wang authored Aug 22, 2025

* [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy

* Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
* Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
* Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.

* lint fix

* Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.

* remove bulk copy

* Refactor copy and atomic add operations to support TMA lower configuration

- Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
- Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
- Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
- Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
- Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.

* Enhance TMA bulk copy logic in `LowerBulkCopy` method

- Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
- Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.

* lint fix

* Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.

* Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions

- Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
- Added TODO comments to indicate the need for further improvements in shared memory handling.

* Update `native_sparse_attention` function to include TMA configuration options

- Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
- Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.

* Refactor JIT decorator formatting in `native_sparse_attention` function

- Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
- No functional changes were made; this update focuses on code clarity and maintainability.

* Enhance thread management and logging in TileLang compilation

- Added a method to check if printing is enabled during compilation, improving control over logging behavior.
- Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
- Added comments to clarify the purpose of changes and improve code readability.

* Add warp specialization scope and refactor register management in TileLang

- Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
- Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
- Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
- Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.

* Refactor test for InjectSetMaxNReg pass in TileLang

- Improved readability by restructuring conditional checks and assertions in the test cases.
- Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
- Ensured consistent formatting and spacing throughout the test functions for better maintainability.

* Enhance bulk copy and store checks in `Copy` class

- Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
- Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
- Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.

* lint fix

5c11d245

17 Aug, 2025 1 commit

[Language] Introduce `StridedTensor` to support non contigious torch inputs (#722) · 1b308baf

Lei Wang authored Aug 18, 2025



* Update submodule 'tvm' to commit e11521e6936a827efa334588d29571fbb4620107

* Support strided tensors

* Refactor target attribute helper functions for improved clarity

* No code changes made in proxy.py and setup.py

* lint fix

* lint fix via gemini

* lint fix

* test fix

* test fix

* lint fix

* Update wrapper.py

* test fix

* Enhance test for InjectSoftwarePipeline by adding LowerOpaqueBlock transformation and updating expected function signature to use match_buffer for better clarity.

* lint fix

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

1b308baf

15 Aug, 2025 1 commit
- [Chore] fix typos (#719) · d0742860
  Gabriel Wu authored Aug 15, 2025
```
* chore: fix typos

* chore: fix ruff

* chore: fix clang-format
```
  d0742860