Commits · d2afb5130f0030d9946d98fee18f97d8b017bb50 · OpenDAS / tilelang

03 Aug, 2025 3 commits

[Refactor] Introduce GemmInst for different targets handling (#688) · d2afb513

Lei Wang authored Aug 03, 2025

* [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling

- Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture.
- Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic.
- Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture.
- Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase.

* bug fix

* test fix

* lint fix

d2afb513

[Refactor] Rebase pipeline injector from upstream tvm (#687) · 73bf8346

Lei Wang authored Aug 03, 2025

* [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling

- Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline.
- Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling.
- Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction.
- Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure.
- Removed obsolete code related to previous buffer access methods for clarity and maintainability.

* test fix

73bf8346

[Feature]:Add auto vectorize for atomic add (#686) · b45e9c45
yyttt6 authored Aug 03, 2025
```
* [Feature]:Add auto vectorize for atomic add

* fix

* fix2

* format
```
b45e9c45

01 Aug, 2025 1 commit

[Enhancement] Add `--ptxas-options=--register-usage-level=10` option (#684) · c5df7938

Lei Wang authored Aug 01, 2025



* Add `--ptxas-options=--register-usage-level=10` option

* lint fix

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

c5df7938

31 Jul, 2025 5 commits

[Fix] fix some issues with JIT decorators existing in the examples (#681) · 950ed16c

Cunxiao Ni authored Aug 01, 2025



* [Fix] fix some issues with JIT decorators existing in the examples

* format

* Uses PassConfigKey instand of str

---------
Co-authored-by: Cunxiao <nicunxiao@bytedance.com>

950ed16c

[Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc (#685) · 689ee52b

Yu Cheng authored Jul 31, 2025

- Renamed TMAFinder to ProducerBufferDetector and improved handling of CallNode and BufferLoadNode.
- This change aims to enhance code maintainability and performance by more accurately tracking producer buffer usage.

689ee52b

Add Flash Attn example on amd mi300 series (#682) · adcba275

alex_xiao authored Jul 31, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

adcba275

[Enhancement] Enhance warp specialization logic (#680) · 05f2fc6d

Yu Cheng authored Jul 31, 2025



- Removed unnecessary configurations from the @tilelang.jit decorator in `example_grouped_gemm_fwd.py`, simplifying the kernel compilation process.
- Updated the `grouped_gemm` function to accept a tuple for batch sizes, enhancing compatibility with the kernel invocation.
- Added logic in `warp_specialized_rewriter.cc` to track buffer usage in `CallNode` expressions, improving the handling of TMA load operations.

This refactor aims to streamline the code and improve maintainability while ensuring better performance in grouped matrix multiplication operations.
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

05f2fc6d

[Enhancement] Output cache-file-related messages with verbose=True (#683) · 042c60fb

Yang Chen authored Jul 30, 2025

This is a minor enhancement to output verbose messages indicating where
cache files are saved and loaded. These messages are useful for
examining the relevant intermediate files.

042c60fb

30 Jul, 2025 5 commits

[CI] Update CI workflow to use Python 3.12 (#679) · eb026b79

Lei Wang authored Jul 30, 2025

* Update CI workflow to use Python 3.12 and enable build isolation for pip installations

- Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements.
- Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs.

* Update CI workflow to trigger on pull requests instead of pull_request_target

- Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process.

* Refactor CI workflow to remove unnecessary repository and token settings

- Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information.

* Remove pip install command from CI workflow to streamline installation process

* Refactor reshape functions and tests for shared memory operations

- Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity.
- Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`.
- Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.

eb026b79

[Refactor] Phaseout version with commit id in editable model (#677) · ca1138c3

Lei Wang authored Jul 30, 2025



* merge from lab

* Add `TILELANG_PRINT_ON_COMPILATION`

* Update CI workflow to disable build isolation for pip installations in testing requirements

- Changed the `PIP_NO_BUILD_ISOLATION` environment variable from `1` to `0` in the CI configuration, ensuring that pip installs the testing requirements without build isolation. This adjustment aims to improve compatibility and streamline the installation process during CI runs.

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

ca1138c3

Do not check for short variables (#676) · 4878cc5d
Yichen Yan authored Jul 30, 2025
```
which there's a lot
```
4878cc5d

Refactor to support upstream tvm (#595) · a7c9a8b9

Siyuan Feng authored Jul 30, 2025

**Summarize part of the rebase pr:**

1. **Support T.thread_return() → CUDA return syntax**  
   Added support for translating `T.thread_return()` to CUDA's native `return` statement.

2. **Dynamic type support for function inputs**  
   Functions now accept dynamically typed parameters using `typing`:
   ```python
   dyn_type = T.int32 or T.float
   @T.prim_func
   def main(
       a: dyn_type,
   )
   ```

3. **Device Function Codegen**  
   Added support for generating `__device__` functions in CUDA:
   ```python
   @I.ir_module
   class Module:
       @T.prim_func(private=True)
       def add(a: T.int32, b: T.int32) -> T.int32:
           return a + b

       @T.prim_func
       def main(
           A: T.Buffer((128, 128), "int32"),
           B: T.Buffer((128, 128), "int32"),
           C: T.Buffer((128, 128), "int32"),
       ):
           T.func_attr({"global_symbol": "main"})
           length: T.int32 = Module.add(64, 64)  # Host call
           for bx in...

a7c9a8b9

Update ci.yml (#675) · 8edd6941
Wenhao Xie authored Jul 30, 2025

8edd6941

29 Jul, 2025 6 commits

[Enhancement] passing verbose to LibraryGenerator (#673) · 9c9e67eb

Yang Chen authored Jul 29, 2025



* [Enhancement] passing verbose to LibraryGenerator

This PR enables passing a verbose parameter to LibraryGenerator
via CtypesKernelAdapter and CythonKernelAdapter.
When verbose is set to True,  we will print out the NVCC
compilation command.

This slightly improves debuggability.

* fix ci

---------
Co-authored-by: xwhzz <wh.xie@outlook.com>

9c9e67eb

[Bugfix][CI] Use valid runner labels in workflow (#674) · 4eba852a
Wenhao Xie authored Jul 29, 2025

4eba852a
[CI] Improve format check output and automate commit of changes (#669) · 562796ef
Wenhao Xie authored Jul 29, 2025
```
* update format check ci

* upd

* upd
```
562796ef

[Bugfix] Passing correct nvcc to cmake (#670) · 8ea00774

Yang Chen authored Jul 28, 2025

cmake doesn't take the nvcc specified by CUDA_HOME by default.
Consequently, the follow command failed for me because cmake still
used the nvcc from the default location (e.g. in my case
/usr/local/cuda/bin/nvcc):

```
$ PATH=/home/yangche/cuda-12.8/bin:$PATH CUDA_HOME=/home/yangche/cuda-12.8 pip install -e . -v
```

This minor fix enforces cmake to use the nvcc specified by the CUDA_HOME env.

8ea00774

Revert "[Enhancement] Add flash attn example for AMD MI300 series(#671)" (#672) · 56a8a644
Lei Wang authored Jul 29, 2025
```
This reverts commit e8cc372f.
```
56a8a644

[Enhancement] Add flash attn example for AMD MI300 series(#671) · e8cc372f

alex_xiao authored Jul 29, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>

e8cc372f

25 Jul, 2025 2 commits
- [Bugfix] Remove redundant T.fill to fix precision issue (#667) · 98f93db1
  徐畅 authored Jul 26, 2025
  
  98f93db1
- [Bugfix] Consider buffer data type into indices provably disjoint analysis (#664) · 722c2a8c
  Lei Wang authored Jul 25, 2025
  
  722c2a8c
24 Jul, 2025 3 commits

[Enhancement] Improve buffer conflict detection in thread storage synchronization (#658) · a16f0cf5

Lei Wang authored Jul 24, 2025

* [Enhancement] Improve buffer conflict detection in thread storage synchronization

- Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`.
- Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons.
- Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity.

* example fix

* enhancement

* improve ci

a16f0cf5

[Bugfix][Docs] Update documentation build process and configurations for autoapi support (#663) · c8edb957
Wenhao Xie authored Jul 24, 2025
```
* [Bugfix][Docs] Update documentation build process and configurations for autoapi support

* lint fix
```
c8edb957

[BugFix] Do not modify strict layout in common or relax level of layout... · fe6cdc9d

Zhengju Tang authored Jul 24, 2025


[BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653)

* [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking

* Lint

* test fix

* Update CI workflow to install dependencies without user site packages

- Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory.

* Update CI workflow to install pip without user site packages

- Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment.

* Update requirements-test.txt

* reduce ci problem size,

* Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

fe6cdc9d

23 Jul, 2025 5 commits

[Examples] Add the support of rocm arch detecting (#661) · 8361eb5c
Zhang Jason authored Jul 24, 2025
```
Co-authored-by: zhangnju <ningzhan@SMC-SC-DI08-33.dh144.dcgpu>
```
8361eb5c

[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes... · d764dca8

Wenhao Xie authored Jul 24, 2025


[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656)

* [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control

* lint fix

* upd

* lint fix

* fix typo

* update typing

* update the use case of compile flags

* ci fix

* fix

* Fix CI workflow to correctly activate virtual environment from shared cache directory

* use local cache

* fix

* fix

* fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

d764dca8

[Cache] Support shared cache directories for multiple process (#649) · 267d9b3b

Lei Wang authored Jul 23, 2025



* Support shared cache directories for multiple users

* ruff fix

* ci_fix

* Add CI step to show worker info

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

267d9b3b

[CI] Enable cache for virtual env and parallelize pytest via xdist (#660) · c12eb181
Lei Wang authored Jul 23, 2025

c12eb181

[Bugfix][CI] Bug fixing and migrate CI from ada to hopper (#652) · e9a608e2

Wenhao Xie authored Jul 23, 2025

* fix CI bugs in hopper

* lint fix

* Update bulk_copy.cc

* Refactor bulk copy logic in LowerBulkCopy function

- Removed unnecessary blank lines for improved code readability.
- Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides.
- Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings.

* test fix

* ci fix

* Update flash-attention dependencies and clean up example code

- Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`.
- Removed unused imports and commented-out code in various example files to enhance readability and maintainability.
- Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`.
- Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity.
- Deleted the `example_mha_inference.py` file as it is no longer needed.

* Update CI workflow to remove `--user` flag from pip install commands

- Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment.

* Update CI workflow to include `--no-user` flag in pip install commands

- Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment.

* Update CI workflow to include `--no-user` flag in pip install command for wheel mode

- Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment.

* test fix

* avoid conflict with system environments

* test fix

* add commnets

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

e9a608e2

22 Jul, 2025 1 commit

[Enhancement] Add role assignment for AllocateNode in warp specialization (#657) · 5bd3f942

Yu Cheng authored Jul 23, 2025

- Implemented a new role assignment for `AllocateNode` in `warp_specialized_rewriter.cc`, setting the role to `kConsumer` to ensure proper handling of memory allocation scenarios.
- This can avoid bug when using T.reduce(clear=False)

5bd3f942

21 Jul, 2025 2 commits

[Refactor] Remove small array reuse condition in shared memory allocation merging (#654) · 8205791d

Lei Wang authored Jul 21, 2025

- Eliminated the condition that disabled the reuse of small arrays (const_nbits <= 32) in the `MergeSharedMemoryAllocations` function, allowing for more flexible memory management.
- Added a comment in `OptimizeForTarget` to clarify the order of applying `MergeSharedMemoryAllocations` after `SplitHostDevice`, ensuring correct allocation site handling in device functions.

8205791d

[Bugfix] Assign Target for jit kernel (#648) · 6e994b12

meinie authored Jul 21, 2025



* fix: Copy Target to self.target

* refactor: Remove unused target attribute and adjust context management in JITKernel

- Removed the unused `target` attribute from the `JITKernel` class.
- Updated the context management in the `compile` method to utilize `self.target`, improving clarity and ensuring proper resource handling during compilation.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

6e994b12

20 Jul, 2025 2 commits

[Bugfix] Adjust role assignment in warp specialization based on read access (#647) · fec9b930

Yu Cheng authored Jul 20, 2025



* [Bugfix] Adjust role assignment in warp specialization based on read access

- Updated the role assignment logic in `warp_specialized_rewriter.cc` to set the role to `kConsumer` when no reads are detected, ensuring correct behavior in memory access scenarios.

* Apply suggestion from @gemini-code-assist[bot]
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fec9b930

[Bugfix] Added missing thread offsets and other information to reduce. (#646) · 3a408158
Lei Wang authored Jul 20, 2025

3a408158

17 Jul, 2025 2 commits

[Enhancement] Align dynamic shared memory allocations in phase.py (#644) · b060c9f7

Lei Wang authored Jul 17, 2025

- Added a comment to clarify the alignment of dynamic shared memory allocations in the `OptimizeForTarget` function.
- Refactored the handling of shared memory allocation merging and synchronization to streamline the process, ensuring consistent behavior regardless of the aggressive merge flag.
- Improved code clarity by removing redundant conditional checks related to synchronization and memory allocation.

b060c9f7

[Enhancement] Add Cython cache directory to setup.py (#643) · 6c0a5841

Lei Wang authored Jul 17, 2025

- Included the Cython cache directory in the list of source files for the TileLang build process, ensuring proper handling of cached Cython files during the build.

6c0a5841

16 Jul, 2025 3 commits

[Example] Add paged block-sparse flash-decoding kernel (#638) · 2aded11a

YizhaoGao authored Jul 17, 2025



* Add paged block-sparse flash-decoding kernel

* Update example_tilelang_sparse_gqa_decode_paged.py

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

2aded11a

[Enhancement] Extend pythonic_expr to support dtype mapping in utils.py (#641) · 60974197

Lei Wang authored Jul 16, 2025

- Updated the `pythonic_expr` function to accept an optional `dtype_map` parameter, allowing for more flexible type conversions.
- Refactored calls to `pythonic_expr` in `TLCUDASourceWrapper` to utilize the new mapping feature, improving type handling in kernel generation.
- Enhanced code clarity by consolidating repeated calls to `pythonic_expr` into a private method within the wrapper class.

60974197

[Bugfix] Put thread_extent into reduce (#640) · 156ff85e

Lei Wang authored Jul 16, 2025

* [Enhancement] Update AllReduce operation to include thread offset in kernel generation

- Modified the `ReduceOp::Lower` method to incorporate the thread offset in the AllReduce kernel generation for the sm_90 architecture.
- This change improves the accuracy of thread management during reduction operations, enhancing performance on specific GPU architectures.

* [Enhancement] Refactor thread offset handling in AllReduce kernel generation

- Updated the `ReduceOp::Lower` method to streamline the handling of thread offset for AllReduce operations, ensuring consistent usage across different architectures.
- This change enhances code clarity and maintains performance improvements for the sm_90 architecture by reducing redundancy in thread offset calculations.

156ff85e