Commits · adcba2757a22ea8382f1986cd61700cc822fd997 · OpenDAS / tilelang

31 Jul, 2025 3 commits

Add Flash Attn example on amd mi300 series (#682) · adcba275

alex_xiao authored Jul 31, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

* Add input configuration file and update AMD example script for enhanced flexibility

- Introduced a new input.txt file for configurable parameters.
- Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
- Streamlined the main function for better clarity and organization.
- Added a new test script to facilitate running the example with specified parameters.

* Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations

- Deleted input.txt and test.sh files as they are no longer needed.
- Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
- Reintroduced swizzle usage in the kernel for better performance.

* Refactor AMD example script for FlashAttention-2

- Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
- Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
- Removed outdated comments and improved code organization for better readability.

* Refactor formatting in AMD FlashAttention example script

- Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
- Streamlined the `main` function parameter formatting for consistency.
- Removed unnecessary blank lines to enhance overall code organization.

* Update example_amd_flash_attn_fwd.py

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

adcba275

[Enhancement] Enhance warp specialization logic (#680) · 05f2fc6d

Yu Cheng authored Jul 31, 2025



- Removed unnecessary configurations from the @tilelang.jit decorator in `example_grouped_gemm_fwd.py`, simplifying the kernel compilation process.
- Updated the `grouped_gemm` function to accept a tuple for batch sizes, enhancing compatibility with the kernel invocation.
- Added logic in `warp_specialized_rewriter.cc` to track buffer usage in `CallNode` expressions, improving the handling of TMA load operations.

This refactor aims to streamline the code and improve maintainability while ensuring better performance in grouped matrix multiplication operations.
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

05f2fc6d

[Enhancement] Output cache-file-related messages with verbose=True (#683) · 042c60fb

Yang Chen authored Jul 30, 2025

This is a minor enhancement to output verbose messages indicating where
cache files are saved and loaded. These messages are useful for
examining the relevant intermediate files.

042c60fb

30 Jul, 2025 5 commits

[CI] Update CI workflow to use Python 3.12 (#679) · eb026b79

Lei Wang authored Jul 30, 2025

* Update CI workflow to use Python 3.12 and enable build isolation for pip installations

- Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements.
- Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs.

* Update CI workflow to trigger on pull requests instead of pull_request_target

- Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process.

* Refactor CI workflow to remove unnecessary repository and token settings

- Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information.

* Remove pip install command from CI workflow to streamline installation process

* Refactor reshape functions and tests for shared memory operations

- Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity.
- Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`.
- Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.

eb026b79

[Refactor] Phaseout version with commit id in editable model (#677) · ca1138c3

Lei Wang authored Jul 30, 2025



* merge from lab

* Add `TILELANG_PRINT_ON_COMPILATION`

* Update CI workflow to disable build isolation for pip installations in testing requirements

- Changed the `PIP_NO_BUILD_ISOLATION` environment variable from `1` to `0` in the CI configuration, ensuring that pip installs the testing requirements without build isolation. This adjustment aims to improve compatibility and streamline the installation process during CI runs.

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

ca1138c3

Do not check for short variables (#676) · 4878cc5d
Yichen Yan authored Jul 30, 2025
```
which there's a lot
```
4878cc5d

Refactor to support upstream tvm (#595) · a7c9a8b9

Siyuan Feng authored Jul 30, 2025

**Summarize part of the rebase pr:**

1. **Support T.thread_return() → CUDA return syntax**  
   Added support for translating `T.thread_return()` to CUDA's native `return` statement.

2. **Dynamic type support for function inputs**  
   Functions now accept dynamically typed parameters using `typing`:
   ```python
   dyn_type = T.int32 or T.float
   @T.prim_func
   def main(
       a: dyn_type,
   )
   ```

3. **Device Function Codegen**  
   Added support for generating `__device__` functions in CUDA:
   ```python
   @I.ir_module
   class Module:
       @T.prim_func(private=True)
       def add(a: T.int32, b: T.int32) -> T.int32:
           return a + b

       @T.prim_func
       def main(
           A: T.Buffer((128, 128), "int32"),
           B: T.Buffer((128, 128), "int32"),
           C: T.Buffer((128, 128), "int32"),
       ):
           T.func_attr({"global_symbol": "main"})
           length: T.int32 = Module.add(64, 64)  # Host call
           for bx in...

a7c9a8b9

Update ci.yml (#675) · 8edd6941
Wenhao Xie authored Jul 30, 2025

8edd6941

29 Jul, 2025 6 commits

[Enhancement] passing verbose to LibraryGenerator (#673) · 9c9e67eb

Yang Chen authored Jul 29, 2025



* [Enhancement] passing verbose to LibraryGenerator

This PR enables passing a verbose parameter to LibraryGenerator
via CtypesKernelAdapter and CythonKernelAdapter.
When verbose is set to True,  we will print out the NVCC
compilation command.

This slightly improves debuggability.

* fix ci

---------
Co-authored-by: xwhzz <wh.xie@outlook.com>

9c9e67eb

[Bugfix][CI] Use valid runner labels in workflow (#674) · 4eba852a
Wenhao Xie authored Jul 29, 2025

4eba852a
[CI] Improve format check output and automate commit of changes (#669) · 562796ef
Wenhao Xie authored Jul 29, 2025
```
* update format check ci

* upd

* upd
```
562796ef

[Bugfix] Passing correct nvcc to cmake (#670) · 8ea00774

Yang Chen authored Jul 28, 2025

cmake doesn't take the nvcc specified by CUDA_HOME by default.
Consequently, the follow command failed for me because cmake still
used the nvcc from the default location (e.g. in my case
/usr/local/cuda/bin/nvcc):

```
$ PATH=/home/yangche/cuda-12.8/bin:$PATH CUDA_HOME=/home/yangche/cuda-12.8 pip install -e . -v
```

This minor fix enforces cmake to use the nvcc specified by the CUDA_HOME env.

8ea00774

Revert "[Enhancement] Add flash attn example for AMD MI300 series(#671)" (#672) · 56a8a644
Lei Wang authored Jul 29, 2025
```
This reverts commit e8cc372f.
```
56a8a644

[Enhancement] Add flash attn example for AMD MI300 series(#671) · e8cc372f

alex_xiao authored Jul 29, 2025



* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)

- Enhanced buffer index handling to address precision issues by removing redundant operations.
- Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
- Updated related documentation to reflect changes in buffer management practices.

* Remove obsolete test script for AMD example, streamlining the examples directory.

* Remove unused dtype_size variable in AMD example script to streamline code.

---------
Co-authored-by: xinxyxiao <xinyxiao@amd.com>

e8cc372f

25 Jul, 2025 2 commits
- [Bugfix] Remove redundant T.fill to fix precision issue (#667) · 98f93db1
  徐畅 authored Jul 26, 2025
  
  98f93db1
- [Bugfix] Consider buffer data type into indices provably disjoint analysis (#664) · 722c2a8c
  Lei Wang authored Jul 25, 2025
  
  722c2a8c
24 Jul, 2025 3 commits

[Enhancement] Improve buffer conflict detection in thread storage synchronization (#658) · a16f0cf5

Lei Wang authored Jul 24, 2025

* [Enhancement] Improve buffer conflict detection in thread storage synchronization

- Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`.
- Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons.
- Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity.

* example fix

* enhancement

* improve ci

a16f0cf5

[Bugfix][Docs] Update documentation build process and configurations for autoapi support (#663) · c8edb957
Wenhao Xie authored Jul 24, 2025
```
* [Bugfix][Docs] Update documentation build process and configurations for autoapi support

* lint fix
```
c8edb957

[BugFix] Do not modify strict layout in common or relax level of layout... · fe6cdc9d

Zhengju Tang authored Jul 24, 2025


[BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653)

* [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking

* Lint

* test fix

* Update CI workflow to install dependencies without user site packages

- Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory.

* Update CI workflow to install pip without user site packages

- Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment.

* Update requirements-test.txt

* reduce ci problem size,

* Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

fe6cdc9d

23 Jul, 2025 5 commits

[Examples] Add the support of rocm arch detecting (#661) · 8361eb5c
Zhang Jason authored Jul 24, 2025
```
Co-authored-by: zhangnju <ningzhan@SMC-SC-DI08-33.dh144.dcgpu>
```
8361eb5c

[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes... · d764dca8

Wenhao Xie authored Jul 24, 2025


[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656)

* [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control

* lint fix

* upd

* lint fix

* fix typo

* update typing

* update the use case of compile flags

* ci fix

* fix

* Fix CI workflow to correctly activate virtual environment from shared cache directory

* use local cache

* fix

* fix

* fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

d764dca8

[Cache] Support shared cache directories for multiple process (#649) · 267d9b3b

Lei Wang authored Jul 23, 2025



* Support shared cache directories for multiple users

* ruff fix

* ci_fix

* Add CI step to show worker info

---------
Co-authored-by: Chenggang Zhao <chenggangz@deepseek.com>

267d9b3b

[CI] Enable cache for virtual env and parallelize pytest via xdist (#660) · c12eb181
Lei Wang authored Jul 23, 2025

c12eb181

[Bugfix][CI] Bug fixing and migrate CI from ada to hopper (#652) · e9a608e2

Wenhao Xie authored Jul 23, 2025

* fix CI bugs in hopper

* lint fix

* Update bulk_copy.cc

* Refactor bulk copy logic in LowerBulkCopy function

- Removed unnecessary blank lines for improved code readability.
- Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides.
- Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings.

* test fix

* ci fix

* Update flash-attention dependencies and clean up example code

- Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`.
- Removed unused imports and commented-out code in various example files to enhance readability and maintainability.
- Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`.
- Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity.
- Deleted the `example_mha_inference.py` file as it is no longer needed.

* Update CI workflow to remove `--user` flag from pip install commands

- Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment.

* Update CI workflow to include `--no-user` flag in pip install commands

- Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment.

* Update CI workflow to include `--no-user` flag in pip install command for wheel mode

- Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment.

* test fix

* avoid conflict with system environments

* test fix

* add commnets

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

e9a608e2

22 Jul, 2025 1 commit

[Enhancement] Add role assignment for AllocateNode in warp specialization (#657) · 5bd3f942

Yu Cheng authored Jul 23, 2025

- Implemented a new role assignment for `AllocateNode` in `warp_specialized_rewriter.cc`, setting the role to `kConsumer` to ensure proper handling of memory allocation scenarios.
- This can avoid bug when using T.reduce(clear=False)

5bd3f942

21 Jul, 2025 2 commits

[Refactor] Remove small array reuse condition in shared memory allocation merging (#654) · 8205791d

Lei Wang authored Jul 21, 2025

- Eliminated the condition that disabled the reuse of small arrays (const_nbits <= 32) in the `MergeSharedMemoryAllocations` function, allowing for more flexible memory management.
- Added a comment in `OptimizeForTarget` to clarify the order of applying `MergeSharedMemoryAllocations` after `SplitHostDevice`, ensuring correct allocation site handling in device functions.

8205791d

[Bugfix] Assign Target for jit kernel (#648) · 6e994b12

meinie authored Jul 21, 2025



* fix: Copy Target to self.target

* refactor: Remove unused target attribute and adjust context management in JITKernel

- Removed the unused `target` attribute from the `JITKernel` class.
- Updated the context management in the `compile` method to utilize `self.target`, improving clarity and ensuring proper resource handling during compilation.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

6e994b12

20 Jul, 2025 2 commits

[Bugfix] Adjust role assignment in warp specialization based on read access (#647) · fec9b930

Yu Cheng authored Jul 20, 2025



* [Bugfix] Adjust role assignment in warp specialization based on read access

- Updated the role assignment logic in `warp_specialized_rewriter.cc` to set the role to `kConsumer` when no reads are detected, ensuring correct behavior in memory access scenarios.

* Apply suggestion from @gemini-code-assist[bot]
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

fec9b930

[Bugfix] Added missing thread offsets and other information to reduce. (#646) · 3a408158
Lei Wang authored Jul 20, 2025

3a408158

17 Jul, 2025 2 commits

[Enhancement] Align dynamic shared memory allocations in phase.py (#644) · b060c9f7

Lei Wang authored Jul 17, 2025

- Added a comment to clarify the alignment of dynamic shared memory allocations in the `OptimizeForTarget` function.
- Refactored the handling of shared memory allocation merging and synchronization to streamline the process, ensuring consistent behavior regardless of the aggressive merge flag.
- Improved code clarity by removing redundant conditional checks related to synchronization and memory allocation.

b060c9f7

[Enhancement] Add Cython cache directory to setup.py (#643) · 6c0a5841

Lei Wang authored Jul 17, 2025

- Included the Cython cache directory in the list of source files for the TileLang build process, ensuring proper handling of cached Cython files during the build.

6c0a5841

16 Jul, 2025 5 commits

[Example] Add paged block-sparse flash-decoding kernel (#638) · 2aded11a

YizhaoGao authored Jul 17, 2025



* Add paged block-sparse flash-decoding kernel

* Update example_tilelang_sparse_gqa_decode_paged.py

* lint fix

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

2aded11a

[Enhancement] Extend pythonic_expr to support dtype mapping in utils.py (#641) · 60974197

Lei Wang authored Jul 16, 2025

- Updated the `pythonic_expr` function to accept an optional `dtype_map` parameter, allowing for more flexible type conversions.
- Refactored calls to `pythonic_expr` in `TLCUDASourceWrapper` to utilize the new mapping feature, improving type handling in kernel generation.
- Enhanced code clarity by consolidating repeated calls to `pythonic_expr` into a private method within the wrapper class.

60974197

[Bugfix] Put thread_extent into reduce (#640) · 156ff85e

Lei Wang authored Jul 16, 2025

* [Enhancement] Update AllReduce operation to include thread offset in kernel generation

- Modified the `ReduceOp::Lower` method to incorporate the thread offset in the AllReduce kernel generation for the sm_90 architecture.
- This change improves the accuracy of thread management during reduction operations, enhancing performance on specific GPU architectures.

* [Enhancement] Refactor thread offset handling in AllReduce kernel generation

- Updated the `ReduceOp::Lower` method to streamline the handling of thread offset for AllReduce operations, ensuring consistent usage across different architectures.
- This change enhances code clarity and maintains performance improvements for the sm_90 architecture by reducing redundancy in thread offset calculations.

156ff85e

[Refactor] Phaseout redundant CUDA_DEVICE_ORDER export (#639) · b5ac9bba
Lei Wang authored Jul 16, 2025

b5ac9bba

[Warp Specialize] Implicit Warp Specialize Programing Model (#605) · e2d25ba8

Lei Wang authored Jul 16, 2025

* [Enhancement] Improve memory access condition checks in GlobalMemChecker

- Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
- This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.

* lintfix

* [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy

- Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
- Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.

* [Refactor] Update barrier and clear operations in warp specialization examples

- Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization.
- Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency.

* [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling

- Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management.
- Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements.
- Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior.
- Refactored the `print` function to include warp group and warp IDs for more detailed debugging output.
- Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements.

* [Enhancement] Add support for disabling TMA in Copy operations

- Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior.
- Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag.
- Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations.
- Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations.

* [Refactor] Clean up whitespace and formatting in multiple files

- Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`.
- Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase.

* [Enhancement] Refactor flash attention implementation for improved performance and configurability

- Split the shared memory allocations for query and key-value pairs to optimize memory usage.
- Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example.
- Updated kernel execution parameters to improve thread management and synchronization.
- Enhanced the overall structure of the flash attention function for better readability and maintainability.

* fix

* Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget

* Refactor barrier handling and update example configurations

- Replaced commented-out barrier creation with new barrier allocation in GEMM example.
- Updated kernel configuration in warp specialization example to include async copy settings.
- Enhanced barrier management in the phase optimization process to improve synchronization handling.
- Introduced new barrier allocation function for better memory management in shared contexts.

* Refactor barrier handling in LowerAndLegalize and OptimizeForTarget

- Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization.
- Removed commented-out barrier lowering in LowerAndLegalize for cleaner code.
- Added exit() call in OptimizeForTarget to halt execution after barrier lowering.

* Enhance CMake configuration and clean up example scripts

- Enabled compile command export in CMakeLists.txt for better build integration.
- Removed unnecessary print statement in the warp specialization example.
- Cleaned up commented-out code in GEMM example for improved readability.
- Updated barrier handling in shared memory allocation transformations for better synchronization.

* Refactor barrier handling in warp specialization examples

- Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization.
- Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts.
- Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management.

* Update lower_shared_barrier.cc

* Update phase.py

* Update warp specialization example and Cython wrapper

- Removed commented-out pass configuration options in the warp specialization example for clarity.
- Added functionality to write the generated kernel source to a file named "kernel.cu".
- Enhanced Cython wrapper to support boolean type conversion for improved type handling.

* Add storage synchronization call in shared barrier transformation

- Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process.

* remove debug files

* Remove kernel source output to file in warp specialization example

* remove comments

* Refactor tensor handling and update test execution in TileLang

- Changed `Buffer` to `Tensor` in `customize.py` for better type consistency.
- Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`.
- Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing.
- Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity.

* Update test_tilelang_language_reshape.py

e2d25ba8

15 Jul, 2025 4 commits

support torch.bool as kernel input (#636) · 68989d80
Lei Wang authored Jul 15, 2025

68989d80

[Dev] Update benchmark and decoding scripts to refine condition checks and... · e937faa6

Yu Cheng authored Jul 15, 2025

[Dev] Update benchmark and decoding scripts to refine condition checks and optimize tensor operations (#637)

- Enhanced the condition in `compare_ab` to ensure baseline checks align with target exclusions.
- Removed unnecessary tensor allocation in `mla_decode_tilelang`, optimizing memory usage and improving performance by directly using shared tensors in GEMM operations.

e937faa6

[Pass][Simplify] Introduce symbolic level simplify for condition expression (#634) · 02a0cf59

Lei Wang authored Jul 15, 2025

* [Enhancement] Add argument simplification option to StmtSimplifier

- Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior.
- Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process.
- Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase.
- Added comments to clarify the purpose of the new flag and its impact on simplification logic.

* lint fix

* [Enhancement] Improve layout inference and reduce operation handling

- Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic.
- Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures.
- Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization.

* lint fix

* [Enhancement] Add input validation for GEMM parameters

- Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method.
- Added informative error messages to assist in debugging when the input parameters do not meet the required conditions.

* bug fix

02a0cf59

fix typo (#635) · a0dfa516
Yuqing Xia authored Jul 15, 2025

a0dfa516