- 22 Oct, 2025 6 commits
-
-
Yu Cheng authored
-
Xuehai Pan authored
* [Maint] Remove pre-commit install in `format.sh` * [Maint] Update uncommitted change detection command * [Minor] update warning messages
-
Yu Cheng authored
-
Xuehai Pan authored
* [Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow * chore: update clang-tidy settings * chore: upgrade clang-format and clang-tidy version * lint: resolve clang-tidy errors * [Maint] restore format.sh * [CI] pre-commit autoupdate * [Minor] fix `command -v` usage
-
Lei Wang authored
-
Lei Wang authored
* add alloc_reducer gemv example * test
-
- 21 Oct, 2025 9 commits
-
-
Zhengju Tang authored
* [Lint] * [BugFix] Freeze the memory order of all atomic_add operations * [Lint] * [Atomic] Move on to regional atomic add * [Lint]
-
Yu Cheng authored
-
Lei Wang authored
* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to generated Allocates and the PrimFunc attrs - thread the map through FlattenBuffer and StorageRewrite so flattened/merged allocations keep their tl.local_var_init annotations - teach annotation handling to accept scalar initializers, resolve buffers, and merge with existing stat * lint fix * enhance * lint fix * lint fix -
Lei Wang authored
* • Enable configurable StorageRewrite inplace detection - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key. - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse. - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples. * lint fix * add test * lint fix
-
Tong WU authored
* [Cleanup] Remove `tilelang.disable_cache()` calls from example scripts * lint * lint
-
Lei Wang authored
* Improve target docs and helper messaging Commit Message: - add SUPPORTED_TARGETS metadata and expose describe_supported_targets() - relax target validation to accept option suffixes and upgrade error messages - document target usage and compute capability mapping in docs/get_started/targets.md - note preference for string targets when caching and link the new guide in docs/index.md * remove american english spelling
-
Lei Wang authored
* refactor cython wrapper * optimize * fix installations
-
Zhengju Tang authored
* [BugFix] Add memory order argument for non-vectorized atomic add * [Lint] * [BugFix] Memory order * [Lint] * [BugFix] Argument in cuda template * [Lint]
-
Zhengju Tang authored
* [Feature] Add GQA backward kernel with varlen input * [Lint] * [BugFix] Freeze the memory order of all atomic_add operations * [Lint] * [Lint] * [BugFix] Use release order to boost performance
-
- 20 Oct, 2025 11 commits
-
-
Tong WU authored
* [Enhancement] Update async intrinsic handling in inject_fence_proxy * Added support for wgmma async intrinsics in IsAsyncIntrinsic function. * Changed handling of unknown externs to treat them as Generic instead of Async, improving accuracy in proxy kind determination. * test fix * Update testing/python/transform/test_tilelang_transform_inject_fence_proxy.py Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
-
Yu Cheng authored
-
Lei Wang authored
* Support reduce ss * lint fix * test fix * lint fix
-
Lei Wang authored
* recommend using T.dynamic instead of T.symbolic * lint fix * lint fix
-
Lei Wang authored
- extend matmul autotune test suite with a symbolic M case and allow run_autotune to accept concrete values for symbolic dims - sanitize _kernel_parameters when generating cache keys so symbolic vars serialize deterministically
-
Zhengju Tang authored
* [Feature] Support Reduce operators for bitwise and/or/xor * [Lint]
-
Lei Wang authored
-
Lei Wang authored
-
Lei Wang authored
* Allow dynamic extents in loop partition; warn when layout inversion falls back to NoCheck * add test and introduce predicate * test fix * fix * enhance * inverse with level * test fix * bug fix
-
Yu Cheng authored
-
dependabot[bot] authored
Bumps [actions/checkout](https://github.com/actions/checkout) from 4 to 5. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](https://github.com/actions/checkout/compare/v4...v5 ) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 19 Oct, 2025 5 commits
-
-
Lei Wang authored
-
Tong WU authored
[Enhancement] Deprecate split&sum in attn bwd examples on Hopper and migrate to vectorized atomic add (#1065)
-
Tong WU authored
* [Refactor][Example] Update linear attention examples and add tests - Refactored the backward and forward linear attention kernels to use shared memory and atomic additions for improved performance. - Introduced L2 normalization in the main functions of both examples. - Added a new test suite for the linear attention examples to ensure correctness and performance. - Updated argument parsing in the main functions for better usability. * upd docstring for tma atomic add * lint * Add flash-linear-attention dependency to requirements.txt * Rename main function to chunk_linear_attn_bwd * Rename main function to chunk_linear_attn_fwd * chore --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Xuehai Pan authored
-
Lei Wang authored
* Add document PYTHONPATH build path * update fp8 benchmark result * remove redpath * remove path * tflops fix
-
- 18 Oct, 2025 3 commits
-
-
Yuqi Dong authored
* [CI]:Reduce test shapes to avoid OOM errors during CI. * rabbit * Increase number of processes for pytest from 2 to 4 --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Lei Wang authored
-
Lei Wang authored
-
- 17 Oct, 2025 6 commits
-
-
Chaofan Lin authored
* [Refactor] Refactor Pass to support recursive load/store rewrite * lint * recursive collect conds for call_extern * fix name * [Lint]: [pre-commit.ci] auto fixes [...] * lint * [Lint]: [pre-commit.ci] auto fixes [...] * lint * [Lint]: [pre-commit.ci] auto fixes [...] * address comment * rename pad_value to safe_value * lint * add oob store test * [Lint]: [pre-commit.ci] auto fixes [...] * fix * fix --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Lei Wang authored
* [Enhancement] Improve layout inference for local buffer handling in parallel operations * Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions. * Updated the condition for determining parallel loop execution to account for local buffer stores. * Cleaned up comments for clarity and future considerations. * [Refactor] Clean up parallel loop condition formatting in layout inference * Reformatted the condition for determining parallel loop execution for better readability. * Maintained existing logic while enhancing code clarity for future modifications. --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
LJC00118 authored
* improve CUDA compiler detection in CMake * Minor fix
-
Lei Wang authored
-
LJC00118 authored
* remove last dimension stride must be 1 constraint * add vectorize test * minor fix * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Lei Wang authored
-