- 10 Nov, 2025 6 commits
-
-
Lei Wang authored
* [Refactor] Update ReduceOpNode to use absolute values in Max computation and remove unused shared memory reduction logic * Changed Max computation for AbsMax type to use absolute values of lhs and rhs. * Removed unused shared memory reduction logic and related checks for buffer dimensions and thread extents, simplifying the Lower method. * Added a fatal log for unsupported buffer scope reductions. * reduce fix * [Fix] Update type check for eval value in Builder class * Changed the type check for eval values to raise a TypeError for unsupported types, specifically excluding instances of tvm.tir.Buffer. This improves error handling and clarity in the Builder class.
-
Lei Wang authored
* [Enhancement] Improve iterator handling in layout utilities and parallel operations * Added a new function, DivideUnusedIterators, to detect per-iterator gaps in fused index expressions, enhancing the accuracy of unused iterator detection. * Updated CompleteBufferFragment to prefer direct inversion for bijective index mappings and introduced a fallback mechanism for non-bijective cases, improving layout inversion robustness. * Added a new test for layout inference in fused kernels to ensure correct compilation and execution without layout inversion failures. * lint fix
-
Lei Wang authored
* Added logging and exception handling for layout errors in InverseWithLevel method. * Replaced direct error check with a throw statement to enhance error reporting and debugging capabilities.
-
Lei Wang authored
* [Enhancement] Add NVCC support for PTX and SASS generation in TileLang * Introduced functions to compile CUDA C++ source to PTX and SASS formats, enhancing the ability to generate intermediate representations for CUDA kernels. * Added default compile options for NVCC, including paths for TileLang templates, CUTLASS, and CUDA includes. * Implemented methods to export and display generated PTX and SASS code, improving usability for developers working with CUDA targets. * Updated JITKernel class to integrate new NVCC functionalities for PTX and SASS handling, ensuring compatibility with existing workflows. * [Fix] Improve error handling in get_sass_from_source function * Added contextlib to suppress exceptions when removing temporary files, enhancing robustness. * Fixed formatting of error message for clarity when CUDA tools are not found, ensuring better user feedback. * [Enhancement] Preserve user flags in NVCC compile options * Updated the default_compile_options function to preserve user-specified compile flags, including repeated tokens, by utilizing shlex for proper tokenization. * This enhancement improves the flexibility and accuracy of NVCC compile options, ensuring that all user inputs are correctly handled.
-
Yichen Yan authored
-
Kuris authored
* Fix Buffer re-import typo in tilelang.langugage * fix lint error
-
- 09 Nov, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Introduce LetWrapper for handling loop variable substitutions in pipeline rewriting * Added LetWrapper struct to encapsulate variable and value pairs for loop variable substitutions. * Updated PipelineRewriter to accept a vector of LetWrapper instances, allowing for proper handling of Let statements that depend on the pipeline loop variable. * Enhanced the BuildPipeline method to incorporate LetWrapper instances into rewritten blocks, ensuring correct substitutions during pipeline execution. * Refactored logic for processing Let statements to differentiate between those that use the loop variable and those that do not, improving the flexibility of the pipeline transformation. * Refactor lambda expression for clarity in loop variable usage check in inject_pipeline.cc * [Test] Add regression test for loop variable handling in kernel compilation * Introduced a new test case to verify correct handling of loop variables in the kernel compilation process, addressing a regression issue with InjectSoftwarePipeline. * The test ensures that the loop variable is not left as a free variable, which previously caused failures in MakePackedAPI. * Configurations are set to disable warp specialization and TMA lowering to align with the original issue reproduction. * Remove unused import in regression test for loop variable handling in kernel compilation
-
- 08 Nov, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Improve handling of negative indices in legalize_negative_index pass * Added logic to handle scalar and vector indices separately, enhancing the ability to determine non-negativity and negativity of indices. * Introduced detailed logging for cases where non-negativity cannot be proven, improving debugging capabilities. * Refactored index state determination for vector types, including support for Ramp and Broadcast nodes. * Fix incorrect lane handling in legalize_negative_index pass by dereferencing lanes to obtain the correct integer value. * Enhance legalize_negative_index pass by including necessary header for TIR operations. This addition supports improved functionality and maintainability of the transformation logic.
-
Jesse authored
-
- 07 Nov, 2025 3 commits
-
-
Lei Wang authored
* Enhance CUDA code generation by improving register type handling for float data types and introducing a workaround for TF32 compatibility. Updated MMA register type registration for A and B operands to boost performance and ensure correctness. * lint fix --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
Lei Wang authored
* Refactor storage access visitor in TileLang to improve readability and maintainability. Organized includes, enhanced comments, and preserved access summaries during condition evaluations in IfThenElse statements. Adjusted handling of buffer accesses and thread invariance checks for better clarity. * lint fix
-
LJC00118 authored
-
- 06 Nov, 2025 3 commits
-
-
Kurisu authored
-
Kurisu authored
* [Feature] Support serial for with step * add more tests * fix * Enhance trip count validation in SerialForWithStep to ensure non-zero step values and prevent undefined behavior. Added error handling for zero step values and improved logging for non-constant steps. * Update builder.py * fix lint error --------- Co-authored-by:
Zhiwen Mo <zm125@ic.ac.uk> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yichen Yan authored
* Enable ccache for linux cibw, unify ccache settings. * hash cc files to avoid get stuck in some case * Add comments about ccache version * fix wrong gitignore
-
- 05 Nov, 2025 8 commits
-
-
Lei Wang authored
* [Feature] Add support for SM70 tensor core MMA instructions - Introduced new intrinsic `ptx_mma_sm70` for Volta GPUs, enabling m16n16k4 shape with FP16 inputs and FP16/FP32 accumulation. - Added `GemmMMASm70` class for handling GEMM operations specific to SM70 architecture. - Implemented layout functions for Volta swizzled layouts and updated existing GEMM layout inference logic. - Updated `requirements-dev.txt` to include `apache-tvm-ffi` dependency. - Added correctness evaluation script for testing GEMM operations on SM70. * [Refactor] Update formatting and installation commands in scripts - Modified `format.sh` to install `pre-commit` and `clang-tidy` with the `--user` flag for user-specific installations. - Improved readability in `correctness_evaluation_sm70.py` by adjusting the formatting of pytest parameters. - Cleaned up spacing and formatting in various C++ source files for better consistency and readability. - Removed unnecessary comments and improved layout function definitions in `mma_sm70_layout.py` and `mma_sm70_macro_generator.py` for clarity. - Ensured consistent formatting in layout initialization and swizzle functions. * typo fix
-
Tong WU authored
* Update dependency version for apache-tvm-ffi in pyproject.toml to fix CI * [Math] Add `tl.infinity` operation and update Python interface for infinity handling - Implemented `infinity_op` in C++ to return infinity values for supported data types. - Registered new operation `tl.infinity` with appropriate attributes. - Updated Python interface to call the new `tl.infinity` operation instead of the previous method. * Add unit tests for `tl.infinity` operation in TileLang - Introduced a new test file `test_tilelang_language_infinity.py` to validate the behavior of the `tl.infinity` operation across multiple data types (float16, bfloat16, float32, float64). - Implemented a kernel to fill a tensor with infinity values and asserted the correctness of the output against PyTorch's `torch.inf`. * lint --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
Lei Wang authored
[Refactor] Dynamic registration of FP8 data type for compatibility with older PyTorch versions (#1197)
-
Yu Cheng authored
* [Example] Update GQA varlen fwd * fix
-
Zhengju Tang authored
* [Test] Add cp async to avoid register spill * [BugFix] GQA fwd and bwd - Fix the undefined behavior of -inf in acc_s - Fix the causal loop range in varlen scenario * [TMA] Move on to TMA and locate the register spill issue * [Debug] Not the reason of zero-assignment. Probably the combination of Parallel op & conditional qkT * [Debug] The SIMT copy in producer occupies too many registers * [BugFix] Use 3D lse and delta to avoid illegal instruction * [Perf] Relaxed order for dQ and SIMT store for dKdV * [Feat] For atomic add version * [Lint] * [Bugfix] Enable code lowering with producer‑copy‑only program (#1168) * bugfix * lint fix * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns. * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic. * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions. * [Bugfix] Support 16bits shfl_sync (#1169) * Add type-safe warp shuffle helpers for 16-bit float types in common.h - Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`. - Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations. - Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability. * lint fix * [Testing] Move TMA 1D and test for its functionality (#1167) * [Testing] Move TMA 1D and test for its functionality * [Lint] * [Refactor]: Change the params in pytest to avoid oom error during ci (#1170) * [Refactor]: Change the params in pytest to avoid oom error during ci * format * fix * Update test_example_cast.py * Update parameters in test_example_cast * Update test_example_flash_attention.py * update * format * fix * fix * format * [Bugfix] Fix tvm import path for editable build (#1172) * [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) * remove debug print * pipeline fix * use the correct buffer access scope * rs support * warp warpgroup_fence_operand * fix * fp8 dtype ptx enhance * mma fix * TCGEN05 Interface * tcgen05 support * rebase * update * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors. * lint fix * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module. * wgmma fix --------- Co-authored-by:
Zhiwen Mo <zm125@ic.ac.uk> * [Language] Add Correctness and performance check scripts for V2 (#1174) * fix * lint fix * fix * lint fix * fix * upd * [Bugfix] Legalize Datatype for mma intrinisc codegen (#1179) * fix * lint fix * Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands. * [Perf] Add layout and use_tma to boost performance * [Lint] * [Note] --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
Yuqi Dong <134183314+yyttt6@users.noreply.github.com> Co-authored-by:
Zhiwen Mo <zm125@ic.ac.uk>
-
Lei Wang authored
* fix * lint fix * fix * lint fix * fix * upd * support n>256 * Remove unnecessary pass configurations for fast math in MHA forward BHSD latency script. * lint fix * lint fix
-
Yichen Yan authored
* update exclude in sdist * reuse cibw workflow in maint * update * fix * fmt * upload artifacts for [Release] PRs * dot-prefix version file * update
-
Kurisu authored
* [Feat] add 2 phase binding to allow swap two var * Minor update tvm dtype constructor * fix lint error
-
- 04 Nov, 2025 4 commits
-
-
Lei Wang authored
* [Feature] Enhance fill operation to support various buffer types - Added support for `BufferLoad` in the `fill` function to handle different buffer types. - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling. - Introduced checks for static bounds in region definitions to ensure safety during operations. - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions. * lint fix * [Refactor] Improve Python compatibility for ParamSpec and Self - Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively. - Updated type annotations across multiple files to ensure consistent usage of typing features. * [Update] Require Python 3.9 and enhance type annotations - Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`. - Removed references to Python 3.8 in classifiers. - Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility. - Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files. * [Refactor] Update import statements to enhance type annotations - Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity. - Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function. - Adjusted import statements in the language proxy file to maintain consistency in type annotations. * disable rocm rs nt test. * lint fix
-
Lei Wang authored
* [Feature] Enhance fill operation to support various buffer types - Added support for `BufferLoad` in the `fill` function to handle different buffer types. - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling. - Introduced checks for static bounds in region definitions to ensure safety during operations. - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions. * lint fix
-
Kurisu authored
* [Fix] Remove type params * fix lint error * [Fix] fix dtype new error
-
pre-commit-ci[bot] authored
* [CI] [pre-commit.ci] autoupdate updates: - [github.com/astral-sh/ruff-pre-commit: v0.14.1 → v0.14.3](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.1...v0.14.3 ) * [CI] sync ruff version --------- Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xuehai Pan <XuehaiPan@pku.edu.cn>
-
- 03 Nov, 2025 5 commits
-
-
Kurisu authored
* Fix incompatible floordiv in packed api * fix lint
-
Kurisu authored
* tilelang frontend v2 * syntax sugar: defining a local var by annotation * [Refactor] fix type linting warning like `T.float32` * Add tl.local_var_init for new tl.float32 * allow passing default argument as function annotation * allow default arguments as annotation * fix lint error * minor fix * [Refactor] refactor tilelang.jit and tilelang.autotune * minor fix * minor fix * minor fix * fix metal get function name * add par_compile impl and tests * Type consistency on tvm datatype 1. isinstance(tl.float32, tvm.DataType) == True 2. Allow `tl.float32` as function annotations 3. Allow `tl.float32` as argument to be passed to `tl.alloc` or other functions * fix lint error * add more warning in frontend * update tvm version * Minor fix on tvm_ffi annotations * add document and examples * fix lint error * Simplify index calculations in example_chunk_o_bwd.py Refactor index calculations for dg_last_fragment assignment. * minor fix * lint fix --------- Co-authored-by:
Lei Wang <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
dependabot[bot] authored
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4 to 5. - [Release notes](https://github.com/actions/upload-artifact/releases) - [Commits](https://github.com/actions/upload-artifact/compare/v4...v5 ) --- updated-dependencies: - dependency-name: actions/upload-artifact dependency-version: '5' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
dependabot[bot] authored
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 5 to 6. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v5...v6 ) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
Lei Wang authored
* fix * lint fix * Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands.
-
- 02 Nov, 2025 4 commits
-
-
Lei Wang authored
* fix * lint fix * fix * lint fix * fix * upd
-
Lei Wang authored
* remove debug print * pipeline fix * use the correct buffer access scope * rs support * warp warpgroup_fence_operand * fix * fp8 dtype ptx enhance * mma fix * TCGEN05 Interface * tcgen05 support * rebase * update * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors. * lint fix * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module. * wgmma fix --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
Lei Wang authored
-
Yuqi Dong authored
* [Refactor]: Change the params in pytest to avoid oom error during ci * format * fix * Update test_example_cast.py * Update parameters in test_example_cast * Update test_example_flash_attention.py * update * format * fix * fix * format
-
- 01 Nov, 2025 1 commit
-
-
Zhengju Tang authored
* [Testing] Move TMA 1D and test for its functionality * [Lint]
-
- 31 Oct, 2025 3 commits
-
-
Lei Wang authored
* Add type-safe warp shuffle helpers for 16-bit float types in common.h - Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`. - Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations. - Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability. * lint fix
-
Lei Wang authored
* bugfix * lint fix * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns. * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic. * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.
-
Lei Wang authored
* 3rdparty tvm bump * bump tvm into v0.22.0 * lint fix * rebase tvm * Update submodule tvm to latest commit 3085bc4 * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang * test fix * add requirement * atomic_fix * atomic_fix * phaseout py39 * optimize * optimize * lint fix * do not clean cache * do not clean cache * [Minor] Minor update for Python versions and dependencies * [Lint] fix lint for py39 * [Lint] fix lint for ROCm * [Build][CI] Sync CI changes from upstream/sdist * [Lint] fix lint for ROCm * [Build][CI] Update `repair-wheel-command` * [Minor] update abi3audit result format * [Lint] fix lint for ROCm * [BugFix] fix build * [Lint] fix lint for ROCm * [BugFix] set rpath for libtvm and libtvm_runtime * [Deps] pin apache-tvm-ffi version * [Build] set Python 3.9 Limited API for Cython target * [Build] set Python 3.9 Limited API for Cython target * [Deps] Restore Python 3.8 support * [Build] use `apache-tvm-ffi`'s `libtvm_ffi` * [BugFix] use `;` as delimiter for RPATH on macOS * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel` * [Build] support `sccache` if available * [Build] add CIBW import test * [Build][CI] enable ccache for CIBW on Linux * [BugFix] set rpath for libtvm and libtvm_runtime * Revert "[Build][CI] enable ccache for CIBW on Linux" This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3. * [CI] fix perfbench bot * [BugFix] use Python 3.9 to build wheel * [Minor] update perfbench bot envs * [BugFix] fix CIBW environment on Linux * [CI] skip import test on CentOS 7 * [CI] use Python urllib to download file instead of Wget --------- Co-authored-by:Xuehai Pan <XuehaiPan@pku.edu.cn>
-