1. 11 Nov, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Simplify logic in the `CompleteBufferFragment` (#1226) · 7045f1d6
      Lei Wang authored
      
      
      * fix
      
      * Fix logging level in LayoutNode::InverseWithLevel method from WARNING to DLOG for symbolic layout fallback.
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      7045f1d6
    • Lei Wang's avatar
      [Enhancement] Add thread count validation for ReduceOp fragment layout inference (#1225) · 67cc8611
      Lei Wang authored
      * [Enhancement] Add thread count validation for ReduceOp fragment layout inference
      
      * Introduced a check to ensure that the thread count is divisible by the replicate extent during layout inference in ReduceOpNode. This validation prevents layout inference failures and provides detailed error messages to guide users in resolving issues related to thread block sizes and fragment layouts.
      * Updated tests to remove unsupported configurations that could lead to layout inference errors, ensuring more robust testing scenarios.
      
      * lint fix
      67cc8611
    • Zhengju Tang's avatar
      [GQA] Add varlen decoding kernel with logits saving (#1223) · eb6e8973
      Zhengju Tang authored
      * [Example] Add GQA varlen decoding kernel with logits return
      
      * [Example] Support Sink for GQA varlen decoding
      
      * [Example] Add for no-varlen support
      
      * [Tune] Add high performance logits saving
      
      * [Lint]
      
      * [Lint]
      
      * [Rename]
      eb6e8973
  2. 10 Nov, 2025 6 commits
    • Lei Wang's avatar
      [Language] Refactor reduce and support shared memory as its in/out (#1219) · 47039f06
      Lei Wang authored
      * [Refactor] Update ReduceOpNode to use absolute values in Max computation and remove unused shared memory reduction logic
      
      * Changed Max computation for AbsMax type to use absolute values of lhs and rhs.
      * Removed unused shared memory reduction logic and related checks for buffer dimensions and thread extents, simplifying the Lower method.
      * Added a fatal log for unsupported buffer scope reductions.
      
      * reduce fix
      
      * [Fix] Update type check for eval value in Builder class
      
      * Changed the type check for eval values to raise a TypeError for unsupported types, specifically excluding instances of tvm.tir.Buffer. This improves error handling and clarity in the Builder class.
      47039f06
    • Lei Wang's avatar
      [Enhancement] Improve iterator handling in layout utilities and parallel operations (#1221) · 2957afca
      Lei Wang authored
      * [Enhancement] Improve iterator handling in layout utilities and parallel operations
      
      * Added a new function, DivideUnusedIterators, to detect per-iterator gaps in fused index expressions, enhancing the accuracy of unused iterator detection.
      * Updated CompleteBufferFragment to prefer direct inversion for bijective index mappings and introduced a fallback mechanism for non-bijective cases, improving layout inversion robustness.
      * Added a new test for layout inference in fused kernels to ensure correct compilation and execution without layout inversion failures.
      
      * lint fix
      2957afca
    • Lei Wang's avatar
      [Bugfix] Improve error handling in LayoutNode::InverseWithLevel (#1215) (#1220) · cf46b7bd
      Lei Wang authored
      * Added logging and exception handling for layout errors in InverseWithLevel method.
      * Replaced direct error check with a throw statement to enhance error reporting and debugging capabilities.
      cf46b7bd
    • Lei Wang's avatar
      [Utils] Add source export, NVCC-based PTX/SASS dump, logging (#1216) · 7e5b1cd2
      Lei Wang authored
      * [Enhancement] Add NVCC support for PTX and SASS generation in TileLang
      
      * Introduced functions to compile CUDA C++ source to PTX and SASS formats, enhancing the ability to generate intermediate representations for CUDA kernels.
      * Added default compile options for NVCC, including paths for TileLang templates, CUTLASS, and CUDA includes.
      * Implemented methods to export and display generated PTX and SASS code, improving usability for developers working with CUDA targets.
      * Updated JITKernel class to integrate new NVCC functionalities for PTX and SASS handling, ensuring compatibility with existing workflows.
      
      * [Fix] Improve error handling in get_sass_from_source function
      
      * Added contextlib to suppress exceptions when removing temporary files, enhancing robustness.
      * Fixed formatting of error message for clarity when CUDA tools are not found, ensuring better user feedback.
      
      * [Enhancement] Preserve user flags in NVCC compile options
      
      * Updated the default_compile_options function to preserve user-specified compile flags, including repeated tokens, by utilizing shlex for proper tokenization.
      * This enhancement improves the flexibility and accuracy of NVCC compile options, ensuring that all user inputs are correctly handled.
      7e5b1cd2
    • Yichen Yan's avatar
    • Kuris's avatar
      [Fix] Fix buffer re-import typo in tilelang.languge (#1214) · d5fda276
      Kuris authored
      * Fix Buffer re-import typo in tilelang.langugage
      
      * fix lint error
      d5fda276
  3. 09 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Enhane LetStmt Handling in Pipeline Transform (#1212) · 85218bd9
      Lei Wang authored
      * [Enhancement] Introduce LetWrapper for handling loop variable substitutions in pipeline rewriting
      
      * Added LetWrapper struct to encapsulate variable and value pairs for loop variable substitutions.
      * Updated PipelineRewriter to accept a vector of LetWrapper instances, allowing for proper handling of Let statements that depend on the pipeline loop variable.
      * Enhanced the BuildPipeline method to incorporate LetWrapper instances into rewritten blocks, ensuring correct substitutions during pipeline execution.
      * Refactored logic for processing Let statements to differentiate between those that use the loop variable and those that do not, improving the flexibility of the pipeline transformation.
      
      * Refactor lambda expression for clarity in loop variable usage check in inject_pipeline.cc
      
      * [Test] Add regression test for loop variable handling in kernel compilation
      
      * Introduced a new test case to verify correct handling of loop variables in the kernel compilation process, addressing a regression issue with InjectSoftwarePipeline.
      * The test ensures that the loop variable is not left as a free variable, which previously caused failures in MakePackedAPI.
      * Configurations are set to disable warp specialization and TMA lowering to align with the original issue reproduction.
      
      * Remove unused import in regression test for loop variable handling in kernel compilation
      85218bd9
  4. 08 Nov, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Improve handling of negative indices for ramp and broadcast node (#1207) · 918a21bd
      Lei Wang authored
      * [Enhancement] Improve handling of negative indices in legalize_negative_index pass
      
      * Added logic to handle scalar and vector indices separately, enhancing the ability to determine non-negativity and negativity of indices.
      * Introduced detailed logging for cases where non-negativity cannot be proven, improving debugging capabilities.
      * Refactored index state determination for vector types, including support for Ramp and Broadcast nodes.
      
      * Fix incorrect lane handling in legalize_negative_index pass by dereferencing lanes to obtain the correct integer value.
      
      * Enhance legalize_negative_index pass by including necessary header for TIR operations. This addition supports improved functionality and maintainability of the transformation logic.
      918a21bd
    • Jesse's avatar
      Fix Dockerfile.cu128 (#1208) · 4818d209
      Jesse authored
      4818d209
  5. 07 Nov, 2025 3 commits
  6. 06 Nov, 2025 3 commits
  7. 05 Nov, 2025 8 commits
    • Lei Wang's avatar
      [SM70] Refactor and minor fix for SM70 (#1195) · 4a9cb470
      Lei Wang authored
      * [Feature] Add support for SM70 tensor core MMA instructions
      
      - Introduced new intrinsic `ptx_mma_sm70` for Volta GPUs, enabling m16n16k4 shape with FP16 inputs and FP16/FP32 accumulation.
      - Added `GemmMMASm70` class for handling GEMM operations specific to SM70 architecture.
      - Implemented layout functions for Volta swizzled layouts and updated existing GEMM layout inference logic.
      - Updated `requirements-dev.txt` to include `apache-tvm-ffi` dependency.
      - Added correctness evaluation script for testing GEMM operations on SM70.
      
      * [Refactor] Update formatting and installation commands in scripts
      
      - Modified `format.sh` to install `pre-commit` and `clang-tidy` with the `--user` flag for user-specific installations.
      - Improved readability in `correctness_evaluation_sm70.py` by adjusting the formatting of pytest parameters.
      - Cleaned up spacing and formatting in various C++ source files for better consistency and readability.
      - Removed unnecessary comments and improved layout function definitions in `mma_sm70_layout.py` and `mma_sm70_macro_generator.py` for clarity.
      - Ensured consistent formatting in layout initialization and swizzle functions.
      
      * typo fix
      4a9cb470
    • Tong WU's avatar
      [Feature] Add `tl.infinity` operator for infinity handling of bfloat16 (#1175) · 11456de2
      Tong WU authored
      
      
      * Update dependency version for apache-tvm-ffi in pyproject.toml to fix CI
      
      * [Math] Add `tl.infinity` operation and update Python interface for infinity handling
      
      - Implemented `infinity_op` in C++ to return infinity values for supported data types.
      - Registered new operation `tl.infinity` with appropriate attributes.
      - Updated Python interface to call the new `tl.infinity` operation instead of the previous method.
      
      * Add unit tests for `tl.infinity` operation in TileLang
      
      - Introduced a new test file `test_tilelang_language_infinity.py` to validate the behavior of the `tl.infinity` operation across multiple data types (float16, bfloat16, float32, float64).
      - Implemented a kernel to fill a tensor with infinity values and asserted the correctness of the output against PyTorch's `torch.inf`.
      
      * lint
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      11456de2
    • Lei Wang's avatar
      [Refactor] Dynamic registration of FP8 data type for compatibility with older... · c67d66a3
      Lei Wang authored
      [Refactor] Dynamic registration of FP8 data type for compatibility with older PyTorch versions (#1197)
      
      c67d66a3
    • Yu Cheng's avatar
      [Example] Update GQA varlen fwd (#1173) · a9d823b8
      Yu Cheng authored
      * [Example] Update GQA varlen fwd
      
      * fix
      a9d823b8
    • Zhengju Tang's avatar
      [GQA] Use TMA in GQA bwd kernel to boost performance (#1176) · 298ab480
      Zhengju Tang authored
      
      
      * [Test] Add cp async to avoid register spill
      
      * [BugFix] GQA fwd and bwd
      - Fix the undefined behavior of -inf in acc_s
      - Fix the causal loop range in varlen scenario
      
      * [TMA] Move on to TMA and locate the register spill issue
      
      * [Debug] Not the reason of zero-assignment. Probably the combination of Parallel op & conditional qkT
      
      * [Debug] The SIMT copy in producer occupies too many registers
      
      * [BugFix] Use 3D lse and delta to avoid illegal instruction
      
      * [Perf] Relaxed order for dQ and SIMT store for dKdV
      
      * [Feat] For atomic add version
      
      * [Lint]
      
      * [Bugfix] Enable code lowering with producer‑copy‑only program (#1168)
      
      * bugfix
      
      * lint fix
      
      * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.
      
      * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic.
      
      * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.
      
      * [Bugfix] Support 16bits shfl_sync (#1169)
      
      * Add type-safe warp shuffle helpers for 16-bit float types in common.h
      
      - Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`.
      - Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations.
      - Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability.
      
      * lint fix
      
      * [Testing] Move TMA 1D and test for its functionality (#1167)
      
      * [Testing] Move TMA 1D and test for its functionality
      
      * [Lint]
      
      * [Refactor]: Change the params in pytest to avoid oom error during ci (#1170)
      
      * [Refactor]: Change the params in pytest to avoid oom error during ci
      
      * format
      
      * fix
      
      * Update test_example_cast.py
      
      * Update parameters in test_example_cast
      
      * Update test_example_flash_attention.py
      
      * update
      
      * format
      
      * fix
      
      * fix
      
      * format
      
      * [Bugfix] Fix tvm import path for editable build (#1172)
      
      * [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986)
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      
      * [Language] Add Correctness and performance check scripts for V2 (#1174)
      
      * fix
      
      * lint fix
      
      * fix
      
      * lint fix
      
      * fix
      
      * upd
      
      * [Bugfix] Legalize Datatype for mma intrinisc codegen  (#1179)
      
      * fix
      
      * lint fix
      
      * Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands.
      
      * [Perf] Add layout and use_tma to boost performance
      
      * [Lint]
      
      * [Note]
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarYuqi Dong <134183314+yyttt6@users.noreply.github.com>
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      298ab480
    • Lei Wang's avatar
      [Langauge] Support n>256 for v2 (#1182) · b66a93c5
      Lei Wang authored
      * fix
      
      * lint fix
      
      * fix
      
      * lint fix
      
      * fix
      
      * upd
      
      * support n>256
      
      * Remove unnecessary pass configurations for fast math in MHA forward BHSD latency script.
      
      * lint fix
      
      * lint fix
      b66a93c5
    • Yichen Yan's avatar
      [Release] Unify local build scripts to use `cibuildwheel` and reduce size of sdist (#1171) · 354e9aff
      Yichen Yan authored
      * update exclude in sdist
      
      * reuse cibw workflow in maint
      
      * update
      
      * fix
      
      * fmt
      
      * upload artifacts for [Release] PRs
      
      * dot-prefix version file
      
      * update
      354e9aff
    • Kurisu's avatar
      [Feat] Add swap like grammar in tuple assignment (#1185) · 055f8500
      Kurisu authored
      * [Feat] add 2 phase binding to allow swap two var
      
      * Minor update tvm dtype constructor
      
      * fix lint error
      055f8500
  8. 04 Nov, 2025 4 commits
    • Lei Wang's avatar
      [Refactor] Improve Python3.9 compatibility for ParamSpec and Self (#1190) · 7d961892
      Lei Wang authored
      * [Feature] Enhance fill operation to support various buffer types
      
      - Added support for `BufferLoad` in the `fill` function to handle different buffer types.
      - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
      - Introduced checks for static bounds in region definitions to ensure safety during operations.
      - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.
      
      * lint fix
      
      * [Refactor] Improve Python compatibility for ParamSpec and Self
      
      - Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively.
      - Updated type annotations across multiple files to ensure consistent usage of typing features.
      
      * [Update] Require Python 3.9 and enhance type annotations
      
      - Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`.
      - Removed references to Python 3.8 in classifiers.
      - Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility.
      - Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files.
      
      * [Refactor] Update import statements to enhance type annotations
      
      - Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity.
      - Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function.
      - Adjusted import statements in the language proxy file to maintain consistency in type annotations.
      
      * disable rocm rs nt test.
      
      * lint fix
      7d961892
    • Lei Wang's avatar
      [Feature] Enhance fill operation to support various buffer types (#1189) · a03df604
      Lei Wang authored
      * [Feature] Enhance fill operation to support various buffer types
      
      - Added support for `BufferLoad` in the `fill` function to handle different buffer types.
      - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
      - Introduced checks for static bounds in region definitions to ensure safety during operations.
      - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.
      
      * lint fix
      a03df604
    • Kurisu's avatar
      [Fix] Remove unsupported type params (#1186) · 1768cbef
      Kurisu authored
      * [Fix] Remove type params
      
      * fix lint error
      
      * [Fix] fix dtype new error
      1768cbef
    • pre-commit-ci[bot]'s avatar
      [CI] [pre-commit.ci] autoupdate (#1183) · 778b97dc
      pre-commit-ci[bot] authored
      * [CI] [pre-commit.ci] autoupdate
      
      updates:
      - [github.com/astral-sh/ruff-pre-commit: v0.14.1 → v0.14.3](https://github.com/astral-sh/ruff-pre-commit/compare/v0.14.1...v0.14.3
      
      )
      
      * [CI] sync ruff version
      
      ---------
      Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@pku.edu.cn>
      778b97dc
  9. 03 Nov, 2025 5 commits
  10. 02 Nov, 2025 4 commits
    • Lei Wang's avatar
      [Language] Add Correctness and performance check scripts for V2 (#1174) · d99853b6
      Lei Wang authored
      * fix
      
      * lint fix
      
      * fix
      
      * lint fix
      
      * fix
      
      * upd
      d99853b6
    • Lei Wang's avatar
      [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb
      Lei Wang authored
      
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      aef0a6bb
    • Lei Wang's avatar
      c85bb3ac
    • Yuqi Dong's avatar
      [Refactor]: Change the params in pytest to avoid oom error during ci (#1170) · 13bdcd60
      Yuqi Dong authored
      * [Refactor]: Change the params in pytest to avoid oom error during ci
      
      * format
      
      * fix
      
      * Update test_example_cast.py
      
      * Update parameters in test_example_cast
      
      * Update test_example_flash_attention.py
      
      * update
      
      * format
      
      * fix
      
      * fix
      
      * format
      13bdcd60
  11. 01 Nov, 2025 1 commit