1. 03 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Legalize Datatype for mma intrinisc codegen (#1179) · 7c61d31a
      Lei Wang authored
      * fix
      
      * lint fix
      
      * Enhance CUDA code generation by updating register type handling for float data types. Introduced a workaround for TF32 type compatibility and improved the registration of MMA register types for A and B operands.
      7c61d31a
  2. 02 Nov, 2025 4 commits
    • Lei Wang's avatar
      [Language] Add Correctness and performance check scripts for V2 (#1174) · d99853b6
      Lei Wang authored
      * fix
      
      * lint fix
      
      * fix
      
      * lint fix
      
      * fix
      
      * upd
      d99853b6
    • Lei Wang's avatar
      [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb
      Lei Wang authored
      
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      aef0a6bb
    • Lei Wang's avatar
      c85bb3ac
    • Yuqi Dong's avatar
      [Refactor]: Change the params in pytest to avoid oom error during ci (#1170) · 13bdcd60
      Yuqi Dong authored
      * [Refactor]: Change the params in pytest to avoid oom error during ci
      
      * format
      
      * fix
      
      * Update test_example_cast.py
      
      * Update parameters in test_example_cast
      
      * Update test_example_flash_attention.py
      
      * update
      
      * format
      
      * fix
      
      * fix
      
      * format
      13bdcd60
  3. 01 Nov, 2025 1 commit
  4. 31 Oct, 2025 4 commits
    • Lei Wang's avatar
      [Bugfix] Support 16bits shfl_sync (#1169) · 54d4bd62
      Lei Wang authored
      * Add type-safe warp shuffle helpers for 16-bit float types in common.h
      
      - Introduced generic passthrough functions for warp shuffle operations: `shfl_xor_sync`, `shfl_down_sync`, `shfl_up_sync`, and `shfl_sync`.
      - Added specializations for `cutlass::half_t` and `cutlass::bfloat16_t` to ensure type safety during shuffle operations.
      - Updated `reduce.h` to utilize the new shuffle functions, enhancing code clarity and maintainability.
      
      * lint fix
      54d4bd62
    • Lei Wang's avatar
      [Bugfix] Enable code lowering with producer‑copy‑only program (#1168) · 7a80b6df
      Lei Wang authored
      * bugfix
      
      * lint fix
      
      * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.
      
      * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic.
      
      * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.
      7a80b6df
    • Lei Wang's avatar
      [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28
      Lei Wang authored
      
      
      * 3rdparty tvm bump
      
      * bump tvm into v0.22.0
      
      * lint fix
      
      * rebase tvm
      
      * Update submodule tvm to latest commit 3085bc4
      
      * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang
      
      * test fix
      
      * add requirement
      
      * atomic_fix
      
      * atomic_fix
      
      * phaseout py39
      
      * optimize
      
      * optimize
      
      * lint fix
      
      * do not clean cache
      
      * do not clean cache
      
      * [Minor] Minor update for Python versions and dependencies
      
      * [Lint] fix lint for py39
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Sync CI changes from upstream/sdist
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Update `repair-wheel-command`
      
      * [Minor] update abi3audit result format
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] fix build
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * [Deps] pin apache-tvm-ffi version
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Deps] Restore Python 3.8 support
      
      * [Build] use `apache-tvm-ffi`'s `libtvm_ffi`
      
      * [BugFix] use `;` as delimiter for RPATH on macOS
      
      * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`
      
      * [Build] support `sccache` if available
      
      * [Build] add CIBW import test
      
      * [Build][CI] enable ccache for CIBW on Linux
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * Revert "[Build][CI] enable ccache for CIBW on Linux"
      
      This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.
      
      * [CI] fix perfbench bot
      
      * [BugFix] use Python 3.9 to build wheel
      
      * [Minor] update perfbench bot envs
      
      * [BugFix] fix CIBW environment on Linux
      
      * [CI] skip import test on CentOS 7
      
      * [CI] use Python urllib to download file instead of Wget
      
      ---------
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@pku.edu.cn>
      10911e28
    • Lei Wang's avatar
      [Release] Bump version to v0.1.6.post2 (#1160) · c37621c5
      Lei Wang authored
      * [Release] Update README and VERSION for v0.1.6.post2 compatibility with Python 3.8
      
      * [Enhancement] Update packaging configuration and Docker scripts for multi-architecture support
      
      * Add allowlist for TVM, CUTLASS, and Composable Kernel items in pyproject.toml
      * Enhance docker_local_distribute.sh to support cross-architecture builds using docker buildx
      * Modify pypi.manylinux.Dockerfile to accept TARGETARCH argument for better architecture handling
      
      * [Enhancement] Improve Docker scripts and build process for multi-architecture support
      
      * Update .gitignore to include dist directories
      * Refactor docker_local_distribute.sh for better cross-architecture handling and error management
      * Enhance docker_pypi_distribute.sh to support multi-architecture builds with docker buildx
      * Modify pypi_distribution.sh to clean up additional directories
      * Update pypi.manylinux.Dockerfile for improved environment configuration and architecture handling
      
      * fix
      
      * Remove outdated classifier for Artificial Intelligence from pyproject.toml
      
      * Update pyproject.toml classifiers and modify Docker distribution scripts for clarity
      
      * Add new classifier for Artificial Intelligence in pyproject.toml
      * Rename output directories in docker_local_distribute.sh and docker_pypi_distribute.sh for better context
      c37621c5
  5. 29 Oct, 2025 6 commits
  6. 28 Oct, 2025 5 commits
  7. 27 Oct, 2025 9 commits
  8. 25 Oct, 2025 1 commit
  9. 24 Oct, 2025 1 commit
  10. 23 Oct, 2025 4 commits
    • Wenhao Xie's avatar
      [Feature] Support None type as input for `T.ptr` and `T.Tensor` (#1114) · 50e789dd
      Wenhao Xie authored
      * [Feature] Support None type as input for T.ptr and T.Tensor
      
      * lint
      
      * lint
      
      * lint
      
      * lint fix
      50e789dd
    • Tong WU's avatar
      [Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a
      Tong WU authored
      * [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen
      
      * Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
      * Enhanced the existing code to support both directions of conversion based on the lane count.
      * Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.
      
      * [Feature] Add float32 to float8 conversion support in CUDA codegen
      
      * Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
      * Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
      * Enhanced type handling for better compatibility with TileLang, particularly for float8 types.
      
      * lint
      
      * fix a bug
      
      * [Enhancement] Support lanes=4 cases and add unit test for vectorized cast
      
      * lint
      
      * [Feature] Refactor bf16 convertion operations and remove legacy compile flags
      
      * lint
      a148d62a
    • Lei Wang's avatar
      [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi (#1111) · 86c8bb46
      Lei Wang authored
      * [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic
      
      * Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function.
      * Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty.
      
      * lint fix
      
      * remove debug print
      86c8bb46
    • Yichen Yan's avatar
      [Lint] Enable pyupgrade linter in ruff (#963) · f14fb111
      Yichen Yan authored
      * update rules
      
      * ruff check
      
      * other fixes
      
      * fmt
      
      * do not touch examples
      
      * fmt
      f14fb111
  11. 22 Oct, 2025 4 commits