- 10 Dec, 2025 1 commit
-
-
Chaofan Lin authored
-
- 08 Dec, 2025 1 commit
-
-
Lei Wang authored
* [Build] Update CMake configuration for tilelang_cython_wrapper installation - Adjusted output directories for the tilelang_cython_wrapper to ensure that development builds place the extension in build/lib. - Updated installation paths to place the extension in tilelang/lib within the wheel, improving organization and avoiding potential conflicts with other modules. - Modified the internal library path exposure in env.py to prevent shadowing of common module names, enhancing compatibility and usability in user projects. * [Build] Standardize output directories for tilelang libraries - Set output directories for both tilelang and tilelang_module libraries to "${CMAKE_BINARY_DIR}/lib" for consistency in development builds. - This change enhances organization and ensures that all build artifacts are located in a unified directory structure.
-
- 28 Nov, 2025 1 commit
-
-
Lei Wang authored
[Enhancement] Improve error handling and assertion messages across runtime and argument binding (#1356) This commit enhances the error handling mechanisms in the runtime by introducing CPU-safe runtime helpers and refining assertion messages in the CodeGenCHost and ArgBinder. It includes structured packed error messages for various conditions, improving clarity in diagnostics. Additionally, the CMake configuration is updated to always include necessary runtime helpers, ensuring consistent error reporting. The changes aim to provide clearer feedback during runtime errors and improve the overall robustness of the argument binding process.
-
- 18 Nov, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Update FFI type handling and simplify argument management * Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity. * Updated function registration in `runtime.cc` to utilize canonical names for better consistency. * Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled. * Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection. * Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity. * [Update] Sync TVM submodule and enhance kernel source handling * Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes. * Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging. * Commented out the main execution call in test files to prevent unintended execution during testing. * Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues. * Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends. * [Refactor] Clean up imports and improve code formatting * Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code. * Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency. * Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality. * Update execution backend options and improve resolution logic - Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target. - Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions. - Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target. - Updated documentation to reflect changes in execution backend options and their defaults. * lint fix * fix * Enhance argument handling in CUDA and HIP runtime modules - Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime. - Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers. - Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks. * lint fix * lint fix * lint fix * lint fix * minor fix * fix * recover check * Refactor argument binding and validation in `arg_binder.cc` - Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers. - Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards. - Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling. - Minor adjustments in test files to streamline kernel execution and improve readability. * lint fix * stride fix * minor fix * fix * lint fix * lint fix * Add CUDA stream access policy window helpers and integrate with L2 persistent cache management - Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage. - Updated runtime files to include new FFI packed functions for managing stream attributes. - Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown. - Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source. * check with symbolic * support null ptr * Update CMakeLists and lower.py for code generation and subproject status - Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support. - Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility. - Marked the TVM subproject as dirty to indicate local modifications. * lint fix * Update comments for clarity in quickstart.py
-
- 12 Nov, 2025 1 commit
-
-
Lei Wang authored
* Add kernel selection option for GEMM v1 in environment settings - Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version. - Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable. - Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable. * bug fix * Add kernel selection option for GEMM in environment settings - Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations. - Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value. - Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable. * Refactor GEMM macro generator to use BufferRegion instead of Buffer - Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`. - Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types. - Simplified buffer access logic for better clarity and maintainability. * Refactor GEMM functions to utilize BufferRegion for improved memory handling - Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices. - Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion. - Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability. - Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access. * Refactor GEMM code for improved readability and consistency - Cleaned up formatting and spacing in GEMM-related files for better readability. - Standardized comments and code structure across various GEMM functions and macros. - Enhanced error messages for clarity in buffer region checks. - Removed redundant lines and improved overall code maintainability. * Update GEMM correctness evaluation and macro generator for improved functionality - Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests. - Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing. - Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access. - Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations. * Refactor GEMM and intrinsic files for improved clarity and functionality - Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code. - Adjusted function signature formatting in `swizzle.py` for better readability. - Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation. - Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness. - Improved function signature formatting in `language.py` for consistency. * Enhance GEMM and MMA functionality for FP64 support - Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection. - Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations. - Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage. - Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms. - Enhanced utility functions to accommodate FP64 index mapping for shared memory operations. * lint fix * Refactor GEMM correctness evaluation and shared memory alignment handling - Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency. - Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope. - Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types. - Cleaned up commented-out test code in `correctness_evaluation.py` for better readability. * Enhance GEMM and MMA implementations with region-based memory handling - Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations. - Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication. - Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations. * Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls - Updated import statements to directly reference modules instead of individual test functions, enhancing clarity. - Modified function calls to use the new module structure for better organization and maintainability in testing examples. * Enhance OnArrayDeclaration method to handle repeated buffer declarations - Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations. - Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations. * Add abbreviation for bfloat16 data type in mfma_macro_generator.py - Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation. * Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation - Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits. - Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity. - Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8. * lint fix * Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP - Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings. - Updated logic to track user-selected backends and conditionally enable defaults based on environment variables. - Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity. - Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency. * Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py - Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity. - Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements. * Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity. * Refactor attention sink examples to simplify index calculations - Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices. - Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops. * Refactor attention sink examples to streamline index calculations - Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices. - Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity. * lint fix * bugfix * Refactor reduce operation handling in CUDA and Python - Removed outdated shared memory reduction logic from `reduce.cc`. - Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes. - Updated CUDA header to define a wider accumulator type for better numerical accuracy. - Enhanced error handling for buffer scope validation in the reduction process. * Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs * Enhance unit loop handling by refining annotation checks - Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present. - Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability. * clean clode
-
- 10 Nov, 2025 1 commit
-
-
Yichen Yan authored
-
- 06 Nov, 2025 1 commit
-
-
Yichen Yan authored
* Enable ccache for linux cibw, unify ccache settings. * hash cc files to avoid get stuck in some case * Add comments about ccache version * fix wrong gitignore
-
- 31 Oct, 2025 1 commit
-
-
Lei Wang authored
* 3rdparty tvm bump * bump tvm into v0.22.0 * lint fix * rebase tvm * Update submodule tvm to latest commit 3085bc4 * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang * test fix * add requirement * atomic_fix * atomic_fix * phaseout py39 * optimize * optimize * lint fix * do not clean cache * do not clean cache * [Minor] Minor update for Python versions and dependencies * [Lint] fix lint for py39 * [Lint] fix lint for ROCm * [Build][CI] Sync CI changes from upstream/sdist * [Lint] fix lint for ROCm * [Build][CI] Update `repair-wheel-command` * [Minor] update abi3audit result format * [Lint] fix lint for ROCm * [BugFix] fix build * [Lint] fix lint for ROCm * [BugFix] set rpath for libtvm and libtvm_runtime * [Deps] pin apache-tvm-ffi version * [Build] set Python 3.9 Limited API for Cython target * [Build] set Python 3.9 Limited API for Cython target * [Deps] Restore Python 3.8 support * [Build] use `apache-tvm-ffi`'s `libtvm_ffi` * [BugFix] use `;` as delimiter for RPATH on macOS * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel` * [Build] support `sccache` if available * [Build] add CIBW import test * [Build][CI] enable ccache for CIBW on Linux * [BugFix] set rpath for libtvm and libtvm_runtime * Revert "[Build][CI] enable ccache for CIBW on Linux" This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3. * [CI] fix perfbench bot * [BugFix] use Python 3.9 to build wheel * [Minor] update perfbench bot envs * [BugFix] fix CIBW environment on Linux * [CI] skip import test on CentOS 7 * [CI] use Python urllib to download file instead of Wget --------- Co-authored-by:Xuehai Pan <XuehaiPan@pku.edu.cn>
-
- 21 Oct, 2025 1 commit
-
-
Lei Wang authored
* refactor cython wrapper * optimize * fix installations
-
- 17 Oct, 2025 2 commits
- 14 Oct, 2025 1 commit
-
-
Yichen Yan authored
* Load libs from build dir, if present, to support faster rebuild. * typo * upd * refine check * md lint
-
- 13 Oct, 2025 1 commit
-
-
Yichen Yan authored
* cleanup * init * build first wheel that may not work * build cython ext * fix tvm build * use sabi * update rpath to support auditwheel * pass editible build * update ci * fix warnings * do not use ccache in self host runner * test local uv cache * test pip index * update lib search to respect new lib location * fix * update ci * enable cuda by default * update src map * fix * fix * fix * Generate version with backend and git information at build time * copy tvm_cython to wheels * fix tvm lib search * fmt * remove unused * auto detect ccache * add back backend-related files * remove jit cython adaptor to simplify code * fmt * fix ci * ci fix 2 * ci fix 3 * workaround metal * ci fix 4 * fmt * fmt * Revert "ci fix 4" This reverts commit d1de8291c3e40927955f3ad3cf87a75c78813676. * tmp * fix metal * trivial cleanup * add detailed build-time version for cuda * add back mlc * Restore wheel info and other trivial updates * update * fix cuda * upd * fix metal ci * test for ga build * test for nvidia/cuda * test ubuntu 20 * fix * fix * Do not use `uv build` * fix * fix * log toolchain version * merge wheel * update * debug * fix * update * skip rocm * update artifacts each * fix * fix * add mac * fix cache * fix cache * fix cache * reset and add comment * upd * fix git version * update deps * trivial update * use in-tree build dir and install to src to speedup editable build * Revert "use in-tree build dir and install to src to speedup editable build" This reverts commit 6ab87b05c5eed811210136b8dca4fc3677dd51f2. * add build-dir * update docs * remove old scrips * [1/n] cleanup scripts * [Lint]: [pre-commit.ci] auto fixes [...] * fix and update * wait for tvm fix * revert some tmp fix * fix * fix * spell * doc update * test cibuildwheel * fix and test macos on ci * Update .github/workflows/dist.yml Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com> * fix * test ga event * cleanup * bump tvm to support api3 * test final version * add cron * Update .github/workflows/dist.yml Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com> * fix * test ccache for metal cibuildwheel * test newer macos * finish --------- Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com>
-
- 10 Oct, 2025 1 commit
-
-
Xuehai Pan authored
* chore: misc cleanup * feat: add pre-commit config * chore: update lint dependencies * style: fix lint issues * feat: add pre-commit hooks * fix: fix typos * chore: update .gitattributes * [Lint]: [pre-commit.ci] auto fixes [...] * docs: update CONTRIBUTING.md * chore: update default venv name * chore: revert and exclude CUDA files --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
- 07 Oct, 2025 1 commit
-
-
Yichen Yan authored
* Reset * Fix other CUDA issue * fmt * fmt * fix cuda error * fix * fix * fmt * cleanup * fix * remove copyright * trivial update * readme update * lint fix --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
- 21 Sep, 2025 1 commit
-
-
Lei Wang authored
* bump version to 0.1.6 * phaseout py38 * py39 * Update submodule 'tvm' to latest commit adc0e48 * [Build] Update CMake and Python environment settings - Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking. - Removed the cmake version requirement from pyproject.toml to allow for broader compatibility. - Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments. * [Build] Update Python version requirements in scripts and documentation - Changed Python version requirement in README.md from 3.9+ to 3.8+. - Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version. - Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments. * [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml - Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility. - Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable. * [Build] Update CMake and Dockerfile for improved compatibility - Removed static linking flags from CMakeLists.txt to simplify build configuration. - Updated Dockerfile to use Ubuntu 20.04 and streamlined the installation of dependencies, removing gcc-9 and g++-9. - Adjusted symlink creation for Python environments to use the `-sf` option for safer linking. * [Build] Bump version to 0.1.6.post1 for post-release updates * [Build] Remove static linking flags from CMakeLists.txt - Eliminated static linking flags for GCC and libstdc++ to simplify build configuration and avoid potential conflicts with Python extensions. * [Build] Update Docker distribution scripts for manylinux compatibility - Changed base image from `tilelang-builder:18.04` to `tilelang-builder:manylinux` in both local and PyPI distribution scripts. - Updated Dockerfile references to use `pypi.manylinux.Dockerfile`. - Added `--gpus all` flag to the Docker run command to enable GPU support during execution. * lint fix * add cmake
-
- 19 Sep, 2025 1 commit
-
-
Lei Wang authored
* bump version to 0.1.6 * phaseout py38 * py39 * Update submodule 'tvm' to latest commit adc0e48 * [Build] Update CMake and Python environment settings - Added static linking flags for GCC and libstdc++ in CMakeLists.txt to enhance library linking. - Removed the cmake version requirement from pyproject.toml to allow for broader compatibility. - Updated the tox command in the Docker distribution script to include Python 3.8 for testing environments. * [Build] Update Python version requirements in scripts and documentation - Changed Python version requirement in README.md from 3.9+ to 3.8+. - Updated installation and testing scripts to use Python 3.8 instead of 3.9, ensuring compatibility with the new minimum version. - Adjusted tox commands in local and PyPI distribution scripts to include Python 3.8 in the testing environments. * [Build] Update Python and CMake requirements in Dockerfile and pyproject.toml - Added CMake version requirement (>=3.26) to pyproject.toml for build compatibility. - Created a Python 3.8 environment in the Dockerfile and added a symlink for easier access to the Python 3.8 executable.
-
- 18 Sep, 2025 1 commit
-
-
Lei Wang authored
* bugfix * [Build] Update build dependencies and Dockerfile configuration - Updated `pyproject.toml` and `requirements-build.txt` to specify Cython version as `Cython>=3.0.0`. - Removed unnecessary dependencies from the build system. - Enhanced `pypi.Dockerfile` to install gcc-9 and g++-9, and added ninja-build for improved build performance. - Updated conda environment creation to include Python 3.9 to 3.12, while removing the Python 3.8 environment. * cmake fix * fix * fix
-
- 17 Sep, 2025 1 commit
-
-
Lei Wang authored
-
- 10 Sep, 2025 1 commit
-
-
Lei Wang authored
* Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability - Removed deprecated prime factorization functions from `gemm.cc` and `gemm_sp.cc`. - Introduced a new `GemmWarpPolicy` class to manage warp policy attributes and methods, improving encapsulation. - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities. - Enhanced `GetArchInt` function in `utils.cc` for better readability and type safety. - Added new `gemm_v2` function in `gemm.py` for improved GEMM operation with additional parameters and checks. * Refactor GEMM and frontend legalize operations for improved clarity and functionality - Updated `gemm_py.h` to include the correct header for GEMM operations. - Renamed `FrontendLegalizer` class to `LetInliner` and updated related methods to reflect this change, enhancing code clarity. - Modified the pass function from `FrontendLegalize` to `LetInline` for better alignment with its purpose. - Updated test cases to utilize the new `gemm_v2` function and adjusted the testing framework for improved output and clarity. - Removed obsolete test file `test_tilelang_transform_frontend_legalize.py` to streamline the test suite. - Enhanced the `LowerAndLegalize` function to utilize the new `LetInline` pass, improving the overall transformation process. * Enhance CUDA code generation and testing for GEMM operations - Added indentation printing in `codegen_cuda.cc` for improved assembly code formatting. - Updated `test_tilelang_tilelibrary_gemm.py` to include additional GEMM test cases and shared memory allocation with specified scope. - Introduced new `matmul_sr` and `run_gemm_sr` functions for GEMM operations with shared and fragment memory layouts. - Refactored layout inference in `mma_macro_generator.py` to improve clarity and correctness in shared memory handling. - Enhanced `gemm/__init__.py` to support new GEMM operation combinations and layout inference logic. These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework. * Refactor GEMM layout and testing for improved clarity and functionality - Updated `gemm_layouts.cc` to enhance the layout generation logic for transposed and non-transposed GEMM operations. - Renamed and modified functions in `test_tilelang_tilelibrary_gemm.py` to reflect changes in GEMM function signatures and improve test coverage. - Introduced new GEMM operation combinations in `gemm/__init__.py` to support additional layouts and configurations. - Enhanced layout inference in `mma_layout.py` and `mma_macro_generator.py` for better handling of shared memory layouts. These changes improve the clarity, functionality, and testing coverage of GEMM operations in the TileLang framework. * Refactor GEMM layout and Python integration for improved functionality - Updated `gemm_layouts.cc` to correct the order of layout replication and repetition for transposed and non-transposed GEMM operations. - Enhanced `gemm_py.cc` to handle block realization more robustly, ensuring correct assignment of global symbols and block attributes. - Refactored `inject_pipeline.cc` to streamline buffer read/write region handling, improving clarity and maintainability. - Cleaned up test cases in `test_tilelang_tilelibrary_gemm.py` by removing unnecessary print statements and adjusting function calls for better test execution flow. These changes enhance the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework. * Refactor GEMM layout and testing for improved clarity and functionality - Updated `gemm_layouts.cc` to enhance layout generation logic for transposed and non-transposed GEMM operations. - Improved block realization handling in `gemm_py.cc` for better assignment of global symbols. - Streamlined buffer read/write region handling in `inject_pipeline.cc` for clarity. - Enhanced test cases in `test_tilelang_tilelibrary_gemm.py` by adjusting function calls and adding new GEMM operation combinations. These changes improve the clarity, functionality, and robustness of GEMM operations and their testing in the TileLang framework. * tfloat32 support. * lint fix * lint fix * Refactor shared memory allocation in GEMM tests - Removed unnecessary scope specification in shared memory allocation for matrices A and B in `test_tilelang_tilelibrary_gemm.py`. - This change simplifies the allocation process and aligns with the updated GEMM function signatures.
-
- 02 Sep, 2025 1 commit
-
-
Lei Wang authored
* Fix type hint for target_host parameter in compile function to allow None value * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code. * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity. * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling. * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits. * Add intrin_rule source files to CMakeLists.txt and implement hrsqrt function for half_t in common.h * lint fix * remove cmake dep in pyproject as it may lead to different cmake paths in diff stages * lint fix * Add cmake dependency to pyproject.toml and improve build logging in setup.py
-
- 30 Jul, 2025 1 commit
-
-
Siyuan Feng authored
**Summarize part of the rebase pr:** 1. **Support T.thread_return() → CUDA return syntax** Added support for translating `T.thread_return()` to CUDA's native `return` statement. 2. **Dynamic type support for function inputs** Functions now accept dynamically typed parameters using `typing`: ```python dyn_type = T.int32 or T.float @T.prim_func def main( a: dyn_type, ) ``` 3. **Device Function Codegen** Added support for generating `__device__` functions in CUDA: ```python @I.ir_module class Module: @T.prim_func(private=True) def add(a: T.int32, b: T.int32) -> T.int32: return a + b @T.prim_func def main( A: T.Buffer((128, 128), "int32"), B: T.Buffer((128, 128), "int32"), C: T.Buffer((128, 128), "int32"), ): T.func_attr({"global_symbol": "main"}) length: T.int32 = Module.add(64, 64) # Host call for bx in T.thread_binding(length, "blockIdx.x"): for tx in T.thread_binding(length, "threadIdx.x"): C[bx, tx] = Module.add(A[bx, tx], B[bx, tx]) # Device call ``` After compilation, `add` becomes a CUDA `__device__` function. 4. **Cython-based Python/C++ interop** Replaced ctypes with Cython for all Python/C++ interactions: - Python → C++ calls - C++ → Cython calls This improves performance by around 100x and reduces CPU overhead during compile/runtime. 5. **FP8 data type standardization** Migrated `e5m2_float8` and similar types to Torch-standardized variants`float8_e5m2` and etc. * Refactor CMakeLists.txt to set default build type and manage dependencies for tvm_cython modules * Update default value of `check_well_formed` parameter in `prim_func` to False for improved flexibility in TIR function parsing. * Add StorageRewrite function to transform module Introduced the StorageRewrite function in the tilelang.transform module, which returns a TVM transform pass. This addition enhances the functionality of the module by providing a new transformation option for users. * Refactor null option handling in IR and layout inference - Updated instances of `NullOpt` to `std::nullopt` in `ir.cc` and `parallel.cc` for consistency with modern C++ practices. - Enhanced layout inference logic in `layout_inference.cc` to improve type safety by replacing `as<Fragment>().get()` with `as<FragmentNode>()`. - Adjusted error handling in `multi_version_buffer_rewriter.cc` and `persist_threadblock.cc` to use more concise null checks. - Cleaned up test files by commenting out `tilelang.testing.main()` and replacing it with specific test function calls for better clarity. - Removed unused test file `test_tilelang_kernel_deepseek_nsa.py` to streamline the testing suite. * Update TVM subproject and refactor cluster planning and tile operation handling - Updated the TVM subproject to a dirty commit state. - Refactored copyright headers in `cluster_planning.cc` to reflect the new licensing. - Enhanced error handling in `lower_tile_op.cc` to check for missing padding map annotations. - Modified test files to improve clarity and functionality, including adjustments to kernel compilation and test assertions. - Updated various test cases to ensure proper handling of annotations and configurations in the TileLang testing framework. * Update annotation type in warp specialized test for consistency - Changed the annotation type in the `test_warp_specialized` function from a literal integer to `T.int32(3)` for improved type safety and consistency with the TileLang framework. * Refactor test execution in warp specialized test - Replaced the direct call to `test_warp_specialized()` with `tilelang.testing.main()` in the test file to standardize test execution and improve integration with the TileLang testing framework. * refactor * [Enhancement] Add strict layout map for improved buffer layout inference (#594) - Introduced a `strict_layout_map` to enhance layout inference by ensuring that buffers with strict layout requirements are properly accounted for during the inference process. - Updated the inference logic to check for the presence of buffers in the `strict_layout_map` before applying layout changes, improving the accuracy of layout assignments. - Refactored the layout inference steps to include the copying of layouts into the new strict map, ensuring a clear separation of layout handling based on inference levels. * [Example] Update examples to use @tilelang.jit (#597) * [Example] Update kernel compilation in examples to use @tilelang.jit - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead. - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability. - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed. * format * Update example_tilelang_sparse_gqa_decode_varlen_indice.py * Update example_dequant_gemm_fine_grained.py * Update example_gemm_autotune.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com> * [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks (#599) * [Enhancement] Improve error messaging for global and shared range legality checks in LowerBulkCopy - Updated error messages in the LowerBulkCopy function to provide clearer context when global and shared ranges are illegal. - Enhanced the readability of the error output by including tensor names, improving debugging and validation processes during bulk copy operations. * [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks - Improved the clarity of error messages in the LowerBulkCopy function by enhancing the output format. - Included additional context in error messages to aid debugging when global and shared ranges are found to be illegal, ensuring better traceability during bulk copy operations. * [Enhancement] Introduce PassConfig `TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE` to enable aggressive shared memory reuse (#602) * [Enhancement] Add aggressive shared memory merge option in memory allocation - Introduced a new configuration option `tl.enable_aggressive_shared_memory_merge` to enable aggressive merging of shared memory allocations. - Updated the `SharedMemLinearAccessPatternFinder` class to support an aggressive merge strategy, allowing for improved memory reuse. - Modified the `MergeSharedMemoryAllocations` function to incorporate the new merging strategy based on the configuration. - Enhanced the `PassConfigKey` enumeration to include the new aggressive merge option, ensuring it can be configured appropriately. * lint fix * [Enhancement] Add aggressive shared memory merge configuration option - Introduced a new configuration option `kEnableAggressiveSharedMemoryMerge` to enable aggressive merging of shared memory allocations, enhancing memory management capabilities. * [Enhancement] Update MergeSharedMemoryAllocations to support aggressive merge option - Modified the `MergeSharedMemoryAllocations` function to accept an `enable_aggressive_merge` parameter, allowing for more flexible memory management. - Introduced a new helper function `should_enable_aggressive_merge` to determine the aggressive merge configuration based on the pass context and target. - Updated the relevant calls in the `phase.py` and `__init__.py` files to utilize the new aggressive merge functionality, enhancing the overall memory allocation strategy. * [Refactor] Update accumulation handling in gemm_sm90.h (#603) - Replaced the use of `tiled_mma.accumulate_ = GMMA::ScaleOut::Zero` with a call to `clear(acc)` for better clarity and maintainability in the accumulation logic. - This change enhances the readability of the code by standardizing the approach to clearing accumulation values across multiple sections of the file. * [Enhancement] Add tma bulk copy. (#600) * [Bugfix] Fixed mha_bwd shape inconsistency error (#604) * lint fix * Update requirements-lint.txt to maintain clang-format version consistency * [Bugfix] Avoid duplicate data access when cross thread buffer meet replicate register (#606) * [Enhancement] Improve debug output formatting in layout and fragment nodes - Updated the `DebugOutput` methods in `LayoutNode` and `FragmentNode` to provide more structured and informative output, including transformation details and thread range information. - Enhanced layout inference logic in `ParallelOp` to add predicates for cross-thread shared memory access, improving layout handling in parallel operations. - Minor adjustment in `layout_inference.cc` to ensure clarity in parallel loop handling. * lint fix * [Enhancement] Support tf32 gemm_rs (#607) - Added a line break in `quickstart.py` for better readability. - Simplified the JIT kernel compilation in `quickstart.py` by removing the unused execution backend option. - Modified `example_elementwise_add.py` to disable cache for `tilelang` and optimized the element-wise addition kernel by utilizing shared memory for input tensors, improving performance. - Updated default values for matrix dimensions and block sizes in the argument parser to enhance usability. * [Enhancement] Introduce option `TL_DISABLE_FAST_MATH` and `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` (#609) * [Enhancement] Introduce new PassConfig options for fast math and PTXAS verbosity - Added `kDisableFastMath` and `kEnablePTXASVerboseOutput` configuration options to enhance control over compilation settings. - Updated `LibraryGenerator` to utilize these new pass configurations, allowing for more flexible compilation behavior based on user preferences. - Enhanced `PassConfigKey` enumeration to include the new options, ensuring they can be configured appropriately in the pass context. * [Refactor] Update PTXAS verbosity configuration key in LibraryGenerator - Changed the configuration key for PTXAS verbosity from `TL_VERBOSE_PTXAS_OUTPUT` to `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` to align with the new naming convention introduced in recent enhancements. - This update ensures consistency in the configuration options used within the `LibraryGenerator` class, improving clarity and maintainability of the code. * lint fix * fix build * [Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526) * [experimental] add a draft gemm_sp * [3rdparty] bump cutlass to v3.9.3 * [lint] run format.sh * [chore] rebase * [chore] use abs path * [gemm_sp] add metadata layout * [ci] add more example * [lint] run format.sh * [chore] polish * [chore] move gemm_sp to experimental * [chore] polish * [lint] run format.sh * [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test * Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy. * Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process. * Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality. * Implement Test * [Enhancement] Update GEMM SP and SM89 templates for improved functionality * Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture. * Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets. * Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types. * Added conditional inclusion for GEMM SP header based on CUDA architecture version. * lint fix * [gemm_sp] support more layout and data types * Enhancement: sync T.gemm_sp's layout inference with T.gemm * Enhancement: support more block_k in compress util * [Enhancement] enable block_k=64 * [Lint] run format.sh * [Enhancement] compressor support more dtype * Enhancement: enable block_K=32 * [Lint] format.sh * [Fixbug] fix shape * Refactor: sync gemm * [Enhancement] enable transpose * [Enhancement] enable fp8_e4m3 * [Enhancement] enable int8 * [Lint] run format.sh * [Benchmark] add gemm_sp benchmark * [Example] fix 256 threads hang * [CI] fix ci * [Chore] resolve gemini feedback * [Benchmark] increase search space * [Lint] format * [CI] skip sparse tensor core related tests as only sm90 is supported * [CI] pass local run * Update gemm_sm89.h * lint fix * lint fix * [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags - Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation. - Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag. - Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected. - Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch. - Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module. * Update test_compress_utils.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> * [Doc] Phaseout Legacy documentations (#610) - Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core. - Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure. * [Refactor] Phaseout Pass ParallelLoopTransformer (#611) * Refactor layout inference by removing the ParallelLoopTransformer class. Updated layout inference logic to streamline buffer access collection and condition handling in parallel loops. This change simplifies the code structure and enhances maintainability. * Update MHA backward test cases to use reduced dimensions for batch size and context length * fix build * [Enhancement] Update ReduceOp initialization values for integer types (#614) * [Enhancement] Update ReduceOp initialization values for integer types - Modified the `MakeInitValue` method in `ReduceOp` to handle integer data types correctly by returning appropriate minimum and maximum values based on the bit width. - Added checks for integer types to ensure correct initialization for `kMax` and `kMin` reduction types, enhancing the robustness of the reduction operations. * [Enhancement] Update ReduceOp to handle unsigned integer initialization values - Enhanced the `MakeInitValue` method in `ReduceOp` to include support for unsigned integer data types. - Added conditions to return appropriate initialization values for `kMax` and `kMin` reduction types based on the data type, improving the robustness of reduction operations. * Bump transformers from 4.50.0 to 4.51.0 in /examples/bitnet-1.58b (#615) Bumps [transformers](https://github.com/huggingface/transformers) from 4.50.0 to 4.51.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.50.0...v4.51.0 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.51.0 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * [Refactor] refactor autotune examples (#617) * [Refactor] Update tilelang kernel functions and remove unused imports - Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance. - Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation. - Removed unused import of `cached` from `tilelang`, streamlining the code. - Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use. * [Refactor] Simplify configuration generation in benchmark and example scripts - Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability. - Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning. - Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized. * [Refactor] Update configuration handling in benchmark scripts - Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management. - Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning. - Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks. - Cleaned up redundant code and improved overall readability in the affected files. * [Refactor] Clean up formatting and update subproject commit - Updated the subproject commit reference in the TVM directory to indicate a dirty state. - Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability. - Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability. * [Refactor] Update AutoTuner configuration handling - Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling. - Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources. - Updated comments for clarity and maintainability. * lint fix * Update TVM subproject commit to indicate dirty state and modify MHA backward test cases - Updated the subproject commit reference in the TVM directory to reflect a dirty state. - Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256. - Uncommented the main testing function call for potential execution. * lint fix * Bump transformers from 4.51.0 to 4.52.1 in /examples/bitnet-1.58b (#619) Bumps [transformers](https://github.com/huggingface/transformers) from 4.51.0 to 4.52.1. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.51.0...v4.52.1 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.52.1 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Fix PTXAS options flag in LibraryGenerator for consistency (#620) * Refactor FP8 type handling across multiple files to standardize usage of "float8_e4m3" and "float8_e5m2" instead of "e4m3_float8" and "e5m2_float8". This includes updates in benchmarks, examples, tests, and internal utilities. * [Refactor] Add parallel loop transform pass for condition extraction (#618) * [Refactor] Add parallel loop transform * done format check * pull 3rdparty repo * Refactor loop variable handling in transformation utilities - Updated the logic in `loop_parallel_transform_utils.h` to simplify the handling of related loop variables. - Removed the check that enforced a single related loop variable, replacing it with a return statement when multiple variables are detected, enhancing clarity and maintainability of the transformation process. * Update loop_parallel_transform_utils.h * Refactor loop variable handling in transformation utilities - Enhanced the logic in `loop_parallel_transform_utils.h` to improve clarity and maintainability by simplifying the handling of related loop variables. - Replaced the previous enforcement of a single related loop variable with a return statement for multiple variables detected. * remove disable cache flag as commit id has been key component * lint fix --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> * [Dev] Update linear attention examples to enhance performance on Hopper GPUs (#621) * Tune linear attention examples on H100 * Add retnet fwd kernel * fix lint * [Enhancement] Add ahead of time cython compilation in setup.py (#622) * [Enhancement] Add Cython support and compiler detection in setup.py - Introduced a new `CythonExtension` class for building Cython-based extensions, enhancing the build process for Cython projects. - Implemented functions to detect the Cython compiler and C++ compiler, improving compatibility and user experience. - Updated the build process to handle Cython extensions alongside CMake extensions, ensuring a seamless integration for users. - Added caching mechanisms for Cython compilation to optimize build times and reduce unnecessary recompilation. * [Enhancement] Add Cython dependency and enable CMake extension building - Added Cython as a required dependency in `pyproject.toml` to support Cython-based extensions. - Updated `setup.py` to enable building CMake extensions, improving the build process for projects utilizing both Cython and CMake. - Modified the Cython compiler detection logic to streamline installation instructions for users. * [Enhancement] Support more flexible layout host pythonic expr (#623) * [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr - Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation. - Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code. * [Refactor] Simplify expression handling in pythonic_expr function - Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability. - Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation. * [Enhancement] Improve expression representation in pythonic_expr function - Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings. - Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation. - Included a docstring for better clarity on the function's purpose and usage. * test fix * [Enhancement] support composable expression for shape with symbolic vars (#624) * [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr - Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation. - Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code. * [Refactor] Simplify expression handling in pythonic_expr function - Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability. - Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation. * [Enhancement] Improve expression representation in pythonic_expr function - Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings. - Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation. - Included a docstring for better clarity on the function's purpose and usage. * test fix * minor update *
🐍 Fix the file name "test_exmaple_tilelang_nsa" (#629) * [Enhancement] Add CPU utilization and count settings for Auto-Tuning (#630) * [Enhancement] Add CPU utilization and count settings for Auto-Tuning - Introduced environment variables for CPU utilization, counts, and maximum CPU count for auto-tuning. - Updated the AutoTuner class to utilize these new settings, improving flexibility and performance in multi-threaded environments. - Enhanced logging to provide better insights into the auto-tuning process based on the configured CPU settings. * typo fix * [AutoTune] Support `with set_autotune_inputs` to set auto tuning input tensors (#632) * [Refactor] Simplify and modularize autotuner implementation - Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability. - Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure. - Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic. - Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes. * [Refactor] Clean up and enhance capture and tuner modules - Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`. - Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning. - Streamlined the `CaptureStack` class for better thread-local context management. * lint fix * [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example - Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity. - Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning. - Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability. - Deleted obsolete test file for autotuning decorator, cleaning up the codebase. * [Refactor] Improve code formatting and readability in autotune test file - Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation. - Fixed a typo in the `enable_rasteration` parameter name to ensure consistency. - Cleaned up unnecessary blank lines to enhance overall code clarity. * Update example_blocksparse_gemm.py * Update capture.py * [Pass] Introduce flag to diable cp async lowering (#633) * [Enhancement] Update PipelinePlanner to support async copy configuration - Modified the `Substitute` method in `PipelinePlanner` to accept a `use_async_copy` parameter, allowing for more flexible pipeline planning based on async copy requirements. - Updated the constructor of `PipelinePlanner` to initialize the `use_async_copy_` member variable. - Adjusted the logic in the pipeline planning process to conditionally apply async copy annotations based on the new parameter. - Commented out the `LoopVectorizeDynamic` call in `LowerAndLegalize` to prevent unintended modifications during the legalizing phase. * Refactor PipelinePlanning function for improved readability - Adjusted the formatting of the `use_async_copy` variable assignment in the `PipelinePlanning` function to enhance code clarity and maintainability. * fix typo (#635) * [Pass][Simplify] Introduce symbolic level simplify for condition expression (#634) * [Enhancement] Add argument simplification option to StmtSimplifier - Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior. - Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process. - Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase. - Added comments to clarify the purpose of the new flag and its impact on simplification logic. * lint fix * [Enhancement] Improve layout inference and reduce operation handling - Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic. - Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures. - Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization. * lint fix * [Enhancement] Add input validation for GEMM parameters - Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method. - Added informative error messages to assist in debugging when the input parameters do not meet the required conditions. * bug fix * Enhance test coverage by adding LLVM requirement decorator to multiple function call tests. This ensures that tests for argument count, type code, null data pointer, and dimensionality checks are only executed when LLVM is available, improving test reliability and clarity. * lint fix * Fix software pipeline stage annotation and update optional config handling in StmtSimplifier * Add Python executable detection in CMake configuration and update TVM submodule reference. Remove unused vectorization tests for improved clarity. * Update TVM submodule reference and refactor FFI registration to use static initialization blocks for improved organization and clarity. * Refactor attribute handling in layout and IR nodes to use reflection registration. This change replaces the VisitAttrs method with a RegisterReflection method for improved clarity and organization across multiple classes, including KernelLaunchFrameNode, WarpSpecializeFrameNode, LayoutNode, FragmentNode, and SwizzledLayoutNode. * finish rebase * tvm update * Refactor FFI registration across tilelang modules to use the updated `tvm.ffi` namespace. This includes changes in various files to replace `tvm._ffi` with `tvm.ffi`, enhancing consistency and clarity in the codebase. * lint fix * Update TVM submodule reference and modify CUDA runtime argument handling to use the new runtime constants for improved clarity and consistency. * lint fix * Refactor tensor data type references from "e4m3_float8" and "e5m2_float8" to "float8_e4m3" and "float8_e5m2" across multiple files for consistency and clarity. * lint fix * Refactor forward_index initialization in Fragment class to default to an empty array instead of None, ensuring consistent handling of optional outputs. * test fix * lint fix * bugfix * lint fix * reduce fix * lint fix * carver fix * cast fix * Update submodule and enhance kernel launch functionality with optional block size parameter; add device kernel launch transformation. * lint fix * bugfix * Refactor test execution in test_tilelang_cpu_gemm.py and enhance device call checks in lower.py to exclude C packed functions from kernel launch conditions. * lint fix * Update runtime.cc * phase out lisence * Update subproject commit for TVM to 555cc71 * Update subproject commit for TVM to d39953fa * Update subproject commit for TVM to 9574805f * Update subproject commit for TVM to a08b7c3 * fix ci * ci fix --------- Signed-off-by:dependabot[bot] <support@github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
Cunxiao Ni <85601223+Cunxiao2002@users.noreply.github.com> Co-authored-by:
Yuxi Chi <cherichy@outlook.com> Co-authored-by:
Nathan Chen <120630832+Nathancgy@users.noreply.github.com> Co-authored-by:
botbw <wang1570@e.ntu.edu.sg> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by:
xs-keju <93414213+xs-keju@users.noreply.github.com> Co-authored-by:
Tong WU <109033598+Rachmanino@users.noreply.github.com> Co-authored-by:
Kadir Nar <kadir.nar@hotmail.com> Co-authored-by:
Yuqing Xia <35415939+xiayuqing0622@users.noreply.github.com> Co-authored-by:
xwhzz <wh.xie@outlook.com>
-
- 16 Jul, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Improve memory access condition checks in GlobalMemChecker - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations. - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis. * lintfix * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation. - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function. * [Refactor] Update barrier and clear operations in warp specialization examples - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization. - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency. * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management. - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements. - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior. - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output. - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements. * [Enhancement] Add support for disabling TMA in Copy operations - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior. - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag. - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations. - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations. * [Refactor] Clean up whitespace and formatting in multiple files - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`. - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase. * [Enhancement] Refactor flash attention implementation for improved performance and configurability - Split the shared memory allocations for query and key-value pairs to optimize memory usage. - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example. - Updated kernel execution parameters to improve thread management and synchronization. - Enhanced the overall structure of the flash attention function for better readability and maintainability. * fix * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget * Refactor barrier handling and update example configurations - Replaced commented-out barrier creation with new barrier allocation in GEMM example. - Updated kernel configuration in warp specialization example to include async copy settings. - Enhanced barrier management in the phase optimization process to improve synchronization handling. - Introduced new barrier allocation function for better memory management in shared contexts. * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization. - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code. - Added exit() call in OptimizeForTarget to halt execution after barrier lowering. * Enhance CMake configuration and clean up example scripts - Enabled compile command export in CMakeLists.txt for better build integration. - Removed unnecessary print statement in the warp specialization example. - Cleaned up commented-out code in GEMM example for improved readability. - Updated barrier handling in shared memory allocation transformations for better synchronization. * Refactor barrier handling in warp specialization examples - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization. - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts. - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management. * Update lower_shared_barrier.cc * Update phase.py * Update warp specialization example and Cython wrapper - Removed commented-out pass configuration options in the warp specialization example for clarity. - Added functionality to write the generated kernel source to a file named "kernel.cu". - Enhanced Cython wrapper to support boolean type conversion for improved type handling. * Add storage synchronization call in shared barrier transformation - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process. * remove debug files * Remove kernel source output to file in warp specialization example * remove comments * Refactor tensor handling and update test execution in TileLang - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency. - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`. - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing. - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity. * Update test_tilelang_language_reshape.py
-
- 03 Jun, 2025 1 commit
-
-
botbw authored
-
- 20 Mar, 2025 1 commit
-
-
Lei Wang authored
* remove llvm build * [Refactor] Update kernel compilation and profiling in examples - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation. - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency. - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations. * lint fix * License Update * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields. - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability. * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files - Improved comment alignment and readability in `cuda.h`. - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability. * lint fix * lint fix * lint fix * lint fix * fix * License update * [Enhancement] Update JITKernel to use artifact for kernel source - Assigned the generated artifact to `self.artifact` for better management. - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling. * lint fix * Add @tilelang.testing.requires_llvm decorator to vectorization tests * Enhance setup.py and env.py for library management - Added functionality to remove original files after copying in CMakeBuild. - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration. * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py * Refactor CMakeBuild file handling in setup.py - Added a check to ensure the target library directory exists before copying .so files. - Improved the logic for creating the target directory and copying files to enhance robustness. * bugfix * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement. * lint fix * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility. * lint fix * Add support for C target in device code generation - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function. * [Enhancement] Implement auto-clear cache feature based on environment variable * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing. * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing. * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true. * [Refactor] Update kernel invocation and import paths in tests and cache * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result. * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`. * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability. * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class. * Enhanced overall code formatting to align with project standards. * [Enhancement] Add bfloat16 test case and improve kernel caching logic * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`. * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading. * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management. * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database. * Improved code formatting and readability across several files. * lint fix * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
-
- 12 Mar, 2025 1 commit
-
-
Yu Cheng authored
* [Feature] Add TMA Store Synchronization Support - Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization - Add new builtin operations in op/builtin.cc and op/builtin.h - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls - Update CUDA codegen to support new TMA store synchronization intrinsics - Add Python language bindings for new TMA store synchronization operations * [CMake] Add CUDA Major Version Detection for Conditional Compilation - Introduce CUDA_MAJOR_VERSION CMake variable to dynamically detect CUDA toolkit version - Update runtime and transform files to use CUDA_MAJOR_VERSION for version-specific code paths - Replace hardcoded __CUDACC_VER_MAJOR__ with dynamically set CUDA_MAJOR_VERSION - Improve cross-version compatibility for CUDA-dependent code sections
-
- 15 Feb, 2025 1 commit
-
-
Lei Wang authored
* bump version into v0.1.0 * [Enhancement] Add custom develop command for editable installs and update .gitignore * [Documentation] Update README to include system dependencies installation instructions * [Build] Update setup.py to support library file copying for both release and develop modes * [Build] Refactor library file copying logic in setup.py * [Documentation] Remove unnecessary install section header in Installation.md * [Build] Add tox configuration and local distribution script for multi-Python version support * [Build] Improve git submodule update function with better error handling * [Build] Update LLVM configuration path in ROCm installation script * [Build] Add .tox/ to .gitignore for tox testing environment * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt * [Cleanup] Remove unused TVM runtime error codes header * [Cleanup] Fix TVM grid constant type reference in CUDA module * [Cleanup] Remove unused customized_code function from IR module * [Feature] Add TileLang thread synchronization and storage access analysis passes * [Build] Reorder DLL search path directories for more flexible library loading * [Refactor] Improve thread synchronization and library path handling - Rename ThreadSync and TileLangThreadSync functions in C++ code - Update Python docstring for ThreadSync with more detailed description - Reorder library path detection in tilelang environment setup - Minor comment and code cleanup in CUDA and warp specialization modules * [Refactor] Improve thread synchronization code style and formatting - Standardize pointer type spacing in storage_access.h and storage_access.cc - Update whitespace and indentation in thread_storage_sync.cc - Reorder include statements in thread_partial_sync.cc - Minor code formatting improvements across thread synchronization files * [Refactor] Fix global function registration for ThreadSync - Correct global function registration to use ThreadSync instead of TileLangThreadSync - Update TVM global registration to match recent refactoring efforts * [Refactor] Simplify ThreadSync global function registration - Remove unnecessary whitespace in global function registration - Compact the TVM global registration line for ThreadSync * [Feature] Add WebGPU code generation support in TileLang - Implement WebGPU code generator (codegen_webgpu.cc and codegen_webgpu.h) - Add WebGPU target support in lower.py and target.py - Update CMakeLists.txt to include WebGPU codegen source files - Introduce WebGPU-specific code generation for WGSL shader language * [Refactor] Improve WebGPU code generation formatting and readability - Enhance code formatting in codegen_webgpu.cc and codegen_webgpu.h - Standardize pointer type spacing and indentation - Improve line breaks and reduce line length for better readability - Minor code style improvements in WebGPU code generation * [Test] Add WebGPU matrix multiplication code generation test - Implement test_webgpu_codegen.py for WebGPU matrix multiplication - Add assert_gemm_codegen function to validate WebGPU code generation - Include basic matrix multiplication kernel test case * Update README with WebGPU codegen support announcement
-
- 14 Feb, 2025 1 commit
-
-
Lei Wang authored
* bump version into v0.1.0 * [Enhancement] Add custom develop command for editable installs and update .gitignore * [Documentation] Update README to include system dependencies installation instructions * [Build] Update setup.py to support library file copying for both release and develop modes * [Build] Refactor library file copying logic in setup.py * [Documentation] Remove unnecessary install section header in Installation.md * [Build] Add tox configuration and local distribution script for multi-Python version support * [Build] Improve git submodule update function with better error handling * [Build] Update LLVM configuration path in ROCm installation script * [Build] Add .tox/ to .gitignore for tox testing environment * [Build] Add support for TVM prebuild path configuration in CMakeLists.txt * [Cleanup] Remove unused TVM runtime error codes header * [Cleanup] Fix TVM grid constant type reference in CUDA module * [Cleanup] Remove unused customized_code function from IR module * [Feature] Add TileLang thread synchronization and storage access analysis passes * [Build] Reorder DLL search path directories for more flexible library loading * [Refactor] Improve thread synchronization and library path handling - Rename ThreadSync and TileLangThreadSync functions in C++ code - Update Python docstring for ThreadSync with more detailed description - Reorder library path detection in tilelang environment setup - Minor comment and code cleanup in CUDA and warp specialization modules * [Refactor] Improve thread synchronization code style and formatting - Standardize pointer type spacing in storage_access.h and storage_access.cc - Update whitespace and indentation in thread_storage_sync.cc - Reorder include statements in thread_partial_sync.cc - Minor code formatting improvements across thread synchronization files * [Refactor] Fix global function registration for ThreadSync - Correct global function registration to use ThreadSync instead of TileLangThreadSync - Update TVM global registration to match recent refactoring efforts * [Refactor] Simplify ThreadSync global function registration - Remove unnecessary whitespace in global function registration - Compact the TVM global registration line for ThreadSync
-
- 17 Jan, 2025 1 commit
-
-
Lei Wang authored
* README.md fixed * test fix * cpu backend update * cpu test case
-
- 11 Jan, 2025 1 commit
-
-
Lei Wang authored
* Add format.sh script for code formatting and linting * docs update * center align the title * lint fix * add ignore * Add .gitignore for 3rdparty directory * Add requirements-dev.txt, requirements-test.txt, and requirements.txt * 3rdparty * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h * Refactor CMakeLists.txt and include statements - Update CMakeLists.txt to use a newer version of CMake and add project name - Remove unnecessary include directories Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc - Update include paths to use relative paths instead of absolute paths * Update submodule for 3rdparty/tvm * update * load dll first * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * git keep update * Refactor CMakeLists.txt and include statements * Refactor CMakeLists.txt and include statements * refactor code structure * Update Readme * CMakeLists Customized * update readme * update README * update readme * update usage * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import * annotate lower transform global func with `transform` prefix * Migrate Simplify Pass from tilelang tvm branch * enhance system environment handling with __init__ and CMake * Initial commit * CODE_OF_CONDUCT.md committed * LICENSE committed * README.md committed * SECURITY.md committed * SUPPORT.md committed * CODE_OF_CONDUCT Commit * LICENSE Commit * SECURITY Commit * SUPPORT Commit * Modify Support * Update README.md * security ci update * remove examples * Update and implement clang-format * add composable kernel components * Migrate from latest update * submodule update * Test update * Update License * Spell check * lint fix * add clang-tidy to apply static analysis for c source * update tilelang examples * Update Install Docs * Refactor filetree * Enhance Install * conflict resloved * annotate_version * Initial Update * test fix * install * Implement setup.py * lint fix * Separate Init * Separate test * docker file commit * add logo * Update Readme and Examples * update readme * update logo * Implement AMD Installation * Add License * Update AMD MI300x Benchmark * update README * update mi300 benchmark scripts * update ignore * enhance build scirpt * update image * enhance setup.py to remove duplicated libraries * remove debug files * update readme * update image * update gemm examples * update flashattention README * readme update * add cmake into requirements * libinfo fix * auto update submodule * lint fix * Fix AMD Build and Test * Update check for transpose attribute for CDNA Arch * typo fix for amd * Implement Matmul Benchmark * Refactor Code * [TypoFix] Fix GEMM Example * [Docs] Init Linear Attention README * [TYPO] Typo fix * [Lint] Lint Fix * enhance example with intrinsics * [Enhancement] Improve Buffer Collection during IR Parser * [Dev] Introduce Current classmethod to get current frame * submodule update * fake test pass update * support thread_extent_api * code optimize * Add GEMM function implementation for matrix multiplication * Update logging format to reflect TileLang in logger messages * Refactor CMakeLists.txt for improved readability and set default build type to Release * Support Gemm SS Primitives Implementation * [README] Upload Tile Language Logo (#5) * update logo * Update README.md to enhance formatting and center the title --------- Co-authored-by:
microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com> Co-authored-by:
Microsoft Open Source <microsoftopensource@users.noreply.github.com> Co-authored-by:
Yu Cheng <yu.cheng@pku.edu.cn>
-