1. 20 Dec, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Enhance let binding handling in layout inference and warp specialized pass (#1484) · 7e8d1f82
      Lei Wang authored
      * [Feature] Add FullyReplicated Fragment Layout and Enhance Layout Inference
      
      * Introduced a new static method `FullyReplicated` in the `Fragment` class to create fully replicated fragment layouts, ensuring all threads hold identical copies of the buffer.
      * Updated `CopyNode` to collect fragment layouts and mark them as fully replicated during layout inference.
      * Enhanced `ParallelOpNode` to expand let bindings for fragment buffer accesses, improving layout inference accuracy.
      * Added documentation for new methods and updated existing methods to support the new layout features.
      
      * lint fix
      
      * Remove debug logging statements from layout inference process to streamline output and improve performance.
      7e8d1f82
  2. 26 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout vmap for Tile Operators (#1334) · f5d9da46
      Lei Wang authored
      
      
      * Refactor GEMM and Reduce operations by moving NormalizeToBufferRegion and MakeAccessPtrFromRegion to utils.{h,cc} for better code organization and reuse.
      
      * lint fix
      
      * Refactor region handling by removing the RegionOp and updating NormalizeToBufferRegion to only accept BufferLoad and BufferRegion. This change improves code organization and simplifies the handling of memory regions across various operations.
      
      * fix
      
      * Refactor memory region handling by introducing `tl.region` calls across various operations, including GEMM and fill functions. This change enhances the consistency of region management and improves code organization by utilizing utility functions for buffer region conversions.
      
      * fix
      
      * fix
      
      * test fix
      
      * lint fix
      
      * Refactor GEMM operations to improve memory region handling by replacing `mbarPtr_` with `mbarRegion_` and updating related logic in both C++ and Python implementations. This change enhances the clarity and consistency of buffer region management.
      
      * fix
      
      * lint fix
      
      * fix
      
      * fix
      
      * test fix
      
      * lint fix
      
      * lint fix
      
      * minor fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      f5d9da46
  3. 31 Oct, 2025 1 commit
    • Lei Wang's avatar
      [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28
      Lei Wang authored
      
      
      * 3rdparty tvm bump
      
      * bump tvm into v0.22.0
      
      * lint fix
      
      * rebase tvm
      
      * Update submodule tvm to latest commit 3085bc4
      
      * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang
      
      * test fix
      
      * add requirement
      
      * atomic_fix
      
      * atomic_fix
      
      * phaseout py39
      
      * optimize
      
      * optimize
      
      * lint fix
      
      * do not clean cache
      
      * do not clean cache
      
      * [Minor] Minor update for Python versions and dependencies
      
      * [Lint] fix lint for py39
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Sync CI changes from upstream/sdist
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Update `repair-wheel-command`
      
      * [Minor] update abi3audit result format
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] fix build
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * [Deps] pin apache-tvm-ffi version
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Deps] Restore Python 3.8 support
      
      * [Build] use `apache-tvm-ffi`'s `libtvm_ffi`
      
      * [BugFix] use `;` as delimiter for RPATH on macOS
      
      * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`
      
      * [Build] support `sccache` if available
      
      * [Build] add CIBW import test
      
      * [Build][CI] enable ccache for CIBW on Linux
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * Revert "[Build][CI] enable ccache for CIBW on Linux"
      
      This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.
      
      * [CI] fix perfbench bot
      
      * [BugFix] use Python 3.9 to build wheel
      
      * [Minor] update perfbench bot envs
      
      * [BugFix] fix CIBW environment on Linux
      
      * [CI] skip import test on CentOS 7
      
      * [CI] use Python urllib to download file instead of Wget
      
      ---------
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@pku.edu.cn>
      10911e28
  4. 02 Oct, 2025 1 commit
    • Zhiwen Mo's avatar
      [Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa
      Zhiwen Mo authored
      * Implements tcgen05.ld instruction support for copying from shared.tmem
        to local.fragment on SM100/Blackwell architecture. Adds layout inference
        and lowering logic for tensor memory operations with proper physical
        coordinate range analysis and warpgroup alignment checks.
      
        Changes:
        - Add kTMemLoad and kTMemStore to CopyInst enumeration
        - Implement CheckTMemLoad() and CheckTMemStore() validation functions
        - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
        - Add tmem layout inference in InferLayout() using expandTcgen05Layout
        - Support multiple instruction variants (32dp32b/64b/128b/256b)
        - Add physical layout bounds analysis for tmem coordinates
        - Change clear_accum from bool to PrimExpr in GEMM operations
        - Fix std::optional access checks in layout_inference.cc
        - Add tmem_allocate/deallocate PTX intrinsic support
        - Fix cooperative_groups grid.sync() code generation
      
      * fix
      
      * pipeline fix
      
      * bug fix
      
      * bool fix
      5ccac4fa
  5. 06 Sep, 2025 1 commit
    • Lei Wang's avatar
      [TMA] Automatically lower 1d tma in appropriate cases (#788) · 9d7d45be
      Lei Wang authored
      * Enhance layout inference and copy operations with 1D TMA support
      
      - Updated `CopyNode` to introduce separate handling for 1D bulk load/store operations, including new methods for checking and lowering these operations.
      - Modified `InferLayout` and `GetCopyInst` to accommodate additional parameters for layout maps and analyzers.
      - Enhanced `AtomicAddNode` and `FillNode` to utilize the updated layout inference logic.
      - Improved buffer out-of-bounds checks during layout inference to ensure safe memory access.
      
      This update improves the efficiency and correctness of memory operations in the TileLang framework.
      
      * Refactor layout inference calls for improved readability
      
      - Updated `InferLayout` calls in `AtomicAddNode`, `CopyNode`, and `FillNode` to enhance code clarity by formatting parameters across multiple lines.
      - Cleaned up whitespace and formatting in `copy.h` and `layout_inference.cc` to adhere to coding standards and improve maintainability.
      
      This refactor aims to streamline the layout inference logic and improve overall code organization.
      
      * Fix shared tensor check in CopyNode for bulk copy operations
      
      - Updated the condition in `CheckBulkCopy1D` to verify contiguity of `shared_tensor` instead of `dst`, ensuring correct handling of shared memory layouts during bulk copy operations.
      - This change enhances the accuracy of memory operations in the TileLang framework.
      
      * Update test_example_gdn_compilation.py to invoke test function directly
      
      - Commented out the call to `tilelang.testing.main()` in `test_example_gdn_compilation.py` and replaced it with a direct call to `test_example_chunk_delta_bwd_compilation()`. This change simplifies the test execution flow and focuses on the specific test case.
      
      * Enhance bulk load/store checks in CopyNode with last dimension validation
      
      - Updated `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to include an optional parameter for validating the last dimension during bulk copy operations.
      - Adjusted related methods `CheckBulkLoad1D` and `CheckBulkStore1D` to pass the new parameter, improving the accuracy of bulk copy checks.
      - This change enhances the robustness of memory operations in the TileLang framework by ensuring compliance with dimensional requirements.
      
      * Refactor CheckBulkLoad and CheckBulkStore methods for improved readability
      
      - Reformatted the parameter lists of `CheckBulkLoad` and `CheckBulkStore` methods in `CopyNode` to enhance code clarity by aligning parameters across multiple lines.
      - This change improves the maintainability of the code and adheres to coding standards.
      9d7d45be
  6. 04 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Support python reflection for tile operators (#783) · 3cfefc8e
      Lei Wang authored
      * Implement Fill operator and related reflection methods in TileLang
      
      - Added Fill operator implementation in `fill.cc` and `fill.h` for element-wise filling of buffers.
      - Introduced reflection methods for Fill, AtomicAdd, Copy, Conv2DIm2Col, FinalizeReducer, Gemm, and Parallel operators to enhance introspection capabilities.
      - Updated relevant files to register reflection methods and ensure proper initialization in static blocks.
      - Removed outdated comments and unnecessary code in various operator files to improve clarity and maintainability.
      - Added new Python bindings for the Fill operator in `tilelang/ir/fill.py` and updated the module imports accordingly.
      
      * Refactor operator reflection methods and improve code clarity
      
      - Updated reflection methods for AtomicAdd, Copy, FinalizeReducer, Gemm, and Parallel operators to enhance readability by using `empty()` instead of size checks.
      - Consolidated static initialization blocks for various operators to a single line for improved consistency.
      - Cleaned up whitespace and formatting in multiple files to adhere to coding standards and improve maintainability.
      - Added new Python bindings for operators in the `tilelang/ir` module, ensuring proper registration and organization of imports.
      
      * Refactor GEMM and AtomicAdd operations for improved clarity
      
      - Updated the `GetArchInt` function in `atomic_add.cc` to use `std::string` and `std::stoi` for better readability and type safety.
      - Removed unnecessary variables and comments in `gemm_sp.cc` and `gemm.cc` to streamline the `ComputeWarpPartition` method.
      - Cleaned up the `layout_reducer.cc` file by removing unused variable declarations, enhancing code clarity.
      - Added import for the `ir` module in `tilelang/__init__.py` to ensure proper organization of module imports.
      
      * Remove deprecated operator files from the tilelang IR module
      
      - Deleted files for Fill, AtomicAdd, Copy, Gemm, GemmSP, FinalizeReducer, Parallel, Reduce, and Region operators to streamline the codebase.
      - This cleanup enhances maintainability by removing unused code and improving overall organization of the module.
      
      * Refactor imports in tilelang IR module for improved organization
      
      - Updated import statements in `tilelang/ir.py` to reflect changes in the TVM library structure, enhancing clarity and maintainability of the codebase.
      
      * lint fix
      
      * Refactor GEMM and GEMM-SP operations to enhance clarity and maintainability
      
      - Updated the `Gemm` and `GemmSP` classes to utilize a new `GemmWarpPolicy` object for warp partitioning, improving encapsulation and readability.
      - Removed deprecated `ComputeWarpPartition` methods and replaced them with calls to the new policy object, streamlining the code.
      - Cleaned up comments and unnecessary code in `gemm.cc`, `gemm_sp.cc`, and related header files to enhance overall clarity.
      - Introduced a new `GemmWarpPolicyNode` class to manage warp policy attributes and methods, facilitating better organization of related functionalities.
      - Updated reflection methods to include the new policy structure, ensuring proper registration and introspection capabilities.
      
      * Refactor Reduce operation to utilize ReduceType class for improved clarity and maintainability
      
      - Replaced multiple conditional checks for reduce types with a single ReduceType object, simplifying the code structure.
      - Introduced a new ReduceTypeNode class to encapsulate reduce type logic and methods, enhancing organization.
      - Updated MakeInitValue, MakeReduce, and Lower methods to leverage the new ReduceType class, improving readability.
      - Added Python bindings for the ReduceType class in tilelang IR module to ensure proper registration and usability.
      
      * comment
      
      * Refactor operator header files for improved readability
      
      - Cleaned up formatting and whitespace in `atomic_add.h`, `copy.h`, `fill.h`, `reduce.cc`, and `reduce.h` to enhance code clarity.
      - Consolidated comments and adjusted line breaks for better organization and maintainability across multiple operator definitions.
      
      * Refactor MakeReduce method in ReduceOpNode for clarity
      
      - Updated the parameter name in the MakeReduce method from `rhs` to `b` and assigned it to `rhs` for improved readability.
      - This change enhances the clarity of the method's purpose and aligns with the overall refactoring efforts in the Reduce operation.
      
      * Update Reduce operation type checks for consistency
      
      - Changed string comparisons for reduce types in the MakeReduce method from "abs_sum" to "abssum" and "abs_max" to "absmax" for uniformity.
      - This adjustment enhances the clarity and consistency of the reduce type handling in the codebase.
      3cfefc8e
  7. 02 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Lint] Introduce clang-tidy into format.sh (#777) · cdc5d8d3
      Lei Wang authored
      * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency
      
      - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization.
      - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity.
      - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions.
      - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code.
      - General code cleanup and adherence to best practices for better maintainability.
      
      * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration
      
      - Updated .clang-tidy configuration to include additional checks for improved code quality and performance.
      - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity.
      - Replaced size checks with `empty()` method calls in various locations for clearer intent.
      - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency
      
      - Added clang-tidy checks to the format script for improved code quality assurance.
      - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity.
      - Updated the requirements-lint.txt file to include clang-tidy as a dependency.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [CI] Update AMD CI Workflow to Include Build Directory Creation
      
      - Added steps to create a build directory and configure CMake with ROCm support during the format check process.
      - Ensured cleanup of the build directory after the format check to maintain a clean workspace.
      
      * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode
      
      - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members.
      - This change enhances code clarity and maintainability by focusing on relevant attributes for each class.
      
      * [Refactor] Update Clang-Tidy Integration and Code Improvements
      
      - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes.
      - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures.
      - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize
      
      - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity.
      - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency.
      - General code cleanup to maintain consistency and adhere to best practices.
      
      * [CI] Add Git Submodule Initialization to CI Workflow
      
      - Included a step to initialize and update git submodules recursively in the CI workflow.
      - This change ensures that all necessary submodules are available during the format check process, improving build reliability.
      
      * [CI] Add Git Submodule Update Step to Format Check
      
      - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process.
      - This enhancement ensures that all required submodules are available, contributing to improved build reliability.
      
      * [Refactor] Update Function Signatures in AtomicAddVectorize
      
      - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity.
      - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase.
      cdc5d8d3
  8. 31 Aug, 2025 1 commit
  9. 29 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Refactor `Operator` into `TileOperator` and with tvm reflection (#763) · b38bd69e
      Lei Wang authored
      * Refactor operator classes to inherit from TileOperator and update layout inference methods
      
      - Changed base class of several operator classes (AtomicAdd, Copy, Gemm, etc.) from Operator to TileOperator for better alignment with tile operations.
      - Updated InferLayout and Lower methods to use 'override' specifier for clarity and consistency.
      - Adjusted header inclusions to replace "op.h" with "operator.h" across multiple files for improved organization.
      - Added missing layout inference implementations for Fill and Conv2DIm2ColOp.
      - Removed deprecated op.cc and op.h files to streamline the codebase.
      
      * lint fix
      
      * Refactor operator classes to use Node pattern and improve memory management
      
      - Updated several operator classes (AtomicAdd, Copy, Gemm, etc.) to utilize the Node pattern for better memory management and encapsulation.
      - Changed constructors to initialize member variables through a node object, enhancing clarity and reducing direct member access.
      - Updated Clone methods to return TileOperator instances instead of unique pointers, aligning with the new design.
      - Refactored InferLayout and Lower methods to ensure consistency across operator implementations.
      - Adjusted header files to reflect the new class structure and removed deprecated code for a cleaner codebase.
      
      * Enhance Clone methods in AtomicAdd and Copy classes to support parallel operation cloning
      
      - Updated the Clone methods in AtomicAddNode and CopyNode to ensure that the parallel operation (par_op_) is properly cloned when defined, improving the integrity of cloned objects.
      - Refactored the FillNode class to use ParallelOp directly instead of std::make_unique, streamlining the creation of parallel operations.
      - Made minor adjustments in layout inference and other related methods for consistency and clarity.
      
      * Refactor FillNode::Lower method to remove unused global function call
      
      - Eliminated the call to the global function "tl.fill.lower" in the FillNode::Lower method, streamlining the code and improving clarity.
      - Retained the core functionality of the method while enhancing maintainability by reducing unnecessary dependencies.
      b38bd69e
  10. 22 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) · 5c11d245
      Lei Wang authored
      * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy
      
      * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
      * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
      * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.
      
      * lint fix
      
      * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.
      
      * remove bulk copy
      
      * Refactor copy and atomic add operations to support TMA lower configuration
      
      - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
      - Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
      - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
      - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
      - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.
      
      * Enhance TMA bulk copy logic in `LowerBulkCopy` method
      
      - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
      - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.
      
      * lint fix
      
      * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.
      
      * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions
      
      - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
      - Added TODO comments to indicate the need for further improvements in shared memory handling.
      
      * Update `native_sparse_attention` function to include TMA configuration options
      
      - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
      - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.
      
      * Refactor JIT decorator formatting in `native_sparse_attention` function
      
      - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
      - No functional changes were made; this update focuses on code clarity and maintainability.
      
      * Enhance thread management and logging in TileLang compilation
      
      - Added a method to check if printing is enabled during compilation, improving control over logging behavior.
      - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
      - Added comments to clarify the purpose of changes and improve code readability.
      
      * Add warp specialization scope and refactor register management in TileLang
      
      - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
      - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
      - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
      - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.
      
      * Refactor test for InjectSetMaxNReg pass in TileLang
      
      - Improved readability by restructuring conditional checks and assertions in the test cases.
      - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
      - Ensured consistent formatting and spacing throughout the test functions for better maintainability.
      
      * Enhance bulk copy and store checks in `Copy` class
      
      - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
      - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
      - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.
      
      * lint fix
      5c11d245