1. 02 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb
      Lei Wang authored
      
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      aef0a6bb
  2. 31 Oct, 2025 1 commit
    • Lei Wang's avatar
      [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28
      Lei Wang authored
      
      
      * 3rdparty tvm bump
      
      * bump tvm into v0.22.0
      
      * lint fix
      
      * rebase tvm
      
      * Update submodule tvm to latest commit 3085bc4
      
      * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang
      
      * test fix
      
      * add requirement
      
      * atomic_fix
      
      * atomic_fix
      
      * phaseout py39
      
      * optimize
      
      * optimize
      
      * lint fix
      
      * do not clean cache
      
      * do not clean cache
      
      * [Minor] Minor update for Python versions and dependencies
      
      * [Lint] fix lint for py39
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Sync CI changes from upstream/sdist
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Update `repair-wheel-command`
      
      * [Minor] update abi3audit result format
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] fix build
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * [Deps] pin apache-tvm-ffi version
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Deps] Restore Python 3.8 support
      
      * [Build] use `apache-tvm-ffi`'s `libtvm_ffi`
      
      * [BugFix] use `;` as delimiter for RPATH on macOS
      
      * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`
      
      * [Build] support `sccache` if available
      
      * [Build] add CIBW import test
      
      * [Build][CI] enable ccache for CIBW on Linux
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * Revert "[Build][CI] enable ccache for CIBW on Linux"
      
      This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.
      
      * [CI] fix perfbench bot
      
      * [BugFix] use Python 3.9 to build wheel
      
      * [Minor] update perfbench bot envs
      
      * [BugFix] fix CIBW environment on Linux
      
      * [CI] skip import test on CentOS 7
      
      * [CI] use Python urllib to download file instead of Wget
      
      ---------
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@pku.edu.cn>
      10911e28
  3. 29 Oct, 2025 1 commit
  4. 27 Oct, 2025 1 commit
  5. 23 Oct, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi (#1111) · 86c8bb46
      Lei Wang authored
      * [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic
      
      * Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function.
      * Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty.
      
      * lint fix
      
      * remove debug print
      86c8bb46
  6. 22 Oct, 2025 2 commits
  7. 21 Oct, 2025 3 commits
    • Lei Wang's avatar
      [Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e
      Lei Wang authored
      * - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
          generated Allocates and the PrimFunc attrs
        - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
          allocations keep their tl.local_var_init annotations
        - teach annotation handling to accept scalar initializers, resolve buffers, and merge
          with existing stat
      
      * lint fix
      
      * enhance
      
      * lint fix
      
      * lint fix
      bddb125e
    • Lei Wang's avatar
      [PassConfig] Introduce PassConfig `TL_STORAGE_REWRITE_DETECT_INPLACE` (#1089) · cdc67fc4
      Lei Wang authored
      * • Enable configurable StorageRewrite inplace detection
      
        - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key.
        - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse.
        - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples.
      
      * lint fix
      
      * add test
      
      * lint fix
      cdc67fc4
    • Zhengju Tang's avatar
      [BugFix] Add memory order argument for non-vectorized atomic add (#1081) · 1d4b7180
      Zhengju Tang authored
      * [BugFix] Add memory order argument for non-vectorized atomic add
      
      * [Lint]
      
      * [BugFix] Memory order
      
      * [Lint]
      
      * [BugFix] Argument in cuda template
      
      * [Lint]
      1d4b7180
  8. 20 Oct, 2025 2 commits
  9. 17 Oct, 2025 2 commits
  10. 16 Oct, 2025 2 commits
  11. 15 Oct, 2025 2 commits
  12. 14 Oct, 2025 1 commit
  13. 13 Oct, 2025 1 commit
  14. 11 Oct, 2025 3 commits
  15. 10 Oct, 2025 2 commits
    • Xuehai Pan's avatar
      [CI] add `pre-commit` integration (#955) · 8fe35402
      Xuehai Pan authored
      
      
      * chore: misc cleanup
      
      * feat: add pre-commit config
      
      * chore: update lint dependencies
      
      * style: fix lint issues
      
      * feat: add pre-commit hooks
      
      * fix: fix typos
      
      * chore: update .gitattributes
      
      * [Lint]: [pre-commit.ci] auto fixes [...]
      
      * docs: update CONTRIBUTING.md
      
      * chore: update default venv name
      
      * chore: revert and exclude CUDA files
      
      ---------
      Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
      8fe35402
    • Lei Wang's avatar
      [Bugfix] Do not force inline let stmt (#947) · f8ae600c
      Lei Wang authored
      * remove debug print
      
      * Remove inline let expressions from the LowerAndLegalize function in phase.py
      
      * add test
      
      * Update sparse MLA examples to support SKV adjustment and correctness checks
      
      - Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
      - Added check_correctness parameter to test functions for validation of outputs.
      - Updated test cases to reflect new SKV values and correctness checks.
      
      * reduce test shape
      
      * Update documentation structure and refactor main function parameters in example_fusedmoe_tilelang.py
      
      - Added a new section for compiler internals in the documentation.
      - Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.
      
      * Update buffer access checks in merge_shared_memory_allocations.cc
      
      - Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level.
      - Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.
      
      * lint fix
      
      * Support pipeline with LetStmt
      
      * lint fix
      
      * • Fix LowerTileOp let handling to avoid LetInline dependency
      
        - inline let-bound BufferLoad nodes via resolver helpers and structured return
        - remap layouts/buffers using original data vars and only rewrite when needed
        - update pipeline planner to understand let-bound address_of buffers
        - document the new inline behaviour in docs/let_inline_fix.md
      
      * fix for wgmma pipeline with let binding
      
      * lint fix
      
      * test fix
      
      * reduce smem usage.
      
      * let binding enhancement
      
      * fix for dpgm
      
      * fix simplify
      
      * lint fix
      
      * use tilelang.Simplify instead of tir.Simplify
      
      * • Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage
      
        - register the new config in builtin headers/registration
        - add helper to pipeline enabling LetInline based on pass context
        - document LetStmt inlining controls and usage
      f8ae600c
  16. 09 Oct, 2025 1 commit
    • Lei Wang's avatar
      [TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28
      Lei Wang authored
      * [Feature] Introduce WGMMA support and enhance GEMM layout handling
      
      - Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
      - Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
      - Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
      - Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
      - Improved documentation for layout functions and GEMM operations to clarify usage and parameters.
      
      These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.
      
      * [Refactor] Clean up code formatting and enhance layout function readability
      
      - Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
      - Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
      - Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
      - Enhanced comments and documentation in layout-related files to clarify usage and parameters.
      
      These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.
      
      * [Feature] Add descriptor initialization and offset manipulation for WGMMA
      
      - Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
      - Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
      - Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
      - Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
      - Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.
      
      These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.
      
      * [Refactor] Improve code formatting and readability in various files
      
      - Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
      - Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
      - Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
      - Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.
      
      These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.
      
      * [Update] Update subproject commit and refactor layout function call
      
      - Updated the subproject commit for `cutlass` to indicate a dirty state.
      - Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.
      
      These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.
      
      * support more data types
      
      * gemm_rs support
      
      * lint fix
      
      * wgmma wrapper
      
      * Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.
      
      * Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.
      
      * Comprehensively support WGMMA GEMM SS
      
      * remove debug print
      
      * lint fix
      
      * remove debug print
      
      * reduce bwd test shape
      
      * lint fix
      
      * clear cache for pytest
      
      * lint fix
      
      * Update sparse MLA examples to support SKV adjustment and correctness checks
      
      - Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
      - Added check_correctness parameter to test functions for validation of outputs.
      - Updated test cases to reflect new SKV values and correctness checks.
      
      * test fix
      
      * adjust test case
      
      * test fix
      
      * skip some test currently
      a13cde28
  17. 05 Oct, 2025 1 commit
  18. 02 Oct, 2025 2 commits
    • Zhiwen Mo's avatar
      [Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa
      Zhiwen Mo authored
      * Implements tcgen05.ld instruction support for copying from shared.tmem
        to local.fragment on SM100/Blackwell architecture. Adds layout inference
        and lowering logic for tensor memory operations with proper physical
        coordinate range analysis and warpgroup alignment checks.
      
        Changes:
        - Add kTMemLoad and kTMemStore to CopyInst enumeration
        - Implement CheckTMemLoad() and CheckTMemStore() validation functions
        - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
        - Add tmem layout inference in InferLayout() using expandTcgen05Layout
        - Support multiple instruction variants (32dp32b/64b/128b/256b)
        - Add physical layout bounds analysis for tmem coordinates
        - Change clear_accum from bool to PrimExpr in GEMM operations
        - Fix std::optional access checks in layout_inference.cc
        - Add tmem_allocate/deallocate PTX intrinsic support
        - Fix cooperative_groups grid.sync() code generation
      
      * fix
      
      * pipeline fix
      
      * bug fix
      
      * bool fix
      5ccac4fa
    • Lei Wang's avatar
      [Layout] Strict annotate completed replicated layout for fragment with constant index (#929) · fc4bd452
      Lei Wang authored
      * [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode
      
      - Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated.
      - Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed.
      - Updated error handling for non-zero index access in fragment buffers to improve robustness.
      
      * [Layout] Improve code formatting and readability in layout.cc and parallel.cc
      
      - Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity.
      - Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation.
      - Ensured consistent formatting across conditional statements and comments for improved maintainability.
      
      * updt
      
      * optimize const index related op
      
      * bug fix
      
      * reduce gdn test
      
      * test fix
      
      * lintfix
      
      * lint fix
      
      * test fix
      fc4bd452
  19. 28 Sep, 2025 2 commits
    • Tong WU's avatar
      [Bugfix] Fix CopyNode Lower method to include disable_tma flag in GetCopyInst (#888) · 599264ca
      Tong WU authored
      * Fix CopyNode Lower method to include disable_tma flag in GetCopyInst call
      
      * Refactor flash attention implementation to disable TMA for specific copy and allow TMA for other operations
      
      * attempt to fix lint
      599264ca
    • Zhiwen Mo's avatar
      [SM100] Add sm100 GEMM layouts and tcgen05 support (#887) · f58bcd43
      Zhiwen Mo authored
      * update sm100 related utcmma, tmem, ld/st256 in src
      * update sm100 related utcmma, tmem, ld/st256 in tilelang
      * Remove deprecated GEMM examples and related README documentation for SM100 architecture support
      * Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files
      * Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes
      * Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation
      * Update README and source files to reflect TCGEN5.MMA terminology changes
      * Refactor CUDA GEMM header for improved readability
      f58bcd43
  20. 26 Sep, 2025 3 commits
    • Lei Wang's avatar
      [Layout] Introduce Flexible Parallel to Support T.serial and local buffers... · c382dcbc
      Lei Wang authored
      
      [Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop (#844)
      
      * Support T.serial and local buffers inside T.Parallel loop.
      
      * Fix reducer layout in T.Parallel nested inside other loops
      
      * Debug output with LOG(INFO)
      
      * Add disable option for WGMMA.
      
      * fix
      
      * Use DLOG; fix missing registration for new pass config
      
      * bug fix
      
      * lint fix
      
      * Enhance GEMM instruction set with UTCMMA and improve local buffer handling in casting example
      
      * Update format.sh shebang, improve logging in layout inference, and enhance buffer store wrapper with detailed comments
      
      * Enhance GEMM instantiation logic and improve layout inference for local buffer detection
      
      - Updated the GEMM instantiation logic to include a check for WGMMA compatibility, ensuring that the conditions for using WGMMA are more robust.
      - Refined the layout inference process to better identify when loops manipulate only local buffers, improving the accuracy of thread binding decisions in parallel loops.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      c382dcbc
    • Lei Wang's avatar
      [Precision] Introduce `T.ieee_rsqrt` and related high precision op (#882) · a58bf9b6
      Lei Wang authored
      * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)
      
      * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.
      
      * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.
      
      * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.
      
      * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.
      
      * Add precision comparison tool for CUDA operations
      
      This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.
      
      * Add precision comparison tool for CUDA operations
      
      This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.
      
      * Add IEEE-compliant mathematical operations and refactor fast math module
      
      This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic.
      
      * debug removed
      
      * Refactor IEEE math tests for improved readability and consistency
      
      This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code.
      
      * Update README.md to enhance formatting of precision comparison results
      
      This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.
      a58bf9b6
    • Lei Wang's avatar
      [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit... · 95c373f5
      Lei Wang authored
      [FastMath] Disable default TVM fastmath intrinsic dispatch and add explicit fastmath op to invoke (#875)
      
      * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)
      
      * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.
      
      * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.
      
      * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.
      
      * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.
      
      * Add precision comparison tool for CUDA operations
      
      This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.
      
      * Add precision comparison tool for CUDA operations
      
      This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.
      95c373f5
  21. 25 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support atomic add with ret (#870) · aa0b1090
      Lei Wang authored
      * Add atomic operations for CUDA templates in new atomic.h file
      
      - Introduced atomic functions including AtomicMax, AtomicMin, AtomicAdd, and their return variants for various data types.
      - Implemented support for half, bfloat16, and float types with appropriate memory ordering.
      - Moved atomic-related utilities from common.h to the new atomic.h file for better organization.
      - Added Python bindings for atomic operations in tilelang, including atomic_max, atomic_min, atomic_add, and their vectorized counterparts.
      - Updated customize.py to utilize the new atomic functions, enhancing modularity and maintainability.
      
      * Refactor atomic operations in CUDA templates for improved readability
      
      - Reformatted atomic operation implementations in atomic.h for better code clarity.
      - Adjusted function signatures in tilelang's atomic.py to enhance readability by aligning parameters.
      - Cleaned up unnecessary whitespace and comments in customize.py to streamline the codebase.
      
      * Add thread storage synchronization configuration option
      
      - Introduced a new configuration option `tl.disable_thread_storage_sync` to control the automatic insertion of thread synchronization barriers in shared memory access.
      - Updated the `ThreadSync` pass to check this configuration and bypass synchronization if disabled.
      - Enhanced documentation in `builtin.h` and `pass_config.py` to clarify the purpose and usage of the new option.
      
      * Refactor thread storage sync configuration retrieval
      
      - Simplified the retrieval of the thread storage sync configuration in the `ThreadSync` pass by removing unnecessary intermediate variables.
      - Ensured that the inclusion of `builtin.h` is consistent by moving it to the appropriate location in the file.
      
      * test fix
      
      * Update atomic operations and tests for improved functionality
      
      - Updated atomic operations in CUDA templates to remove unnecessary address_of calls, enhancing performance and readability.
      - Refactored atomic operation signatures in tilelang's atomic.py to accept references instead of pointers.
      - Added new atomic operations and corresponding test cases for atomic add, max, min, and load/store functionalities in the testing suite.
      - Updated the TVM subproject to the latest commit for better compatibility.
      
      * Update attention sink examples to use 32 heads
      
      - Modified the `heads` parameter in both `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py` and `example_mha_sink_fwd_bhsd_wgmma_pipelined.py` from 1 to 32 to enhance performance in attention mechanisms.
      - Ensured consistency across example scripts for improved usability and testing.
      
      * Refactor atomic add handling in vectorization
      
      - Simplified the extraction of buffer loads for atomic add operations by removing unnecessary address_of calls, improving code clarity and performance.
      - Updated the data type retrieval for vectorization size calculation to directly access the buffer load node, enhancing efficiency.
      
      * Add loop break functionality and enhance thread synchronization
      
      - Introduced a new `loop_break` function in `customize.py` to allow breaking out of loops, returning a call to the `tl.loop_break` intrinsic.
      - Updated the `sync_threads` function in `builtin.py` to accept optional parameters for `barrier_id` and `arrive_count`, improving its flexibility for thread synchronization.
      - Added necessary imports in `__init__.py` to include the new `loop_break` function for broader accessibility.
      
      * test fix
      aa0b1090
  22. 18 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355
      Lei Wang authored
      * [Enhancement] Enable fast math optimization in tilelang JIT configurations
      
      - Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
      - Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
      - Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
      - Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.
      
      * lint fix
      
      * [Refactor] Introduce deprecated_warning utility for improved deprecation handling
      
      - Added a new `deprecated_warning` function to streamline deprecation messages.
      - Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
      - Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.
      e7e38355
  23. 15 Sep, 2025 2 commits
    • Yu Cheng's avatar
      [Refactor] Update TVM subproject and refactor BlockNode handling in... · 8b005226
      Yu Cheng authored
      [Refactor] Update TVM subproject and refactor BlockNode handling in warp_specialized_rewriter.cc (#812)
      
      * [Feature] Introduce custom warp specialization attribute and enhance warp group register allocation
      
      - Added a new attribute `kCustomWarpSpecialization` to support custom warp specialization in the TileLang framework.
      - Updated the `Collect` method in `SetMaxNRegCollector` to handle cases where warp specialization is detected, returning an empty array accordingly.
      - Enhanced the `SetMaxNRegInjector` to skip processing when no registers are needed, improving efficiency.
      - Modified the `WarpSpecialized` pass to include the new attribute in the function body when warp specialization is enabled, ensuring proper handling in transformations.
      
      * lint
      
      * lint
      8b005226
    • botbw's avatar
      [feat] support gemm_sp for ampere and ada arch (#691) · 0b3683bf
      botbw authored
      
      
      * [feat] add an example mma atom
      
      * [fix] fix typo naming
      
      * [feat] add a template to enable compilation
      
      * [feat] add print util
      
      * [WIP] pass on single block tile
      
      * [feat] add sm80 metadata layout
      
      * [chore] clean codebase
      
      * [CI] format.sh
      
      * [feat] add sm80 compress utils
      
      * [bugfix] fix C fragment layout
      
      * [refactor] use nvcc version instead of str
      
      * [test] add test cases
      
      * [chore] add a param check
      
      * [chore] format a bit
      
      * [chore] rename func to satisfy PEP 8 and appease gemini
      
      * [chore] add check
      
      * [feat] support sm75 layout && add assertion && chore
      
      * [bug] fix illegal memory access when using two warps over N=32
      
      This could be a missing check related to cutlass 2.x implementation.
      Using the cutlass example can't trigger this cause it's bypassed by
      padding the input.
      
      For now I think it might be safe to increase the atom size and inve-
      sgate in the future.
      
      * [chore] add example
      
      * [chore] format
      
      * [example] update benchmark
      
      * [bugfix] fix namespace and format
      
      * [bugfix] fix incorrect param passing
      
      * [refactor] update variable declaration for clarity in gemm_layouts and gemm_sp
      
      * [Cleanup] Remove unnecessary blank lines in metadata layout functions in gemm_sp.py
      
      * [CI] fix arch
      
      * [example] add torch sparse benchmark
      
      * [misc] polish && add reference && apply review suggestionsi && format
      
      * [CI] format with clang-tidy
      
      * [Cleanup] Format and align template struct definitions in half.hpp, common.h, and gemm_sp_sm80.h
      
      * [Update] Modify CUDA version requirements in test_gemm_sp_sm80 and mark cutlass subproject as dirty
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0b3683bf
  24. 14 Sep, 2025 1 commit
    • Yu Cheng's avatar
      [Feature] Add ptx_cp_async_barrier_noinc intrinsic and related functionality (#809) · ae9b7063
      Yu Cheng authored
      - Introduced a new intrinsic `ptx_cp_async_barrier_noinc` for handling the `cp.async.mbarrier.arrive.noinc` operation in TileLang.
      - Updated the CUDA code generation to support the new barrier operation.
      - Added a corresponding function in the TileLang Python API for ease of use.
      - Enhanced the barrier handling in CUDA templates to include the new no-increment operation, improving synchronization capabilities in parallel execution contexts.
      ae9b7063
  25. 11 Sep, 2025 1 commit