1. 21 Dec, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout PassConfig `kDisableDynamicTailSplit` and... · a874e4e8
      Lei Wang authored
      [Refactor] Phaseout PassConfig `kDisableDynamicTailSplit` and `kDynamicAlignment` as they are legacy (#1486)
      
      * [Cleanup] Remove dynamic shape example and related tests
      
      * Deleted the dynamic shape example script `example_dynamic.py` and its corresponding test file `test_example_dynamic.py` to streamline the codebase.
      * Removed unused dynamic tail split and dynamic alignment configurations from `builtin.h` and `pass_config.py`.
      * Cleaned up the dynamic shape testing files to eliminate redundancy and improve maintainability.
      
      * build fix
      a874e4e8
  2. 20 Dec, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Enhance let binding handling in layout inference and warp specialized pass (#1484) · 7e8d1f82
      Lei Wang authored
      * [Feature] Add FullyReplicated Fragment Layout and Enhance Layout Inference
      
      * Introduced a new static method `FullyReplicated` in the `Fragment` class to create fully replicated fragment layouts, ensuring all threads hold identical copies of the buffer.
      * Updated `CopyNode` to collect fragment layouts and mark them as fully replicated during layout inference.
      * Enhanced `ParallelOpNode` to expand let bindings for fragment buffer accesses, improving layout inference accuracy.
      * Added documentation for new methods and updated existing methods to support the new layout features.
      
      * lint fix
      
      * Remove debug logging statements from layout inference process to streamline output and improve performance.
      7e8d1f82
  3. 19 Dec, 2025 7 commits
    • Lei Wang's avatar
      [Enhancement] Use static Z3 context (#1482) · 168aec7b
      Lei Wang authored
      * use static Z3 context
      
      * Update submodule reference for TVM to indicate a dirty state
      168aec7b
    • Lei Wang's avatar
      [News] update with latest news (#1475) · 2217eb74
      Lei Wang authored
      * Update README.md with latest news, including CuTeDSL backend support, Z3 theorem prover integration, and migration to apache-tvm-ffi for improved compatibility.
      
      * Update README.md to enhance CuTeDSL backend announcement with a link to related issue and clarify migration benefits to apache-tvm-ffi, reducing CPU overhead.
      2217eb74
    • Lei Wang's avatar
      [Language] Enhance T.dtype.as_torch conversion for compatibility (#1473) · 3516f1ee
      Lei Wang authored
      * [Language] Enhance dtype conversion for PyTorch compatibility
      
      - Added support for new float8 and float4 data types in the __dtype_as_torch__ method.
      - Implemented backend-specific handling for float8_e4m3 based on HIP or CUDA.
      - Included assertions to ensure compatibility with the required PyTorch versions for each dtype.
      - Improved error handling for unsupported dtypes.
      
      * Fix test script execution and improve error messages for dtype assertions
      
      - Commented out the main execution call in the test script and replaced it with a direct call to the test function `test_divmod()`.
      - Enhanced error messages in the dtype conversion assertions to improve clarity and readability, ensuring proper guidance for required PyTorch versions.
      3516f1ee
    • silentCoder-dev's avatar
      [Refactor] Remove triton dependence in testing & move triton baseline into examples (#1470) · 95e3b5a7
      silentCoder-dev authored
      * remove triton dependence in testing & move triton baseline into example
      
      * use ceildiv and handles arbitrary M correctly for triton
      95e3b5a7
    • Chaofan Lin's avatar
      [Language] Make TL scripts friendly to Python syntax highlights (#1466) · 1a3a64fb
      Chaofan Lin authored
      * Language] Make TL scripts friendly to Python syntax highlights
      
      * add comments
      
      * fix submodule
      1a3a64fb
    • Lei Wang's avatar
      [ArgBinder] Enhance shape variable handling and assertions (#1467) · f6db2014
      Lei Wang authored
      * feat(arg_binder): enhance shape variable handling and assertions
      
      - Implemented special handling for comparing if_then_else expressions to simplify conditions involving NULL checks.
      - Added methods to set shared shape variables and finalize deferred bindings, generating cascading if_then_else expressions and runtime assertions for non-NULL buffers.
      - Updated the binding logic to defer shape variable bindings for shared variables, ensuring proper handling across multiple nullable buffers.
      
      * refactor(arg_binder): clean up shape variable handling and remove unused code
      
      - Removed deprecated methods for setting shared shape variables and finalizing deferred bindings, streamlining the argument binding process.
      - Simplified the logic for handling shape values in the `BindDLTensor` function, ensuring immediate binding for normal shape variables.
      - Enhanced clarity by eliminating unnecessary comments and code related to cascading if_then_else expressions for shared variables.
      
      * refactor(arg_binder): enhance DLTensor binding with improved shape handling
      
      - Replaced the single `BindDLTensor` method with `BindDLTensors` to support multiple buffers, improving flexibility in handling DLTensor bindings.
      - Introduced a two-pass approach for shape variable handling, allowing for better management of symbolic dimensions and null checks.
      - Updated the logic to assert non-null conditions at runtime and utilize cascaded if_then_else expressions for shape retrieval, enhancing robustness.
      - Removed deprecated code and streamlined the binding process for clarity and maintainability.
      
      * fix(test_nullable_buffer_params): improve formatting and consistency in test output
      
      - Updated string formatting for better readability in the `test_nullable_shared_shape` function.
      - Ensured consistent use of double quotes for string literals.
      - Added a missing newline at the end of the file for proper formatting.
      
      * refactor(arg_binder): simplify allocation size calculation in BindDLTensors
      
      - Streamlined the calculation of allocation size by replacing a lambda function with a direct loop, enhancing readability and maintainability.
      - Improved clarity in the null check message for data pointers, ensuring better understanding of the binding process.
      
      * Remove debug prints from phase.py
      
      Removed debug print statements after MakePackedAPI transformation.
      f6db2014
    • silentCoder-dev's avatar
      [Refactor] Rename test for curand & add triton baseline in `test_tilelang_language_rand.py` (#1464) · f0672603
      silentCoder-dev authored
      * rename test for curand & add triton baseline
      
      * add a comment for calling T.rng_rand() four times
      
      * refactor tilelang&triton kernel
      
      * Add boundary checks for M not divisible by 128
      f0672603
  4. 18 Dec, 2025 3 commits
    • Gabriel Wu's avatar
      feat(cutedsl): add CuTeDSL backend (#1421) · 7248a810
      Gabriel Wu authored
      
      
      * feat: CuTeDSL backend
      
      * fix: clang-tidy
      
      * fix: clang-format
      
      * fix: ci
      
      * fix: revert example gemm fp8
      
      * fix: remove duplicate code
      
      * fix: switch-case
      
      * fix: fp16 silence
      
      * fix: TVM IR print
      
      * fix: useless tir
      
      * fix: clang-format
      
      * fix: remove tilelang/contrib/cutedsl/.gitignore
      
      * fix: use hexfloat
      
      * fix: gsym guard
      
      * fix: unknown storage sync type
      
      * fix: string literal
      
      * fix: add args guard
      
      * fix: name hint dedup
      
      * fix: better find_kernel_by_pattern
      
      * fix: set libpath for from_database path
      
      * fix: guard buffer.strides
      
      * fix: from guard
      
      * fix: eviction guard
      
      * fix: use thread local tma descs
      
      * fix: ruff
      
      * fix: drop tma_init_cpp
      
      * fix: exc_info
      
      * fix: negative unmatch early return
      
      * fix: rename postproc func and add test
      
      * fix: handle fast math according to pass config
      
      * fix: dyn_sym parse
      
      * fix: wrap_forward
      
      * fix: use tvm_ffi.libinfo instead of cli
      
      * fix: keep signature
      
      * fix: C++ string safety
      
      * fix: mark tma_store_add as unsupported
      
      * fix: tvm version
      
      * resolve ldsm and cpasync issues.
      
      * fix: minor fixes
      
      * fix: parse signature using ast
      
      * fix: guard global_addr
      
      * fix: create tempfile only when necessary
      
      * fix: use logger.execption for exceptions
      
      * fix: guard lib_path and host_func
      
      * fix: remove tma_cpp_init and add timeout for cpp compile
      
      * add timeout for mbarrier_wait.
      
      * fix: _load_kernel_from_disk signature
      
      * resolve codegen issues.
      
      * fix: logger.exception
      
      * add comment for div_by=1
      
      * merge
      
      * fix: reserve cutlass,cute,tl
      
      * fix: guard tma_store
      
      * fix: allow int64 offset in make_tensor_at_offset
      
      * fix: guard barrier
      
      * fix: add comments for div_by=16
      
      * fix: div_by=1 issue
      
      * delete div_by when offset is 0
      
      * use tl.make_tensor when offset is 0
      
      * fix: explicitly check cutedsl target
      
      * fix: use param.torch_dtype()
      
      ---------
      Co-authored-by: default avataryuxic <yuxic@nvidia.com>
      Co-authored-by: default avatarYong <yong@local>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      7248a810
    • Jinjie Liu's avatar
      remove unused duplicated type check (#1462) · a6f59f31
      Jinjie Liu authored
      
      Signed-off-by: default avatarJinjie Liu <jjliu@baai.ac.cn>
      a6f59f31
    • silentCoder-dev's avatar
      [Language]Adds a random number generation capability through curand_kernel (#1461) · cae06edd
      silentCoder-dev authored
      
      
      * add curand.{curand_init, curand}
      
      * run format.sh
      
      * add default value for curand_init & add test for curand
      
      * Update testing/python/language/test_rand.py
      
      Remove unused thread binding
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      
      * remove unused library
      
      * enable tilelang cache for testing
      
      * run format.sh
      
      * Revert "run format.sh"
      
      This reverts commit 5afaff782f31cdf653e2c45b469da8dead228b8a.
      
      * Revert "enable tilelang cache for testing"
      
      This reverts commit c277a43e77938bd88d47a108dd1bd65734d4a1ae.
      
      * Revert "remove unused library"
      
      This reverts commit 568ad20611f039380113937fd131151a2bffd801.
      
      * run format.sh
      
      * ensure FreshName for __philox_state
      
      * ensure FreshName for __philox_state
      
      * change the return type of T.rng_init
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      cae06edd
  5. 17 Dec, 2025 11 commits
    • Lei Wang's avatar
      [Cache] Rename sparse compress cache directory (#1460) · 48e70e68
      Lei Wang authored
      * Enhance cache directory structure by including version information in sparse.py to ensure separate caches for different versions.
      
      * Fix formatting in sparse.py by adding a newline for improved readability and consistency.
      48e70e68
    • Kuris's avatar
      [Analyzer] Fix floordiv & floormod bug in z3 prover (#1458) · 91cf7966
      Kuris authored
      * fix floordiv & floormod in z3 prover
      
      * fix lint error
      91cf7966
    • Kuris's avatar
      3ee0939b
    • Lei Wang's avatar
      6aaf3c7a
    • Lei Wang's avatar
    • Lei Wang's avatar
      [Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a
      Lei Wang authored
      * [Enhancement] Update examples and tests for improved type handling and functionality
      
      - Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
      - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
      - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
      - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.
      
      * [Refactor] Update accumulation data type to float32 across examples
      
      - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
      - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.
      
      * [Refactor] Standardize data type usage across benchmark scripts
      
      - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
      - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
      - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.
      
      * [Refactor] Standardize data type usage in templates and scripts
      
      - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
      - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
      - This change aims to streamline type handling and improve compatibility with existing workflows.
      
      * [Refactor] Standardize data type usage in examples and benchmarks
      
      - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
      - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.
      
      * [Refactor] Import dtypes from language.v2 module
      
      - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
      - This change aims to streamline data type management and improve overall code clarity.
      
      * fix
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
      - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
      - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.
      
      * [Refactor] Update data type handling for consistency and clarity
      
      - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
      - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
      - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
      - This refactor aims to streamline data type management and improve overall code clarity and maintainability.
      
      * [Enhancement] Improve data type handling and error messaging
      
      - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
      - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
      - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.
      
      * [Fix] Correct boolean flag in GEMM SP test case
      
      - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
      - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.
      
      * [Refactor] Standardize data type usage in various modules
      
      - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.
      
      * [Refactor] Update argument parsing for data types in benchmarks
      
      - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
      - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.
      
      * [Refactor] Update data type handling in benchmark and example scripts
      
      - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
      - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
      - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.
      
      * [Refactor] Fix data type conversion in multiple scripts
      
      - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
      - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.
      
      * [Refactor] Update float8 data type usage across multiple scripts
      
      - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
      - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.
      
      * [Refactor] Enhance float8 data type handling in CUDA code generation
      
      - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
      - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
      - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Streamline float8 data type handling in CUDA and related modules
      
      - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
      - Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Remove unnecessary cache disabling in float8 example script
      
      - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
      - This change enhances clarity and maintainability of the example script without affecting its functionality.
      
      * [Refactor] Update data type usage in debug print tests
      
      - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
      - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.
      
      * lint fix
      
      * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples
      
      * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
      c750fb8a
    • Gongen-Ali's avatar
      Updat ROCm CI to Nightly-ROCm-7.1 (#1449) · 0c25c4f3
      Gongen-Ali authored
      0c25c4f3
    • Kuris's avatar
      f914f2d7
    • Lei Wang's avatar
      [Language] Introduce `T.annotate_restrict_buffers` (#1428) · 0814b171
      Lei Wang authored
      * [Enhancement] Introduce non-restrict parameter support in code generation
      
      - Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation.
      - Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases.
      - Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict.
      - Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions.
      - Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively.
      
      * [Enhancement] Improve HoistNonRestrictParams transformation
      
      - Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement.
      - Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters.
      - Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels.
      - Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process.
      
      * [Refactor] Simplify non-restrict parameter handling in code generation
      
      - Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`.
      - Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability.
      - Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability.
      
      * [Dependency] Update TVM subproject to latest commit 68aa8461
      
      - Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements.
      - Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability.
      - Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters.
      
      * fix
      0814b171
    • senlyu163's avatar
      [Bugfix] Improve autotune from elementwise_add function in examples (#1445) · f4f87f46
      senlyu163 authored
      * Remove JIT decorator from elementwise_add function in examples
      
      * fix kernel compilation without autotune
      
      * Refactor main function to accept parameters and update tests for autotune option
      
      * Refactor autotune test function for morden style
      f4f87f46
    • Kuris's avatar
      [Feat] Integrate Z3 in TVM Arith Analyzer (#1367) · 9c21586b
      Kuris authored
      9c21586b
  6. 16 Dec, 2025 4 commits
    • Chaofan Lin's avatar
    • Kuris's avatar
      [Refactor] Use `pytest.mark.parameterize` to speedup parallel testing (#1447) · 0b6336b5
      Kuris authored
      
      
      * Refactor GEMM tests to use parameterized pytest fixtures
      
      - Converted multiple test cases for GEMM operations in `test_tilelang_tilelibrary_gemm_sp.py` to use `pytest.mark.parametrize` for better maintainability and readability.
      - Similar refactoring applied to `test_tilelang_tilelibrary_gemm_sp_v2.py`, consolidating test cases for `run_gemm_ss`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` into parameterized tests.
      - This change reduces code duplication and enhances the clarity of test configurations.
      
      * Update testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      0b6336b5
    • Lei Wang's avatar
      [Refactor] Reduce direct dependency on PyTorch due to its limited type support (#1444) · dda45126
      Lei Wang authored
      
      
      * [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method
      
      - Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions.
      - Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation.
      - Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization.
      
      * [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types
      
      - Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively.
      - Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64).
      - Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions.
      - Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing.
      - Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch.
      - Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively.
      
      * lint fix
      
      * Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file.
      
      * [Enhancement] Add support for FP4 and FP8 types in CUDA code generation
      
      - Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types.
      - Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch.
      - Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations.
      - Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations.
      - Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows.
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      dda45126
    • Kuris's avatar
      [Fix] Fix analyzer bind conflicting (#1446) · 81b8c1b7
      Kuris authored
      81b8c1b7
  7. 15 Dec, 2025 11 commits
    • Dayuxiaoshui's avatar
      [Feature] Support region as input of T.cumsum (#1426) · 869f021b
      Dayuxiaoshui authored
      
      
      * [Feature] Support region as input of T.cumsum
      
      - Extend T.cumsum to accept BufferRegion and BufferLoad inputs in addition to Buffer
      - This enables operations on buffer slices/regions like:
        T.cumsum(InputG_fragment[i * chunk_size:(i + 1) * chunk_size], dim=0)
      - Update cumsum_fragment to handle region inputs properly
      - Add comprehensive tests for 1D and 2D region inputs including normal and reverse modes
      
      Fixes #879
      
      * Fix formatting and add docstring for cumsum_fragment
      
      - Add comprehensive docstring for cumsum_fragment function
      - Format code according to ruff style guidelines
      
      * Fix CodeRabbit review issues
      
      - Fix negative dimension bounds check (dim < -len(shape) instead of dim <= -len(shape))
      - Add src/dst shape compatibility validation for out-of-place cumsum
      - Update copy() type annotation to accept BufferRegion as dst parameter
      - Fix test in-place mutation issues by using out-of-place cumsum operations
      - Add non-divisible size test cases for tail region coverage
      
      * Fix out-of-bounds access in region tests
      
      - Add bounds clamping using T.min() for chunk_end calculations
      - Prevents accessing beyond tensor bounds for non-divisible sizes
      - Matches reference implementation behavior
      - Fixes both 1D and 2D region test cases
      
      * Fix region test: use simple slice expressions instead of T.min()
      
      - Remove T.min() which cannot be used directly in slice indices
      - Use chunk_start + chunk_size form instead
      - Rely on system's automatic bounds checking for non-divisible sizes
      - Update comments to reflect this approach
      
      * Fix cumsum region: use region extents in lowering and update tests for shared memory
      
      * Simplify fragment scope check using is_fragment()
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      869f021b
    • Xiangwen Wang's avatar
      bcae814e
    • Lei Wang's avatar
      [Enhancement] Refactor vectorization checks in loop_vectorize (#1440) · e387102c
      Lei Wang authored
      * Introduced a new function, IsExprInvariantInVectorBoundary, to encapsulate the logic for checking if an expression is invariant within vector boundaries, improving code clarity and reusability.
      * Updated the existing vectorization logic to utilize this new function, streamlining the process of determining vectorization feasibility based on boundary conditions.
      * Enhanced comments for better understanding of the vectorization criteria and mathematical rationale behind the checks.
      e387102c
    • Xuehai Pan's avatar
      [CI] Update lint dependencies and fix lint on trunk (#1433) · 4dbc910d
      Xuehai Pan authored
      * [CI] Update pre-commit hooks
      
      * [Lint] Pass correct `exclude-header-filter` to `clang-tidy`
      
      * [Lint] Download latest `run-clang-tidy` script
      
      * [CI] Show compile commands
      
      * [CI] Add output grouping to GHA
      
      * [Lint] Re-order pre-commit hooks
      4dbc910d
    • Lei Wang's avatar
      [Enhancement] Include PrimFunc name in memory cache logs for better debugging (#1437) · b8003a28
      Lei Wang authored
      * Added the `get_prim_func_name` utility to extract human-readable function names from TVM PrimFuncs.
      * Updated memory cache logging in `AutoTuner` and `KernelCache` classes to include the kernel name, improving clarity during cache hits.
      * Enhanced debug logging to provide more informative messages when checking disk cache for kernels.
      b8003a28
    • Chaofan Lin's avatar
      [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice (#1405) · 2feaa41e
      Chaofan Lin authored
      * [Refactor] Refactor InjectAssumes logic and make assumes work after SplitHostDevice
      
      * address comments
      
      * fix
      
      * fix submodule
      
      * fix
      
      * fix 3rdparty
      2feaa41e
    • Lei Wang's avatar
      [Enhancement] Improve buffer usage tracking in MakePackedAPI (#1435) · 0788feb8
      Lei Wang authored
      * Added detailed logging for data and shape variable parameters during buffer usage detection in the MakePackedAPI function.
      * Refactored the UsedBufferDetector to differentiate between used parameters by data and shape variables, enhancing clarity in buffer management.
      * Updated logic to ensure minimal carrier buffers are selected for shape symbols, improving the efficiency of parameter handling.
      0788feb8
    • Lei Wang's avatar
      [Bugfix] Convey `compile_flags` to ffi compilation path with pass_configs (#1434) · fba12a5f
      Lei Wang authored
      * [Enhancement] Add device compile flags support in pass configuration
      
      * Introduced `kDeviceCompileFlags` option in the pass configuration to allow additional device compiler flags for CUDA compilation.
      * Updated the `tilelang_callback_cuda_compile` function to merge extra flags from the pass configuration, enhancing flexibility in compiler options.
      * Modified the `JITKernel` class to handle device compile flags appropriately, ensuring they are included during compilation.
      * Documented the new pass configuration key for clarity on usage and expected input formats.
      
      * lint fix
      
      * [Refactor] Simplify compile_flags handling in JIT functions
      
      * Removed redundant string check for compile_flags in the compile, jit, and lazy_jit functions, ensuring compile_flags is consistently treated as a list.
      * Updated the JITKernel class to handle compile_flags as a list when a string is provided, enhancing code clarity and maintainability.
      
      * lint fix
      
      * fix
      fba12a5f
    • dependabot[bot]'s avatar
      [CI]: Bump actions/download-artifact from 6 to 7 (#1432) · 87e9e170
      dependabot[bot] authored
      Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 6 to 7.
      - [Release notes](https://github.com/actions/download-artifact/releases)
      - [Commits](https://github.com/actions/download-artifact/compare/v6...v7
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: actions/download-artifact
        dependency-version: '7'
        dependency-type: direct:production
        update-type: version-update:semver-major
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      87e9e170
    • dependabot[bot]'s avatar
      [CI]: Bump actions/upload-artifact from 5 to 6 (#1431) · 3aa6938f
      dependabot[bot] authored
      Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 6.
      - [Release notes](https://github.com/actions/upload-artifact/releases)
      - [Commits](https://github.com/actions/upload-artifact/compare/v5...v6
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: actions/upload-artifact
        dependency-version: '6'
        dependency-type: direct:production
        update-type: version-update:semver-major
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      3aa6938f
    • Lei Wang's avatar
      [Refactor] Phase out the primitives folder since its design has been merged into tileop (#1429) · 89521e63
      Lei Wang authored
      * Phase out primitives
      
      * revert changes
      
      * Refactor GemmWarpPolicy method signature for clarity
      
      Updated the `from_warp_partition` method in the `GemmWarpPolicy` class to return the type `GemmWarpPolicy` instead of a string, enhancing type safety and clarity in the codebase. Removed an unnecessary blank line for improved readability.
      
      * fix
      89521e63
  8. 13 Dec, 2025 2 commits
    • Lei Wang's avatar
      [CUDA] Add read-only parameter annotation for CUDA codegen (#1416) · 00dd7388
      Lei Wang authored
      * [Enhancement] Add read-only parameter annotation for CUDA codegen
      
      * Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen.
      * Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads.
      * Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process.
      
      * lint fix
      
      * [Dependency] Update apache-tvm-ffi version to >=0.1.3
      
      * Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes.
      * Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity.
      
      * lint fix
      
      * [Enhancement] Refactor ReadWriteMarker for improved parameter handling
      
      * Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables.
      * Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables.
      * Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters.
      * Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications.
      
      * lint fix
      
      * Update tma_load function to use const qualifier for global memory pointer
      
      * Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations.
      * This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming.
      
      * Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity
      
      * Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.
      00dd7388
    • Lei Wang's avatar
      [Atomic] Use ptr for atomicAdd dst instead of reference (#1425) · 3546e2ee
      Lei Wang authored
      * [Enhancement] Update AtomicAdd function signature to accept pointer to destination
      
      * Modified AtomicAdd in CUDA to take a pointer instead of a reference for the destination argument.
      * Updated related code in atomicadd_vectorize.cc to ensure compatibility with the new signature.
      * Adjusted Python interface in atomic.py to pass the destination by pointer, aligning with device function requirements.
      
      * [Enhancement] Refactor AtomicAddRet function signature to accept pointer
      
      * Updated AtomicAddRet in both CUDA and HIP to take a pointer instead of a reference for the address argument, improving consistency with the AtomicAdd function.
      * Adjusted the implementation to ensure proper reinterpretation of the address type for atomic operations.
      
      * lint fix
      
      * [Enhancement] Refactor AtomicAddNode::MakeSIMTLoop to use destination pointer
      
      * Updated the MakeSIMTLoop function to build a pointer to the destination element using tvm_access_ptr instead of loading the destination value directly.
      * Simplified the handling of source and destination predicates, improving clarity and maintainability of the code.
      * Ensured compatibility with the new pointer-based approach for atomic operations.
      
      * lint fix
      
      * test fix
      
      * lint fix
      3546e2ee