1. 17 Dec, 2025 9 commits
    • Kuris's avatar
      3ee0939b
    • Lei Wang's avatar
      6aaf3c7a
    • Lei Wang's avatar
    • Lei Wang's avatar
      [Enhancement] Update examples and tests for improved type handling functionality (#1448) · c750fb8a
      Lei Wang authored
      * [Enhancement] Update examples and tests for improved type handling and functionality
      
      - Enhanced various example scripts to support new data types and improve compatibility with PyTorch.
      - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling.
      - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management.
      - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling.
      
      * [Refactor] Update accumulation data type to float32 across examples
      
      - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability.
      - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board.
      
      * [Refactor] Standardize data type usage across benchmark scripts
      
      - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling.
      - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard.
      - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules.
      
      * [Refactor] Standardize data type usage in templates and scripts
      
      - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity.
      - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates.
      - This change aims to streamline type handling and improve compatibility with existing workflows.
      
      * [Refactor] Standardize data type usage in examples and benchmarks
      
      - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard.
      - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples.
      
      * [Refactor] Import dtypes from language.v2 module
      
      - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase.
      - This change aims to streamline data type management and improve overall code clarity.
      
      * fix
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity.
      - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability.
      - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase.
      
      * [Refactor] Update data type handling for consistency and clarity
      
      - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency.
      - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase.
      - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling.
      - This refactor aims to streamline data type management and improve overall code clarity and maintainability.
      
      * [Enhancement] Improve data type handling and error messaging
      
      - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation.
      - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs.
      - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience.
      
      * [Fix] Correct boolean flag in GEMM SP test case
      
      - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case.
      - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation.
      
      * [Refactor] Standardize data type usage across scripts
      
      - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability.
      - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability.
      
      * [Refactor] Standardize data type usage in various modules
      
      - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability.
      - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase.
      - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations.
      
      * [Refactor] Update argument parsing for data types in benchmarks
      
      - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float.
      - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability.
      
      * [Refactor] Update data type handling in benchmark and example scripts
      
      - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency.
      - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase.
      - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples.
      
      * [Refactor] Fix data type conversion in multiple scripts
      
      - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts.
      - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations.
      
      * [Refactor] Update float8 data type usage across multiple scripts
      
      - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling.
      - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations.
      
      * [Refactor] Enhance float8 data type handling in CUDA code generation
      
      - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic.
      - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase.
      - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Streamline float8 data type handling in CUDA and related modules
      
      - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks.
      - Updated layout inference for float8 types to improve clarity and maintainability across the implementation.
      - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy.
      
      * [Refactor] Remove unnecessary cache disabling in float8 example script
      
      - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code.
      - This change enhances clarity and maintainability of the example script without affecting its functionality.
      
      * [Refactor] Update data type usage in debug print tests
      
      - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type.
      - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability.
      
      * lint fix
      
      * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples
      
      * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
      c750fb8a
    • Gongen-Ali's avatar
      Updat ROCm CI to Nightly-ROCm-7.1 (#1449) · 0c25c4f3
      Gongen-Ali authored
      0c25c4f3
    • Kuris's avatar
      f914f2d7
    • Lei Wang's avatar
      [Language] Introduce `T.annotate_restrict_buffers` (#1428) · 0814b171
      Lei Wang authored
      * [Enhancement] Introduce non-restrict parameter support in code generation
      
      - Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation.
      - Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases.
      - Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict.
      - Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions.
      - Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively.
      
      * [Enhancement] Improve HoistNonRestrictParams transformation
      
      - Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement.
      - Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters.
      - Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels.
      - Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process.
      
      * [Refactor] Simplify non-restrict parameter handling in code generation
      
      - Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`.
      - Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability.
      - Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability.
      
      * [Dependency] Update TVM subproject to latest commit 68aa8461
      
      - Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements.
      - Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability.
      - Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters.
      
      * fix
      0814b171
    • senlyu163's avatar
      [Bugfix] Improve autotune from elementwise_add function in examples (#1445) · f4f87f46
      senlyu163 authored
      * Remove JIT decorator from elementwise_add function in examples
      
      * fix kernel compilation without autotune
      
      * Refactor main function to accept parameters and update tests for autotune option
      
      * Refactor autotune test function for morden style
      f4f87f46
    • Kuris's avatar
      [Feat] Integrate Z3 in TVM Arith Analyzer (#1367) · 9c21586b
      Kuris authored
      9c21586b
  2. 16 Dec, 2025 4 commits
    • Chaofan Lin's avatar
    • Kuris's avatar
      [Refactor] Use `pytest.mark.parameterize` to speedup parallel testing (#1447) · 0b6336b5
      Kuris authored
      
      
      * Refactor GEMM tests to use parameterized pytest fixtures
      
      - Converted multiple test cases for GEMM operations in `test_tilelang_tilelibrary_gemm_sp.py` to use `pytest.mark.parametrize` for better maintainability and readability.
      - Similar refactoring applied to `test_tilelang_tilelibrary_gemm_sp_v2.py`, consolidating test cases for `run_gemm_ss`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` into parameterized tests.
      - This change reduces code duplication and enhances the clarity of test configurations.
      
      * Update testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      
      ---------
      Co-authored-by: default avatarcoderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
      0b6336b5
    • Lei Wang's avatar
      [Refactor] Reduce direct dependency on PyTorch due to its limited type support (#1444) · dda45126
      Lei Wang authored
      
      
      * [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method
      
      - Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions.
      - Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation.
      - Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization.
      
      * [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types
      
      - Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively.
      - Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64).
      - Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions.
      - Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing.
      - Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch.
      - Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively.
      
      * lint fix
      
      * Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file.
      
      * [Enhancement] Add support for FP4 and FP8 types in CUDA code generation
      
      - Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types.
      - Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch.
      - Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations.
      - Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations.
      - Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows.
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      dda45126
    • Kuris's avatar
      [Fix] Fix analyzer bind conflicting (#1446) · 81b8c1b7
      Kuris authored
      81b8c1b7
  3. 15 Dec, 2025 11 commits
    • Dayuxiaoshui's avatar
      [Feature] Support region as input of T.cumsum (#1426) · 869f021b
      Dayuxiaoshui authored
      
      
      * [Feature] Support region as input of T.cumsum
      
      - Extend T.cumsum to accept BufferRegion and BufferLoad inputs in addition to Buffer
      - This enables operations on buffer slices/regions like:
        T.cumsum(InputG_fragment[i * chunk_size:(i + 1) * chunk_size], dim=0)
      - Update cumsum_fragment to handle region inputs properly
      - Add comprehensive tests for 1D and 2D region inputs including normal and reverse modes
      
      Fixes #879
      
      * Fix formatting and add docstring for cumsum_fragment
      
      - Add comprehensive docstring for cumsum_fragment function
      - Format code according to ruff style guidelines
      
      * Fix CodeRabbit review issues
      
      - Fix negative dimension bounds check (dim < -len(shape) instead of dim <= -len(shape))
      - Add src/dst shape compatibility validation for out-of-place cumsum
      - Update copy() type annotation to accept BufferRegion as dst parameter
      - Fix test in-place mutation issues by using out-of-place cumsum operations
      - Add non-divisible size test cases for tail region coverage
      
      * Fix out-of-bounds access in region tests
      
      - Add bounds clamping using T.min() for chunk_end calculations
      - Prevents accessing beyond tensor bounds for non-divisible sizes
      - Matches reference implementation behavior
      - Fixes both 1D and 2D region test cases
      
      * Fix region test: use simple slice expressions instead of T.min()
      
      - Remove T.min() which cannot be used directly in slice indices
      - Use chunk_start + chunk_size form instead
      - Rely on system's automatic bounds checking for non-divisible sizes
      - Update comments to reflect this approach
      
      * Fix cumsum region: use region extents in lowering and update tests for shared memory
      
      * Simplify fragment scope check using is_fragment()
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      869f021b
    • Xiangwen Wang's avatar
      bcae814e
    • Lei Wang's avatar
      [Enhancement] Refactor vectorization checks in loop_vectorize (#1440) · e387102c
      Lei Wang authored
      * Introduced a new function, IsExprInvariantInVectorBoundary, to encapsulate the logic for checking if an expression is invariant within vector boundaries, improving code clarity and reusability.
      * Updated the existing vectorization logic to utilize this new function, streamlining the process of determining vectorization feasibility based on boundary conditions.
      * Enhanced comments for better understanding of the vectorization criteria and mathematical rationale behind the checks.
      e387102c
    • Xuehai Pan's avatar
      [CI] Update lint dependencies and fix lint on trunk (#1433) · 4dbc910d
      Xuehai Pan authored
      * [CI] Update pre-commit hooks
      
      * [Lint] Pass correct `exclude-header-filter` to `clang-tidy`
      
      * [Lint] Download latest `run-clang-tidy` script
      
      * [CI] Show compile commands
      
      * [CI] Add output grouping to GHA
      
      * [Lint] Re-order pre-commit hooks
      4dbc910d
    • Lei Wang's avatar
      [Enhancement] Include PrimFunc name in memory cache logs for better debugging (#1437) · b8003a28
      Lei Wang authored
      * Added the `get_prim_func_name` utility to extract human-readable function names from TVM PrimFuncs.
      * Updated memory cache logging in `AutoTuner` and `KernelCache` classes to include the kernel name, improving clarity during cache hits.
      * Enhanced debug logging to provide more informative messages when checking disk cache for kernels.
      b8003a28
    • Chaofan Lin's avatar
      [Enhancement] Improve InjectAssumes logic and make assumes work after SplitHostDevice (#1405) · 2feaa41e
      Chaofan Lin authored
      * [Refactor] Refactor InjectAssumes logic and make assumes work after SplitHostDevice
      
      * address comments
      
      * fix
      
      * fix submodule
      
      * fix
      
      * fix 3rdparty
      2feaa41e
    • Lei Wang's avatar
      [Enhancement] Improve buffer usage tracking in MakePackedAPI (#1435) · 0788feb8
      Lei Wang authored
      * Added detailed logging for data and shape variable parameters during buffer usage detection in the MakePackedAPI function.
      * Refactored the UsedBufferDetector to differentiate between used parameters by data and shape variables, enhancing clarity in buffer management.
      * Updated logic to ensure minimal carrier buffers are selected for shape symbols, improving the efficiency of parameter handling.
      0788feb8
    • Lei Wang's avatar
      [Bugfix] Convey `compile_flags` to ffi compilation path with pass_configs (#1434) · fba12a5f
      Lei Wang authored
      * [Enhancement] Add device compile flags support in pass configuration
      
      * Introduced `kDeviceCompileFlags` option in the pass configuration to allow additional device compiler flags for CUDA compilation.
      * Updated the `tilelang_callback_cuda_compile` function to merge extra flags from the pass configuration, enhancing flexibility in compiler options.
      * Modified the `JITKernel` class to handle device compile flags appropriately, ensuring they are included during compilation.
      * Documented the new pass configuration key for clarity on usage and expected input formats.
      
      * lint fix
      
      * [Refactor] Simplify compile_flags handling in JIT functions
      
      * Removed redundant string check for compile_flags in the compile, jit, and lazy_jit functions, ensuring compile_flags is consistently treated as a list.
      * Updated the JITKernel class to handle compile_flags as a list when a string is provided, enhancing code clarity and maintainability.
      
      * lint fix
      
      * fix
      fba12a5f
    • dependabot[bot]'s avatar
      [CI]: Bump actions/download-artifact from 6 to 7 (#1432) · 87e9e170
      dependabot[bot] authored
      Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 6 to 7.
      - [Release notes](https://github.com/actions/download-artifact/releases)
      - [Commits](https://github.com/actions/download-artifact/compare/v6...v7
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: actions/download-artifact
        dependency-version: '7'
        dependency-type: direct:production
        update-type: version-update:semver-major
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      87e9e170
    • dependabot[bot]'s avatar
      [CI]: Bump actions/upload-artifact from 5 to 6 (#1431) · 3aa6938f
      dependabot[bot] authored
      Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 5 to 6.
      - [Release notes](https://github.com/actions/upload-artifact/releases)
      - [Commits](https://github.com/actions/upload-artifact/compare/v5...v6
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: actions/upload-artifact
        dependency-version: '6'
        dependency-type: direct:production
        update-type: version-update:semver-major
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      3aa6938f
    • Lei Wang's avatar
      [Refactor] Phase out the primitives folder since its design has been merged into tileop (#1429) · 89521e63
      Lei Wang authored
      * Phase out primitives
      
      * revert changes
      
      * Refactor GemmWarpPolicy method signature for clarity
      
      Updated the `from_warp_partition` method in the `GemmWarpPolicy` class to return the type `GemmWarpPolicy` instead of a string, enhancing type safety and clarity in the codebase. Removed an unnecessary blank line for improved readability.
      
      * fix
      89521e63
  4. 13 Dec, 2025 2 commits
    • Lei Wang's avatar
      [CUDA] Add read-only parameter annotation for CUDA codegen (#1416) · 00dd7388
      Lei Wang authored
      * [Enhancement] Add read-only parameter annotation for CUDA codegen
      
      * Introduced the `AnnotateReadOnlyParams` transformation to annotate read-only handle parameters in PrimFuncs, enabling the generation of `const` qualifiers in CUDA codegen.
      * Updated `PrintFunctionSignature` and `AddFunction` methods to utilize the new attribute `tl.readonly_param_indices`, enhancing performance by allowing read-only cache loads.
      * Modified the optimization pipeline to include the new annotation step, improving the overall efficiency of the code generation process.
      
      * lint fix
      
      * [Dependency] Update apache-tvm-ffi version to >=0.1.3
      
      * Updated the version of apache-tvm-ffi in pyproject.toml, requirements.txt, and requirements-dev.txt to ensure compatibility with the latest features and fixes.
      * Made adjustments in CUDA and HIP template files to use `const` qualifiers for global pointer parameters, enhancing code safety and clarity.
      
      * lint fix
      
      * [Enhancement] Refactor ReadWriteMarker for improved parameter handling
      
      * Updated the ReadWriteMarker class to accept a set of parameter or data variables, enhancing its ability to track written variables.
      * Introduced a new method, ResolveDataVarFromPtrArg, to resolve underlying buffer data from pointer-like arguments, improving accuracy in identifying written variables.
      * Modified the MarkReadOnlyParams function to gather handle parameters and their corresponding buffer data variables, streamlining the process of determining read-only parameters.
      * Enhanced the logic for identifying written variables to account for aliased data variables, ensuring comprehensive tracking of modifications.
      
      * lint fix
      
      * Update tma_load function to use const qualifier for global memory pointer
      
      * Changed the parameter type of gmem_ptr in the tma_load function from void* to void const* to enhance type safety and clarity in memory operations.
      * This modification ensures that the function correctly handles read-only global memory pointers, aligning with best practices in CUDA programming.
      
      * Remove commented-out code and reorder transformations in OptimizeForTarget function for clarity
      
      * Refactor buffer marking logic in annotate_read_only_params.cc to improve accuracy in identifying written variables. Update OptimizeForTarget function to reorder transformations for better clarity.
      00dd7388
    • Lei Wang's avatar
      [Atomic] Use ptr for atomicAdd dst instead of reference (#1425) · 3546e2ee
      Lei Wang authored
      * [Enhancement] Update AtomicAdd function signature to accept pointer to destination
      
      * Modified AtomicAdd in CUDA to take a pointer instead of a reference for the destination argument.
      * Updated related code in atomicadd_vectorize.cc to ensure compatibility with the new signature.
      * Adjusted Python interface in atomic.py to pass the destination by pointer, aligning with device function requirements.
      
      * [Enhancement] Refactor AtomicAddRet function signature to accept pointer
      
      * Updated AtomicAddRet in both CUDA and HIP to take a pointer instead of a reference for the address argument, improving consistency with the AtomicAdd function.
      * Adjusted the implementation to ensure proper reinterpretation of the address type for atomic operations.
      
      * lint fix
      
      * [Enhancement] Refactor AtomicAddNode::MakeSIMTLoop to use destination pointer
      
      * Updated the MakeSIMTLoop function to build a pointer to the destination element using tvm_access_ptr instead of loading the destination value directly.
      * Simplified the handling of source and destination predicates, improving clarity and maintainability of the code.
      * Ensured compatibility with the new pointer-based approach for atomic operations.
      
      * lint fix
      
      * test fix
      
      * lint fix
      3546e2ee
  5. 12 Dec, 2025 5 commits
    • Lei Wang's avatar
      29051439
    • Xiangwen Wang's avatar
      [Enhancement] Improve vectorization invariant check (#1398) · e84b24bc
      Xiangwen Wang authored
      * Improve loop vectorize
      
      * Improve loop vectorize
      
      * Improve loop vectorize
      
      * Improve loop vectorize
      
      * Improve loop vectorize
      
      * Add some vectorize tests and comments
      e84b24bc
    • Lei Wang's avatar
      [Enhancement] Introduce `T.__ldg` (#1414) · 6f67da84
      Lei Wang authored
      * [Enhancement] Add __ldg intrinsic for CUDA read-only cache loads
      
      * Introduced the __ldg intrinsic to enable explicit read-only cached loads from global memory in CUDA.
      * Updated the corresponding documentation and added support in both CUDA and HIP code generation.
      * Enhanced the Python interface for __ldg to accept BufferLoad and Buffer types, improving usability.
      
      * [Enhancement] Update formatting and linting rules in pyproject.toml; minor test adjustment
      
      * Added new formatting rules in pyproject.toml to enforce consistent code style, including hanging indents and argument splitting.
      * Updated test_tilelang_language_intrinsics_codegen.py to improve readability by adding a blank line before the main execution block.
      * Refactored error messages in builtin.py for better clarity and consistency, ensuring proper formatting in function definitions and raising ValueErrors.
      
      * lint fix
      6f67da84
    • Lei Wang's avatar
    • Lei Wang's avatar
      [Dependency] Add torch-c-dlpack-ext to project requirements (#1403) · ba2c1856
      Lei Wang authored
      
      
      * [Dependency] Add torch-c-dlpack-ext to project requirements
      
      * Added torch-c-dlpack-ext to both pyproject.toml and requirements.txt to provide prebuilt torch extensions, which may prevent JIT compilation on first import of TVM FFI.
      
      * [Build] Update manylinux images in project configuration
      
      * Changed the manylinux image for x86_64 from "manylinux2014" to "manylinux_2_28" in both pyproject.toml and the Dockerfile to align with updated standards for compatibility and performance.
      
      * [Build] Update CUDA repository configuration in pyproject.toml
      
      * Changed the package manager command from `yum-config-manager` to `dnf config-manager` for adding the CUDA repository, ensuring compatibility with newer systems.
      
      * fix
      
      * [Build] Update CUDA repository to RHEL 8
      
      * Changed the CUDA repository configuration in both pyproject.toml and the manylinux Dockerfile from RHEL 7 to RHEL 8, ensuring compatibility with newer systems.
      
      * test: run out of space
      
      * use cu130 to reduce size
      
      * upd
      
      * upd comment
      
      * upd
      
      ---------
      Co-authored-by: default avatarYour Name <wenji.yyc@alibaba-inc.com>
      ba2c1856
  6. 11 Dec, 2025 5 commits
  7. 10 Dec, 2025 4 commits
    • danielhua23's avatar
      d19142f6
    • Lei Wang's avatar
      [Enhancement] Refactor inflight computing to support dynamic pipeline extents (#1399) · f2858fa1
      Lei Wang authored
      * [Build] Update CMake configuration for tilelang_cython_wrapper installation
      
      - Adjusted output directories for the tilelang_cython_wrapper to ensure that development builds place the extension in build/lib.
      - Updated installation paths to place the extension in tilelang/lib within the wheel, improving organization and avoiding potential conflicts with other modules.
      - Modified the internal library path exposure in env.py to prevent shadowing of common module names, enhancing compatibility and usability in user projects.
      
      * [Build] Standardize output directories for tilelang libraries
      
      - Set output directories for both tilelang and tilelang_module libraries to "${CMAKE_BINARY_DIR}/lib" for consistency in development builds.
      - This change enhances organization and ensures that all build artifacts are located in a unified directory structure.
      
      * [Refactor] Update TVM subproject and enhance pipeline loop handling
      
      - Updated the TVM subproject to commit 90581fe9e5287bbcf1844ad14255a1e1e8cdf7f0.
      - Added new fields to `PipelineAnnotation` and `RewrittenBlockInfo` structures to track original statement indices and improve async state management.
      - Refactored `EmitImpl` and `PopulateWaitCounts` methods to enhance clarity and functionality, including better handling of commit groups and wait counts.
      - Simplified access index calculations and strengthened analyzer constraints for loop bounds.
      
      * [Cleanup] Remove license block and unused includes from inject_pipeline.cc
      
      - Eliminated the Apache license block from the top of the file to streamline the code.
      - Removed unused include directives for memory and stringstream to enhance code clarity and reduce unnecessary dependencies.
      
      * [Refactor] Enhance transformation pipeline and test execution
      
      - Added an additional Simplify transformation in the InjectSoftwarePipeline to improve optimization.
      - Updated the test file to call `test_trival_pipeline()` directly, commenting out the previous main execution for better test isolation.
      f2858fa1
    • Chaofan Lin's avatar
      [Doc] Update logging docs (#1395) · bc084aa4
      Chaofan Lin authored
      bc084aa4
    • Kuris's avatar