1. 18 Nov, 2025 2 commits
  2. 17 Nov, 2025 2 commits
  3. 16 Nov, 2025 1 commit
  4. 15 Nov, 2025 1 commit
    • Gabriel Wu's avatar
      [fix] NVRTC execution backend (#1256) · eb415744
      Gabriel Wu authored
      * [fix] NVRTC execution backend
      
      * [fmt] run pre-commit
      
      * [fix] coderabbit reviews
      
      * [test] add cuda-python to test dep
      
      * [fix] coderabbit reviews
      
      * [fix] CUDA 13 compatibility
      
      * [fix] sm90
      
      * [fix] CUDA 13 compatibility
      
      * [fix] pre-commit
      
      * [fix] always use cuda::std::__atomic_ref_impl
      
      * [fix] restore to external API
      
      * Revert "[fix] restore to external API"
      
      This reverts commit 49bd875638fb631d270015f408991d38fd1e9a5d.
      
      * [fmt] use space instead tabs for py codegen
      
      * [fix] im2col API
      
      * [fix] revert atomic.h
      
      * [fix] dynamic shape
      
      * [refactor] extract common utils
      
      * [feat] support L2 persistent map
      
      * [fix] l2 persistent map
      
      * [fix] pre-commit
      
      * [fix] restore _TYPE_MAP
      
      * [fix] pre-commit
      
      * [fix] avoid duplicate TMA descs
      
      * [docs] add docstring
      
      * [fix] coderabbit
      
      * [fix] coderabbit
      
      * [fix] coderabbit
      
      * [fix] coderabbit
      eb415744
  5. 14 Nov, 2025 1 commit
  6. 13 Nov, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Update buffer handling in copy and atomic operations (#1247) · 2c0072a8
      Lei Wang authored
      * [Refactor] Update buffer handling in copy and atomic operations
      
      * Refactored the `copy` and `atomic_add` functions to use element-wise minimum for defining copy extents, ensuring correct handling of overlapping regions.
      * Updated utility functions to create `BufferLoad` instances with explicit extents, improving memory management and clarity.
      * Removed unused imports from `atomic.py` and `copy.py` to streamline the codebase.
      * Adjusted logging in `copy.cc` to provide clearer warnings for fallback scenarios in bulk copy operations.
      
      * Remove obsolete .git_commit.txt file
      
      * Add unit test for dynamic copy extent handling in TileLang
      
      * Introduced a new test file `test_tilelang_issue_1237.py` to verify that the `T.copy` function correctly manages dynamic extents during primitive function building.
      * The test reproduces a specific issue related to dynamic slice lengths and static buffer sizes, ensuring robustness in the handling of such scenarios.
      * The test does not require execution of the kernel, as building the primitive function is sufficient to validate the fix.
      
      * lint fix
      
      * fix
      
      * Revert "fix"
      
      This reverts commit 828b4c1e4de76a7d11e4d4092927303fbbe00097.
      
      * Update TVM submodule and refactor atomic and copy functions
      
      * Updated the TVM submodule to a dirty state.
      * Refactored `atomic_add` and `copy` functions to pass extents explicitly to the `_to_region` helper, improving clarity and correctness in handling buffer regions.
      * Commented out the main execution call in the test example for `cast` and added a new function call to better demonstrate the example usage.
      
      * Enhance extent handling in atomic and copy functions
      
      * Introduced `legalize_pairwise_extents` utility to align and broadcast extent lists for `atomic_add` and `copy` functions, ensuring compatibility and correctness in buffer operations.
      * Updated both functions to utilize the new utility, improving clarity and robustness in handling dynamic and static extents.
      * Added comments to clarify the extent handling logic.
      
      * Enhance `legalize_pairwise_extents` function with early-exit rule
      
      * Added an early-exit condition to the `legalize_pairwise_extents` function to return original extents if the number of non-1 dimensions in both source and destination extents is equal, improving performance by avoiding unnecessary adjustments.
      * Updated the function's documentation to clarify the new behavior and maintain clarity in the extent handling logic.
      
      * lint fix
      2c0072a8
    • Jiaxing Ding's avatar
  7. 12 Nov, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Support Layout/Fragment Reshape (#1241) · 4370309b
      Lei Wang authored
      
      
      * Update layout handling and introduce reshape functionality
      
      - Updated the `LayoutNode` class to include a new `Reshape` method, allowing for dynamic reshaping of layouts based on input shapes.
      - Enhanced the `OutputShape` method to provide better handling of cases where the analyzer cannot form an `IntervalSet`, implementing fallback mechanisms to ensure safe extents.
      - Refactored the `ReduceOpNode` to utilize `BufferRegion` for improved memory handling during reduction operations.
      - Added tests for reshaping functionality and layout transformations to ensure correctness and performance in various scenarios.
      
      * lint fix
      
      * Revert tvm submodule pointer to 1815c3e0b6ec4ead36370bbd1562025d8529017c; keep src unchanged
      
      * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1
      
      * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1
      
      * remove useless prove
      
      * remove comment
      
      ---------
      Co-authored-by: default avatartilelang-bot <bot@tilelang>
      4370309b
    • Lei Wang's avatar
      [Refactor] Add kernel selection option for GEMM v1 in environment settings (#1200) · 8fbe1b3a
      Lei Wang authored
      * Add kernel selection option for GEMM v1 in environment settings
      
      - Introduced `TILELANG_USE_GEMM_V1` environment variable to control the selection of GEMM version.
      - Added `use_gemm_v1` method in the `Environment` class to determine if GEMM v1 should be used based on the environment variable.
      - Updated GEMM function assignment to default to v2, allowing for v1 to be forced via the new environment variable.
      
      * bug fix
      
      * Add kernel selection option for GEMM in environment settings
      
      - Introduced `TILELANG_USE_GEMM_V1` environment variable to allow users to select between GEMM v1 and v2 implementations.
      - Updated `gemm` function to default to v2 but switch to v1 if the environment variable is set to a truthy value.
      - Added a method `use_gemm_v1` in the `Environment` class to facilitate this selection based on the environment variable.
      
      * Refactor GEMM macro generator to use BufferRegion instead of Buffer
      
      - Updated `wgmma` and `wgmma_rs` methods in `TensorCoreIntrinEmitter` to accept `BufferRegion` parameters instead of `Buffer`.
      - Adjusted related calls in `GemmWGMMA` to ensure compatibility with the new parameter types.
      - Simplified buffer access logic for better clarity and maintainability.
      
      * Refactor GEMM functions to utilize BufferRegion for improved memory handling
      
      - Updated `run_gemm`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` functions to set `num_stages` based on block dimensions, enhancing performance for larger matrices.
      - Simplified calls to GEMM functions by removing redundant parameters and ensuring compatibility with BufferRegion.
      - Introduced utility functions for converting between Buffer, BufferLoad, and BufferRegion, improving code clarity and maintainability.
      - Enhanced error handling for full region checks in GEMM operations to ensure correctness in memory access.
      
      * Refactor GEMM code for improved readability and consistency
      
      - Cleaned up formatting and spacing in GEMM-related files for better readability.
      - Standardized comments and code structure across various GEMM functions and macros.
      - Enhanced error messages for clarity in buffer region checks.
      - Removed redundant lines and improved overall code maintainability.
      
      * Update GEMM correctness evaluation and macro generator for improved functionality
      
      - Modified `N_VALUES` in `correctness_evaluation_sm70.py` to include only relevant sizes for tests.
      - Updated test function call in `correctness_evaluation.py` to use `test_gemm_false_true` for better accuracy in testing.
      - Refactored buffer handling in `mma_sm70_macro_generator.py` to improve clarity and consistency in shared buffer access.
      - Enhanced `gemm_mma_sm70.py` to ensure full region checks for input and output buffers, improving correctness in GEMM operations.
      
      * Refactor GEMM and intrinsic files for improved clarity and functionality
      
      - Removed unused variable `A_stride_last` in `mma_sm70_macro_generator.py` to streamline code.
      - Adjusted function signature formatting in `swizzle.py` for better readability.
      - Restored the return of `GemmWGMMA` in `__init__.py` for correct GEMM instantiation.
      - Removed unused variable `B_buf` in `gemm_mma_sm70.py` to enhance code cleanliness.
      - Improved function signature formatting in `language.py` for consistency.
      
      * Enhance GEMM and MMA functionality for FP64 support
      
      - Refactored `GemmNode` to streamline the decision-making process for GEMM instruction selection.
      - Added support for FP64 inputs in the MMA dispatcher, enabling new tensor operations.
      - Introduced a new layout function for FP64 in `mma_layout.py` to facilitate shared memory storage.
      - Updated `TensorCoreIntrinEmitter` to handle FP64 data types, including adjustments for micro tile dimensions and loading mechanisms.
      - Enhanced utility functions to accommodate FP64 index mapping for shared memory operations.
      
      * lint fix
      
      * Refactor GEMM correctness evaluation and shared memory alignment handling
      
      - Reverted the GEMM function call in `correctness_evaluation.py` to the original implementation for consistency.
      - Added a helper function in `merge_shared_memory_allocations.cc` to streamline the marking of shared variables under alignment scope.
      - Enhanced the `VisitExpr_` methods to ensure proper handling of shared memory alignment for `BufferLoadNode` and `VarNode` types.
      - Cleaned up commented-out test code in `correctness_evaluation.py` for better readability.
      
      * Enhance GEMM and MMA implementations with region-based memory handling
      
      - Updated GEMM and MMA classes to utilize BufferRegion for input and output buffers, improving memory management and supporting strided GEMM operations.
      - Added checks to ensure full region compliance for input buffers, enhancing correctness in matrix multiplication.
      - Implemented clear accumulation functionality to reset output buffers before accumulation, ensuring accurate results in GEMM operations.
      
      * Refactor test_tilelang_example_deepseek_v32.py to improve import structure and function calls
      
      - Updated import statements to directly reference modules instead of individual test functions, enhancing clarity.
      - Modified function calls to use the new module structure for better organization and maintainability in testing examples.
      
      * Enhance OnArrayDeclaration method to handle repeated buffer declarations
      
      - Updated the OnArrayDeclaration method to merge metadata for buffers that may appear in multiple Allocate statements, improving robustness against upstream transformations.
      - Added logic to prefer concrete element data types and record extents when previously unknown, enhancing the handling of buffer declarations.
      
      * Add abbreviation for bfloat16 data type in mfma_macro_generator.py
      
      - Introduced a new abbreviation "bf16" for the bfloat16 data type in the mfma_macro_generator.py file, enhancing clarity and consistency in data type representation.
      
      * Refactor CodeGenTileLangHIP to enhance dtype handling and mfma call generation
      
      - Introduced a mapping function to normalize input data types to their corresponding scalar types, improving compatibility with MfmaTraits.
      - Updated the mfma call generation to utilize the new mapping, streamlining the code and enhancing clarity.
      - Removed outdated dtype mapping and replaced it with a more flexible approach to support additional data types like FP8.
      
      * lint fix
      
      * Enhance backend configuration in CMakeLists.txt and improve dtype handling in CodeGenTileLangHIP
      
      - Introduced a macro to define backend options for CUDA, ROCM, and Metal, allowing user overrides and caching of settings.
      - Updated logic to track user-selected backends and conditionally enable defaults based on environment variables.
      - Refactored dtype handling in CodeGenTileLangHIP to streamline mfma call generation and improve clarity.
      - Added support for bfloat16 in the mfma_macro_generator.py, enhancing data type representation consistency.
      
      * Update bfloat16 handling in CodeGenTileLangHIP and mfma_macro_generator.py
      
      - Changed the representation of bfloat16 in CodeGenTileLangHIP from "bfloat16x4" to "bfloat16x4_vec" for improved clarity.
      - Adjusted the mfma_suffix generation in mfma_macro_generator.py to remove the underscore before "bf16", aligning with HIP intrinsic requirements.
      
      * Change logging level from WARNING to DLOG in LegalizeNegativeIndex for non-negative index checks to reduce log verbosity.
      
      * Refactor attention sink examples to simplify index calculations
      
      - Updated index handling in `example_gqa_sink_bwd_bhsd.py` and `example_mha_sink_bwd_bhsd.py` to eliminate unnecessary local allocations and streamline logic for determining start and end indices.
      - Improved readability by using direct calculations instead of local variables for index bounds in pipelined loops.
      
      * Refactor attention sink examples to streamline index calculations
      
      - Simplified index handling in `example_gqa_sink_bwd_bhsd.py`, `example_gqa_sink_fwd_bhsd_wgmma_pipelined.py`, `example_mha_sink_bwd_bhsd.py`, `example_mha_sink_fwd_bhsd_wgmma_pipelined.py`, and `example_mha_sink_fwd_bhsd.py` by removing unnecessary local allocations for start and end indices.
      - Enhanced readability by directly calculating index bounds for pipelined loops, improving overall code clarity.
      
      * lint fix
      
      * bugfix
      
      * Refactor reduce operation handling in CUDA and Python
      
      - Removed outdated shared memory reduction logic from `reduce.cc`.
      - Introduced fragment allocation and improved buffer handling in `reduce.py` to support shared and fragment scopes.
      - Updated CUDA header to define a wider accumulator type for better numerical accuracy.
      - Enhanced error handling for buffer scope validation in the reduction process.
      
      * Fix ReduceOpNode to correctly compute AbsMax by using absolute values of inputs
      
      * Enhance unit loop handling by refining annotation checks
      
      - Updated the condition for identifying effectively empty annotations in unit loops to include cases where only the `pragma_unroll_explicit` hint is present.
      - Introduced a new method, `IsEffectivelyEmptyAnnotation`, to encapsulate this logic, improving code clarity and maintainability.
      
      * clean clode
      8fbe1b3a
  8. 11 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add thread count validation for ReduceOp fragment layout inference (#1225) · 67cc8611
      Lei Wang authored
      * [Enhancement] Add thread count validation for ReduceOp fragment layout inference
      
      * Introduced a check to ensure that the thread count is divisible by the replicate extent during layout inference in ReduceOpNode. This validation prevents layout inference failures and provides detailed error messages to guide users in resolving issues related to thread block sizes and fragment layouts.
      * Updated tests to remove unsupported configurations that could lead to layout inference errors, ensuring more robust testing scenarios.
      
      * lint fix
      67cc8611
  9. 10 Nov, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Improve iterator handling in layout utilities and parallel operations (#1221) · 2957afca
      Lei Wang authored
      * [Enhancement] Improve iterator handling in layout utilities and parallel operations
      
      * Added a new function, DivideUnusedIterators, to detect per-iterator gaps in fused index expressions, enhancing the accuracy of unused iterator detection.
      * Updated CompleteBufferFragment to prefer direct inversion for bijective index mappings and introduced a fallback mechanism for non-bijective cases, improving layout inversion robustness.
      * Added a new test for layout inference in fused kernels to ensure correct compilation and execution without layout inversion failures.
      
      * lint fix
      2957afca
    • Kuris's avatar
      [Fix] Fix buffer re-import typo in tilelang.languge (#1214) · d5fda276
      Kuris authored
      * Fix Buffer re-import typo in tilelang.langugage
      
      * fix lint error
      d5fda276
  10. 09 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Enhane LetStmt Handling in Pipeline Transform (#1212) · 85218bd9
      Lei Wang authored
      * [Enhancement] Introduce LetWrapper for handling loop variable substitutions in pipeline rewriting
      
      * Added LetWrapper struct to encapsulate variable and value pairs for loop variable substitutions.
      * Updated PipelineRewriter to accept a vector of LetWrapper instances, allowing for proper handling of Let statements that depend on the pipeline loop variable.
      * Enhanced the BuildPipeline method to incorporate LetWrapper instances into rewritten blocks, ensuring correct substitutions during pipeline execution.
      * Refactored logic for processing Let statements to differentiate between those that use the loop variable and those that do not, improving the flexibility of the pipeline transformation.
      
      * Refactor lambda expression for clarity in loop variable usage check in inject_pipeline.cc
      
      * [Test] Add regression test for loop variable handling in kernel compilation
      
      * Introduced a new test case to verify correct handling of loop variables in the kernel compilation process, addressing a regression issue with InjectSoftwarePipeline.
      * The test ensures that the loop variable is not left as a free variable, which previously caused failures in MakePackedAPI.
      * Configurations are set to disable warp specialization and TMA lowering to align with the original issue reproduction.
      
      * Remove unused import in regression test for loop variable handling in kernel compilation
      85218bd9
  11. 06 Nov, 2025 2 commits
  12. 05 Nov, 2025 2 commits
    • Tong WU's avatar
      [Feature] Add `tl.infinity` operator for infinity handling of bfloat16 (#1175) · 11456de2
      Tong WU authored
      
      
      * Update dependency version for apache-tvm-ffi in pyproject.toml to fix CI
      
      * [Math] Add `tl.infinity` operation and update Python interface for infinity handling
      
      - Implemented `infinity_op` in C++ to return infinity values for supported data types.
      - Registered new operation `tl.infinity` with appropriate attributes.
      - Updated Python interface to call the new `tl.infinity` operation instead of the previous method.
      
      * Add unit tests for `tl.infinity` operation in TileLang
      
      - Introduced a new test file `test_tilelang_language_infinity.py` to validate the behavior of the `tl.infinity` operation across multiple data types (float16, bfloat16, float32, float64).
      - Implemented a kernel to fill a tensor with infinity values and asserted the correctness of the output against PyTorch's `torch.inf`.
      
      * lint
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      11456de2
    • Kurisu's avatar
      [Feat] Add swap like grammar in tuple assignment (#1185) · 055f8500
      Kurisu authored
      * [Feat] add 2 phase binding to allow swap two var
      
      * Minor update tvm dtype constructor
      
      * fix lint error
      055f8500
  13. 04 Nov, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Improve Python3.9 compatibility for ParamSpec and Self (#1190) · 7d961892
      Lei Wang authored
      * [Feature] Enhance fill operation to support various buffer types
      
      - Added support for `BufferLoad` in the `fill` function to handle different buffer types.
      - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
      - Introduced checks for static bounds in region definitions to ensure safety during operations.
      - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.
      
      * lint fix
      
      * [Refactor] Improve Python compatibility for ParamSpec and Self
      
      - Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively.
      - Updated type annotations across multiple files to ensure consistent usage of typing features.
      
      * [Update] Require Python 3.9 and enhance type annotations
      
      - Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`.
      - Removed references to Python 3.8 in classifiers.
      - Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility.
      - Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files.
      
      * [Refactor] Update import statements to enhance type annotations
      
      - Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity.
      - Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function.
      - Adjusted import statements in the language proxy file to maintain consistency in type annotations.
      
      * disable rocm rs nt test.
      
      * lint fix
      7d961892
    • Lei Wang's avatar
      [Feature] Enhance fill operation to support various buffer types (#1189) · a03df604
      Lei Wang authored
      * [Feature] Enhance fill operation to support various buffer types
      
      - Added support for `BufferLoad` in the `fill` function to handle different buffer types.
      - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
      - Introduced checks for static bounds in region definitions to ensure safety during operations.
      - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.
      
      * lint fix
      a03df604
  14. 03 Nov, 2025 2 commits
    • Kurisu's avatar
      [Fix] fix type imcompatible error in #1115 (#1180) · 4ef94f22
      Kurisu authored
      * Fix incompatible floordiv in packed api
      
      * fix lint
      4ef94f22
    • Kurisu's avatar
      [Language] Initial version of tilelang frontend v2 (#1120) · 5f202fe5
      Kurisu authored
      
      
      * tilelang frontend v2
      
      * syntax sugar: defining a local var by annotation
      
      * [Refactor] fix type linting warning like `T.float32`
      
      * Add tl.local_var_init for new tl.float32
      
      * allow passing default argument as function annotation
      
      * allow default arguments as annotation
      
      * fix lint error
      
      * minor fix
      
      * [Refactor] refactor tilelang.jit and tilelang.autotune
      
      * minor fix
      
      * minor fix
      
      * minor fix
      
      * fix metal get function name
      
      * add par_compile impl and tests
      
      * Type consistency on tvm datatype
      1. isinstance(tl.float32, tvm.DataType) == True
      2. Allow `tl.float32` as function annotations
      3. Allow `tl.float32` as argument to be passed to `tl.alloc` or other functions
      
      * fix lint error
      
      * add more warning in frontend
      
      * update tvm version
      
      * Minor fix on tvm_ffi annotations
      
      * add document and examples
      
      * fix lint error
      
      * Simplify index calculations in example_chunk_o_bwd.py
      
      Refactor index calculations for dg_last_fragment assignment.
      
      * minor fix
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLei Wang <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      5f202fe5
  15. 02 Nov, 2025 3 commits
    • Lei Wang's avatar
      [Language] Add Correctness and performance check scripts for V2 (#1174) · d99853b6
      Lei Wang authored
      * fix
      
      * lint fix
      
      * fix
      
      * lint fix
      
      * fix
      
      * upd
      d99853b6
    • Lei Wang's avatar
      [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb
      Lei Wang authored
      
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      aef0a6bb
    • Lei Wang's avatar
      c85bb3ac
  16. 01 Nov, 2025 1 commit
  17. 31 Oct, 2025 1 commit
    • Lei Wang's avatar
      [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28
      Lei Wang authored
      
      
      * 3rdparty tvm bump
      
      * bump tvm into v0.22.0
      
      * lint fix
      
      * rebase tvm
      
      * Update submodule tvm to latest commit 3085bc4
      
      * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang
      
      * test fix
      
      * add requirement
      
      * atomic_fix
      
      * atomic_fix
      
      * phaseout py39
      
      * optimize
      
      * optimize
      
      * lint fix
      
      * do not clean cache
      
      * do not clean cache
      
      * [Minor] Minor update for Python versions and dependencies
      
      * [Lint] fix lint for py39
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Sync CI changes from upstream/sdist
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Update `repair-wheel-command`
      
      * [Minor] update abi3audit result format
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] fix build
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * [Deps] pin apache-tvm-ffi version
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Deps] Restore Python 3.8 support
      
      * [Build] use `apache-tvm-ffi`'s `libtvm_ffi`
      
      * [BugFix] use `;` as delimiter for RPATH on macOS
      
      * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`
      
      * [Build] support `sccache` if available
      
      * [Build] add CIBW import test
      
      * [Build][CI] enable ccache for CIBW on Linux
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * Revert "[Build][CI] enable ccache for CIBW on Linux"
      
      This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.
      
      * [CI] fix perfbench bot
      
      * [BugFix] use Python 3.9 to build wheel
      
      * [Minor] update perfbench bot envs
      
      * [BugFix] fix CIBW environment on Linux
      
      * [CI] skip import test on CentOS 7
      
      * [CI] use Python urllib to download file instead of Wget
      
      ---------
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@pku.edu.cn>
      10911e28
  18. 29 Oct, 2025 3 commits
  19. 28 Oct, 2025 3 commits
  20. 27 Oct, 2025 1 commit
  21. 23 Oct, 2025 2 commits
    • Wenhao Xie's avatar
      [Feature] Support None type as input for `T.ptr` and `T.Tensor` (#1114) · 50e789dd
      Wenhao Xie authored
      * [Feature] Support None type as input for T.ptr and T.Tensor
      
      * lint
      
      * lint
      
      * lint
      
      * lint fix
      50e789dd
    • Tong WU's avatar
      [Feature] Enhance vectorized conversion support in CUDA codegen (#1095) · a148d62a
      Tong WU authored
      * [Feature] Add vectorized float16 and float32 conversion support in CUDA codegen
      
      * Implemented handling for conversions between float16 and float32 types, specifically for vectorized operations using __half22float2 and __float22half2_rn.
      * Enhanced the existing code to support both directions of conversion based on the lane count.
      * Improved overall type handling in the VisitExpr_ method for better compatibility with TileLang.
      
      * [Feature] Add float32 to float8 conversion support in CUDA codegen
      
      * Implemented handling for conversion from float32 to float8 (E4M3/E5M2) in the VisitExpr_ method.
      * Added vectorized conversion support using __nv_cvt_float2_to_fp8x2 for float2 to fp8x2 transformations.
      * Enhanced type handling for better compatibility with TileLang, particularly for float8 types.
      
      * lint
      
      * fix a bug
      
      * [Enhancement] Support lanes=4 cases and add unit test for vectorized cast
      
      * lint
      
      * [Feature] Refactor bf16 convertion operations and remove legacy compile flags
      
      * lint
      a148d62a
  22. 21 Oct, 2025 3 commits
    • Lei Wang's avatar
      [Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e
      Lei Wang authored
      * - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
          generated Allocates and the PrimFunc attrs
        - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
          allocations keep their tl.local_var_init annotations
        - teach annotation handling to accept scalar initializers, resolve buffers, and merge
          with existing stat
      
      * lint fix
      
      * enhance
      
      * lint fix
      
      * lint fix
      bddb125e
    • Lei Wang's avatar
      [PassConfig] Introduce PassConfig `TL_STORAGE_REWRITE_DETECT_INPLACE` (#1089) · cdc67fc4
      Lei Wang authored
      * • Enable configurable StorageRewrite inplace detection
      
        - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key.
        - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse.
        - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples.
      
      * lint fix
      
      * add test
      
      * lint fix
      cdc67fc4
    • Tong WU's avatar
      [Cleanup] Remove `tilelang.disable_cache()` calls from examples and tests (#1088) · 0c7e7419
      Tong WU authored
      * [Cleanup] Remove `tilelang.disable_cache()` calls from example scripts
      
      * lint
      
      * lint
      0c7e7419