- 05 Feb, 2026 1 commit
-
-
qisan authored
-
- 04 Feb, 2026 1 commit
-
-
zhangqha authored
-
- 22 Dec, 2025 4 commits
-
-
qisan authored
-
qisan authored
-
qisan authored
-
guchaoyang authored
-
- 21 Dec, 2025 1 commit
-
-
Lei Wang authored
[Refactor] Phaseout PassConfig `kDisableDynamicTailSplit` and `kDynamicAlignment` as they are legacy (#1486) * [Cleanup] Remove dynamic shape example and related tests * Deleted the dynamic shape example script `example_dynamic.py` and its corresponding test file `test_example_dynamic.py` to streamline the codebase. * Removed unused dynamic tail split and dynamic alignment configurations from `builtin.h` and `pass_config.py`. * Cleaned up the dynamic shape testing files to eliminate redundancy and improve maintainability. * build fix
-
- 20 Dec, 2025 1 commit
-
-
Lei Wang authored
* [Feature] Add FullyReplicated Fragment Layout and Enhance Layout Inference * Introduced a new static method `FullyReplicated` in the `Fragment` class to create fully replicated fragment layouts, ensuring all threads hold identical copies of the buffer. * Updated `CopyNode` to collect fragment layouts and mark them as fully replicated during layout inference. * Enhanced `ParallelOpNode` to expand let bindings for fragment buffer accesses, improving layout inference accuracy. * Added documentation for new methods and updated existing methods to support the new layout features. * lint fix * Remove debug logging statements from layout inference process to streamline output and improve performance.
-
- 19 Dec, 2025 7 commits
-
-
Lei Wang authored
* use static Z3 context * Update submodule reference for TVM to indicate a dirty state
-
Lei Wang authored
* Update README.md with latest news, including CuTeDSL backend support, Z3 theorem prover integration, and migration to apache-tvm-ffi for improved compatibility. * Update README.md to enhance CuTeDSL backend announcement with a link to related issue and clarify migration benefits to apache-tvm-ffi, reducing CPU overhead.
-
Lei Wang authored
* [Language] Enhance dtype conversion for PyTorch compatibility - Added support for new float8 and float4 data types in the __dtype_as_torch__ method. - Implemented backend-specific handling for float8_e4m3 based on HIP or CUDA. - Included assertions to ensure compatibility with the required PyTorch versions for each dtype. - Improved error handling for unsupported dtypes. * Fix test script execution and improve error messages for dtype assertions - Commented out the main execution call in the test script and replaced it with a direct call to the test function `test_divmod()`. - Enhanced error messages in the dtype conversion assertions to improve clarity and readability, ensuring proper guidance for required PyTorch versions.
-
silentCoder-dev authored
* remove triton dependence in testing & move triton baseline into example * use ceildiv and handles arbitrary M correctly for triton
-
Chaofan Lin authored
* Language] Make TL scripts friendly to Python syntax highlights * add comments * fix submodule
-
Lei Wang authored
* feat(arg_binder): enhance shape variable handling and assertions - Implemented special handling for comparing if_then_else expressions to simplify conditions involving NULL checks. - Added methods to set shared shape variables and finalize deferred bindings, generating cascading if_then_else expressions and runtime assertions for non-NULL buffers. - Updated the binding logic to defer shape variable bindings for shared variables, ensuring proper handling across multiple nullable buffers. * refactor(arg_binder): clean up shape variable handling and remove unused code - Removed deprecated methods for setting shared shape variables and finalizing deferred bindings, streamlining the argument binding process. - Simplified the logic for handling shape values in the `BindDLTensor` function, ensuring immediate binding for normal shape variables. - Enhanced clarity by eliminating unnecessary comments and code related to cascading if_then_else expressions for shared variables. * refactor(arg_binder): enhance DLTensor binding with improved shape handling - Replaced the single `BindDLTensor` method with `BindDLTensors` to support multiple buffers, improving flexibility in handling DLTensor bindings. - Introduced a two-pass approach for shape variable handling, allowing for better management of symbolic dimensions and null checks. - Updated the logic to assert non-null conditions at runtime and utilize cascaded if_then_else expressions for shape retrieval, enhancing robustness. - Removed deprecated code and streamlined the binding process for clarity and maintainability. * fix(test_nullable_buffer_params): improve formatting and consistency in test output - Updated string formatting for better readability in the `test_nullable_shared_shape` function. - Ensured consistent use of double quotes for string literals. - Added a missing newline at the end of the file for proper formatting. * refactor(arg_binder): simplify allocation size calculation in BindDLTensors - Streamlined the calculation of allocation size by replacing a lambda function with a direct loop, enhancing readability and maintainability. - Improved clarity in the null check message for data pointers, ensuring better understanding of the binding process. * Remove debug prints from phase.py Removed debug print statements after MakePackedAPI transformation.
-
silentCoder-dev authored
* rename test for curand & add triton baseline * add a comment for calling T.rng_rand() four times * refactor tilelang&triton kernel * Add boundary checks for M not divisible by 128
-
- 18 Dec, 2025 4 commits
-
-
qisan authored
-
Gabriel Wu authored
* feat: CuTeDSL backend * fix: clang-tidy * fix: clang-format * fix: ci * fix: revert example gemm fp8 * fix: remove duplicate code * fix: switch-case * fix: fp16 silence * fix: TVM IR print * fix: useless tir * fix: clang-format * fix: remove tilelang/contrib/cutedsl/.gitignore * fix: use hexfloat * fix: gsym guard * fix: unknown storage sync type * fix: string literal * fix: add args guard * fix: name hint dedup * fix: better find_kernel_by_pattern * fix: set libpath for from_database path * fix: guard buffer.strides * fix: from guard * fix: eviction guard * fix: use thread local tma descs * fix: ruff * fix: drop tma_init_cpp * fix: exc_info * fix: negative unmatch early return * fix: rename postproc func and add test * fix: handle fast math according to pass config * fix: dyn_sym parse * fix: wrap_forward * fix: use tvm_ffi.libinfo instead of cli * fix: keep signature * fix: C++ string safety * fix: mark tma_store_add as unsupported * fix: tvm version * resolve ldsm and cpasync issues. * fix: minor fixes * fix: parse signature using ast * fix: guard global_addr * fix: create tempfile only when necessary * fix: use logger.execption for exceptions * fix: guard lib_path and host_func * fix: remove tma_cpp_init and add timeout for cpp compile * add timeout for mbarrier_wait. * fix: _load_kernel_from_disk signature * resolve codegen issues. * fix: logger.exception * add comment for div_by=1 * merge * fix: reserve cutlass,cute,tl * fix: guard tma_store * fix: allow int64 offset in make_tensor_at_offset * fix: guard barrier * fix: add comments for div_by=16 * fix: div_by=1 issue * delete div_by when offset is 0 * use tl.make_tensor when offset is 0 * fix: explicitly check cutedsl target * fix: use param.torch_dtype() --------- Co-authored-by:
yuxic <yuxic@nvidia.com> Co-authored-by:
Yong <yong@local> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Jinjie Liu authored
Signed-off-by:Jinjie Liu <jjliu@baai.ac.cn>
-
silentCoder-dev authored
* add curand.{curand_init, curand} * run format.sh * add default value for curand_init & add test for curand * Update testing/python/language/test_rand.py Remove unused thread binding Co-authored-by:coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * remove unused library * enable tilelang cache for testing * run format.sh * Revert "run format.sh" This reverts commit 5afaff782f31cdf653e2c45b469da8dead228b8a. * Revert "enable tilelang cache for testing" This reverts commit c277a43e77938bd88d47a108dd1bd65734d4a1ae. * Revert "remove unused library" This reverts commit 568ad20611f039380113937fd131151a2bffd801. * run format.sh * ensure FreshName for __philox_state * ensure FreshName for __philox_state * change the return type of T.rng_init --------- Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
-
- 17 Dec, 2025 11 commits
-
-
Lei Wang authored
* Enhance cache directory structure by including version information in sparse.py to ensure separate caches for different versions. * Fix formatting in sparse.py by adding a newline for improved readability and consistency.
-
Kuris authored
* fix floordiv & floormod in z3 prover * fix lint error
-
Kuris authored
-
Lei Wang authored
-
Lei Wang authored
-
Lei Wang authored
* [Enhancement] Update examples and tests for improved type handling and functionality - Enhanced various example scripts to support new data types and improve compatibility with PyTorch. - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling. - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management. - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling. * [Refactor] Update accumulation data type to float32 across examples - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability. - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board. * [Refactor] Standardize data type usage across benchmark scripts - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling. - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard. - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules. * [Refactor] Standardize data type usage in templates and scripts - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity. - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates. - This change aims to streamline type handling and improve compatibility with existing workflows. * [Refactor] Standardize data type usage in examples and benchmarks - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability. - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard. - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples. * [Refactor] Import dtypes from language.v2 module - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase. - This change aims to streamline data type management and improve overall code clarity. * fix * [Refactor] Standardize data type usage across scripts - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity. - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability. - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase. * [Refactor] Update data type handling for consistency and clarity - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency. - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase. - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling. - This refactor aims to streamline data type management and improve overall code clarity and maintainability. * [Enhancement] Improve data type handling and error messaging - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation. - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs. - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience. * [Fix] Correct boolean flag in GEMM SP test case - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case. - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation. * [Refactor] Standardize data type usage across scripts - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability. - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase. - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability. * [Refactor] Standardize data type usage in various modules - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability. - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase. - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations. * [Refactor] Update argument parsing for data types in benchmarks - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float. - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability. * [Refactor] Update data type handling in benchmark and example scripts - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency. - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase. - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples. * [Refactor] Fix data type conversion in multiple scripts - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts. - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations. * [Refactor] Update float8 data type usage across multiple scripts - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling. - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations. * [Refactor] Enhance float8 data type handling in CUDA code generation - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic. - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase. - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation. - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy. * [Refactor] Streamline float8 data type handling in CUDA and related modules - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks. - Updated layout inference for float8 types to improve clarity and maintainability across the implementation. - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy. * [Refactor] Remove unnecessary cache disabling in float8 example script - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code. - This change enhances clarity and maintainability of the example script without affecting its functionality. * [Refactor] Update data type usage in debug print tests - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type. - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability. * lint fix * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines. -
Gongen-Ali authored
-
Kuris authored
-
Lei Wang authored
* [Enhancement] Introduce non-restrict parameter support in code generation - Added a new PrimFunc-level attribute `tl.non_restrict_params` to specify handle Vars that should not be marked with the restrict qualifier during code generation. - Updated `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to handle non-restrict parameters, ensuring proper treatment of overlapping buffer aliases. - Implemented a new annotation function `annotate_restrict_buffers` to facilitate the marking of buffer parameters as non-restrict. - Enhanced the `SplitHostDevice` transformation to propagate non-restrict parameters from host to device functions. - Added a new transform function `HoistNonRestrictParams` to manage non-restrict parameters effectively. * [Enhancement] Improve HoistNonRestrictParams transformation - Updated the HoistNonRestrictParams function to recursively collect all `tl.non_restrict_params` annotations from nested blocks, enhancing flexibility in annotation placement. - Introduced a new NonRestrictCollector class to manage the collection and deduplication of non-restrict parameters. - Modified the SplitHostDevice transformation to remove the non-restrict attribute from the host-side PrimFunc after propagation to device kernels. - Adjusted the LowerAndLegalize function to directly apply the HoistNonRestrictParams transformation without exception handling, streamlining the process. * [Refactor] Simplify non-restrict parameter handling in code generation - Removed unnecessary normalization logic and associated data structures from `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP`. - Streamlined the handling of non-restrict parameters by directly inserting them into the `non_restrict` set, improving code clarity and maintainability. - Updated conditional checks to eliminate redundant checks against normalized names, enhancing performance and readability. * [Dependency] Update TVM subproject to latest commit 68aa8461 - Updated the TVM subproject to the latest commit, ensuring compatibility with recent changes and improvements. - Refactored non-restrict parameter handling in `CodeGenTileLangCPP`, `CodeGenTileLangCUDA`, and `CodeGenTileLangHIP` to enhance code clarity and maintainability. - Adjusted the `SplitHostDevice` transformation to streamline the propagation of non-restrict parameters. * fix
-
senlyu163 authored
* Remove JIT decorator from elementwise_add function in examples * fix kernel compilation without autotune * Refactor main function to accept parameters and update tests for autotune option * Refactor autotune test function for morden style
-
Kuris authored
-
- 16 Dec, 2025 4 commits
-
-
Chaofan Lin authored
-
Kuris authored
* Refactor GEMM tests to use parameterized pytest fixtures - Converted multiple test cases for GEMM operations in `test_tilelang_tilelibrary_gemm_sp.py` to use `pytest.mark.parametrize` for better maintainability and readability. - Similar refactoring applied to `test_tilelang_tilelibrary_gemm_sp_v2.py`, consolidating test cases for `run_gemm_ss`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` into parameterized tests. - This change reduces code duplication and enhances the clarity of test configurations. * Update testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
-
Lei Wang authored
* [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method - Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions. - Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation. - Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization. * [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types - Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively. - Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64). - Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions. - Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing. - Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch. - Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively. * lint fix * Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file. * [Enhancement] Add support for FP4 and FP8 types in CUDA code generation - Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types. - Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch. - Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations. - Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations. - Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows. --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
Kuris authored
-
- 15 Dec, 2025 6 commits
-
-
Dayuxiaoshui authored
* [Feature] Support region as input of T.cumsum - Extend T.cumsum to accept BufferRegion and BufferLoad inputs in addition to Buffer - This enables operations on buffer slices/regions like: T.cumsum(InputG_fragment[i * chunk_size:(i + 1) * chunk_size], dim=0) - Update cumsum_fragment to handle region inputs properly - Add comprehensive tests for 1D and 2D region inputs including normal and reverse modes Fixes #879 * Fix formatting and add docstring for cumsum_fragment - Add comprehensive docstring for cumsum_fragment function - Format code according to ruff style guidelines * Fix CodeRabbit review issues - Fix negative dimension bounds check (dim < -len(shape) instead of dim <= -len(shape)) - Add src/dst shape compatibility validation for out-of-place cumsum - Update copy() type annotation to accept BufferRegion as dst parameter - Fix test in-place mutation issues by using out-of-place cumsum operations - Add non-divisible size test cases for tail region coverage * Fix out-of-bounds access in region tests - Add bounds clamping using T.min() for chunk_end calculations - Prevents accessing beyond tensor bounds for non-divisible sizes - Matches reference implementation behavior - Fixes both 1D and 2D region test cases * Fix region test: use simple slice expressions instead of T.min() - Remove T.min() which cannot be used directly in slice indices - Use chunk_start + chunk_size form instead - Rely on system's automatic bounds checking for non-divisible sizes - Update comments to reflect this approach * Fix cumsum region: use region extents in lowering and update tests for shared memory * Simplify fragment scope check using is_fragment() --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Xiangwen Wang authored
-
Lei Wang authored
* Introduced a new function, IsExprInvariantInVectorBoundary, to encapsulate the logic for checking if an expression is invariant within vector boundaries, improving code clarity and reusability. * Updated the existing vectorization logic to utilize this new function, streamlining the process of determining vectorization feasibility based on boundary conditions. * Enhanced comments for better understanding of the vectorization criteria and mathematical rationale behind the checks.
-
Xuehai Pan authored
* [CI] Update pre-commit hooks * [Lint] Pass correct `exclude-header-filter` to `clang-tidy` * [Lint] Download latest `run-clang-tidy` script * [CI] Show compile commands * [CI] Add output grouping to GHA * [Lint] Re-order pre-commit hooks
-
Lei Wang authored
* Added the `get_prim_func_name` utility to extract human-readable function names from TVM PrimFuncs. * Updated memory cache logging in `AutoTuner` and `KernelCache` classes to include the kernel name, improving clarity during cache hits. * Enhanced debug logging to provide more informative messages when checking disk cache for kernels.
-
Chaofan Lin authored
* [Refactor] Refactor InjectAssumes logic and make assumes work after SplitHostDevice * address comments * fix * fix submodule * fix * fix 3rdparty
-