- 19 Dec, 2025 2 commits
-
-
silentCoder-dev authored
* remove triton dependence in testing & move triton baseline into example * use ceildiv and handles arbitrary M correctly for triton
-
silentCoder-dev authored
* rename test for curand & add triton baseline * add a comment for calling T.rng_rand() four times * refactor tilelang&triton kernel * Add boundary checks for M not divisible by 128
-
- 18 Dec, 2025 1 commit
-
-
silentCoder-dev authored
* add curand.{curand_init, curand} * run format.sh * add default value for curand_init & add test for curand * Update testing/python/language/test_rand.py Remove unused thread binding Co-authored-by:coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> * remove unused library * enable tilelang cache for testing * run format.sh * Revert "run format.sh" This reverts commit 5afaff782f31cdf653e2c45b469da8dead228b8a. * Revert "enable tilelang cache for testing" This reverts commit c277a43e77938bd88d47a108dd1bd65734d4a1ae. * Revert "remove unused library" This reverts commit 568ad20611f039380113937fd131151a2bffd801. * run format.sh * ensure FreshName for __philox_state * ensure FreshName for __philox_state * change the return type of T.rng_init --------- Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
-
- 17 Dec, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Update examples and tests for improved type handling and functionality - Enhanced various example scripts to support new data types and improve compatibility with PyTorch. - Updated tests across multiple modules to ensure correct functionality with the latest changes in type handling. - Refactored code in examples to streamline operations and improve clarity, particularly in tensor operations and memory management. - Added comprehensive tests for new features and fixed existing issues related to type conversions and buffer handling. * [Refactor] Update accumulation data type to float32 across examples - Changed accumulation data type from "float" to T.float32 in multiple example scripts to ensure consistency and improve numerical stability. - This update affects various modules including flash attention, GEMM analysis, convolution, and deepseek MLA examples, enhancing type handling across the board. * [Refactor] Standardize data type usage across benchmark scripts - Updated data type definitions in benchmark scripts to use T.float16 and T.float32 consistently, enhancing clarity and type handling. - Adjusted dtype assignments in matmul functions and configuration setups to align with the new standard. - Improved overall code consistency and maintainability by ensuring uniform data type usage across various modules. * [Refactor] Standardize data type usage in templates and scripts - Updated data type definitions in various templates and scripts to use string representations (e.g., "float16", "int32") instead of T.float16 and T.int32 for improved consistency and clarity. - Enhanced overall code maintainability by ensuring uniform data type usage across multiple modules, including convolution, elementwise operations, and matrix multiplication templates. - This change aims to streamline type handling and improve compatibility with existing workflows. * [Refactor] Standardize data type usage in examples and benchmarks - Updated data type definitions in various example and benchmark scripts to use T.float16 and T.int32 consistently, enhancing clarity and maintainability. - Adjusted dtype assignments in kernel functions and configuration setups to align with the new standard. - Improved overall code consistency by ensuring uniform data type usage across multiple modules, including attention mechanisms, matrix multiplication, and GEMM examples. * [Refactor] Import dtypes from language.v2 module - Added import statement for dtypes from the language.v2 module to enhance type handling and maintain consistency across the codebase. - This change aims to streamline data type management and improve overall code clarity. * fix * [Refactor] Standardize data type usage across scripts - Updated data type definitions in various scripts to use string representations (e.g., "float16", "int8") instead of T.float16 and T.int8 for improved consistency and clarity. - Adjusted dtype assignments in functions and configuration setups to align with the new standard, enhancing overall code maintainability. - This change affects multiple modules, including benchmark and attention mechanisms, ensuring uniform data type usage throughout the codebase. * [Refactor] Update data type handling for consistency and clarity - Changed string representations of data types in the Hint class to use T.float32 and T.int32 for improved consistency. - Added new data types "int4" and "int16" to the dtypes module, enhancing type support across the codebase. - Updated function signatures and assertions in the lop3 and mxfp modules to utilize the new data types, ensuring uniformity in type handling. - This refactor aims to streamline data type management and improve overall code clarity and maintainability. * [Enhancement] Improve data type handling and error messaging - Introduced a mapping for canonical data types to their display strings, enhancing clarity in type representation. - Updated the dtype creation logic to utilize the new mapping, ensuring more intuitive handling of string inputs. - Refined error messages in the lop3 module to provide clearer feedback on invalid source formats, improving debugging and user experience. * [Fix] Correct boolean flag in GEMM SP test case - Updated the boolean flag in the test_gemm_sp_sm90 function to ensure proper functionality in the test case. - This change enhances the accuracy of the test and aligns it with expected behavior for the GEMM SP implementation. * [Refactor] Standardize data type usage across scripts - Updated data type definitions in various scripts to use T.float16 and T.bfloat16 consistently, enhancing clarity and maintainability. - Adjusted dtype assignments in function signatures and argument parsing to align with the new standard, ensuring uniform data type usage throughout the codebase. - This change affects multiple modules, including benchmarks and examples, improving overall code consistency and readability. * [Refactor] Standardize data type usage in various modules - Updated data type assignments in multiple scripts to utilize T.float32, T.int8, and T.int32 consistently, enhancing clarity and maintainability. - Adjusted function signatures and parameter types across benchmarks, examples, and tests to align with the new standard, ensuring uniform data type usage throughout the codebase. - This change improves overall code consistency and readability, impacting modules related to matrix multiplication, GEMM, and tensor operations. * [Refactor] Update argument parsing for data types in benchmarks - Changed argument parsing for data types in benchmark_matmul_intrinsic.py and benchmark_matmul_sp.py to use string representations ("float16", "int8", "float") instead of T.float16 and T.float. - This update enhances consistency in data type handling across benchmark scripts, improving clarity and maintainability. * [Refactor] Update data type handling in benchmark and example scripts - Changed data type arguments in benchmark and example scripts to use string representations ("float16") instead of T.float16 for improved consistency. - Updated function signatures and argument parsing to align with the new standard, enhancing clarity and maintainability across the codebase. - This change affects multiple modules related to attention mechanisms and tensor operations, ensuring uniform data type usage throughout the examples. * [Refactor] Fix data type conversion in multiple scripts - Corrected the usage of the data type conversion method from dtype..as_torch() to dtype.as_torch() across various benchmark and example scripts. - This change enhances consistency in data type handling and improves code readability, impacting modules related to attention mechanisms and tensor operations. * [Refactor] Update float8 data type usage across multiple scripts - Changed instances of T.float8_e4m3 to T.float8_e4m3fn in various benchmark, example, and test scripts to ensure consistency in data type handling. - This update enhances clarity and maintainability across the codebase, particularly in modules related to matrix multiplication and tensor operations. * [Refactor] Enhance float8 data type handling in CUDA code generation - Updated the handling of float8 data types in the CUDA code generation to include additional float8 variants, improving type conversion logic. - Adjusted conditions to ensure proper type checks for float8 conversions, enhancing clarity and maintainability in the codebase. - Modified layout inference to streamline float8 type checks, ensuring consistency across the implementation. - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy. * [Refactor] Streamline float8 data type handling in CUDA and related modules - Enhanced float8 data type handling in CUDA code generation by refining type conversion logic and ensuring consistent type checks. - Updated layout inference for float8 types to improve clarity and maintainability across the implementation. - This change impacts modules related to matrix operations and CUDA code generation, improving overall type handling and conversion accuracy. * [Refactor] Remove unnecessary cache disabling in float8 example script - Eliminated the call to tilelang.disable_cache() in example_group_per_split_token_cast_to_fp8.py to streamline the code. - This change enhances clarity and maintainability of the example script without affecting its functionality. * [Refactor] Update data type usage in debug print tests - Changed the argument for dtype in the test_debug_print_buffer function from a string representation to the corresponding T.bool type. - This update enhances consistency in data type handling within the test suite, improving clarity and maintainability. * lint fix * Update function parameter types from `str` to `T.dtype` for improved type safety in attention sink and related examples * Refactor `gemv_alloc_reducer` function signature for improved readability by formatting parameters across multiple lines.
-
- 16 Dec, 2025 2 commits
-
-
Kuris authored
* Refactor GEMM tests to use parameterized pytest fixtures - Converted multiple test cases for GEMM operations in `test_tilelang_tilelibrary_gemm_sp.py` to use `pytest.mark.parametrize` for better maintainability and readability. - Similar refactoring applied to `test_tilelang_tilelibrary_gemm_sp_v2.py`, consolidating test cases for `run_gemm_ss`, `run_gemm_rs`, `run_gemm_sr`, and `run_gemm_rr` into parameterized tests. - This change reduces code duplication and enhances the clarity of test configurations. * Update testing/python/amd/test_tilelang_gemm_mfma_preshuffle.py Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --------- Co-authored-by:
coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
-
Lei Wang authored
* [Enhancement] Update KernelParam to use tvm.DataType directly and add torch_dtype conversion method - Changed dtype in KernelParam from torch.dtype to tvm.DataType to support a wider range of data types and prevent information loss during conversions. - Added a new method, torch_dtype, to convert tvm.DataType back to torch.dtype for tensor creation. - Updated various adapters to utilize the new torch_dtype method for parameter type conversion during initialization. * [Enhancement] Refactor CUDA type handling and add support for FP4 and FP8 types - Renamed functions for clarity: GetFP8Type, GetFP6Type, and GetFP4Type are now GetTileLangFP8Type, GetTileLangFP6Type, and GetTileLangFP4Type respectively. - Enhanced FP4 type handling to support additional lane sizes (2, 4, 8, 16, 32, 64). - Updated CUDA code generation to include new FP8 and FP4 types, ensuring proper type handling in PrintType and related functions. - Introduced new structures for FP8 types in cuda_fp8.h to facilitate better memory management and type packing. - Added methods in KernelParam and tensor utilities to recognize and handle float4 types, improving compatibility with PyTorch. - Enhanced logging for debugging purposes in various CUDA functions to track type handling and memory operations more effectively. * lint fix * Remove unnecessary logging statements from CUDA code generation and delete obsolete matrix multiplication test file. * [Enhancement] Add support for FP4 and FP8 types in CUDA code generation - Enhanced PrintVecElemLoad and PrintVecElemStore functions to handle new FP4 types. - Updated arg_binder to allow float4 to match int8 at runtime, improving compatibility with PyTorch. - Modified loop_vectorize to account for buffer dtype lanes in vectorization calculations. - Refactored tensor type mapping to support new float4 and float8 types, ensuring correct type handling in tensor operations. - Added tests for FP4 and FP8 copy operations to validate functionality and integration with existing workflows. --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
- 15 Dec, 2025 2 commits
-
-
Dayuxiaoshui authored
* [Feature] Support region as input of T.cumsum - Extend T.cumsum to accept BufferRegion and BufferLoad inputs in addition to Buffer - This enables operations on buffer slices/regions like: T.cumsum(InputG_fragment[i * chunk_size:(i + 1) * chunk_size], dim=0) - Update cumsum_fragment to handle region inputs properly - Add comprehensive tests for 1D and 2D region inputs including normal and reverse modes Fixes #879 * Fix formatting and add docstring for cumsum_fragment - Add comprehensive docstring for cumsum_fragment function - Format code according to ruff style guidelines * Fix CodeRabbit review issues - Fix negative dimension bounds check (dim < -len(shape) instead of dim <= -len(shape)) - Add src/dst shape compatibility validation for out-of-place cumsum - Update copy() type annotation to accept BufferRegion as dst parameter - Fix test in-place mutation issues by using out-of-place cumsum operations - Add non-divisible size test cases for tail region coverage * Fix out-of-bounds access in region tests - Add bounds clamping using T.min() for chunk_end calculations - Prevents accessing beyond tensor bounds for non-divisible sizes - Matches reference implementation behavior - Fixes both 1D and 2D region test cases * Fix region test: use simple slice expressions instead of T.min() - Remove T.min() which cannot be used directly in slice indices - Use chunk_start + chunk_size form instead - Rely on system's automatic bounds checking for non-divisible sizes - Update comments to reflect this approach * Fix cumsum region: use region extents in lowering and update tests for shared memory * Simplify fragment scope check using is_fragment() --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Xiangwen Wang authored
-
- 12 Dec, 2025 3 commits
-
-
Lei Wang authored
-
Xiangwen Wang authored
* Improve loop vectorize * Improve loop vectorize * Improve loop vectorize * Improve loop vectorize * Improve loop vectorize * Add some vectorize tests and comments
-
Lei Wang authored
* [Enhancement] Add __ldg intrinsic for CUDA read-only cache loads * Introduced the __ldg intrinsic to enable explicit read-only cached loads from global memory in CUDA. * Updated the corresponding documentation and added support in both CUDA and HIP code generation. * Enhanced the Python interface for __ldg to accept BufferLoad and Buffer types, improving usability. * [Enhancement] Update formatting and linting rules in pyproject.toml; minor test adjustment * Added new formatting rules in pyproject.toml to enforce consistent code style, including hanging indents and argument splitting. * Updated test_tilelang_language_intrinsics_codegen.py to improve readability by adding a blank line before the main execution block. * Refactored error messages in builtin.py for better clarity and consistency, ensuring proper formatting in function definitions and raising ValueErrors. * lint fix
-
- 06 Dec, 2025 1 commit
-
-
Kuris authored
* initial step * modify builder * scratch version of new frontend * write some tests * add many tests * add typing stub for tir.ir * remove idents * minor update * minor update * First version of jitv2 (renamed to LazyJIT) * fix pre-commit error * minor fix * fix lint error * fix lint error * Fix conditional check for PrimFunc instance --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 01 Dec, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Implement dynamic unroll factor in CUDA code generation This commit introduces support for specifying a dynamic unroll factor in the CUDA code generation. The `unroll_factor` map is added to store unroll factors for loop variables, allowing for more flexible and optimized loop unrolling. Additionally, the `unroll` function is integrated into the loop language, enabling users to define unroll factors directly in their code. This enhancement improves performance by allowing tailored unrolling strategies based on specific loop characteristics. * lint fix * [Bugfix] Correct initialization of non-zero counters in custom compress kernel and update TIR registration for gemm_sp_py to use the correct tile operation
-
- 26 Nov, 2025 2 commits
-
-
Kuris authored
-
LJC00118 authored
* Add test for T.assume * Add unit test for T.assume * Add unit test for T.assume * Add unit tests for T.assume * Remove debug print for kernel source Remove print statement for kernel source in tests. * Update test_tilelang_language_assume.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 25 Nov, 2025 2 commits
-
-
Chaofan Lin authored
-
Kuris authored
-
- 24 Nov, 2025 2 commits
-
-
Chaofan Lin authored
* [BugFix] Use BufferRegion in tl.cumsum to infer buffer shape * remove debug lines * remove rubbish * Fix decorator syntax for atomic_different_memory_orders_program --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Tong WU authored
* [Feat] Support warp reduce * lint * add test * lint
-
- 21 Nov, 2025 2 commits
-
-
Kuris authored
* [Fix] Fix #1307 by adding macro inside function * fix lint error * add comments and fix lint error * Remove debug print from enter_frame method Removed debug print statement from enter_frame method. --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Kuris authored
* [Feat] add missing support of uint32x2 * [Feat] Add `T.Ref` annotation and tests * fix lint error * minor update for error message on twice decl * Remove unused let_bindings_ in CodeGenC to fix #1300
-
- 20 Nov, 2025 3 commits
-
-
Lei Wang authored
* bugfix * lint fix * test * lint fix * increate procs * recover
-
Kuris authored
-
Kuris authored
* [Feature] Add support for A: T.Tensor(n + 1) and A: T.Tensor(2*n) * issue fix * fix * fix * decreate nproc for debugging --------- Co-authored-by:Lei Wang <leiwang1999@outlook.com>
-
- 19 Nov, 2025 2 commits
-
-
Chaofan Lin authored
* [Language][UX] Nested loop checker in pre-lowering stage * rename * comment * address comments
-
Kuris authored
* add typing stub for tir.ir * remove idents * minor update * [Refactor] add numpy conversion for dtype * fix lint error * remove unused np.float_ in dtype conversion * fix type in np.int_ * fix typo * minor fix * remove debug files * fix memory leak bug * fix lint error * add comments * fix lint error * remove duplicated, because tilelang doesn't dependent deprecated
-
- 18 Nov, 2025 3 commits
-
-
Lei Wang authored
* [Refactor] Update FFI type handling and simplify argument management * Refactored FFI type definitions in runtime and code generation files to use `TVMFFIAny` instead of `TVMValue`, enhancing type clarity. * Updated function registration in `runtime.cc` to utilize canonical names for better consistency. * Simplified argument handling in the `simplify` transformation, ensuring unused buffer parameters are removed only when simplification is enabled. * Adjusted autotuner and profiler parameters to standardize the execution backend to `tvm_ffi`, improving clarity in backend selection. * Removed obsolete `adapt_torch2tvm` function from tensor utilities to streamline the codebase and reduce complexity. * [Update] Sync TVM submodule and enhance kernel source handling * Updated the TVM submodule to commit cdc2aced, ensuring compatibility with recent changes. * Added functionality to print kernel source in `example_blocksparse_gemm.py` for better debugging. * Commented out the main execution call in test files to prevent unintended execution during testing. * Introduced `tilelang.disable_cache()` in various test files to streamline testing and avoid cache-related issues. * Refactored kernel source retrieval methods to improve clarity and consistency across different execution backends. * [Refactor] Clean up imports and improve code formatting * Removed unused import of `tilelang.testing` in `test_example_blocksparse_gemm.py` to streamline the code. * Reformatted several lines in `arg_binder.cc`, `make_packed_api.cc`, `tvm_ffi.py`, and `adapter.py` for improved readability and consistency. * Updated comments and spacing in `tvm_ffi.py` to enhance clarity without altering functionality. * Update execution backend options and improve resolution logic - Changed default execution backend from "cython" to "auto" in multiple locations to allow automatic selection based on the target. - Expanded the list of supported execution backends to include "torch" and "nvrtc" across various classes and functions. - Enhanced backend resolution logic in `KernelCache` and `AutoTuner` to ensure appropriate backend selection based on the target. - Updated documentation to reflect changes in execution backend options and their defaults. * lint fix * fix * Enhance argument handling in CUDA and HIP runtime modules - Updated `ExtractFuncInfo` in `rt_mod_cuda.cc` and `rt_mod_hip.cc` to map boolean argument types to int32, ensuring compatibility with device runtime. - Refactored `BindDLTensor` in `arg_binder.cc` to improve null handling and validation checks for DLTensor parameters, utilizing expression-level guards to prevent dereferencing null pointers. - Enhanced error checking for buffer shape, strides, and data fields, ensuring robust handling of optional inputs and maintaining consistency across various checks. * lint fix * lint fix * lint fix * lint fix * minor fix * fix * recover check * Refactor argument binding and validation in `arg_binder.cc` - Improved null handling and validation checks in `BindDLTensor`, ensuring safe dereferencing of pointers. - Enhanced consistency checks for buffer shape, strides, and data fields, utilizing expression-level guards. - Updated `MakePackedAPI` to maintain code clarity and consistency in argument handling. - Minor adjustments in test files to streamline kernel execution and improve readability. * lint fix * stride fix * minor fix * fix * lint fix * lint fix * Add CUDA stream access policy window helpers and integrate with L2 persistent cache management - Introduced functions to set and reset the CUDA stream access policy window, allowing for better control over L2 cache usage. - Updated runtime files to include new FFI packed functions for managing stream attributes. - Modified lower_hopper_intrin to incorporate prologue and epilogue statements for L2 cache setup and teardown. - Enhanced tests to verify the inclusion of new FFI calls in the generated kernel source. * check with symbolic * support null ptr * Update CMakeLists and lower.py for code generation and subproject status - Added `codegen_c_host.cc` to the list of source files in CMakeLists.txt for improved code generation support. - Updated the function call in `lower.py` to use `target.build.tilelang_c` for C target host code generation, enhancing compatibility. - Marked the TVM subproject as dirty to indicate local modifications. * lint fix * Update comments for clarity in quickstart.py
-
Chaofan Lin authored
* [Language] Add shape check in T.view/reshape * address comments
-
Elevator14B authored
* Fix various issues under int64_t static and dynamic shape. * Resolve reviewed issues. * Add unit test. * fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 17 Nov, 2025 1 commit
-
-
Kuris authored
* add typing stub for tir.ir * remove idents * minor update * [Refactor] add numpy conversion for dtype * fix lint error * remove unused np.float_ in dtype conversion * fix type in np.int_ * fix typo * minor fix * remove debug files
-
- 16 Nov, 2025 1 commit
-
-
Kevinzz authored
* fix nsa bwd and atomic * [Lint] * [BugFix] - New implementation for atomicMax and atomicMin using atomicCAS - PTX version atomicAdd for single 16-byte data - Modify the test cases * [Lint] --------- Co-authored-by:tzj-fxz <tzjfxz@gmail.com>
-
- 14 Nov, 2025 1 commit
-
-
Kuris authored
* add typing stub for tir.ir * remove idents * minor update * [Language] Add missing while statement * add test
-
- 12 Nov, 2025 1 commit
-
-
Lei Wang authored
* Update layout handling and introduce reshape functionality - Updated the `LayoutNode` class to include a new `Reshape` method, allowing for dynamic reshaping of layouts based on input shapes. - Enhanced the `OutputShape` method to provide better handling of cases where the analyzer cannot form an `IntervalSet`, implementing fallback mechanisms to ensure safe extents. - Refactored the `ReduceOpNode` to utilize `BufferRegion` for improved memory handling during reduction operations. - Added tests for reshaping functionality and layout transformations to ensure correctness and performance in various scenarios. * lint fix * Revert tvm submodule pointer to 1815c3e0b6ec4ead36370bbd1562025d8529017c; keep src unchanged * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1 * Update tvm submodule to commit f0bbd3bf741413c35c389ba5dedd5be206000ad1 * remove useless prove * remove comment --------- Co-authored-by:tilelang-bot <bot@tilelang>
-
- 11 Nov, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Add thread count validation for ReduceOp fragment layout inference * Introduced a check to ensure that the thread count is divisible by the replicate extent during layout inference in ReduceOpNode. This validation prevents layout inference failures and provides detailed error messages to guide users in resolving issues related to thread block sizes and fragment layouts. * Updated tests to remove unsupported configurations that could lead to layout inference errors, ensuring more robust testing scenarios. * lint fix
-
- 06 Nov, 2025 2 commits
-
-
Kurisu authored
-
Kurisu authored
* [Feature] Support serial for with step * add more tests * fix * Enhance trip count validation in SerialForWithStep to ensure non-zero step values and prevent undefined behavior. Added error handling for zero step values and improved logging for non-constant steps. * Update builder.py * fix lint error --------- Co-authored-by:
Zhiwen Mo <zm125@ic.ac.uk> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 05 Nov, 2025 2 commits
-
-
Tong WU authored
* Update dependency version for apache-tvm-ffi in pyproject.toml to fix CI * [Math] Add `tl.infinity` operation and update Python interface for infinity handling - Implemented `infinity_op` in C++ to return infinity values for supported data types. - Registered new operation `tl.infinity` with appropriate attributes. - Updated Python interface to call the new `tl.infinity` operation instead of the previous method. * Add unit tests for `tl.infinity` operation in TileLang - Introduced a new test file `test_tilelang_language_infinity.py` to validate the behavior of the `tl.infinity` operation across multiple data types (float16, bfloat16, float32, float64). - Implemented a kernel to fill a tensor with infinity values and asserted the correctness of the output against PyTorch's `torch.inf`. * lint --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
Kurisu authored
* [Feat] add 2 phase binding to allow swap two var * Minor update tvm dtype constructor * fix lint error
-
- 03 Nov, 2025 1 commit
-
-
Kurisu authored
* tilelang frontend v2 * syntax sugar: defining a local var by annotation * [Refactor] fix type linting warning like `T.float32` * Add tl.local_var_init for new tl.float32 * allow passing default argument as function annotation * allow default arguments as annotation * fix lint error * minor fix * [Refactor] refactor tilelang.jit and tilelang.autotune * minor fix * minor fix * minor fix * fix metal get function name * add par_compile impl and tests * Type consistency on tvm datatype 1. isinstance(tl.float32, tvm.DataType) == True 2. Allow `tl.float32` as function annotations 3. Allow `tl.float32` as argument to be passed to `tl.alloc` or other functions * fix lint error * add more warning in frontend * update tvm version * Minor fix on tvm_ffi annotations * add document and examples * fix lint error * Simplify index calculations in example_chunk_o_bwd.py Refactor index calculations for dg_last_fragment assignment. * minor fix * lint fix --------- Co-authored-by:
Lei Wang <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 02 Nov, 2025 1 commit
-
-
Lei Wang authored
* remove debug print * pipeline fix * use the correct buffer access scope * rs support * warp warpgroup_fence_operand * fix * fp8 dtype ptx enhance * mma fix * TCGEN05 Interface * tcgen05 support * rebase * update * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors. * lint fix * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module. * wgmma fix --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-