Commits · 68989d80858d8b034137330cf4fce1165a5db933 · OpenDAS / tilelang

15 Jul, 2025 1 commit

[Pass][Simplify] Introduce symbolic level simplify for condition expression (#634) · 02a0cf59

Lei Wang authored Jul 15, 2025

* [Enhancement] Add argument simplification option to StmtSimplifier

- Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior.
- Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process.
- Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase.
- Added comments to clarify the purpose of the new flag and its impact on simplification logic.

* lint fix

* [Enhancement] Improve layout inference and reduce operation handling

- Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic.
- Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures.
- Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization.

* lint fix

* [Enhancement] Add input validation for GEMM parameters

- Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method.
- Added informative error messages to assist in debugging when the input parameters do not meet the required conditions.

* bug fix

02a0cf59

14 Jul, 2025 1 commit

[Pass] Introduce flag to diable cp async lowering (#633) · 9c777b67

Lei Wang authored Jul 14, 2025

* [Enhancement] Update PipelinePlanner to support async copy configuration

- Modified the `Substitute` method in `PipelinePlanner` to accept a `use_async_copy` parameter, allowing for more flexible pipeline planning based on async copy requirements.
- Updated the constructor of `PipelinePlanner` to initialize the `use_async_copy_` member variable.
- Adjusted the logic in the pipeline planning process to conditionally apply async copy annotations based on the new parameter.
- Commented out the `LoopVectorizeDynamic` call in `LowerAndLegalize` to prevent unintended modifications during the legalizing phase.

* Refactor PipelinePlanning function for improved readability

- Adjusted the formatting of the `use_async_copy` variable assignment in the `PipelinePlanning` function to enhance code clarity and maintainability.

9c777b67

09 Jul, 2025 1 commit

[Refactor] Add parallel loop transform pass for condition extraction (#618) · 67b81609

xs-keju authored Jul 09, 2025



* [Refactor] Add parallel loop transform

* done format check

* pull 3rdparty repo

* Refactor loop variable handling in transformation utilities

- Updated the logic in `loop_parallel_transform_utils.h` to simplify the handling of related loop variables.
- Removed the check that enforced a single related loop variable, replacing it with a return statement when multiple variables are detected, enhancing clarity and maintainability of the transformation process.

* Update loop_parallel_transform_utils.h

* Refactor loop variable handling in transformation utilities

- Enhanced the logic in `loop_parallel_transform_utils.h` to improve clarity and maintainability by simplifying the handling of related loop variables.
- Replaced the previous enforcement of a single related loop variable with a return statement for multiple variables detected.

* remove disable cache flag as commit id has been key component

* lint fix

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

67b81609

08 Jul, 2025 1 commit

[Enhancement] Update ReduceOp initialization values for integer types (#614) · 80ffea6d

Lei Wang authored Jul 08, 2025

* [Enhancement] Update ReduceOp initialization values for integer types

- Modified the `MakeInitValue` method in `ReduceOp` to handle integer data types correctly by returning appropriate minimum and maximum values based on the bit width.
- Added checks for integer types to ensure correct initialization for `kMax` and `kMin` reduction types, enhancing the robustness of the reduction operations.

* [Enhancement] Update ReduceOp to handle unsigned integer initialization values

- Enhanced the `MakeInitValue` method in `ReduceOp` to include support for unsigned integer data types.
- Added conditions to return appropriate initialization values for `kMax` and `kMin` reduction types based on the data type, improving the robustness of reduction operations.

80ffea6d

04 Jul, 2025 1 commit

[Refactor] Phaseout Pass ParallelLoopTransformer (#611) · 42c3b452

Lei Wang authored Jul 04, 2025

* Refactor layout inference by removing the ParallelLoopTransformer class. Updated layout inference logic to streamline buffer access collection and condition handling in parallel loops. This change simplifies the code structure and enhances maintainability.

* Update MHA backward test cases to use reduced dimensions for batch size and context length

42c3b452

03 Jul, 2025 1 commit

[Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526) · be44758c

botbw authored Jul 04, 2025



* [experimental] add a draft gemm_sp

* [3rdparty] bump cutlass to v3.9.3

* [lint] run format.sh

* [chore] rebase

* [chore] use abs path

* [gemm_sp] add metadata layout

* [ci] add more example

* [lint] run format.sh

* [chore] polish

* [chore] move gemm_sp to experimental

* [chore] polish

* [lint] run format.sh

* [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test

* Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy.
* Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process.
* Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality.

* Implement Test

* [Enhancement] Update GEMM SP and SM89 templates for improved functionality

* Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture.
* Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets.
* Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types.
* Added conditional inclusion for GEMM SP header based on CUDA architecture version.

* lint fix

* [gemm_sp] support more layout and data types

* Enhancement: sync T.gemm_sp's layout inference with T.gemm

* Enhancement: support more block_k in compress util

* [Enhancement] enable block_k=64

* [Lint] run format.sh

* [Enhancement] compressor support more dtype

* Enhancement: enable block_K=32

* [Lint] format.sh

* [Fixbug] fix shape

* Refactor: sync gemm

* [Enhancement] enable transpose

* [Enhancement] enable fp8_e4m3

* [Enhancement] enable int8

* [Lint] run format.sh

* [Benchmark] add gemm_sp benchmark

* [Example] fix 256 threads hang

* [CI] fix ci

* [Chore] resolve gemini feedback

* [Benchmark] increase search space

* [Lint] format

* [CI] skip sparse tensor core related tests as only sm90 is supported

* [CI] pass local run

* Update gemm_sm89.h

* lint fix

* lint fix

* [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags

- Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation.
- Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag.
- Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected.
- Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch.
- Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module.

* Update test_compress_utils.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

be44758c

02 Jul, 2025 1 commit

[Enhancement] Introduce option `TL_DISABLE_FAST_MATH` and `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` (#609) · d7aebf4d

Lei Wang authored Jul 02, 2025

* [Enhancement] Introduce new PassConfig options for fast math and PTXAS verbosity

- Added `kDisableFastMath` and `kEnablePTXASVerboseOutput` configuration options to enhance control over compilation settings.
- Updated `LibraryGenerator` to utilize these new pass configurations, allowing for more flexible compilation behavior based on user preferences.
- Enhanced `PassConfigKey` enumeration to include the new options, ensuring they can be configured appropriately in the pass context.

* [Refactor] Update PTXAS verbosity configuration key in LibraryGenerator

- Changed the configuration key for PTXAS verbosity from `TL_VERBOSE_PTXAS_OUTPUT` to `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` to align with the new naming convention introduced in recent enhancements.
- This update ensures consistency in the configuration options used within the `LibraryGenerator` class, improving clarity and maintainability of the code.

* lint fix

d7aebf4d

01 Jul, 2025 1 commit

[Enhancement] Support tf32 gemm_rs (#607) · 0ff81755

Lei Wang authored Jul 01, 2025

- Added a line break in `quickstart.py` for better readability.
- Simplified the JIT kernel compilation in `quickstart.py` by removing the unused execution backend option.
- Modified `example_elementwise_add.py` to disable cache for `tilelang` and optimized the element-wise addition kernel by utilizing shared memory for input tensors, improving performance.
- Updated default values for matrix dimensions and block sizes in the argument parser to enhance usability.

0ff81755

30 Jun, 2025 1 commit

[Bugfix] Avoid duplicate data access when cross thread buffer meet replicate register (#606) · 8df45c9d

Lei Wang authored Jun 30, 2025

* [Enhancement] Improve debug output formatting in layout and fragment nodes

- Updated the `DebugOutput` methods in `LayoutNode` and `FragmentNode` to provide more structured and informative output, including transformation details and thread range information.
- Enhanced layout inference logic in `ParallelOp` to add predicates for cross-thread shared memory access, improving layout handling in parallel operations.
- Minor adjustment in `layout_inference.cc` to ensure clarity in parallel loop handling.

* lint fix

8df45c9d

27 Jun, 2025 2 commits

[Enhancement] Add tma bulk copy. (#600) · 3b52738d
Yuxi Chi authored Jun 27, 2025

3b52738d

[Refactor] Update accumulation handling in gemm_sm90.h (#603) · 9232e7b8

Lei Wang authored Jun 27, 2025

- Replaced the use of `tiled_mma.accumulate_ = GMMA::ScaleOut::Zero` with a call to `clear(acc)` for better clarity and maintainability in the accumulation logic.
- This change enhances the readability of the code by standardizing the approach to clearing accumulation values across multiple sections of the file.

9232e7b8

26 Jun, 2025 2 commits

[Enhancement] Introduce PassConfig `TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE`... · 3ca5a4ba

Lei Wang authored Jun 27, 2025

[Enhancement] Introduce PassConfig `TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE` to enable aggressive shared memory reuse (#602)

* [Enhancement] Add aggressive shared memory merge option in memory allocation

- Introduced a new configuration option `tl.enable_aggressive_shared_memory_merge` to enable aggressive merging of shared memory allocations.
- Updated the `SharedMemLinearAccessPatternFinder` class to support an aggressive merge strategy, allowing for improved memory reuse.
- Modified the `MergeSharedMemoryAllocations` function to incorporate the new merging strategy based on the configuration.
- Enhanced the `PassConfigKey` enumeration to include the new aggressive merge option, ensuring it can be configured appropriately.

* lint fix

* [Enhancement] Add aggressive shared memory merge configuration option

- Introduced a new configuration option `kEnableAggressiveSharedMemoryMerge` to enable aggressive merging of shared memory allocations, enhancing memory management capabilities.

* [Enhancement] Update MergeSharedMemoryAllocations to support aggressive merge option

- Modified the `MergeSharedMemoryAllocations` function to accept an `enable_aggressive_merge` parameter, allowing for more flexible memory management.
- Introduced a new helper function `should_enable_aggressive_merge` to determine the aggressive merge configuration based on the pass context and target.
- Updated the relevant calls in the `phase.py` and `__init__.py` files to utilize the new aggressive merge functionality, enhancing the overall memory allocation strategy.

3ca5a4ba

[Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks (#599) · a664c998

Lei Wang authored Jun 26, 2025

* [Enhancement] Improve error messaging for global and shared range legality checks in LowerBulkCopy

- Updated error messages in the LowerBulkCopy function to provide clearer context when global and shared ranges are illegal.
- Enhanced the readability of the error output by including tensor names, improving debugging and validation processes during bulk copy operations.

* [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks

- Improved the clarity of error messages in the LowerBulkCopy function by enhancing the output format.
- Included additional context in error messages to aid debugging when global and shared ranges are found to be illegal, ensuring better traceability during bulk copy operations.

a664c998

24 Jun, 2025 1 commit

[Enhancement] Add strict layout map for improved buffer layout inference (#594) · 18889821

Lei Wang authored Jun 24, 2025

- Introduced a `strict_layout_map` to enhance layout inference by ensuring that buffers with strict layout requirements are properly accounted for during the inference process.
- Updated the inference logic to check for the presence of buffers in the `strict_layout_map` before applying layout changes, improving the accuracy of layout assignments.
- Refactored the layout inference steps to include the copying of layouts into the new strict map, ensuring a clear separation of layout handling based on inference levels.

18889821

23 Jun, 2025 1 commit

[Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy (#592) · 78651bae

Lei Wang authored Jun 23, 2025

* [Enhancement] Improve memory access condition checks in GlobalMemChecker

- Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
- This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.

* lintfix

* [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy

- Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
- Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.

78651bae

22 Jun, 2025 1 commit

[Enhancement] Improve memory access condition checks in GlobalMemChecker (#591) · 41ec2bc6

Lei Wang authored Jun 22, 2025

* [Enhancement] Improve memory access condition checks in GlobalMemChecker

- Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
- This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.

* lintfix

41ec2bc6

20 Jun, 2025 2 commits

[Bugfix] fix missing node in ws role maker (#587) · c15d35e4
botbw authored Jun 21, 2025

c15d35e4

[Enhancement] align shared memory allocations (#583) · fecc8336

Lei Wang authored Jun 20, 2025

* [Enhancement] Update `pythonic_expr` to format type casts and improve tensor validation in Cython wrapper

- Enhanced `pythonic_expr` to represent type casts as `(type)value` for better clarity in expression representation.
- Modified tensor validation in `CythonKernelWrapper` to conditionally check for tensor contiguity based on a new `skip_tensor_validation` parameter.
- Improved type mapping in `map_torch_type` to include version checks for new float8 types, ensuring compatibility with specific PyTorch versions.

* [Feature] Implement dynamic shared memory allocation alignment

- Added a new transformation pass `AlignDynamicSharedMemoryAllocations` to align dynamic shared memory allocations to specified byte boundaries, enhancing memory access efficiency.
- Introduced a new utility class `TileLangAlignDynamicSharedMemoryAllocations` to handle the alignment logic for both allocation and buffer operations.
- Updated the `LowerAndLegalize` function to apply the alignment transformation based on the target device's capabilities, ensuring compatibility with different architectures.

* [Enhancement] Update dtype and argument defaults in GEMM autotuning example

- Changed data type from `float16` to `bfloat16` for improved precision in computations.
- Updated the default value of the `--with_roller` argument from `True` to `False` to modify the behavior of the autotuning process.

* [Enhancement] Improve thread range computation in storage access

- Added a new method `ComputeThreadRange` to calculate the range of threads for better access tracking.
- Updated `AccessEntry` structure to include `thread_range`.
- Modified various visitor methods to utilize `IRVisitorWithAnalyzer` for improved analysis during expression and statement visits.
- Ensured thread range is computed and stored during buffer load and store operations, enhancing memory access efficiency.

* [Refactor] Update comments for clarity in dynamic shared memory allocation alignment

- Translated comments in `align_dynamic_shared_memory_allocations.cc` from Chinese to English for better understanding.
- Removed an unnecessary call to `IRVisitorWithAnalyzer::VisitStmt_` in `storage_access.cc`.
- Added a blank line for improved readability in `thread_storage_sync.cc`.

* [Refactor] Enhance storage access analysis and thread range computation

- Introduced `ExtractRealCondition` to improve condition handling in `IfThenElseNode` visits.
- Updated `ComputeThreadRange` to use `Var` instead of `IterVar` for thread range mapping, enhancing clarity and consistency.
- Wrapped statement visits in `With<arith::ConstraintContext>` to ensure proper analysis context during condition evaluations.

* [Enhancement] Update default matrix dimensions in GEMM autotune example

- Changed default values for matrix dimensions M, N, and K from 16384 to 4096 in `example_gemm_autotune.py` to facilitate quicker testing and benchmarking.

* typo fix

* enhancement

* [Fix] Add conflict detection for buffer index size mismatch in thread storage sync

- Implemented a check to return true if the sizes of previous and current buffer indices do not match, indicating a conflict.

fecc8336

18 Jun, 2025 1 commit

[Enhancement] Update warp specialization checking (#580) · 6cede73d

Lei Wang authored Jun 19, 2025

* Fix L2 cache size calculation to handle symbolic expressions and ensure float conversion of hit ratios in annotation

* [Enhancement] Update warp specialization check in phase.py

* lint fix

* [Enhancement] Add ContainsSeqStmt method to improve statement handling in merge_shared_memory_allocations.cc

* [Refactor] Simplify memory copy operations in GEMM kernel tests

- Updated memory copy operations in `test_tilelang_kernel_gemm.py` to use shared memory allocations for both A and B matrices, improving clarity and performance.
- Adjusted the main execution block to include a new `run_gemm_rs` function call for testing, enhancing the test structure.

* revert memory reuse pass.

* revert the memory resue and thread sync pass/

* Update test_tilelang_kernel_gemm.py

* Update test_tilelang_kernel_mha_bwd.py

6cede73d

16 Jun, 2025 2 commits

[Enhancement] Introduce wrapper util `pythonic_expr` to transform a PrimExpr... · 916ee60e

Lei Wang authored Jun 16, 2025

[Enhancement] Introduce wrapper util `pythonic_expr` to transform a PrimExpr into python string (#577)

* [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic

- Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
- Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
- Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
- Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.

* lint fix

* lint fix

* rebase

* rebase

* typo

* requirement fix

* revert flash atten requirenemts

916ee60e

[Refactor] Phaseout tf32 Casting from GEMM Templates (#573) · 9ba8b480

Lei Wang authored Jun 16, 2025

* [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic

- Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
- Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
- Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
- Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.

* lint fix

* [Refactor] Update GEMM Layout Functions and Inference Logic

- Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
- Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
- Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
- Adjusted related layout functions to ensure consistent behavior across different architectures.

* [Refactor] Remove tf32 Casting Logic from GEMM Templates

- Eliminated the `cast_float_to_tf32` function from `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` templates to streamline the code.
- Removed conditional casting logic for float32 to tfloat32 conversion, enhancing clarity and maintainability.
- Updated relevant sections in GEMM operations to reflect the removal of casting, ensuring consistent behavior across templates.
- Adjusted tensor view handling to improve performance and accuracy in matrix operations.

* Update bulk_copy.cc

* Fix profiler initialization in GEMM test by removing TensorSupplyType argument for improved flexibility.

9ba8b480

11 Jun, 2025 2 commits

[Feature] Implement Swizzle 32B (#566) · ae9668a8

Lei Wang authored Jun 11, 2025

* [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic

- Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
- Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
- Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
- Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.

* lint fix

* [Refactor] Update GEMM Layout Functions and Inference Logic

- Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
- Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
- Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
- Adjusted related layout functions to ensure consistent behavior across different architectures.

* Update bulk_copy.cc

* Update __init__.py

ae9668a8

[Feature] Introduce Persistent Loop and Update GEMM Example (#563) · e7b97be2

Yu Cheng authored Jun 11, 2025

* [Feature] Added Support for Synchronizing Grids and Persistent Threadblock Transformation

- Defined the sync_grid operation in builtin.cc and builtin.h, allowing synchronization of all threads within a grid.
- Implemented support for sync_grid in codegen_cuda.cc, ensuring proper handling of this operation in the generated CUDA code.
- Added the PersistThreadblock transformation, enabling the conversion of thread blocks to persistent thread blocks, enhancing support for persistent kernels.
- Updated relevant documentation and comments to reflect the addition of new features and usage instructions.

* [Example] Add MLA Decode With Persistent Threadblock Example

* [Feature] Introduce Persistent Loop and Update GEMM Example

- Added a new persistent loop construct in the TIR framework, enabling more efficient kernel execution.
- Updated the GEMM example to utilize the new persistent primitive, enhancing performance for matrix multiplication.
- Introduced a `loop_break` intrinsic for better control flow within persistent loops.
- Updated relevant files to support the new features, including changes in code generation and language interface.

* lint fix

e7b97be2

07 Jun, 2025 2 commits

[Feature] Support persistent kernels and add persistent GEMM examples (#559) · 225aca61

Yu Cheng authored Jun 07, 2025

* [Enhancement] Fix multi-version buffer index in nested-loop

* [Feature] Support persistent kernels and add persistent GEMM example

* lint fix

* lint fix

* [CI] Remove test_tilelang_transform_annotate_device_regions.py

225aca61

[Bugfix] Add tf32 casting to GEMM templates (#556) · 8cc8db52

Lei Wang authored Jun 07, 2025

* Add tf32 casting functionality to GEMM templates

- Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates.
- Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types.
- Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations.

* lint fix

* Refactor tfloat32 casting logic in GEMM templates

- Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32.
- Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types.
- Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations.

* Refactor GEMM template functions for improved readability

- Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity.
- Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability.

* Enhance tf32 casting logic in GEMM templates

- Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness.
- Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability.

* Refactor GEMM template functions and layout inference logic

- Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability.
- Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations.
- Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling.

8cc8db52

05 Jun, 2025 1 commit

[Enhancement] Add nvrtc execution backend (#461) · 17f7394f

Gabriel Wu authored Jun 05, 2025



* [wip] feat: add nvrtc backend

* [wip] fix: handle out_idx

* [wip] refactor: move lib logic to libgen

* feat: cache for nvrtc backend

* fmt: run format

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: handle cuda bindings import error

* fix: get kernel source

* refactor: speedup pyimport

* Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning.

* Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.

* Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates.

* fix tl_templates

* ensure CUDA context

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

17f7394f

04 Jun, 2025 3 commits

[Bugfix] Enhance layout inference pass for flexibility (#550) · 444b7c4e

Lei Wang authored Jun 04, 2025

* Enhance Layout

* strict update

* lint fix

* Refactor layout inference by removing unnecessary logging statements in `parallel.cc` and `layout_inference.cc`. This cleanup enhances code readability and reduces log clutter during layout inference steps.

* lint fix

* Refactor file copying logic in setup.py to simplify directory creation and file copying process. Removed unnecessary existence check before copying source files to the target directory.

444b7c4e

[AMD][Enhancement] Add support for Vectorized FP8 DataPacking (#542) · 319bc6b1

Lei Wang authored Jun 04, 2025

* [Enhancement] Add support for new FP8 types in HIP code generation

* Updated `PrintConst` function in `codegen_hip.cc` to handle `float8_e4m3fnuz` type.
* Introduced new functions in `hip_fp8.h` for creating FP8 types, including `make_fp8_e4_4_t` and `make_fp8_e4_8_t`, enhancing type handling for FP8 data structures.
* Improved overall compatibility and performance for FP8 data types in HIP.

* workaround for competition

* enhance autotune

* autotune cache fix

* Implement validation for unused keys in AutoTuner configuration

* Added a check in the AutoTuner class to raise a ValueError if there are unused keys in the configuration, enhancing error handling and ensuring configuration integrity.

* lint fix

* revert changes of threads

* Update pipelining in `example_mla_decode.py` to improve performance

* Changed the number of stages in the pipelined loop from 0 to 2, enhancing the efficiency of the attention mechanism in the decoding process.

* Enhance Cython kernel validation by adding tensor attribute checks

* Updated the `CythonKernelWrapper` to include dedicated methods for validating tensor device, dtype, and static shape.
* Modified the `forward` method to utilize these new validation methods, improving error handling and ensuring input integrity.
* Updated the `lambda_forward` function in `CythonKernelAdapter` to reflect changes in validation parameters.

319bc6b1

[Refactor] Include several examples into ci (#531) · 3ca3a8af

Lei Wang authored Jun 04, 2025

* Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.

* lint fix

* fix split k example

* Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation

* Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice

3ca3a8af

01 Jun, 2025 1 commit

[AMD] Support float8 matrix core (#537) · 5872e647

Lei Wang authored Jun 02, 2025



* [Enhancement] Add support for FP8 types in CUDA and HIP code generation

* Updated `GetFP8Type` function in `codegen_cuda.cc` and `codegen_hip.cc` to handle new FP8 types, including `kFloat8_e4m3fnuz`.
* Introduced a new header file `hip_fp8.h` for FP8 type definitions in HIP.
* Modified type mappings in `dlpack.py` and `mfma_macro_generator.py` to accommodate new FP8 types.
* Enhanced type handling in `TLHIPSourceWrapper` and `tensor.py` for better integration with FP8 types.
* Added necessary includes and logic to support FP8 in the code generation process, improving performance and compatibility with FP8 data types.

* lint fix

* Update src/target/codegen_hip.cc
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* Update tilelang/intrinsics/mfma_macro_generator.py
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

* workaround

* fix

* Update submodule TVM to latest commit 587028ffebfff0ded520f8f90d62f0f6b165906c

* bug fix

* Refactor tilelang matrix multiplication to support transposition and packing options. Adjusted shared memory shapes and loading logic for A and B matrices. Updated test cases to validate new functionality.

* Refactor assertion function for tilelang matrix multiplication to improve readability by formatting parameters and aligning code. Cleaned up whitespace in intrinsic layout functions for consistency.

* Update bfloat16 type definitions in common.h and gemm.h for consistency. Changed __hip_bfloat16 to hip_bfloat16 and updated MfmaTraits specialization accordingly.

* lint fix

---------
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

5872e647

31 May, 2025 1 commit
- [Bugfix] Fix a bug when simplifying warp combination for T.gemm (#540) · 1940b3c9
  Lei Wang authored May 31, 2025
  
  1940b3c9
29 May, 2025 1 commit

[Language] Support `T.annotate_l2_hit_ratio` via `cudaStreamSetAttribute` (#539) · a65f481e

Lei Wang authored May 30, 2025

* Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code

* Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process.
* Cleaned up unnecessary whitespace in the function for improved readability.
* Enhanced the overall clarity and maintainability of the code.

* Refactor AllocateNode handling in vectorize_loop.cc

* Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic.
* Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability.
* Improved overall readability by eliminating unnecessary comments and code related to extent handling.

* Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing.

* Add buffer allocation planning step in OptimizeForTarget function

* Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process.
* This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.

* Update submodule TVM to latest commit db50d4e, ensuring alignment with upstream changes.

* Add L2 persistent annotation support and related functionality

* Introduced a new file `lower_l2_persistent_annotation.cc` to handle the lowering of L2 persistent annotations.
* Added functions to annotate L2 hit ratios for buffers, ensuring compatibility with global buffer requirements.
* Updated the `LowerAndLegalize` function to include the new L2 persistent map lowering step.
* Enhanced CUDA driver with a function to retrieve the maximum size of the persisting L2 cache.
* Modified the `TLCUDASourceWrapper` class to integrate L2 persistent map handling during kernel launches.

These changes improve the framework's ability to manage L2 cache optimizations, enhancing performance for CUDA applications.

* lint fix

a65f481e

28 May, 2025 1 commit

[Refactor] Disable legacy vectorization for buffer allocation (#535) · e71c7a17

Lei Wang authored May 29, 2025

* Refactor OptimizeForTarget function by removing redundant buffer allocation step and cleaning up code

* Removed the PlanAndUpdateBufferAllocationLocation step from the OptimizeForTarget function to streamline the optimization process.
* Cleaned up unnecessary whitespace in the function for improved readability.
* Enhanced the overall clarity and maintainability of the code.

* Refactor AllocateNode handling in vectorize_loop.cc

* Simplified the VisitStmt_ method for AllocateNode by removing the complex extent mutation logic.
* Streamlined the allocation process to directly call the base class method, enhancing code clarity and maintainability.
* Improved overall readability by eliminating unnecessary comments and code related to extent handling.

* Remove `tl_kernel.c` file, eliminating the backward kernel implementation and associated error handling functions. This cleanup enhances code maintainability by removing unused components related to the backward kernel processing.

* Add buffer allocation planning step in OptimizeForTarget function

* Introduced the PlanAndUpdateBufferAllocationLocation step to the OptimizeForTarget function, enhancing the optimization process.
* This addition improves the overall efficiency of buffer allocation during the target optimization phase, ensuring better resource management.

e71c7a17

27 May, 2025 1 commit

[Enhancement] Add warp specialization attribute handling in IR and rewriter (#518) · 41bc15cb

Yu Cheng authored May 27, 2025

* Introduced an `AttrFrame` for warp specialization in the IR, enhancing the handling of warp-specific optimizations.
* Refactored the `VisitStmt_` method in `warp_specialized_rewriter.cc` to check for the new warp specialization attribute, improving the detection of warp specialization conditions.
* Removed outdated code related to condition checks in `IfThenElseNode`, streamlining the specialization logic.

41bc15cb

26 May, 2025 3 commits

[Enhancement] Add commit ID to versioning and improve logging initialization (#524) · 62a8d7f0

Lei Wang authored May 27, 2025

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

62a8d7f0

[Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 (#522) · 46798f25

Lei Wang authored May 26, 2025

* [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA

* Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA.
* Updated `customize.py` to include the new `atomic_addx4` function for external calls.
* Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module.

* lint fix

46798f25

[Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) · 6addc509

Lei Wang authored May 26, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

* Add global barrier state and expectation handling in CUDA code generation

* Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
* Updated `Finish` method to declare the global barrier state if needed.
* Implemented handling for `EvaluateNode` to initialize the barrier expectation.
* Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
* Enhanced CUDA FP8 type definitions for better alignment and structure.

* Enhance CUDA FP8 type handling and debug printing

* Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure.
* Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types.
* Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions.

* Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state

* Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration.
* Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework.

* Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting

* Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting.
* This cleanup enhances maintainability and reduces potential confusion regarding obsolete code.

* lint fix

6addc509

25 May, 2025 1 commit

[Enhancement] Support auto synchronization for global memory access (#519) · 623edf4c

Lei Wang authored May 25, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

* Add global barrier state and expectation handling in CUDA code generation

* Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
* Updated `Finish` method to declare the global barrier state if needed.
* Implemented handling for `EvaluateNode` to initialize the barrier expectation.
* Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
* Enhanced CUDA FP8 type definitions for better alignment and structure.

623edf4c

24 May, 2025 1 commit

[Refactor] Support auto index bitwidth casting (#517) · 6ad73f6f

Lei Wang authored May 24, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

6ad73f6f

23 May, 2025 1 commit

[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness... · 0fdefe2b

Lei Wang authored May 23, 2025

[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management (#508)

* Introduced a new StmtAttr structure to track the scope level of statements, enhancing the liveness analysis process.
* Updated the UpdateStmtAttr function to manage statement attributes effectively during memory allocation visits.
* Modified the VisitStmt_ methods to utilize the new scope level tracking, ensuring accurate memory access patterns.
* Refactored the LivenessAnalysis and PlanMemory functions to incorporate statement attributes, improving the handling of gen and kill points in memory management.
* Added a new helper function allow_warp_specialized in phase.py to conditionally enable warp specialization based on pass context and target, addressing potential bugs in the MergeSharedMemoryAllocations pass.
* Enhanced the OptimizeForTarget function to conditionally apply the MergeSharedMemoryAllocations pass based on warp specialization settings, improving robustness in memory allocation strategies.

0fdefe2b