Commits · 62a8d7f0b1bc041fb90521e2a63a41dd5063e281 · OpenDAS / tilelang

26 May, 2025 4 commits

[Enhancement] Add commit ID to versioning and improve logging initialization (#524) · 62a8d7f0

Lei Wang authored May 27, 2025

* Updated `get_tilelang_version` to include an optional commit ID in the version string.
* Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
* Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
* Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
* Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.

62a8d7f0

[Refactor] Reorganize Thread Synchronization Steps to make sure global... · 41c51d07

Lei Wang authored May 26, 2025

[Refactor] Reorganize Thread Synchronization Steps to make sure global synchronization can be correctly lowered (#521)

* [Refactor] Reorganize Thread Synchronization Steps in OptimizeForTarget Function

* Removed redundant thread synchronization steps for "global" and "shared" memory, streamlining the optimization process.
* Reintroduced necessary synchronization for "shared" and "shared.dyn" after the injection of PTX async copy, ensuring correct memory access patterns.
* Enhanced overall clarity and maintainability of the OptimizeForTarget function by restructuring the order of operations.

* [Refactor] Reorder Thread Synchronization and PTX Async Copy in OptimizeForTarget Function

* Removed redundant global thread synchronization step and adjusted the order of operations for shared memory synchronization.
* Ensured that the PTX async copy injection occurs after the global thread sync, improving memory access validity.
* Enhanced clarity and maintainability of the OptimizeForTarget function by restructuring synchronization steps.

41c51d07

[Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 (#522) · 46798f25

Lei Wang authored May 26, 2025

* [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA

* Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA.
* Updated `customize.py` to include the new `atomic_addx4` function for external calls.
* Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module.

* lint fix

46798f25

[Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) · 6addc509

Lei Wang authored May 26, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

* Add global barrier state and expectation handling in CUDA code generation

* Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
* Updated `Finish` method to declare the global barrier state if needed.
* Implemented handling for `EvaluateNode` to initialize the barrier expectation.
* Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
* Enhanced CUDA FP8 type definitions for better alignment and structure.

* Enhance CUDA FP8 type handling and debug printing

* Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure.
* Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types.
* Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions.

* Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state

* Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration.
* Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework.

* Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting

* Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting.
* This cleanup enhances maintainability and reduces potential confusion regarding obsolete code.

* lint fix

6addc509

25 May, 2025 1 commit

[Enhancement] Support auto synchronization for global memory access (#519) · 623edf4c

Lei Wang authored May 25, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

* Add global barrier state and expectation handling in CUDA code generation

* Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
* Updated `Finish` method to declare the global barrier state if needed.
* Implemented handling for `EvaluateNode` to initialize the barrier expectation.
* Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
* Enhanced CUDA FP8 type definitions for better alignment and structure.

623edf4c

24 May, 2025 1 commit

[Refactor] Support auto index bitwidth casting (#517) · 6ad73f6f

Lei Wang authored May 24, 2025

* [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)

* Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
* Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
* Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
* Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
* Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.

* [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature

* Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
* Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
* Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.

* Add dynamic matrix multiplication example and corresponding test

* Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
* Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
* The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.

* lint fix

* Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device

* Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
* Updated the `__init__.py` file to include the new function in the module exports.

* lint fix

6ad73f6f

23 May, 2025 4 commits

Fix deepgemm exmaple (#513) · 0d1eab57

Taoyu Zhu authored May 23, 2025



* fix deepgemm example

* fix deepgemm example

* make format

* Update example_deepgemm_fp8_2xAcc.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

0d1eab57

[Dev] Add grouped GEMM backward example scripts (#515) · de028927

Yu Cheng authored May 23, 2025

* Introduced `example_grouped_gemm_fwd.py` and `example_grouped_gemm_bwd.py` to demonstrate grouped matrix multiplication with forward and backward operations.
* Implemented functions for grouped GEMM, input construction, and validation against PyTorch's implementation.
* Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions.
* Included a test function to validate the functionality with various input scenarios.

de028927

[Dev] Add grouped GEMM example with TileLang and PyTorch integration (#514) · fb801940

Yu Cheng authored May 23, 2025

* Introduced a new example script `example_grouped_gemm.py` demonstrating grouped matrix multiplication using TileLang and PyTorch.
* Implemented functions for performing grouped GEMM, constructing inputs, and validating results against PyTorch's implementation.
* Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions.
* Included a test function to validate the grouped GEMM functionality with various input scenarios.

fb801940

[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness... · 0fdefe2b

Lei Wang authored May 23, 2025

[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management (#508)

* Introduced a new StmtAttr structure to track the scope level of statements, enhancing the liveness analysis process.
* Updated the UpdateStmtAttr function to manage statement attributes effectively during memory allocation visits.
* Modified the VisitStmt_ methods to utilize the new scope level tracking, ensuring accurate memory access patterns.
* Refactored the LivenessAnalysis and PlanMemory functions to incorporate statement attributes, improving the handling of gen and kill points in memory management.
* Added a new helper function allow_warp_specialized in phase.py to conditionally enable warp specialization based on pass context and target, addressing potential bugs in the MergeSharedMemoryAllocations pass.
* Enhanced the OptimizeForTarget function to conditionally apply the MergeSharedMemoryAllocations pass based on warp specialization settings, improving robustness in memory allocation strategies.

0fdefe2b

22 May, 2025 3 commits

[Enhancement] Introduce padding annotation and improve memory access validation (#511) · f23c4d30

Lei Wang authored May 22, 2025

* Added a new attribute `kPaddingMap` in `builtin.h` for managing padding annotations.
* Enhanced `SafeMemorysRewriter` to utilize an annotated padding map for buffer stores, improving memory access safety.
* Implemented checks in `layout_inference.cc` to ensure buffers are correctly referenced during layout mapping.
* Introduced a new test file for validating the padding annotation functionality in TileLang.

f23c4d30

[Bugfix] Enhance smem copy selector for uncommon shape (#510) · dbe8689f

Lei Wang authored May 22, 2025

* [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility

* Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square.
* Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance.
* Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures.
* Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies.

* [Refactor] Optimize matrix multiplication parameters and performance in quickstart example

* Updated thread count in the kernel context from 256 to 128 to enhance performance.
* Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency.
* Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution.
* Cleaned up comments for clarity and corrected a typo in the memory copy comment.

* [Refactor] Simplify Copy type selection in OperandTraits for improved clarity

* Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code.
* This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.

dbe8689f

[Refactor] Update buffer handling in layout transformation functions (#509) · 094796b6

Lei Wang authored May 22, 2025

* Modified `makeBufferWithLayout` to include a `var_remap` parameter for improved variable remapping during buffer creation.
* Enhanced buffer load and store operations to utilize the new variable remapping logic, ensuring correct buffer references.
* Commented out a check in `ThreadExtent` for clarity, maintaining functionality while improving code readability.

094796b6

21 May, 2025 1 commit

[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling... · 41d4988b

Lei Wang authored May 22, 2025

[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization (#507)

* [Refactor] Update reduce functions to support default dimension values and improve dimension handling

* Added a helper function `_legalize_dim` to handle negative dimension values in reduction operations.
* Updated `reduce_max`, `reduce_min`, `reduce_sum`, `reduce_abssum`, and `reduce_absmax` functions to accept a default dimension value of -1, enhancing usability and flexibility in buffer reduction operations.
* Ensured consistent dimension handling across all reduction functions for improved clarity and correctness.

* Update submodule `tvm` to latest commit c2921fd, ensuring compatibility with recent changes.

* [Refactor] Enhance ReduceOp and JITKernel for improved dimension handling and initialization

* Updated ReduceOp to handle 1D reduction cases and ensure correct dimension checks, improving robustness in reduction operations.
* Initialized prim_func in JITKernel to enhance clarity and prevent potential null reference issues.
* Added whitespace for better code readability in reduce.py.

41d4988b

20 May, 2025 4 commits

[Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically (#505) · 84ddb9e1

Lei Wang authored May 20, 2025

* [Refactor] Update GlobalMemChecker to use IRVisitorWithAnalyzer for improved analysis (#505)

* Refactored GlobalMemChecker to inherit from IRVisitorWithAnalyzer, enhancing its capabilities for expression analysis.
* Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
* Added additional lower bound condition checks to ensure comprehensive validation of memory access indices.

* [Refactor] Update GlobalMemChecker to use StmtExprVisitor for improved memory access validation

* Refactored GlobalMemChecker to inherit from StmtExprVisitor, enhancing its capabilities for expression analysis.
* Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
* Ensured that the analyzer is passed correctly during instantiation, maintaining consistency in condition checks.

84ddb9e1

[Refactor] Adjust GEMM fragment layout for improved clarity and performance (#504) · c59e1aab

Lei Wang authored May 20, 2025

* Modified the layout creation in makeGemmFragmentB to enhance the order of operations, ensuring the Replicate method is called before Repeat for better readability and performance.
* This change improves the logical flow of fragment creation, aligning with best practices for GEMM layout management.

c59e1aab

[Refactor] Refactor `jit` to `_JitImplementation` to support `@tilelang.jit` (#502) · 8c8d8ca2

Lei Wang authored May 20, 2025

* [Refactor] Rename `jit` class to `_JitImplementation` and improve debug path handling

* Refactored the `jit` class to `_JitImplementation` for clarity and encapsulation.
* Enhanced handling of `debug_root_path` to ensure it is correctly set as an absolute path when provided.
* Updated the public `jit` function to serve as a decorator interface, allowing for both default and configured usage.
* Added validation to ensure input tensors are contiguous in the Cython wrapper, improving error handling.

* [Refactor] Improve formatting and handling in `_JitImplementation` and `jit` function

* Refactored the `_JitImplementation` class to enhance readability by adjusting comment formatting and consolidating conditions for setting `debug_root_path`.
* Updated the `jit` function signature for better alignment and clarity in parameter definitions.
* Ensured consistent spacing and comments throughout the code for improved maintainability.

* [Refactor] Update GEMM test parameters for performance optimization

* Set num_stages to 0 and adjusted matrix dimensions in the GEMM test function to enhance performance and consistency across tests in test_tilelang_jit_gemm.py.
* Reduced the number of threads used in the test to align with the updated configuration, improving overall test efficiency.

* [Refactor] Enhance buffer error logging in layout inference

* Updated the warning message in layout inference to provide clearer context when a buffer cannot be inferred due to its absence in the use list. This change improves the clarity of error reporting during layout inference operations.
* Refactored tensor handling in the Cython wrapper to ensure input tensors are checked for contiguity before processing, enhancing error handling and robustness in tensor management.

* bugfix

8c8d8ca2

Fix amd docker issues. (#503) · 67bd9f69

Zhiwen Mo authored May 20, 2025


Co-authored-by: Ubuntu <srguser@srgmi300c.ibcr0fi0qgdu5pqgbnhfbyasxg.parx.internal.cloudapp.net>

67bd9f69

18 May, 2025 1 commit

[Refactor] refactor `tilelang.jit` to support a faster and more flexible kernel cache (#501) · 25a50f1a

Lei Wang authored May 19, 2025

* [Refactor] Update JIT kernel functions and streamline GEMM tests

* Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity.
* Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function.
* Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability.
* Cleaned up commented-out code and unnecessary imports to improve overall code quality.

* Update main function call in GEMM test to use tilelang testing framework

* Update README and example scripts to include JIT decorator comments

* Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions.
* Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity.

* Update GEMM test parameters for improved performance

* Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.

25a50f1a

17 May, 2025 3 commits

[Refactor] Update GEMM layout and operand traits for improved CUDA compatibility (#500) · 33937683

Lei Wang authored May 18, 2025

* [Enhancement] Improve GEMM layout function and documentation

* Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
* Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
* Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.

* lint fix

* [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility

* Adjusted the InferLayout method in gemm.cc to include trans_A in fragment creation, enhancing layout inference for transposed matrices.
* Updated OperandTraits in gemm_sm89.h and gemm_sm90.h to change the Copy type from SM75_U16x4_LDSM_N to SM75_U16x4_LDSM_T, optimizing memory access patterns for different warp configurations.
* Enhanced static assertions in gemm_sm90.h to clarify requirements for num_warp_m, ensuring compatibility with Hopper architecture.

* [Refactor] Clean up formatting in GEMM implementation and CUDA templates

* Simplified the formatting of the fragment creation in the InferLayout method of gemm.cc for better readability.
* Adjusted the static assertion message in gemm_sm90.h to enhance clarity regarding the num_warp_m requirement for Hopper architecture.

33937683

[Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f

Lei Wang authored May 18, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

* Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.

* lint fix

* Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.

2837878f

[Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3

Lei Wang authored May 17, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

* Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.

* lint fix

68a3c4f3

16 May, 2025 3 commits

[Bugfix] Fix Hopper GEMM layout for small tile size (#497) · c93e8695

Lei Wang authored May 16, 2025

* [Enhancement] Improve GEMM layout function and documentation

* Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
* Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
* Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.

* lint fix

c93e8695

[Refactor] Update main function structure in example scripts and add tests (#475) · 73ae8087

Yu Cheng authored May 16, 2025

* [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang

* Refactor example_mla_decode.py to define a main function for better structure and clarity.
* Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode.
* Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing.
* Ensure all new test files are integrated with tilelang testing framework.

* [Test] Enhance test_example_mla_decode with argument mocking

* Update test_example_mla_decode.py to mock sys.argv for better test isolation.
* Ensure the main function of example_mla_decode is called with the correct arguments during testing.

73ae8087

[Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48

Lei Wang authored May 16, 2025

* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.

* Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.

* Add merge shared memory allocations pass and related configurations

- Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
- Registered configuration options for debugging and controlling the merging behavior.
- Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
- Adjusted import paths and added documentation for the new functionality.

* Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py

dca2fb48

14 May, 2025 1 commit

[Refactor] Introduce quantize components of TileLang and add testing for... · cde1886f

Lei Wang authored May 14, 2025

[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple (#494)

* Remove deprecated example_dequant_gemm.py and add DataType import in __init__.py

* lint fix

* lint fix

* Refactor dequantization examples to use tilelang imports and update data type handling in quantization utilities

* lint fix

cde1886f

13 May, 2025 3 commits

[CI] Add Reminder Bot for pull request contributions (#491) · 31dbb471
Wenhao Xie authored May 13, 2025
```
* [CI] Add Reminder Bot for pull request contributions

* upd
```
31dbb471

[CI] Add flash_decoding example to CI (#487) · 7b66fb19

徐畅 authored May 13, 2025

* [CI] Add flash_decoding example to CI

* Add output of ref latency

* format example_gqa_decode.py

7b66fb19

[Enhancement] Support register input for gemm when trans_a or trans_b is true (#490) · d4f096ef

Lei Wang authored May 13, 2025

* [Refactor] Enhance makeGemmFragmentB to support transposition

* Updated the `makeGemmFragmentB` function to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
* Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
* Modified the function signature in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
* Added a new `matmul_sr` function in the test suite to validate the behavior of the updated fragment generation with transposition support.

* [Refactor] Enhance makeGemmFragmentA and makeGemmFragmentB for transposition support

* Updated the `makeGemmFragmentA` and `makeGemmFragmentB` functions to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
* Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
* Modified function signatures in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
* Added a new `matmul_rs` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
*

* Improve error messaging in layout equality checks

* Enhanced the error output in layout equality checks to provide clearer context by adding line breaks for better readability in the debug output.
* This change ensures that when layouts are structurally unequal, the current and previous layouts are displayed more distinctly, aiding in debugging.

d4f096ef

12 May, 2025 2 commits
- Revert "[Bugfix] Use AutoTune cache_input_tensors properly (#483)" (#488) · 39ae28e4
  Lei Wang authored May 12, 2025
```
This reverts commit 22e6de184fa4b307640b108b779f3d46d132f96c.
```
  39ae28e4
- [Bugfix] Use AutoTune cache_input_tensors properly (#483) · a10882e0
  yyttt6 authored May 12, 2025
  
  a10882e0
11 May, 2025 2 commits

[Bugfix] Check CUDA target before checking for TMA #482 · fa0fca58
Thien Tran authored May 12, 2025

fa0fca58

[Feature] Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check (#481) · 089cc0a7

yuanjypku authored May 11, 2025



* Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check

* lint fix

* Update example_mla_decode.py

* Update __init__.py

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

089cc0a7

10 May, 2025 7 commits

Update version retrieval in conf.py to read from VERSION file (#478) · 2297af9a
Wenhao Xie authored May 10, 2025

2297af9a

[Refactor] Improve layout equality checks and error messaging (#471) · c2480907

Lei Wang authored May 10, 2025

* [Refactor] Simplify buffer_region_to_tile_region function in copy.py

* Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
* Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.

* [Refactor] Improve layout equality checks and error messaging

* Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges.
* Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging.
* Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy.

* lintfix

* [Refactor] Rename SetThreadRange to BindThreadRange for clarity

* Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity.
* Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior.
* Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method.
* Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API.

* [Refactor] Update BindThreadRange usage across layout inference methods

* Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality.
* Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation.
* Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity.

* lint fix

c2480907

[Refactor] Skip patchelf if not installed (#477) · 273be768

Lei Wang authored May 10, 2025

* [Refactor] Enhance TMA barrier validation and support for additional architectures

* Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
* Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.

* Enhance logging in setup.py and refactor TMA architecture checks in phase.py

* Added logging configuration to setup.py, replacing print statements with logger for better traceability.
* Updated download and extraction functions to use logger for status messages.
* Refactored TMA architecture checks in phase.py to utilize the new `have_tma` function for improved clarity and maintainability.
* Introduced support for additional compute capabilities in nvcc.py, including TMA support checks.

* Update documentation for get_target_compute_version to reflect correct GPU compute capability range

* Refactor have_tma function to accept tvm.target.Target instead of compute_version

* Updated the `have_tma` function in nvcc.py to take a `target` parameter, improving clarity and usability.
* Adjusted calls to `have_tma` in phase.py to pass the target directly, enhancing maintainability and consistency in TMA support checks.

273be768

[CI] Add Analyzer and blocksparse_attention examples to CI (#472) · 8dec14e0

yyttt6 authored May 10, 2025



* yes

* [Bugfix] fix the unexpected keyword error of autotune

* format

* test

* [CI] Add Analyzer and blocksparse_attention examples to CI

* format

* try

* try

* try

* try

* t

* format

* d

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

8dec14e0

[Refactor] set USE_LLVM to optional. (#476) · 66dba763
Yuxuan Hu authored May 10, 2025

66dba763

[BugFix] Correct argparse for example_convolution test (#474) · 3f25bd1b

Wenhao Xie authored May 10, 2025



* add convolution example to CI

* lint fix

* Update test_example_convolution.py

* fix bug

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

3f25bd1b

[CI] Add Convolution example to CI (#473) · abe170a6

Wenhao Xie authored May 10, 2025



* add convolution example to CI

* lint fix

* Update test_example_convolution.py

---------
Co-authored-by: Lei Wang <34334180+LeiWang1999@users.noreply.github.com>

abe170a6