- 23 May, 2025 4 commits
-
-
Taoyu Zhu authored
* fix deepgemm example * fix deepgemm example * make format * Update example_deepgemm_fp8_2xAcc.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yu Cheng authored
* Introduced `example_grouped_gemm_fwd.py` and `example_grouped_gemm_bwd.py` to demonstrate grouped matrix multiplication with forward and backward operations. * Implemented functions for grouped GEMM, input construction, and validation against PyTorch's implementation. * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions. * Included a test function to validate the functionality with various input scenarios.
-
Yu Cheng authored
* Introduced a new example script `example_grouped_gemm.py` demonstrating grouped matrix multiplication using TileLang and PyTorch. * Implemented functions for performing grouped GEMM, constructing inputs, and validating results against PyTorch's implementation. * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions. * Included a test function to validate the grouped GEMM functionality with various input scenarios.
-
Lei Wang authored
[Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management (#508) * Introduced a new StmtAttr structure to track the scope level of statements, enhancing the liveness analysis process. * Updated the UpdateStmtAttr function to manage statement attributes effectively during memory allocation visits. * Modified the VisitStmt_ methods to utilize the new scope level tracking, ensuring accurate memory access patterns. * Refactored the LivenessAnalysis and PlanMemory functions to incorporate statement attributes, improving the handling of gen and kill points in memory management. * Added a new helper function allow_warp_specialized in phase.py to conditionally enable warp specialization based on pass context and target, addressing potential bugs in the MergeSharedMemoryAllocations pass. * Enhanced the OptimizeForTarget function to conditionally apply the MergeSharedMemoryAllocations pass based on warp specialization settings, improving robustness in memory allocation strategies.
-
- 22 May, 2025 3 commits
-
-
Lei Wang authored
* Added a new attribute `kPaddingMap` in `builtin.h` for managing padding annotations. * Enhanced `SafeMemorysRewriter` to utilize an annotated padding map for buffer stores, improving memory access safety. * Implemented checks in `layout_inference.cc` to ensure buffers are correctly referenced during layout mapping. * Introduced a new test file for validating the padding annotation functionality in TileLang.
-
Lei Wang authored
* [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility * Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square. * Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance. * Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures. * Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies. * [Refactor] Optimize matrix multiplication parameters and performance in quickstart example * Updated thread count in the kernel context from 256 to 128 to enhance performance. * Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency. * Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution. * Cleaned up comments for clarity and corrected a typo in the memory copy comment. * [Refactor] Simplify Copy type selection in OperandTraits for improved clarity * Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code. * This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.
-
Lei Wang authored
* Modified `makeBufferWithLayout` to include a `var_remap` parameter for improved variable remapping during buffer creation. * Enhanced buffer load and store operations to utilize the new variable remapping logic, ensuring correct buffer references. * Commented out a check in `ThreadExtent` for clarity, maintaining functionality while improving code readability.
-
- 21 May, 2025 1 commit
-
-
Lei Wang authored
[Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization (#507) * [Refactor] Update reduce functions to support default dimension values and improve dimension handling * Added a helper function `_legalize_dim` to handle negative dimension values in reduction operations. * Updated `reduce_max`, `reduce_min`, `reduce_sum`, `reduce_abssum`, and `reduce_absmax` functions to accept a default dimension value of -1, enhancing usability and flexibility in buffer reduction operations. * Ensured consistent dimension handling across all reduction functions for improved clarity and correctness. * Update submodule `tvm` to latest commit c2921fd, ensuring compatibility with recent changes. * [Refactor] Enhance ReduceOp and JITKernel for improved dimension handling and initialization * Updated ReduceOp to handle 1D reduction cases and ensure correct dimension checks, improving robustness in reduction operations. * Initialized prim_func in JITKernel to enhance clarity and prevent potential null reference issues. * Added whitespace for better code readability in reduce.py.
-
- 20 May, 2025 4 commits
-
-
Lei Wang authored
* [Refactor] Update GlobalMemChecker to use IRVisitorWithAnalyzer for improved analysis (#505) * Refactored GlobalMemChecker to inherit from IRVisitorWithAnalyzer, enhancing its capabilities for expression analysis. * Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation. * Added additional lower bound condition checks to ensure comprehensive validation of memory access indices. * [Refactor] Update GlobalMemChecker to use StmtExprVisitor for improved memory access validation * Refactored GlobalMemChecker to inherit from StmtExprVisitor, enhancing its capabilities for expression analysis. * Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation. * Ensured that the analyzer is passed correctly during instantiation, maintaining consistency in condition checks.
-
Lei Wang authored
* Modified the layout creation in makeGemmFragmentB to enhance the order of operations, ensuring the Replicate method is called before Repeat for better readability and performance. * This change improves the logical flow of fragment creation, aligning with best practices for GEMM layout management.
-
Lei Wang authored
* [Refactor] Rename `jit` class to `_JitImplementation` and improve debug path handling * Refactored the `jit` class to `_JitImplementation` for clarity and encapsulation. * Enhanced handling of `debug_root_path` to ensure it is correctly set as an absolute path when provided. * Updated the public `jit` function to serve as a decorator interface, allowing for both default and configured usage. * Added validation to ensure input tensors are contiguous in the Cython wrapper, improving error handling. * [Refactor] Improve formatting and handling in `_JitImplementation` and `jit` function * Refactored the `_JitImplementation` class to enhance readability by adjusting comment formatting and consolidating conditions for setting `debug_root_path`. * Updated the `jit` function signature for better alignment and clarity in parameter definitions. * Ensured consistent spacing and comments throughout the code for improved maintainability. * [Refactor] Update GEMM test parameters for performance optimization * Set num_stages to 0 and adjusted matrix dimensions in the GEMM test function to enhance performance and consistency across tests in test_tilelang_jit_gemm.py. * Reduced the number of threads used in the test to align with the updated configuration, improving overall test efficiency. * [Refactor] Enhance buffer error logging in layout inference * Updated the warning message in layout inference to provide clearer context when a buffer cannot be inferred due to its absence in the use list. This change improves the clarity of error reporting during layout inference operations. * Refactored tensor handling in the Cython wrapper to ensure input tensors are checked for contiguity before processing, enhancing error handling and robustness in tensor management. * bugfix
-
Zhiwen Mo authored
Co-authored-by:Ubuntu <srguser@srgmi300c.ibcr0fi0qgdu5pqgbnhfbyasxg.parx.internal.cloudapp.net>
-
- 18 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Update JIT kernel functions and streamline GEMM tests * Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity. * Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function. * Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability. * Cleaned up commented-out code and unnecessary imports to improve overall code quality. * Update main function call in GEMM test to use tilelang testing framework * Update README and example scripts to include JIT decorator comments * Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions. * Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity. * Update GEMM test parameters for improved performance * Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.
-
- 17 May, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Improve GEMM layout function and documentation * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies. * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations. * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases. * lint fix * [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility * Adjusted the InferLayout method in gemm.cc to include trans_A in fragment creation, enhancing layout inference for transposed matrices. * Updated OperandTraits in gemm_sm89.h and gemm_sm90.h to change the Copy type from SM75_U16x4_LDSM_N to SM75_U16x4_LDSM_T, optimizing memory access patterns for different warp configurations. * Enhanced static assertions in gemm_sm90.h to clarify requirements for num_warp_m, ensuring compatibility with Hopper architecture. * [Refactor] Clean up formatting in GEMM implementation and CUDA templates * Simplified the formatting of the fragment creation in the InferLayout method of gemm.cc for better readability. * Adjusted the static assertion message in gemm_sm90.h to enhance clarity regarding the num_warp_m requirement for Hopper architecture.
-
Lei Wang authored
* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully. * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management. * Add merge shared memory allocations pass and related configurations - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage. - Registered configuration options for debugging and controlling the merging behavior. - Updated relevant files to integrate the new pass into the TileLang engine and transform modules. - Adjusted import paths and added documentation for the new functionality. * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels. * lint fix * Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.
-
Lei Wang authored
* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully. * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management. * Add merge shared memory allocations pass and related configurations - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage. - Registered configuration options for debugging and controlling the merging behavior. - Updated relevant files to integrate the new pass into the TileLang engine and transform modules. - Adjusted import paths and added documentation for the new functionality. * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels. * lint fix
-
- 16 May, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Improve GEMM layout function and documentation * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies. * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations. * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases. * lint fix
-
Yu Cheng authored
* [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang * Refactor example_mla_decode.py to define a main function for better structure and clarity. * Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode. * Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing. * Ensure all new test files are integrated with tilelang testing framework. * [Test] Enhance test_example_mla_decode with argument mocking * Update test_example_mla_decode.py to mock sys.argv for better test isolation. * Ensure the main function of example_mla_decode is called with the correct arguments during testing.
-
Lei Wang authored
* Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully. * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management. * Add merge shared memory allocations pass and related configurations - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage. - Registered configuration options for debugging and controlling the merging behavior. - Updated relevant files to integrate the new pass into the TileLang engine and transform modules. - Adjusted import paths and added documentation for the new functionality. * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
-
- 14 May, 2025 1 commit
-
-
Lei Wang authored
[Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple (#494) * Remove deprecated example_dequant_gemm.py and add DataType import in __init__.py * lint fix * lint fix * Refactor dequantization examples to use tilelang imports and update data type handling in quantization utilities * lint fix
-
- 13 May, 2025 3 commits
-
-
Wenhao Xie authored
* [CI] Add Reminder Bot for pull request contributions * upd
-
徐畅 authored
* [CI] Add flash_decoding example to CI * Add output of ref latency * format example_gqa_decode.py
-
Lei Wang authored
* [Refactor] Enhance makeGemmFragmentB to support transposition * Updated the `makeGemmFragmentB` function to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition. * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation. * Modified the function signature in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter. * Added a new `matmul_sr` function in the test suite to validate the behavior of the updated fragment generation with transposition support. * [Refactor] Enhance makeGemmFragmentA and makeGemmFragmentB for transposition support * Updated the `makeGemmFragmentA` and `makeGemmFragmentB` functions to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition. * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation. * Modified function signatures in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter. * Added a new `matmul_rs` function in the test suite to validate the behavior of the updated fragment generation with transposition support. * * Improve error messaging in layout equality checks * Enhanced the error output in layout equality checks to provide clearer context by adding line breaks for better readability in the debug output. * This change ensures that when layouts are structurally unequal, the current and previous layouts are displayed more distinctly, aiding in debugging.
-
- 12 May, 2025 2 commits
- 11 May, 2025 2 commits
-
-
Thien Tran authored
-
yuanjypku authored
* Fix Device Consistency in Autotuner Threads and Add Manual Profiler Check * lint fix * Update example_mla_decode.py * Update __init__.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 10 May, 2025 7 commits
-
-
Wenhao Xie authored
-
Lei Wang authored
* [Refactor] Simplify buffer_region_to_tile_region function in copy.py * Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability. * Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents. * [Refactor] Improve layout equality checks and error messaging * Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges. * Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging. * Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy. * lintfix * [Refactor] Rename SetThreadRange to BindThreadRange for clarity * Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity. * Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior. * Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method. * Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API. * [Refactor] Update BindThreadRange usage across layout inference methods * Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality. * Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation. * Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity. * lint fix
-
Lei Wang authored
* [Refactor] Enhance TMA barrier validation and support for additional architectures * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`. * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic. * Enhance logging in setup.py and refactor TMA architecture checks in phase.py * Added logging configuration to setup.py, replacing print statements with logger for better traceability. * Updated download and extraction functions to use logger for status messages. * Refactored TMA architecture checks in phase.py to utilize the new `have_tma` function for improved clarity and maintainability. * Introduced support for additional compute capabilities in nvcc.py, including TMA support checks. * Update documentation for get_target_compute_version to reflect correct GPU compute capability range * Refactor have_tma function to accept tvm.target.Target instead of compute_version * Updated the `have_tma` function in nvcc.py to take a `target` parameter, improving clarity and usability. * Adjusted calls to `have_tma` in phase.py to pass the target directly, enhancing maintainability and consistency in TMA support checks.
-
yyttt6 authored
* yes * [Bugfix] fix the unexpected keyword error of autotune * format * test * [CI] Add Analyzer and blocksparse_attention examples to CI * format * try * try * try * try * t * format * d --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yuxuan Hu authored
-
Wenhao Xie authored
* add convolution example to CI * lint fix * Update test_example_convolution.py * fix bug --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Wenhao Xie authored
* add convolution example to CI * lint fix * Update test_example_convolution.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 09 May, 2025 6 commits
-
-
Lei Wang authored
* Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability. * Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.
-
Lei Wang authored
* Modified the `set_compile_args` method in `AutoTuner` to accept `None` as a valid input for the `out_idx` parameter, enhancing flexibility in argument handling.
-
Zhengju Tang authored
* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`. * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic. * [CI] Add BlocksparseGemm, Dynamic, and Cast examples to CI. * Lint --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Lei Wang authored
* typo fix * Rename `power_of_int` to `pow_of_int` in math operations and update corresponding Python API reference. Adjusted registration attributes to reflect the new naming convention.
-
Lei Wang authored
* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`. * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic. * [Feature] Implement fast integer power operation and related API * Added a new math operation `tl.power_of_int` in `math.cc` for efficient integer exponentiation. * Introduced a corresponding Python API `pow_of_int` in `tir/op.py` to facilitate usage in TileLang. * Enhanced `common.h` with a template function for integer power calculations. * Updated documentation to reflect the new functionality and usage examples.
-
Lei Wang authored
* [Refactor] Enhance TMA barrier validation and support for additional architectures (#463) * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`. * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic. * [Refactor] Improve buffer region validation in copy.py * Added handling for variable extents in buffer_region_to_tile_region function to enhance type checking and error handling. * Introduced debug print statements to trace values of region extents and temporary extents during validation. * Updated logic to account for variable extent counts when determining alignment of extents. * [Refactor] Remove debug print statements in buffer_region_to_tile_region function * Eliminated unnecessary print statements that were used for debugging temporary extents and region extents. * Streamlined the code for better readability while maintaining the existing functionality of buffer region validation. * [Refactor] Clean up whitespace in buffer_region_to_tile_region function * Removed an unnecessary blank line in the buffer_region_to_tile_region function to improve code readability and maintain consistency in formatting.
-