1. 26 May, 2025 4 commits
    • Lei Wang's avatar
      [Enhancement] Add commit ID to versioning and improve logging initialization (#524) · 62a8d7f0
      Lei Wang authored
      * Updated `get_tilelang_version` to include an optional commit ID in the version string.
      * Enhanced the `TileLangBuilPydCommand` to write the version with commit ID to the VERSION file during the build process.
      * Introduced a new function `get_git_commit_id` in `version.py` to retrieve the current git commit hash.
      * Refactored logger initialization in `autotuner/__init__.py` to ensure handlers are set up only once, improving performance and clarity.
      * Minor fixes in `flatten_buffer.cc` and `kernel_cache.py` for better handling of versioning and logging.
      62a8d7f0
    • Lei Wang's avatar
      [Refactor] Reorganize Thread Synchronization Steps to make sure global... · 41c51d07
      Lei Wang authored
      [Refactor] Reorganize Thread Synchronization Steps to make sure global synchronization can be correctly lowered (#521)
      
      * [Refactor] Reorganize Thread Synchronization Steps in OptimizeForTarget Function
      
      * Removed redundant thread synchronization steps for "global" and "shared" memory, streamlining the optimization process.
      * Reintroduced necessary synchronization for "shared" and "shared.dyn" after the injection of PTX async copy, ensuring correct memory access patterns.
      * Enhanced overall clarity and maintainability of the OptimizeForTarget function by restructuring the order of operations.
      
      * [Refactor] Reorder Thread Synchronization and PTX Async Copy in OptimizeForTarget Function
      
      * Removed redundant global thread synchronization step and adjusted the order of operations for shared memory synchronization.
      * Ensured that the PTX async copy injection occurs after the global thread sync, improving memory access validity.
      * Enhanced clarity and maintainability of the OptimizeForTarget function by restructuring synchronization steps.
      41c51d07
    • Lei Wang's avatar
      [Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 (#522) · 46798f25
      Lei Wang authored
      * [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA
      
      * Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA.
      * Updated `customize.py` to include the new `atomic_addx4` function for external calls.
      * Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module.
      
      * lint fix
      46798f25
    • Lei Wang's avatar
      [Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) · 6addc509
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      
      * Enhance CUDA FP8 type handling and debug printing
      
      * Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure.
      * Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types.
      * Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions.
      
      * Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state
      
      * Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration.
      * Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework.
      
      * Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting
      
      * Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting.
      * This cleanup enhances maintainability and reduces potential confusion regarding obsolete code.
      
      * lint fix
      6addc509
  2. 25 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support auto synchronization for global memory access (#519) · 623edf4c
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      623edf4c
  3. 24 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Support auto index bitwidth casting (#517) · 6ad73f6f
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      6ad73f6f
  4. 23 May, 2025 4 commits
    • Taoyu Zhu's avatar
      Fix deepgemm exmaple (#513) · 0d1eab57
      Taoyu Zhu authored
      
      
      * fix deepgemm example
      
      * fix deepgemm example
      
      * make format
      
      * Update example_deepgemm_fp8_2xAcc.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      0d1eab57
    • Yu Cheng's avatar
      [Dev] Add grouped GEMM backward example scripts (#515) · de028927
      Yu Cheng authored
      * Introduced `example_grouped_gemm_fwd.py` and `example_grouped_gemm_bwd.py` to demonstrate grouped matrix multiplication with forward and backward operations.
      * Implemented functions for grouped GEMM, input construction, and validation against PyTorch's implementation.
      * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions.
      * Included a test function to validate the functionality with various input scenarios.
      de028927
    • Yu Cheng's avatar
      [Dev] Add grouped GEMM example with TileLang and PyTorch integration (#514) · fb801940
      Yu Cheng authored
      * Introduced a new example script `example_grouped_gemm.py` demonstrating grouped matrix multiplication using TileLang and PyTorch.
      * Implemented functions for performing grouped GEMM, constructing inputs, and validating results against PyTorch's implementation.
      * Added command-line argument parsing for flexible input configuration, including batch sizes and matrix dimensions.
      * Included a test function to validate the grouped GEMM functionality with various input scenarios.
      fb801940
    • Lei Wang's avatar
      [Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness... · 0fdefe2b
      Lei Wang authored
      [Refactor] Enhance MergeSharedMemoryAllocations Pass for Improved Liveness Analysis and Scope Management (#508)
      
      * Introduced a new StmtAttr structure to track the scope level of statements, enhancing the liveness analysis process.
      * Updated the UpdateStmtAttr function to manage statement attributes effectively during memory allocation visits.
      * Modified the VisitStmt_ methods to utilize the new scope level tracking, ensuring accurate memory access patterns.
      * Refactored the LivenessAnalysis and PlanMemory functions to incorporate statement attributes, improving the handling of gen and kill points in memory management.
      * Added a new helper function allow_warp_specialized in phase.py to conditionally enable warp specialization based on pass context and target, addressing potential bugs in the MergeSharedMemoryAllocations pass.
      * Enhanced the OptimizeForTarget function to conditionally apply the MergeSharedMemoryAllocations pass based on warp specialization settings, improving robustness in memory allocation strategies.
      0fdefe2b
  5. 22 May, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] Introduce padding annotation and improve memory access validation (#511) · f23c4d30
      Lei Wang authored
      * Added a new attribute `kPaddingMap` in `builtin.h` for managing padding annotations.
      * Enhanced `SafeMemorysRewriter` to utilize an annotated padding map for buffer stores, improving memory access safety.
      * Implemented checks in `layout_inference.cc` to ensure buffers are correctly referenced during layout mapping.
      * Introduced a new test file for validating the padding annotation functionality in TileLang.
      f23c4d30
    • Lei Wang's avatar
      [Bugfix] Enhance smem copy selector for uncommon shape (#510) · dbe8689f
      Lei Wang authored
      * [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility
      
      * Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square.
      * Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance.
      * Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures.
      * Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies.
      
      * [Refactor] Optimize matrix multiplication parameters and performance in quickstart example
      
      * Updated thread count in the kernel context from 256 to 128 to enhance performance.
      * Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency.
      * Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution.
      * Cleaned up comments for clarity and corrected a typo in the memory copy comment.
      
      * [Refactor] Simplify Copy type selection in OperandTraits for improved clarity
      
      * Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code.
      * This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.
      dbe8689f
    • Lei Wang's avatar
      [Refactor] Update buffer handling in layout transformation functions (#509) · 094796b6
      Lei Wang authored
      * Modified `makeBufferWithLayout` to include a `var_remap` parameter for improved variable remapping during buffer creation.
      * Enhanced buffer load and store operations to utilize the new variable remapping logic, ensuring correct buffer references.
      * Commented out a check in `ThreadExtent` for clarity, maintaining functionality while improving code readability.
      094796b6
  6. 21 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling... · 41d4988b
      Lei Wang authored
      [Enhancement] Enhance ReduceOp and JITKernel for improved dimension handling and initialization (#507)
      
      * [Refactor] Update reduce functions to support default dimension values and improve dimension handling
      
      * Added a helper function `_legalize_dim` to handle negative dimension values in reduction operations.
      * Updated `reduce_max`, `reduce_min`, `reduce_sum`, `reduce_abssum`, and `reduce_absmax` functions to accept a default dimension value of -1, enhancing usability and flexibility in buffer reduction operations.
      * Ensured consistent dimension handling across all reduction functions for improved clarity and correctness.
      
      * Update submodule `tvm` to latest commit c2921fd, ensuring compatibility with recent changes.
      
      * [Refactor] Enhance ReduceOp and JITKernel for improved dimension handling and initialization
      
      * Updated ReduceOp to handle 1D reduction cases and ensure correct dimension checks, improving robustness in reduction operations.
      * Initialized prim_func in JITKernel to enhance clarity and prevent potential null reference issues.
      * Added whitespace for better code readability in reduce.py.
      41d4988b
  7. 20 May, 2025 4 commits
    • Lei Wang's avatar
      [Refactor] Update GlobalMemChecker to Detect Lower Bound illegal memory access automatically (#505) · 84ddb9e1
      Lei Wang authored
      * [Refactor] Update GlobalMemChecker to use IRVisitorWithAnalyzer for improved analysis (#505)
      
      * Refactored GlobalMemChecker to inherit from IRVisitorWithAnalyzer, enhancing its capabilities for expression analysis.
      * Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
      * Added additional lower bound condition checks to ensure comprehensive validation of memory access indices.
      
      * [Refactor] Update GlobalMemChecker to use StmtExprVisitor for improved memory access validation
      
      * Refactored GlobalMemChecker to inherit from StmtExprVisitor, enhancing its capabilities for expression analysis.
      * Updated condition checks to utilize the new analyzer interface, improving clarity and correctness in memory access validation.
      * Ensured that the analyzer is passed correctly during instantiation, maintaining consistency in condition checks.
      84ddb9e1
    • Lei Wang's avatar
      [Refactor] Adjust GEMM fragment layout for improved clarity and performance (#504) · c59e1aab
      Lei Wang authored
      * Modified the layout creation in makeGemmFragmentB to enhance the order of operations, ensuring the Replicate method is called before Repeat for better readability and performance.
      * This change improves the logical flow of fragment creation, aligning with best practices for GEMM layout management.
      c59e1aab
    • Lei Wang's avatar
      [Refactor] Refactor `jit` to `_JitImplementation` to support `@tilelang.jit` (#502) · 8c8d8ca2
      Lei Wang authored
      * [Refactor] Rename `jit` class to `_JitImplementation` and improve debug path handling
      
      * Refactored the `jit` class to `_JitImplementation` for clarity and encapsulation.
      * Enhanced handling of `debug_root_path` to ensure it is correctly set as an absolute path when provided.
      * Updated the public `jit` function to serve as a decorator interface, allowing for both default and configured usage.
      * Added validation to ensure input tensors are contiguous in the Cython wrapper, improving error handling.
      
      * [Refactor] Improve formatting and handling in `_JitImplementation` and `jit` function
      
      * Refactored the `_JitImplementation` class to enhance readability by adjusting comment formatting and consolidating conditions for setting `debug_root_path`.
      * Updated the `jit` function signature for better alignment and clarity in parameter definitions.
      * Ensured consistent spacing and comments throughout the code for improved maintainability.
      
      * [Refactor] Update GEMM test parameters for performance optimization
      
      * Set num_stages to 0 and adjusted matrix dimensions in the GEMM test function to enhance performance and consistency across tests in test_tilelang_jit_gemm.py.
      * Reduced the number of threads used in the test to align with the updated configuration, improving overall test efficiency.
      
      * [Refactor] Enhance buffer error logging in layout inference
      
      * Updated the warning message in layout inference to provide clearer context when a buffer cannot be inferred due to its absence in the use list. This change improves the clarity of error reporting during layout inference operations.
      * Refactored tensor handling in the Cython wrapper to ensure input tensors are checked for contiguity before processing, enhancing error handling and robustness in tensor management.
      
      * bugfix
      8c8d8ca2
    • Zhiwen Mo's avatar
  8. 18 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] refactor `tilelang.jit` to support a faster and more flexible kernel cache (#501) · 25a50f1a
      Lei Wang authored
      * [Refactor] Update JIT kernel functions and streamline GEMM tests
      
      * Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity.
      * Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function.
      * Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability.
      * Cleaned up commented-out code and unnecessary imports to improve overall code quality.
      
      * Update main function call in GEMM test to use tilelang testing framework
      
      * Update README and example scripts to include JIT decorator comments
      
      * Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions.
      * Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity.
      
      * Update GEMM test parameters for improved performance
      
      * Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.
      25a50f1a
  9. 17 May, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility (#500) · 33937683
      Lei Wang authored
      * [Enhancement] Improve GEMM layout function and documentation
      
      * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
      * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
      * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.
      
      * lint fix
      
      * [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility
      
      * Adjusted the InferLayout method in gemm.cc to include trans_A in fragment creation, enhancing layout inference for transposed matrices.
      * Updated OperandTraits in gemm_sm89.h and gemm_sm90.h to change the Copy type from SM75_U16x4_LDSM_N to SM75_U16x4_LDSM_T, optimizing memory access patterns for different warp configurations.
      * Enhanced static assertions in gemm_sm90.h to clarify requirements for num_warp_m, ensuring compatibility with Hopper architecture.
      
      * [Refactor] Clean up formatting in GEMM implementation and CUDA templates
      
      * Simplified the formatting of the fragment creation in the InferLayout method of gemm.cc for better readability.
      * Adjusted the static assertion message in gemm_sm90.h to enhance clarity regarding the num_warp_m requirement for Hopper architecture.
      33937683
    • Lei Wang's avatar
      [Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      
      * Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.
      2837878f
    • Lei Wang's avatar
      [Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      68a3c4f3
  10. 16 May, 2025 3 commits
    • Lei Wang's avatar
      [Bugfix] Fix Hopper GEMM layout for small tile size (#497) · c93e8695
      Lei Wang authored
      * [Enhancement] Improve GEMM layout function and documentation
      
      * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
      * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
      * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.
      
      * lint fix
      c93e8695
    • Yu Cheng's avatar
      [Refactor] Update main function structure in example scripts and add tests (#475) · 73ae8087
      Yu Cheng authored
      * [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang
      
      * Refactor example_mla_decode.py to define a main function for better structure and clarity.
      * Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode.
      * Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing.
      * Ensure all new test files are integrated with tilelang testing framework.
      
      * [Test] Enhance test_example_mla_decode with argument mocking
      
      * Update test_example_mla_decode.py to mock sys.argv for better test isolation.
      * Ensure the main function of example_mla_decode is called with the correct arguments during testing.
      73ae8087
    • Lei Wang's avatar
      [Enhancement] Introduce flag to visualize shared memory merge plan (#496) · dca2fb48
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      dca2fb48
  11. 14 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Introduce quantize components of TileLang and add testing for... · cde1886f
      Lei Wang authored
      [Refactor] Introduce quantize components of TileLang and add testing for dequant gemm exmaple (#494)
      
      * Remove deprecated example_dequant_gemm.py and add DataType import in __init__.py
      
      * lint fix
      
      * lint fix
      
      * Refactor dequantization examples to use tilelang imports and update data type handling in quantization utilities
      
      * lint fix
      cde1886f
  12. 13 May, 2025 3 commits
    • Wenhao Xie's avatar
      [CI] Add Reminder Bot for pull request contributions (#491) · 31dbb471
      Wenhao Xie authored
      * [CI] Add Reminder Bot for pull request contributions
      
      * upd
      31dbb471
    • 徐畅's avatar
      [CI] Add flash_decoding example to CI (#487) · 7b66fb19
      徐畅 authored
      * [CI] Add flash_decoding example to CI
      
      * Add output of ref latency
      
      * format example_gqa_decode.py
      7b66fb19
    • Lei Wang's avatar
      [Enhancement] Support register input for gemm when trans_a or trans_b is true (#490) · d4f096ef
      Lei Wang authored
      * [Refactor] Enhance makeGemmFragmentB to support transposition
      
      * Updated the `makeGemmFragmentB` function to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
      * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
      * Modified the function signature in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
      * Added a new `matmul_sr` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
      
      * [Refactor] Enhance makeGemmFragmentA and makeGemmFragmentB for transposition support
      
      * Updated the `makeGemmFragmentA` and `makeGemmFragmentB` functions to include a `transposed` parameter, allowing for flexible layout generation based on matrix transposition.
      * Adjusted layout calculations for both transposed and non-transposed cases to ensure correct fragment generation.
      * Modified function signatures in `layout.h` and updated all relevant calls in `gemm.cc` to accommodate the new parameter.
      * Added a new `matmul_rs` function in the test suite to validate the behavior of the updated fragment generation with transposition support.
      *
      
      * Improve error messaging in layout equality checks
      
      * Enhanced the error output in layout equality checks to provide clearer context by adding line breaks for better readability in the debug output.
      * This change ensures that when layouts are structurally unequal, the current and previous layouts are displayed more distinctly, aiding in debugging.
      d4f096ef
  13. 12 May, 2025 2 commits
  14. 11 May, 2025 2 commits
  15. 10 May, 2025 7 commits
    • Wenhao Xie's avatar
    • Lei Wang's avatar
      [Refactor] Improve layout equality checks and error messaging (#471) · c2480907
      Lei Wang authored
      * [Refactor] Simplify buffer_region_to_tile_region function in copy.py
      
      * Removed redundant logic for handling region extents in the buffer_region_to_tile_region function, streamlining the code for better readability and maintainability.
      * Enhanced error handling by focusing on essential checks while eliminating unnecessary complexity related to variable extents.
      
      * [Refactor] Improve layout equality checks and error messaging
      
      * Updated the `IsEqual` method in `FragmentNode` to ensure consistent evaluation of thread ranges.
      * Enhanced error messaging in `ParallelOp::InferLayout` to include source buffer information for better debugging.
      * Adjusted `ReduceOp::InferLayout` to set thread range during layout condensation, improving layout inference accuracy.
      
      * lintfix
      
      * [Refactor] Rename SetThreadRange to BindThreadRange for clarity
      
      * Updated the `SetThreadRange` method in `FragmentNode` and related classes to `BindThreadRange`, improving method naming consistency and clarity.
      * Adjusted all references to the renamed method across the codebase, ensuring proper functionality and maintaining existing behavior.
      * Enhanced layout equality checks to handle thread ranges more robustly in `IsEqual` method.
      * Updated layout inference methods in `Gemm`, `ParallelOp`, and `ReduceOp` to utilize the new method name, ensuring seamless integration with the updated API.
      
      * [Refactor] Update BindThreadRange usage across layout inference methods
      
      * Modified the implementation of `BindThreadRange` in `FragmentNode` to create a new object instance, enhancing thread range binding functionality.
      * Updated all references to `BindThreadRange` in layout inference methods across `Gemm`, `ParallelOp`, and `ReduceOp` to ensure consistency with the new implementation.
      * Adjusted the return statements in various layout inference functions to utilize the updated method, maintaining existing behavior while improving clarity.
      
      * lint fix
      c2480907
    • Lei Wang's avatar
      [Refactor] Skip patchelf if not installed (#477) · 273be768
      Lei Wang authored
      * [Refactor] Enhance TMA barrier validation and support for additional architectures
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * Enhance logging in setup.py and refactor TMA architecture checks in phase.py
      
      * Added logging configuration to setup.py, replacing print statements with logger for better traceability.
      * Updated download and extraction functions to use logger for status messages.
      * Refactored TMA architecture checks in phase.py to utilize the new `have_tma` function for improved clarity and maintainability.
      * Introduced support for additional compute capabilities in nvcc.py, including TMA support checks.
      
      * Update documentation for get_target_compute_version to reflect correct GPU compute capability range
      
      * Refactor have_tma function to accept tvm.target.Target instead of compute_version
      
      * Updated the `have_tma` function in nvcc.py to take a `target` parameter, improving clarity and usability.
      * Adjusted calls to `have_tma` in phase.py to pass the target directly, enhancing maintainability and consistency in TMA support checks.
      273be768
    • yyttt6's avatar
      [CI] Add Analyzer and blocksparse_attention examples to CI (#472) · 8dec14e0
      yyttt6 authored
      
      
      * yes
      
      * [Bugfix] fix the unexpected keyword error of autotune
      
      * format
      
      * test
      
      * [CI] Add Analyzer and blocksparse_attention examples to CI
      
      * format
      
      * try
      
      * try
      
      * try
      
      * try
      
      * t
      
      * format
      
      * d
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      8dec14e0
    • Yuxuan Hu's avatar
      [Refactor] set USE_LLVM to optional. (#476) · 66dba763
      Yuxuan Hu authored
      66dba763
    • Wenhao Xie's avatar
      [BugFix] Correct argparse for example_convolution test (#474) · 3f25bd1b
      Wenhao Xie authored
      
      
      * add convolution example to CI
      
      * lint fix
      
      * Update test_example_convolution.py
      
      * fix bug
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      3f25bd1b
    • Wenhao Xie's avatar
      [CI] Add Convolution example to CI (#473) · abe170a6
      Wenhao Xie authored
      
      
      * add convolution example to CI
      
      * lint fix
      
      * Update test_example_convolution.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      abe170a6