1. 14 Aug, 2025 1 commit
  2. 11 Aug, 2025 2 commits
    • Wenhao Xie's avatar
      [Enhancement] Add eviction policy support for TMA operations, enhance CUDA... · 6664d170
      Wenhao Xie authored
      [Enhancement] Add eviction policy support for TMA operations, enhance CUDA codegen, and introduce new pass config (#690)
      
      * Enhance TMA and barrier handling in CUDA code generation
      
      - Updated `CodeGenTileLangCUDA` to support eviction policies for TMA operations, allowing for more flexible memory management.
      - Introduced a new `CacheHintSm90` enum to define eviction strategies in `copy_sm90.h`.
      - Modified TMA load/store functions to accept eviction policies, improving performance on different architectures.
      - Enhanced `TmaBarrierCollector` and `TmaBarrierRewriter` to account for SIMT copies, ensuring correct barrier insertion.
      - Refactored thread synchronization logic to utilize barrier IDs, improving the efficiency of partial thread synchronization.
      - Updated Python interface for `copy` and `c2d_im2col` to include optional eviction policy parameters, enhancing usability.
      
      * update shuffle and elect optimization
      
      * fix bug
      
      * fix bug
      
      * fix potential bug
      
      * lint fix
      
      * lint fix
      
      * update shuffle_elect template
      
      * fix bug
      
      * fix bug
      
      * fix template
      
      * lint and fix
      
      * fix typo
      6664d170
    • FeiyangChen's avatar
      [Feat] Support mma gemm with stride (#701) · fe70549f
      FeiyangChen authored
      
      
      * gemm_with_stride sm89
      
      * fix offset issue
      
      * bug fix
      
      * format
      
      * sm80 support
      
      * add sm90
      
      * add testing
      
      * format
      
      * add static_assert for wgmma
      
      * Enhance error message for inner_box_dim validation in LowerBulkCopy
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      fe70549f
  3. 04 Aug, 2025 1 commit
  4. 20 Jul, 2025 1 commit
  5. 16 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Warp Specialize] Implicit Warp Specialize Programing Model (#605) · e2d25ba8
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      
      * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy
      
      - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
      - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.
      
      * [Refactor] Update barrier and clear operations in warp specialization examples
      
      - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization.
      - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency.
      
      * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling
      
      - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management.
      - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements.
      - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior.
      - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output.
      - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements.
      
      * [Enhancement] Add support for disabling TMA in Copy operations
      
      - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior.
      - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag.
      - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations.
      - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations.
      
      * [Refactor] Clean up whitespace and formatting in multiple files
      
      - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`.
      - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase.
      
      * [Enhancement] Refactor flash attention implementation for improved performance and configurability
      
      - Split the shared memory allocations for query and key-value pairs to optimize memory usage.
      - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example.
      - Updated kernel execution parameters to improve thread management and synchronization.
      - Enhanced the overall structure of the flash attention function for better readability and maintainability.
      
      * fix
      
      * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget
      
      * Refactor barrier handling and update example configurations
      
      - Replaced commented-out barrier creation with new barrier allocation in GEMM example.
      - Updated kernel configuration in warp specialization example to include async copy settings.
      - Enhanced barrier management in the phase optimization process to improve synchronization handling.
      - Introduced new barrier allocation function for better memory management in shared contexts.
      
      * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget
      
      - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization.
      - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code.
      - Added exit() call in OptimizeForTarget to halt execution after barrier lowering.
      
      * Enhance CMake configuration and clean up example scripts
      
      - Enabled compile command export in CMakeLists.txt for better build integration.
      - Removed unnecessary print statement in the warp specialization example.
      - Cleaned up commented-out code in GEMM example for improved readability.
      - Updated barrier handling in shared memory allocation transformations for better synchronization.
      
      * Refactor barrier handling in warp specialization examples
      
      - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization.
      - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts.
      - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management.
      
      * Update lower_shared_barrier.cc
      
      * Update phase.py
      
      * Update warp specialization example and Cython wrapper
      
      - Removed commented-out pass configuration options in the warp specialization example for clarity.
      - Added functionality to write the generated kernel source to a file named "kernel.cu".
      - Enhanced Cython wrapper to support boolean type conversion for improved type handling.
      
      * Add storage synchronization call in shared barrier transformation
      
      - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process.
      
      * remove debug files
      
      * Remove kernel source output to file in warp specialization example
      
      * remove comments
      
      * Refactor tensor handling and update test execution in TileLang
      
      - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency.
      - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`.
      - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing.
      - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity.
      
      * Update test_tilelang_language_reshape.py
      e2d25ba8
  6. 15 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Pass][Simplify] Introduce symbolic level simplify for condition expression (#634) · 02a0cf59
      Lei Wang authored
      * [Enhancement] Add argument simplification option to StmtSimplifier
      
      - Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior.
      - Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process.
      - Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase.
      - Added comments to clarify the purpose of the new flag and its impact on simplification logic.
      
      * lint fix
      
      * [Enhancement] Improve layout inference and reduce operation handling
      
      - Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic.
      - Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures.
      - Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization.
      
      * lint fix
      
      * [Enhancement] Add input validation for GEMM parameters
      
      - Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method.
      - Added informative error messages to assist in debugging when the input parameters do not meet the required conditions.
      
      * bug fix
      02a0cf59
  7. 03 Jul, 2025 1 commit
    • botbw's avatar
      [Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526) · be44758c
      botbw authored
      
      
      * [experimental] add a draft gemm_sp
      
      * [3rdparty] bump cutlass to v3.9.3
      
      * [lint] run format.sh
      
      * [chore] rebase
      
      * [chore] use abs path
      
      * [gemm_sp] add metadata layout
      
      * [ci] add more example
      
      * [lint] run format.sh
      
      * [chore] polish
      
      * [chore] move gemm_sp to experimental
      
      * [chore] polish
      
      * [lint] run format.sh
      
      * [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test
      
      * Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy.
      * Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process.
      * Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality.
      
      * Implement Test
      
      * [Enhancement] Update GEMM SP and SM89 templates for improved functionality
      
      * Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture.
      * Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets.
      * Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types.
      * Added conditional inclusion for GEMM SP header based on CUDA architecture version.
      
      * lint fix
      
      * [gemm_sp] support more layout and data types
      
      * Enhancement: sync T.gemm_sp's layout inference with T.gemm
      
      * Enhancement: support more block_k in compress util
      
      * [Enhancement] enable block_k=64
      
      * [Lint] run format.sh
      
      * [Enhancement] compressor support more dtype
      
      * Enhancement: enable block_K=32
      
      * [Lint] format.sh
      
      * [Fixbug] fix shape
      
      * Refactor: sync gemm
      
      * [Enhancement] enable transpose
      
      * [Enhancement] enable fp8_e4m3
      
      * [Enhancement] enable int8
      
      * [Lint] run format.sh
      
      * [Benchmark] add gemm_sp benchmark
      
      * [Example] fix 256 threads hang
      
      * [CI] fix ci
      
      * [Chore] resolve gemini feedback
      
      * [Benchmark] increase search space
      
      * [Lint] format
      
      * [CI] skip sparse tensor core related tests as only sm90 is supported
      
      * [CI] pass local run
      
      * Update gemm_sm89.h
      
      * lint fix
      
      * lint fix
      
      * [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags
      
      - Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation.
      - Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag.
      - Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected.
      - Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch.
      - Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module.
      
      * Update test_compress_utils.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      be44758c
  8. 27 Jun, 2025 2 commits
  9. 16 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout tf32 Casting from GEMM Templates (#573) · 9ba8b480
      Lei Wang authored
      * [Feature] Add Quarter Bank Swizzle Layout and Update GEMM Layout Logic
      
      - Introduced a new `makeQuarterBankSwizzleLayout` function for layout swizzling of 32 bytes.
      - Updated `makeGemmABLayout` to include an `enable_padding` parameter, allowing for conditional layout selection between padded and quarter bank swizzle layouts.
      - Adjusted layout inference in GEMM operations to utilize the new quarter bank swizzle layout when appropriate.
      - Enhanced bulk copy operations to recognize and handle the new layout type, improving memory access patterns.
      
      * lint fix
      
      * [Refactor] Update GEMM Layout Functions and Inference Logic
      
      - Removed the `enable_padding` parameter from `makeGemmABLayout` to simplify its signature.
      - Introduced `makeGemmABLayoutHopper` for enhanced layout handling specific to Hopper architecture.
      - Updated layout inference in GEMM operations to utilize the new `makeGemmABLayoutHopper` function, improving clarity and maintainability in layout selection.
      - Adjusted related layout functions to ensure consistent behavior across different architectures.
      
      * [Refactor] Remove tf32 Casting Logic from GEMM Templates
      
      - Eliminated the `cast_float_to_tf32` function from `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` templates to streamline the code.
      - Removed conditional casting logic for float32 to tfloat32 conversion, enhancing clarity and maintainability.
      - Updated relevant sections in GEMM operations to reflect the removal of casting, ensuring consistent behavior across templates.
      - Adjusted tensor view handling to improve performance and accuracy in matrix operations.
      
      * Update bulk_copy.cc
      
      * Fix profiler initialization in GEMM test by removing TensorSupplyType argument for improved flexibility.
      9ba8b480
  10. 07 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Add tf32 casting to GEMM templates (#556) · 8cc8db52
      Lei Wang authored
      * Add tf32 casting functionality to GEMM templates
      
      - Introduced a `cast_float_to_tf32` function to convert float32 values to tfloat32 format across gemm_sm80, gemm_sm89, and gemm_sm90 templates.
      - Implemented conditional casting in relevant sections of the GEMM operations to ensure compatibility with tfloat32 types.
      - Enhanced the handling of tensor views to support the new casting logic, improving performance and accuracy in matrix operations.
      
      * lint fix
      
      * Refactor tfloat32 casting logic in GEMM templates
      
      - Replaced the `is_tfloat32` boolean with `need_tfloat32_cast` to improve clarity and accuracy in determining when to cast float32 to tfloat32.
      - Updated relevant sections in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to utilize the new casting logic, enhancing compatibility with tfloat32 types.
      - Ensured consistent application of casting across tensor views, improving performance and correctness in matrix operations.
      
      * Refactor GEMM template functions for improved readability
      
      - Simplified the function signature of `body_rs` in both `gemm_sm80` and `gemm_sm90` templates for better clarity.
      - Adjusted the casting logic in `gemm_sm90` to ensure consistent application of `cast_float_to_tf32` across tensor views, enhancing performance and maintainability.
      
      * Enhance tf32 casting logic in GEMM templates
      
      - Updated the `cast_float_to_tf32` function in `gemm_sm80`, `gemm_sm89`, and `gemm_sm90` to conditionally apply the casting only if the input is finite, improving robustness.
      - Simplified the `need_tfloat32_cast` logic to clarify the conditions under which tfloat32 casting is required, enhancing code readability and maintainability.
      
      * Refactor GEMM template functions and layout inference logic
      
      - Removed the `cast_float_to_tf32` function from `gemm_sm90` and updated the `body_sr` function to streamline the casting process for tensor views, enhancing code clarity and maintainability.
      - Improved layout inference in `layout_inference.cc` by adding checks for the layout map's definition, ensuring robustness in handling layout annotations.
      - Simplified the handling of layout maps in the `annotate_layout` function, allowing for more flexible layout definitions and error handling.
      8cc8db52
  11. 05 Jun, 2025 1 commit
    • Gabriel Wu's avatar
      [Enhancement] Add nvrtc execution backend (#461) · 17f7394f
      Gabriel Wu authored
      
      
      * [wip] feat: add nvrtc backend
      
      * [wip] fix: handle out_idx
      
      * [wip] refactor: move lib logic to libgen
      
      * feat: cache for nvrtc backend
      
      * fmt: run format
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: handle cuda bindings import error
      
      * fix: get kernel source
      
      * refactor: speedup pyimport
      
      * Improve error handling for missing cuda-python dependency in nvrtc backend. Raise ImportError with detailed installation instructions instead of logging a warning.
      
      * Enhance nvrtc backend error handling by introducing a flag to check for cuda-python availability. Raise ImportError with detailed installation instructions during initialization if the nvrtc backend is unavailable, improving user experience and clarity.
      
      * Update README.md to include recent NVRTC Backend addition, highlighting reduced compilation time for CUDA templates.
      
      * fix tl_templates
      
      * ensure CUDA context
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      17f7394f
  12. 26 May, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Add atomicAdd for FLOAT16x2 and FLOAT16x4 (#522) · 46798f25
      Lei Wang authored
      * [Enhancement] Add atomic addition functions for FLOAT16x2 and FLOAT16x4 in CUDA
      
      * Introduced `AtomicAddx2` and `AtomicAddx4` functions for performing atomic addition operations on double-width float types in CUDA.
      * Updated `customize.py` to include the new `atomic_addx4` function for external calls.
      * Modified `__init__.py` to export the new atomic addition function, ensuring accessibility in the module.
      
      * lint fix
      46798f25
    • Lei Wang's avatar
      [Refactor] Replace default fp8 dtype with cute to perform fast cast (#520) · 6addc509
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      
      * Enhance CUDA FP8 type handling and debug printing
      
      * Updated `cuda_fp8.h` to replace NVidia's FP8 types with Cute's FP8 types for better compatibility and structure.
      * Added specializations for `debug_print_var` and `debug_print_buffer_value` functions to support the new FP8 types, improving debugging capabilities for these data types.
      * Updated `debug.h` to include the new `cuda_fp8.h` header for access to the FP8 type definitions.
      
      * Refactor CUDA code generation to remove unnecessary managed qualifier for global barrier state
      
      * Updated the `Finish` method in `codegen_cuda.cc` to declare the global barrier state without the `__managed__` qualifier, simplifying the declaration.
      * Added a new `sync_global` function in `builtin.py` to synchronize all threads in a block, enhancing synchronization capabilities in the TileLang framework.
      
      * Remove deprecated CUDA kernel and Python script for FP8 E4M3 casting
      
      * Deleted the `cast_to_fp8_e4m3_kernel` CUDA kernel implementation and its corresponding Python script, streamlining the codebase by removing unused components related to FP8 E4M3 type casting.
      * This cleanup enhances maintainability and reduces potential confusion regarding obsolete code.
      
      * lint fix
      6addc509
  13. 25 May, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support auto synchronization for global memory access (#519) · 623edf4c
      Lei Wang authored
      * [Refactor] Enhance GEMM Warp Partitioning Logic and Introduce Buffer Remapping (#516)
      
      * Improved the warp partitioning logic in `Gemm::ComputeWarpPartition` to better accommodate various GEMM policies, including FullRow, FullCol, and Square, ensuring optimal performance based on matrix dimensions.
      * Introduced a new `RemapBufferRewriter` class to handle buffer reference updates and padding annotations during statement transformations, enhancing memory access safety and clarity.
      * Updated the `OptimizeForTarget` function to include a new step for configuring index bitwidth, improving the overall optimization process.
      * Refactored existing code to utilize constants for warp sizes, enhancing maintainability and readability.
      * Added checks to ensure correct warp allocation and padding map handling, improving robustness in memory management strategies.
      
      * [Refactor] Update ConfigIndexBitwidthRewriter to Support Auto-Check Feature
      
      * Modified the constructor of `ConfigIndexBitwidthRewriter` to include an `auto_check` parameter, allowing for dynamic bitwidth adjustments based on input conditions.
      * Enhanced the `VisitExpr_` methods to apply the new auto-check logic, ensuring that integer types are upgraded to 64 bits when necessary, or to a specified index bitwidth otherwise.
      * Updated the `ConfigIndexBitwidth` pass to determine the index bitwidth based on the presence of configuration, improving flexibility in handling different scenarios.
      
      * Add dynamic matrix multiplication example and corresponding test
      
      * Introduced `example_dynamic.py` to demonstrate dynamic matrix multiplication using TileLang and PyTorch, including a main function for execution and performance profiling.
      * Added `test_example_dynamic.py` to validate the functionality of the dynamic matrix multiplication example.
      * The example includes detailed parameter configurations and checks against PyTorch's implementation for correctness.
      
      * lint fix
      
      * Add get_num_sms function to retrieve the number of streaming multiprocessors on the CUDA device
      
      * Implemented the `get_num_sms` function in `cuda_driver.py` to return the count of streaming multiprocessors for a specified CUDA device.
      * Updated the `__init__.py` file to include the new function in the module exports.
      
      * lint fix
      
      * Add global barrier state and expectation handling in CUDA code generation
      
      * Introduced `vid_global_barrier_state_` and `vid_global_barrier_expect_` to manage global barrier synchronization in the CUDA code generator.
      * Updated `Finish` method to declare the global barrier state if needed.
      * Implemented handling for `EvaluateNode` to initialize the barrier expectation.
      * Removed unnecessary extern declaration for the global barrier state in `PrintStorageSync` method.
      * Enhanced CUDA FP8 type definitions for better alignment and structure.
      623edf4c
  14. 22 May, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Enhance smem copy selector for uncommon shape (#510) · dbe8689f
      Lei Wang authored
      * [Refactor] Enhance GEMM warp partitioning logic for improved performance and flexibility
      
      * Updated the warp partitioning logic in `Gemm::ComputeWarpPartition` to better handle various GEMM policies, including FullRow, FullCol, and Square.
      * Implemented checks to dynamically adjust warp allocation based on matrix dimensions, ensuring optimal performance.
      * Introduced a new `SelectCopy` template to streamline memory access patterns in CUDA templates, enhancing compatibility across different architectures.
      * Refactored the Python `GemmWarpPolicy` class to align with the updated C++ logic, improving clarity and maintainability in warp allocation strategies.
      
      * [Refactor] Optimize matrix multiplication parameters and performance in quickstart example
      
      * Updated thread count in the kernel context from 256 to 128 to enhance performance.
      * Increased block sizes for matrix dimensions (M, N, block_M, block_N) to 1024 and 128 respectively, improving computational efficiency.
      * Adjusted the pipeline stages in the GEMM loop from 0 to 3 for better parallel execution.
      * Cleaned up comments for clarity and corrected a typo in the memory copy comment.
      
      * [Refactor] Simplify Copy type selection in OperandTraits for improved clarity
      
      * Replaced the conditional Copy type definition with a new SelectCopy template in OperandTraits, enhancing readability and maintainability of the code.
      * This change streamlines the logic for selecting memory copy patterns based on matrix dimensions and warp configurations.
      dbe8689f
  15. 17 May, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility (#500) · 33937683
      Lei Wang authored
      * [Enhancement] Improve GEMM layout function and documentation
      
      * Added detailed documentation for the makeGemmABLayout function, explaining parameters and layout selection strategies.
      * Updated the layout selection logic to use mat_continuous consistently, enhancing clarity and correctness in memory layout calculations.
      * Adjusted the InferLayout method to reflect changes in the layout function, ensuring accurate matrix dimension handling for transposed cases.
      
      * lint fix
      
      * [Refactor] Update GEMM layout and operand traits for improved CUDA compatibility
      
      * Adjusted the InferLayout method in gemm.cc to include trans_A in fragment creation, enhancing layout inference for transposed matrices.
      * Updated OperandTraits in gemm_sm89.h and gemm_sm90.h to change the Copy type from SM75_U16x4_LDSM_N to SM75_U16x4_LDSM_T, optimizing memory access patterns for different warp configurations.
      * Enhanced static assertions in gemm_sm90.h to clarify requirements for num_warp_m, ensuring compatibility with Hopper architecture.
      
      * [Refactor] Clean up formatting in GEMM implementation and CUDA templates
      
      * Simplified the formatting of the fragment creation in the InferLayout method of gemm.cc for better readability.
      * Adjusted the static assertion message in gemm_sm90.h to enhance clarity regarding the num_warp_m requirement for Hopper architecture.
      33937683
    • Lei Wang's avatar
      [Bugfix] Rename SM75_U16x8_LDSM_N into SM75_U16x8_LDSM_T for correctness (#499) · 2837878f
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      
      * Update Copy type in OperandTraits for GEMM templates to use SM75_U16x4_LDSM_T and SM75_U16x8_LDSM_T for improved memory access patterns across CUDA architectures.
      2837878f
    • Lei Wang's avatar
      [Enhancement] Fallback transposed_ldmatrix into `SM75_U16x4_LDSM_N` when warp_n is 8 (#498) · 68a3c4f3
      Lei Wang authored
      * Remove debug print statement from block_sparse_attn_triton.py and implement a timeout handler in autotuner for function execution. This enhances the robustness of the autotuner by allowing it to handle timeouts gracefully.
      
      * Enhance the autotuner module by adding a timeout handler for function execution, improving robustness in handling long-running tasks. This change includes the introduction of a custom TimeoutException and updates to the run_with_timeout function for better signal management.
      
      * Add merge shared memory allocations pass and related configurations
      
      - Introduced a new pass for merging shared memory allocations in GPU kernels, allowing for more efficient memory usage.
      - Registered configuration options for debugging and controlling the merging behavior.
      - Updated relevant files to integrate the new pass into the TileLang engine and transform modules.
      - Adjusted import paths and added documentation for the new functionality.
      
      * Reduce num_stages parameter in GEMM functions from 3 to 1 for improved performance in test_tilelang_kernel_gemm.py
      
      * Update Copy type in OperandTraits for GEMM templates to use conditional selection based on num_warp_n. This change enhances memory access patterns for different configurations in CUDA kernels.
      
      * lint fix
      68a3c4f3
  16. 09 May, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Implement fast integer power operation and related API (#466) · 1f5eb492
      Lei Wang authored
      * [Refactor] Enhance TMA barrier validation and support for additional architectures (#463)
      
      * Updated the TMA barrier validation in `inject_tma_barrier.cc` to check for non-empty `barrier_id_to_range_` before raising an error for missing `create_list_of_mbarrier`.
      * Refactored architecture checks in `phase.py` to utilize a new constant `SUPPORTED_TMA_ARCHS`, allowing for easier updates and improved readability in the target architecture validation logic.
      
      * [Feature] Implement fast integer power operation and related API
      
      * Added a new math operation `tl.power_of_int` in `math.cc` for efficient integer exponentiation.
      * Introduced a corresponding Python API `pow_of_int` in `tir/op.py` to facilitate usage in TileLang.
      * Enhanced `common.h` with a template function for integer power calculations.
      * Updated documentation to reflect the new functionality and usage examples.
      1f5eb492
  17. 06 May, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP (#450) · 0a8c8b99
      Lei Wang authored
      * [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP
      
      * Introduced TILELANG_CHECK_LAST_ERROR macro to streamline error checking for kernel launches in both CUDA and HIP.
      * Updated kernel launch code in wrapper.py to utilize the new macro, enhancing readability and maintainability.
      * This change improves error reporting by providing detailed messages when kernel execution fails.
      
      * [Refactor] Standardize error message formatting in TILELANG_CHECK_LAST_ERROR macro
      
      * Updated the TILELANG_CHECK_LAST_ERROR macro in both CUDA and HIP implementations to ensure consistent formatting of error messages.
      * Enhanced readability by aligning the error message structure across different platforms, improving maintainability of error handling code.
      0a8c8b99
  18. 29 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Fix layout inference for free fragment buffer (#443) · 2ea45ae9
      Lei Wang authored
      * [Enhancement] Improve layout inference accuracy in ParallelOp (#441)
      
      * Added logic to use non-replicated buffers as source buffers for more accurate layout inference.
      * Enhanced comments to clarify the rationale behind buffer selection in layout inference process.
      
      * [Enhancement] Add error handling macros and refactor loop partitioning logic
      
      * Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches.
      * Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent.
      * Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis.
      * Updated pass configuration management to streamline vectorization control in the optimization process.
      
      * lint fix
      
      * remove debug print
      2ea45ae9
  19. 25 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support cute mma tile mxn8ky (#434) · d1c15bc5
      Lei Wang authored
      * [Enhancement] Improve error handling in layout inference and update profiler type in tests
      
      * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B.
      * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy.
      
      * lint fix
      
      * [Refactor] Update OperandTraits to include num_warp_n parameter
      
      * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations.
      * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios.
      
      * lint fix
      
      * [Refactor] Update DispatchInstruction templates to include N parameter
      
      * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations.
      * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.
      d1c15bc5
  20. 22 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Language] Support tile operator `T.cumsum` (#423) · 88747fcd
      Lei Wang authored
      * [Feature] Implement CumSum operation in TileLang
      
      * Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic.
      * Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums.
      * Created tests for CumSum functionality in shared memory and fragment contexts.
      * Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang.
      * Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms.
      
      * lint fix
      88747fcd
  21. 11 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Language] Introduce `T.any_of` and `T.all_of` to reduce a bool arrary (#371) · c4638d65
      Lei Wang authored
      
      
      * [Enhancement] Introduce logical operations `any_of` and `all_of` for buffer checks
      
      - Added new logical operations `any_of` and `all_of` to the TileLang language interface, allowing users to check conditions across buffer elements.
      - Implemented corresponding intrinsic calls for CUDA, enhancing the functionality of the TileLang framework.
      - Updated the `allocate.py` to handle boolean types correctly in shared memory allocations.
      - Introduced tests for the new logical operations to ensure correctness and performance.
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zhiwen.mo25@ic.ac.uk>
      c4638d65
  22. 03 Apr, 2025 1 commit
  23. 30 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add support for CUDA architecture 8.9 in GEMM template (#304) · edbb9b6d
      Lei Wang authored
      * [Enhancement] Add support for CUDA architecture 8.9 in GEMM template
      
      - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware.
      - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs.
      
      * lintfix
      
      * [Refactor] Clean up includes in gemm_sm89.h
      
      - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization.
      - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order.
      edbb9b6d
  24. 28 Mar, 2025 1 commit
  25. 27 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Enable bfloat16 atomic operations only for CUDA architectures greater than 7.5 (#291) · 83412458
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      
      * [Refactor] Revamp cache management and enhance documentation in env.py and proxy.py
      
      - Replaced global cache functions with a CacheState class to improve encapsulation and management of kernel caching.
      - Updated the `from_ptr` method in BufferProxy and BaseTensorProxy classes to include detailed docstrings for better clarity on parameters and return values.
      - Enhanced class docstrings across various proxy classes to provide clearer descriptions of their purpose and functionality, improving overall code documentation.
      
      * [Refactor] Update imports in __init__.py for tir compatibility
      
      - Added imports for `prim_func` and `tir.op` to enhance compatibility with the upstream tir script.
      - Marked imports with `# noqa: F401` to suppress linting warnings for unused imports, indicating future removal once compatibility is achieved.
      
      * lint fix
      
      * [Refactor] Update imports in tir.ir.py for improved compatibility
      
      - Removed unused import of `PrimExpr` from `tvm.script.ir_builder.tir` and replaced it with the correct import from `tvm.tir`.
      - Added import for `tir.ir` in `__init__.py` to enhance module accessibility and maintain compatibility with upstream changes.
      
      * [Refactor] Update function calls in tir.ir.py to return values
      
      - Modified the `serial`, `parallel`, `vectorized`, `unroll`, `thread_binding`, and `grid` functions to return the results of their respective calls to `_ir` methods, enhancing clarity and ensuring proper value propagation.
      
      * bugfix
      
      * [Enhancement] Add support for uint16 data type in TLCUDASourceWrapper
      
      - Introduced the "uint16" mapping to the type dictionary in the TLCUDASourceWrapper class, expanding the range of supported data types for CUDA operations.
      
      * bugfix
      
      * [Update] Sync subproject commit and modify CUDA atomic add functions
      
      - Updated the subproject commit for TVM to edd35139a0481e9359aa269e3e50450b95ba2f5a.
      - Commented out the CUDA capability check in the example convolution script to prevent execution errors.
      - Refactored atomic add functions for BFLOAT16 in common.h to include a conditional compilation directive for improved compatibility with CUDA architectures.
      83412458
  26. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  27. 19 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Enhancement] Add zero initialization option to GEMM operations (#246) · 701e9234
      Yu Cheng authored
      * [Enhancement] Add zero initialization option to GEMM operations
      
      - Introduced a new `zero_init` parameter to the GEMM function, allowing for optional zero initialization of the accumulator.
      - Updated the GEMM implementation across various CUDA architectures to support the new parameter.
      - Modified the Python interface for GEMM to include the `zero_init` argument, enhancing flexibility in kernel execution.
      - Ensured compatibility with existing functionality while improving initialization control for performance optimization.
      
      * rename zero_init to clear_accum
      
      * lint
      701e9234
  28. 17 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Disable force inline for ldmatrix (#227) · a1da26f2
      Lei Wang authored
      * Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture
      
      - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
      - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
      - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
      - Include necessary header files for built-in operations and layout inference improvements.
      
      * Refactor parameter formatting in CUDA matrix load functions for consistency
      
      - Adjusted parameter alignment in `ptx_ldmatrix_x1`, `ptx_ldmatrix_x2`, `ptx_ldmatrix_x4`, and their transposed counterparts for improved readability.
      - Added a blank line in `get_tensor_supply` function in `tensor.py` to enhance code clarity.
      
      * Enhance tensor supply generation in `get_tensor_supply` function
      
      - Introduced handling for unsigned integer and float8 tensor types, allowing for specific random tensor generation based on data type.
      - Updated logic to return appropriate random tensors for different data types, improving flexibility and functionality of tensor supply generation.
      - Refactored existing conditions for clarity and maintainability.
      
      * Fix tensor supply generation logic in `get_tensor_supply` function
      
      - Updated the variable reference from `tensor` to `param` to ensure correct handling of tensor data types.
      - Improved the accuracy of unsigned integer and float8 checks for tensor supply generation, enhancing functionality and reliability.
      
      * Enhance tensor supply checks in `get_tensor_supply` function
      
      - Updated the logic for identifying unsigned integers and float8 types by using `removeprefix` on the dtype string, improving accuracy in tensor supply generation.
      - Ensured better handling of tensor data types for more reliable random tensor generation based on the updated checks.
      
      * Enhance KernelParam functionality and improve tensor supply checks
      
      - Added methods `is_unsigned` and `is_float8` to the `KernelParam` class for better type identification of parameters.
      - Updated the `get_tensor_supply` function to utilize the new methods, improving clarity and accuracy in tensor supply generation based on parameter types.
      a1da26f2
  29. 16 Mar, 2025 1 commit
  30. 14 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Allow mma fallback when wgmma is not supported (#206) · 45559a1f
      Lei Wang authored
      * Enhance error message for constant size stack allocation in CUDA codegen. Include the actual constant size and buffer variable name in the error output for better debugging.
      
      * Refactor GEMM and Bulk Copy operations to enhance layout handling and support for Hopper architecture
      
      - Update `ComputeWarpPartition` to include a new parameter for Hopper WGMMA support.
      - Modify layout checks in `LowerBulkCopy` to accommodate new GEMM layout types.
      - Enhance layout inference logic in `InferLayout` for better compatibility with Hopper architecture.
      - Include necessary header files for built-in operations and layout inference improvements.
      
      * lint fix
      
      * Remove unused builtin.h include directive
      
      * Update include path for builtin.h
      45559a1f
  31. 13 Mar, 2025 1 commit
    • zqh-wz's avatar
      [Feature] Upgrade cutlass version and support fp8 T.gemm (#202) · 2cccf1f5
      zqh-wz authored
      
      
      * upgrade cutlass to upstream v3.8.0
      
      * Implement fp8 gemm and add example script
      
      * Fix dtype retrieval with map_torch_type for fp8 inputs
      
      * Disable vectorization of fp8 values
      
      * Make MMA declaration compatible with cutlass 3.4.0+
      
      * Add test for fp8 T.gemm
      
      * fix indent
      
      * fix indent
      
      * Add copyright and license header
      
      * Add copyright and license header
      
      * lint fix
      
      * Refactor matmul_nt and assert_matmul_correctness functions for improved readability by consolidating parameter definitions and adjusting formatting.
      
      * clang format lint
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2cccf1f5
  32. 12 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Feature] Add TMA Store Synchronization Support (#195) · eba7dd5a
      Yu Cheng authored
      - Introduce TMAStoreArrive and TMAStoreWait operations for CUDA TMA store synchronization
      - Add new builtin operations in op/builtin.cc and op/builtin.h
      - Implement TMAStoreSyncInjector to automatically inject TMA store synchronization calls
      - Update CUDA codegen to support new TMA store synchronization intrinsics
      - Add Python language bindings for new TMA store synchronization operations
      eba7dd5a
  33. 11 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug (#188) · fe0de672
      Yu Cheng authored
      * [Dev][Bugfix] Add RMS Normalization Kernels and Fix Reduce Bug
      
      - Implement two RMS normalization implementations in TileLang:
        * `rms_norm_splitk`: Split-K reduction approach for large matrices
        * `rms_norm`: Full reduction kernel with simplified implementation
      - Add reference implementation using PyTorch for validation
      - Include performance benchmarking for both kernel variants
      - Demonstrate flexible block size and matrix size configurations
      
      * [Examples] Simplify RMS Normalization Kernel Compilation
      
      - Remove commented-out code for split-K RMS normalization
      - Simplify kernel compilation by removing explicit TMA lowering configuration
      - Update copyright header to Tile-AI Corporation
      - Streamline main script for RMS normalization example
      fe0de672
  34. 05 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Support debug print for unsigned char datatype (#145) · bb60f6ce
      Lei Wang authored
      * Fix debug print buffer template for unsigned char type
      
      - Update debug_print_buffer_value template specialization for unsigned char
      - Modify test_tilelang_debug_print.py to include additional dtype tests
      - Add test case for uint8 dtype in debug print buffer function
      
      * Refactor debug print buffer template formatting for unsigned char
      
      - Improve code formatting for debug_print_buffer_value template specialization
      - Adjust line breaks and indentation for better readability
      - Maintain consistent code style with other template specializations
      bb60f6ce
    • Yu Cheng's avatar
      [Dev] Adjust computation logic to avoid precision loss when casting acc_s from... · e1d82bf3
      Yu Cheng authored
      [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 (#141)
      
      - Remove redundant `acc_s_0` fragment in flash attention kernel
      - Simplify memory copy and reduction operations
      - Reorder memory copy and scaling steps for improved performance
      - Add Hopper-specific synchronization method in CUDA reduce template
      - Update reduce operation to use architecture-specific synchronization
      e1d82bf3