1. 24 Jul, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] Improve buffer conflict detection in thread storage synchronization (#658) · a16f0cf5
      Lei Wang authored
      * [Enhancement] Improve buffer conflict detection in thread storage synchronization
      
      - Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`.
      - Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons.
      - Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity.
      
      * example fix
      
      * enhancement
      
      * improve ci
      a16f0cf5
    • Wenhao Xie's avatar
      [Bugfix][Docs] Update documentation build process and configurations for autoapi support (#663) · c8edb957
      Wenhao Xie authored
      * [Bugfix][Docs] Update documentation build process and configurations for autoapi support
      
      * lint fix
      c8edb957
    • Zhengju Tang's avatar
      [BugFix] Do not modify strict layout in common or relax level of layout... · fe6cdc9d
      Zhengju Tang authored
      
      [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653)
      
      * [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking
      
      * Lint
      
      * test fix
      
      * Update CI workflow to install dependencies without user site packages
      
      - Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory.
      
      * Update CI workflow to install pip without user site packages
      
      - Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment.
      
      * Update requirements-test.txt
      
      * reduce ci problem size,
      
      * Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      fe6cdc9d
  2. 23 Jul, 2025 5 commits
    • Zhang Jason's avatar
    • Wenhao Xie's avatar
      [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes... · d764dca8
      Wenhao Xie authored
      
      [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656)
      
      * [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control
      
      * lint fix
      
      * upd
      
      * lint fix
      
      * fix typo
      
      * update typing
      
      * update the use case of compile flags
      
      * ci fix
      
      * fix
      
      * Fix CI workflow to correctly activate virtual environment from shared cache directory
      
      * use local cache
      
      * fix
      
      * fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      d764dca8
    • Lei Wang's avatar
      [Cache] Support shared cache directories for multiple process (#649) · 267d9b3b
      Lei Wang authored
      
      
      * Support shared cache directories for multiple users
      
      * ruff fix
      
      * ci_fix
      
      * Add CI step to show worker info
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      267d9b3b
    • Lei Wang's avatar
    • Wenhao Xie's avatar
      [Bugfix][CI] Bug fixing and migrate CI from ada to hopper (#652) · e9a608e2
      Wenhao Xie authored
      
      
      * fix CI bugs in hopper
      
      * lint fix
      
      * Update bulk_copy.cc
      
      * Refactor bulk copy logic in LowerBulkCopy function
      
      - Removed unnecessary blank lines for improved code readability.
      - Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides.
      - Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings.
      
      * test fix
      
      * ci fix
      
      * Update flash-attention dependencies and clean up example code
      
      - Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`.
      - Removed unused imports and commented-out code in various example files to enhance readability and maintainability.
      - Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`.
      - Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity.
      - Deleted the `example_mha_inference.py` file as it is no longer needed.
      
      * Update CI workflow to remove `--user` flag from pip install commands
      
      - Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment.
      
      * Update CI workflow to include `--no-user` flag in pip install commands
      
      - Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment.
      
      * Update CI workflow to include `--no-user` flag in pip install command for wheel mode
      
      - Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment.
      
      * test fix
      
      * avoid conflict with system environments
      
      * test fix
      
      * add commnets
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      e9a608e2
  3. 22 Jul, 2025 1 commit
  4. 21 Jul, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Remove small array reuse condition in shared memory allocation merging (#654) · 8205791d
      Lei Wang authored
      - Eliminated the condition that disabled the reuse of small arrays (const_nbits <= 32) in the `MergeSharedMemoryAllocations` function, allowing for more flexible memory management.
      - Added a comment in `OptimizeForTarget` to clarify the order of applying `MergeSharedMemoryAllocations` after `SplitHostDevice`, ensuring correct allocation site handling in device functions.
      8205791d
    • meinie's avatar
      [Bugfix] Assign Target for jit kernel (#648) · 6e994b12
      meinie authored
      
      
      * fix: Copy Target to self.target
      
      * refactor: Remove unused target attribute and adjust context management in JITKernel
      
      - Removed the unused `target` attribute from the `JITKernel` class.
      - Updated the context management in the `compile` method to utilize `self.target`, improving clarity and ensuring proper resource handling during compilation.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      6e994b12
  5. 20 Jul, 2025 2 commits
  6. 17 Jul, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Align dynamic shared memory allocations in phase.py (#644) · b060c9f7
      Lei Wang authored
      - Added a comment to clarify the alignment of dynamic shared memory allocations in the `OptimizeForTarget` function.
      - Refactored the handling of shared memory allocation merging and synchronization to streamline the process, ensuring consistent behavior regardless of the aggressive merge flag.
      - Improved code clarity by removing redundant conditional checks related to synchronization and memory allocation.
      b060c9f7
    • Lei Wang's avatar
      [Enhancement] Add Cython cache directory to setup.py (#643) · 6c0a5841
      Lei Wang authored
      - Included the Cython cache directory in the list of source files for the TileLang build process, ensuring proper handling of cached Cython files during the build.
      6c0a5841
  7. 16 Jul, 2025 5 commits
    • YizhaoGao's avatar
      [Example] Add paged block-sparse flash-decoding kernel (#638) · 2aded11a
      YizhaoGao authored
      
      
      * Add paged block-sparse flash-decoding kernel
      
      * Update example_tilelang_sparse_gqa_decode_paged.py
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2aded11a
    • Lei Wang's avatar
      [Enhancement] Extend pythonic_expr to support dtype mapping in utils.py (#641) · 60974197
      Lei Wang authored
      - Updated the `pythonic_expr` function to accept an optional `dtype_map` parameter, allowing for more flexible type conversions.
      - Refactored calls to `pythonic_expr` in `TLCUDASourceWrapper` to utilize the new mapping feature, improving type handling in kernel generation.
      - Enhanced code clarity by consolidating repeated calls to `pythonic_expr` into a private method within the wrapper class.
      60974197
    • Lei Wang's avatar
      [Bugfix] Put thread_extent into reduce (#640) · 156ff85e
      Lei Wang authored
      * [Enhancement] Update AllReduce operation to include thread offset in kernel generation
      
      - Modified the `ReduceOp::Lower` method to incorporate the thread offset in the AllReduce kernel generation for the sm_90 architecture.
      - This change improves the accuracy of thread management during reduction operations, enhancing performance on specific GPU architectures.
      
      * [Enhancement] Refactor thread offset handling in AllReduce kernel generation
      
      - Updated the `ReduceOp::Lower` method to streamline the handling of thread offset for AllReduce operations, ensuring consistent usage across different architectures.
      - This change enhances code clarity and maintains performance improvements for the sm_90 architecture by reducing redundancy in thread offset calculations.
      156ff85e
    • Lei Wang's avatar
      b5ac9bba
    • Lei Wang's avatar
      [Warp Specialize] Implicit Warp Specialize Programing Model (#605) · e2d25ba8
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      
      * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy
      
      - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
      - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.
      
      * [Refactor] Update barrier and clear operations in warp specialization examples
      
      - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization.
      - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency.
      
      * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling
      
      - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management.
      - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements.
      - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior.
      - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output.
      - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements.
      
      * [Enhancement] Add support for disabling TMA in Copy operations
      
      - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior.
      - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag.
      - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations.
      - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations.
      
      * [Refactor] Clean up whitespace and formatting in multiple files
      
      - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`.
      - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase.
      
      * [Enhancement] Refactor flash attention implementation for improved performance and configurability
      
      - Split the shared memory allocations for query and key-value pairs to optimize memory usage.
      - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example.
      - Updated kernel execution parameters to improve thread management and synchronization.
      - Enhanced the overall structure of the flash attention function for better readability and maintainability.
      
      * fix
      
      * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget
      
      * Refactor barrier handling and update example configurations
      
      - Replaced commented-out barrier creation with new barrier allocation in GEMM example.
      - Updated kernel configuration in warp specialization example to include async copy settings.
      - Enhanced barrier management in the phase optimization process to improve synchronization handling.
      - Introduced new barrier allocation function for better memory management in shared contexts.
      
      * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget
      
      - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization.
      - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code.
      - Added exit() call in OptimizeForTarget to halt execution after barrier lowering.
      
      * Enhance CMake configuration and clean up example scripts
      
      - Enabled compile command export in CMakeLists.txt for better build integration.
      - Removed unnecessary print statement in the warp specialization example.
      - Cleaned up commented-out code in GEMM example for improved readability.
      - Updated barrier handling in shared memory allocation transformations for better synchronization.
      
      * Refactor barrier handling in warp specialization examples
      
      - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization.
      - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts.
      - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management.
      
      * Update lower_shared_barrier.cc
      
      * Update phase.py
      
      * Update warp specialization example and Cython wrapper
      
      - Removed commented-out pass configuration options in the warp specialization example for clarity.
      - Added functionality to write the generated kernel source to a file named "kernel.cu".
      - Enhanced Cython wrapper to support boolean type conversion for improved type handling.
      
      * Add storage synchronization call in shared barrier transformation
      
      - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process.
      
      * remove debug files
      
      * Remove kernel source output to file in warp specialization example
      
      * remove comments
      
      * Refactor tensor handling and update test execution in TileLang
      
      - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency.
      - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`.
      - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing.
      - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity.
      
      * Update test_tilelang_language_reshape.py
      e2d25ba8
  8. 15 Jul, 2025 4 commits
    • Lei Wang's avatar
      support torch.bool as kernel input (#636) · 68989d80
      Lei Wang authored
      68989d80
    • Yu Cheng's avatar
      [Dev] Update benchmark and decoding scripts to refine condition checks and... · e937faa6
      Yu Cheng authored
      [Dev] Update benchmark and decoding scripts to refine condition checks and optimize tensor operations (#637)
      
      - Enhanced the condition in `compare_ab` to ensure baseline checks align with target exclusions.
      - Removed unnecessary tensor allocation in `mla_decode_tilelang`, optimizing memory usage and improving performance by directly using shared tensors in GEMM operations.
      e937faa6
    • Lei Wang's avatar
      [Pass][Simplify] Introduce symbolic level simplify for condition expression (#634) · 02a0cf59
      Lei Wang authored
      * [Enhancement] Add argument simplification option to StmtSimplifier
      
      - Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior.
      - Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process.
      - Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase.
      - Added comments to clarify the purpose of the new flag and its impact on simplification logic.
      
      * lint fix
      
      * [Enhancement] Improve layout inference and reduce operation handling
      
      - Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic.
      - Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures.
      - Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization.
      
      * lint fix
      
      * [Enhancement] Add input validation for GEMM parameters
      
      - Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method.
      - Added informative error messages to assist in debugging when the input parameters do not meet the required conditions.
      
      * bug fix
      02a0cf59
    • Yuqing Xia's avatar
      fix typo (#635) · a0dfa516
      Yuqing Xia authored
      a0dfa516
  9. 14 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Pass] Introduce flag to diable cp async lowering (#633) · 9c777b67
      Lei Wang authored
      * [Enhancement] Update PipelinePlanner to support async copy configuration
      
      - Modified the `Substitute` method in `PipelinePlanner` to accept a `use_async_copy` parameter, allowing for more flexible pipeline planning based on async copy requirements.
      - Updated the constructor of `PipelinePlanner` to initialize the `use_async_copy_` member variable.
      - Adjusted the logic in the pipeline planning process to conditionally apply async copy annotations based on the new parameter.
      - Commented out the `LoopVectorizeDynamic` call in `LowerAndLegalize` to prevent unintended modifications during the legalizing phase.
      
      * Refactor PipelinePlanning function for improved readability
      
      - Adjusted the formatting of the `use_async_copy` variable assignment in the `PipelinePlanning` function to enhance code clarity and maintainability.
      9c777b67
  10. 13 Jul, 2025 1 commit
    • Lei Wang's avatar
      [AutoTune] Support `with set_autotune_inputs` to set auto tuning input tensors (#632) · eec47592
      Lei Wang authored
      * [Refactor] Simplify and modularize autotuner implementation
      
      - Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability.
      - Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure.
      - Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic.
      - Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes.
      
      * [Refactor] Clean up and enhance capture and tuner modules
      
      - Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`.
      - Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning.
      - Streamlined the `CaptureStack` class for better thread-local context management.
      
      * lint fix
      
      * [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example
      
      - Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity.
      - Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning.
      - Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability.
      - Deleted obsolete test file for autotuning decorator, cleaning up the codebase.
      
      * [Refactor] Improve code formatting and readability in autotune test file
      
      - Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation.
      - Fixed a typo in the `enable_rasteration` parameter name to ensure consistency.
      - Cleaned up unnecessary blank lines to enhance overall code clarity.
      
      * Update example_blocksparse_gemm.py
      
      * Update capture.py
      eec47592
  11. 12 Jul, 2025 2 commits
  12. 10 Jul, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] support composable expression for shape with symbolic vars (#624) · 0b521a3b
      Lei Wang authored
      * [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr
      
      - Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation.
      - Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code.
      
      * [Refactor] Simplify expression handling in pythonic_expr function
      
      - Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability.
      - Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation.
      
      * [Enhancement] Improve expression representation in pythonic_expr function
      
      - Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings.
      - Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation.
      - Included a docstring for better clarity on the function's purpose and usage.
      
      * test fix
      
      * minor update
      0b521a3b
    • Lei Wang's avatar
      [Enhancement] Support more flexible layout host pythonic expr (#623) · 22aed721
      Lei Wang authored
      * [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr
      
      - Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation.
      - Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code.
      
      * [Refactor] Simplify expression handling in pythonic_expr function
      
      - Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability.
      - Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation.
      
      * [Enhancement] Improve expression representation in pythonic_expr function
      
      - Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings.
      - Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation.
      - Included a docstring for better clarity on the function's purpose and usage.
      
      * test fix
      22aed721
    • Lei Wang's avatar
      [Enhancement] Add ahead of time cython compilation in setup.py (#622) · 5101e6bc
      Lei Wang authored
      * [Enhancement] Add Cython support and compiler detection in setup.py
      
      - Introduced a new `CythonExtension` class for building Cython-based extensions, enhancing the build process for Cython projects.
      - Implemented functions to detect the Cython compiler and C++ compiler, improving compatibility and user experience.
      - Updated the build process to handle Cython extensions alongside CMake extensions, ensuring a seamless integration for users.
      - Added caching mechanisms for Cython compilation to optimize build times and reduce unnecessary recompilation.
      
      * [Enhancement] Add Cython dependency and enable CMake extension building
      
      - Added Cython as a required dependency in `pyproject.toml` to support Cython-based extensions.
      - Updated `setup.py` to enable building CMake extensions, improving the build process for projects utilizing both Cython and CMake.
      - Modified the Cython compiler detection logic to streamline installation instructions for users.
      5101e6bc
  13. 09 Jul, 2025 4 commits
  14. 08 Jul, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] refactor autotune examples (#617) · d110d087
      Lei Wang authored
      * [Refactor] Update tilelang kernel functions and remove unused imports
      
      - Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance.
      - Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation.
      - Removed unused import of `cached` from `tilelang`, streamlining the code.
      - Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use.
      
      * [Refactor] Simplify configuration generation in benchmark and example scripts
      
      - Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability.
      - Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning.
      - Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized.
      
      * [Refactor] Update configuration handling in benchmark scripts
      
      - Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management.
      - Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning.
      - Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks.
      - Cleaned up redundant code and improved overall readability in the affected files.
      
      * [Refactor] Clean up formatting and update subproject commit
      
      - Updated the subproject commit reference in the TVM directory to indicate a dirty state.
      - Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability.
      - Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability.
      
      * [Refactor] Update AutoTuner configuration handling
      
      - Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling.
      - Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources.
      - Updated comments for clarity and maintainability.
      
      * lint fix
      
      * Update TVM subproject commit to indicate dirty state and modify MHA backward test cases
      
      - Updated the subproject commit reference in the TVM directory to reflect a dirty state.
      - Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256.
      - Uncommented the main testing function call for potential execution.
      d110d087
    • dependabot[bot]'s avatar
      Bump transformers from 4.50.0 to 4.51.0 in /examples/bitnet-1.58b (#615) · 78056597
      dependabot[bot] authored
      Bumps [transformers](https://github.com/huggingface/transformers) from 4.50.0 to 4.51.0.
      - [Release notes](https://github.com/huggingface/transformers/releases)
      - [Commits](https://github.com/huggingface/transformers/compare/v4.50.0...v4.51.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: transformers
        dependency-version: 4.51.0
        dependency-type: direct:production
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      78056597
    • Lei Wang's avatar
      [Enhancement] Update ReduceOp initialization values for integer types (#614) · 80ffea6d
      Lei Wang authored
      * [Enhancement] Update ReduceOp initialization values for integer types
      
      - Modified the `MakeInitValue` method in `ReduceOp` to handle integer data types correctly by returning appropriate minimum and maximum values based on the bit width.
      - Added checks for integer types to ensure correct initialization for `kMax` and `kMin` reduction types, enhancing the robustness of the reduction operations.
      
      * [Enhancement] Update ReduceOp to handle unsigned integer initialization values
      
      - Enhanced the `MakeInitValue` method in `ReduceOp` to include support for unsigned integer data types.
      - Added conditions to return appropriate initialization values for `kMax` and `kMin` reduction types based on the data type, improving the robustness of reduction operations.
      80ffea6d
  15. 04 Jul, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Phaseout Pass ParallelLoopTransformer (#611) · 42c3b452
      Lei Wang authored
      * Refactor layout inference by removing the ParallelLoopTransformer class. Updated layout inference logic to streamline buffer access collection and condition handling in parallel loops. This change simplifies the code structure and enhances maintainability.
      
      * Update MHA backward test cases to use reduced dimensions for batch size and context length
      42c3b452
    • Lei Wang's avatar
      [Doc] Phaseout Legacy documentations (#610) · d9ae74c6
      Lei Wang authored
      - Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core.
      - Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.
      d9ae74c6