"vscode:/vscode.git/clone" did not exist on "962d9aee04fbf7af04047d917266b1b40cc6a31a"
  1. 16 Jul, 2025 5 commits
    • YizhaoGao's avatar
      [Example] Add paged block-sparse flash-decoding kernel (#638) · 2aded11a
      YizhaoGao authored
      
      
      * Add paged block-sparse flash-decoding kernel
      
      * Update example_tilelang_sparse_gqa_decode_paged.py
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2aded11a
    • Lei Wang's avatar
      [Enhancement] Extend pythonic_expr to support dtype mapping in utils.py (#641) · 60974197
      Lei Wang authored
      - Updated the `pythonic_expr` function to accept an optional `dtype_map` parameter, allowing for more flexible type conversions.
      - Refactored calls to `pythonic_expr` in `TLCUDASourceWrapper` to utilize the new mapping feature, improving type handling in kernel generation.
      - Enhanced code clarity by consolidating repeated calls to `pythonic_expr` into a private method within the wrapper class.
      60974197
    • Lei Wang's avatar
      [Bugfix] Put thread_extent into reduce (#640) · 156ff85e
      Lei Wang authored
      * [Enhancement] Update AllReduce operation to include thread offset in kernel generation
      
      - Modified the `ReduceOp::Lower` method to incorporate the thread offset in the AllReduce kernel generation for the sm_90 architecture.
      - This change improves the accuracy of thread management during reduction operations, enhancing performance on specific GPU architectures.
      
      * [Enhancement] Refactor thread offset handling in AllReduce kernel generation
      
      - Updated the `ReduceOp::Lower` method to streamline the handling of thread offset for AllReduce operations, ensuring consistent usage across different architectures.
      - This change enhances code clarity and maintains performance improvements for the sm_90 architecture by reducing redundancy in thread offset calculations.
      156ff85e
    • Lei Wang's avatar
      b5ac9bba
    • Lei Wang's avatar
      [Warp Specialize] Implicit Warp Specialize Programing Model (#605) · e2d25ba8
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      
      * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy
      
      - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
      - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.
      
      * [Refactor] Update barrier and clear operations in warp specialization examples
      
      - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization.
      - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency.
      
      * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling
      
      - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management.
      - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements.
      - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior.
      - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output.
      - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements.
      
      * [Enhancement] Add support for disabling TMA in Copy operations
      
      - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior.
      - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag.
      - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations.
      - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations.
      
      * [Refactor] Clean up whitespace and formatting in multiple files
      
      - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`.
      - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase.
      
      * [Enhancement] Refactor flash attention implementation for improved performance and configurability
      
      - Split the shared memory allocations for query and key-value pairs to optimize memory usage.
      - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example.
      - Updated kernel execution parameters to improve thread management and synchronization.
      - Enhanced the overall structure of the flash attention function for better readability and maintainability.
      
      * fix
      
      * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget
      
      * Refactor barrier handling and update example configurations
      
      - Replaced commented-out barrier creation with new barrier allocation in GEMM example.
      - Updated kernel configuration in warp specialization example to include async copy settings.
      - Enhanced barrier management in the phase optimization process to improve synchronization handling.
      - Introduced new barrier allocation function for better memory management in shared contexts.
      
      * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget
      
      - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization.
      - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code.
      - Added exit() call in OptimizeForTarget to halt execution after barrier lowering.
      
      * Enhance CMake configuration and clean up example scripts
      
      - Enabled compile command export in CMakeLists.txt for better build integration.
      - Removed unnecessary print statement in the warp specialization example.
      - Cleaned up commented-out code in GEMM example for improved readability.
      - Updated barrier handling in shared memory allocation transformations for better synchronization.
      
      * Refactor barrier handling in warp specialization examples
      
      - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization.
      - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts.
      - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management.
      
      * Update lower_shared_barrier.cc
      
      * Update phase.py
      
      * Update warp specialization example and Cython wrapper
      
      - Removed commented-out pass configuration options in the warp specialization example for clarity.
      - Added functionality to write the generated kernel source to a file named "kernel.cu".
      - Enhanced Cython wrapper to support boolean type conversion for improved type handling.
      
      * Add storage synchronization call in shared barrier transformation
      
      - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process.
      
      * remove debug files
      
      * Remove kernel source output to file in warp specialization example
      
      * remove comments
      
      * Refactor tensor handling and update test execution in TileLang
      
      - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency.
      - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`.
      - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing.
      - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity.
      
      * Update test_tilelang_language_reshape.py
      e2d25ba8
  2. 15 Jul, 2025 4 commits
    • Lei Wang's avatar
      support torch.bool as kernel input (#636) · 68989d80
      Lei Wang authored
      68989d80
    • Yu Cheng's avatar
      [Dev] Update benchmark and decoding scripts to refine condition checks and... · e937faa6
      Yu Cheng authored
      [Dev] Update benchmark and decoding scripts to refine condition checks and optimize tensor operations (#637)
      
      - Enhanced the condition in `compare_ab` to ensure baseline checks align with target exclusions.
      - Removed unnecessary tensor allocation in `mla_decode_tilelang`, optimizing memory usage and improving performance by directly using shared tensors in GEMM operations.
      e937faa6
    • Lei Wang's avatar
      [Pass][Simplify] Introduce symbolic level simplify for condition expression (#634) · 02a0cf59
      Lei Wang authored
      * [Enhancement] Add argument simplification option to StmtSimplifier
      
      - Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior.
      - Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process.
      - Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase.
      - Added comments to clarify the purpose of the new flag and its impact on simplification logic.
      
      * lint fix
      
      * [Enhancement] Improve layout inference and reduce operation handling
      
      - Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic.
      - Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures.
      - Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization.
      
      * lint fix
      
      * [Enhancement] Add input validation for GEMM parameters
      
      - Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method.
      - Added informative error messages to assist in debugging when the input parameters do not meet the required conditions.
      
      * bug fix
      02a0cf59
    • Yuqing Xia's avatar
      fix typo (#635) · a0dfa516
      Yuqing Xia authored
      a0dfa516
  3. 14 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Pass] Introduce flag to diable cp async lowering (#633) · 9c777b67
      Lei Wang authored
      * [Enhancement] Update PipelinePlanner to support async copy configuration
      
      - Modified the `Substitute` method in `PipelinePlanner` to accept a `use_async_copy` parameter, allowing for more flexible pipeline planning based on async copy requirements.
      - Updated the constructor of `PipelinePlanner` to initialize the `use_async_copy_` member variable.
      - Adjusted the logic in the pipeline planning process to conditionally apply async copy annotations based on the new parameter.
      - Commented out the `LoopVectorizeDynamic` call in `LowerAndLegalize` to prevent unintended modifications during the legalizing phase.
      
      * Refactor PipelinePlanning function for improved readability
      
      - Adjusted the formatting of the `use_async_copy` variable assignment in the `PipelinePlanning` function to enhance code clarity and maintainability.
      9c777b67
  4. 13 Jul, 2025 1 commit
    • Lei Wang's avatar
      [AutoTune] Support `with set_autotune_inputs` to set auto tuning input tensors (#632) · eec47592
      Lei Wang authored
      * [Refactor] Simplify and modularize autotuner implementation
      
      - Removed unused imports and extensive code sections from the autotuner module to enhance readability and maintainability.
      - Modularized the code by introducing new imports for autotuning and capturing functionalities, streamlining the overall structure.
      - Improved logging setup and removed redundant timeout handling functions, focusing on core autotuning logic.
      - Updated the AutoTuner class to better utilize the new modular structure, ensuring efficient performance during auto-tuning processes.
      
      * [Refactor] Clean up and enhance capture and tuner modules
      
      - Improved code readability by removing unnecessary blank lines and organizing imports in `capture.py` and `tuner.py`.
      - Enhanced logging in the `AutoTuner` class to provide clearer warnings regarding the usage of `supply_prog` in the context of auto-tuning.
      - Streamlined the `CaptureStack` class for better thread-local context management.
      
      * lint fix
      
      * [Refactor] Simplify configuration and autotuning logic in blocksparse GEMM example
      
      - Updated `get_configs` function to reduce the number of configurations, enhancing performance and clarity.
      - Removed the `get_best_config` function, integrating its logic directly into the `blocksparse_matmul` function with the `@autotune` decorator for streamlined autotuning.
      - Adjusted the main function to directly utilize the autotuned kernel, simplifying the overall structure and improving readability.
      - Deleted obsolete test file for autotuning decorator, cleaning up the codebase.
      
      * [Refactor] Improve code formatting and readability in autotune test file
      
      - Reformatted the `matmul` function and `get_configs` function for better readability by adjusting line breaks and indentation.
      - Fixed a typo in the `enable_rasteration` parameter name to ensure consistency.
      - Cleaned up unnecessary blank lines to enhance overall code clarity.
      
      * Update example_blocksparse_gemm.py
      
      * Update capture.py
      eec47592
  5. 12 Jul, 2025 2 commits
  6. 10 Jul, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] support composable expression for shape with symbolic vars (#624) · 0b521a3b
      Lei Wang authored
      * [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr
      
      - Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation.
      - Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code.
      
      * [Refactor] Simplify expression handling in pythonic_expr function
      
      - Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability.
      - Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation.
      
      * [Enhancement] Improve expression representation in pythonic_expr function
      
      - Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings.
      - Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation.
      - Included a docstring for better clarity on the function's purpose and usage.
      
      * test fix
      
      * minor update
      0b521a3b
    • Lei Wang's avatar
      [Enhancement] Support more flexible layout host pythonic expr (#623) · 22aed721
      Lei Wang authored
      * [Refactor] Enhance expression handling in utils.py and update wrapper to use pythonic_expr
      
      - Added support for additional TIR expressions (FloorDiv, Min, Max, Add, Sub, FloorMod) in the pythonic_expr function to improve string representation.
      - Replaced the deprecated legalize_c function calls in TLCUDASourceWrapper and TLCPUSourceWrapper with pythonic_expr for better expression handling in kernel launch code.
      
      * [Refactor] Simplify expression handling in pythonic_expr function
      
      - Consolidated binary and min/max operation handling in the pythonic_expr function to improve readability and maintainability.
      - Replaced individual checks for binary operations with a mapping approach, streamlining the code and enhancing performance in expression representation.
      
      * [Enhancement] Improve expression representation in pythonic_expr function
      
      - Added operator precedence handling to the pythonic_expr function, enhancing the conversion of TVM PrimExpr to Python-style strings.
      - Updated the visitor logic to intelligently add parentheses based on operator precedence, improving the accuracy of expression representation.
      - Included a docstring for better clarity on the function's purpose and usage.
      
      * test fix
      22aed721
    • Lei Wang's avatar
      [Enhancement] Add ahead of time cython compilation in setup.py (#622) · 5101e6bc
      Lei Wang authored
      * [Enhancement] Add Cython support and compiler detection in setup.py
      
      - Introduced a new `CythonExtension` class for building Cython-based extensions, enhancing the build process for Cython projects.
      - Implemented functions to detect the Cython compiler and C++ compiler, improving compatibility and user experience.
      - Updated the build process to handle Cython extensions alongside CMake extensions, ensuring a seamless integration for users.
      - Added caching mechanisms for Cython compilation to optimize build times and reduce unnecessary recompilation.
      
      * [Enhancement] Add Cython dependency and enable CMake extension building
      
      - Added Cython as a required dependency in `pyproject.toml` to support Cython-based extensions.
      - Updated `setup.py` to enable building CMake extensions, improving the build process for projects utilizing both Cython and CMake.
      - Modified the Cython compiler detection logic to streamline installation instructions for users.
      5101e6bc
  7. 09 Jul, 2025 4 commits
  8. 08 Jul, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] refactor autotune examples (#617) · d110d087
      Lei Wang authored
      * [Refactor] Update tilelang kernel functions and remove unused imports
      
      - Refactored the `flashattn_fwd`, `flashattn_bwd_preprocess`, and `flashattn_bwd_postprocess` functions to utilize direct kernel calls instead of cached versions, improving clarity and performance.
      - Added `@tilelang.jit` decorators with specified output indices to enhance kernel compilation.
      - Removed unused import of `cached` from `tilelang`, streamlining the code.
      - Commented out the main testing function call in `test_tilelang_kernel_mha_bwd.py` for potential future use.
      
      * [Refactor] Simplify configuration generation in benchmark and example scripts
      
      - Refactored the `get_configs` functions in multiple benchmark and example scripts to utilize a dictionary-based approach for parameter configuration, improving readability and maintainability.
      - Updated the `flashattn` and `chunk_scan_fwd` functions to directly accept configuration parameters, enhancing flexibility in kernel tuning.
      - Removed redundant code and streamlined the configuration generation process across various files, ensuring consistency in how configurations are defined and utilized.
      
      * [Refactor] Update configuration handling in benchmark scripts
      
      - Refactored the `get_configs` functions in benchmark scripts to accept a variable argument list, improving flexibility in configuration management.
      - Enhanced the `matmul` and `flashattn` functions to utilize the updated configuration approach, streamlining parameter handling for kernel tuning.
      - Added `@autotune` decorators to relevant functions, ensuring consistent autotuning behavior across benchmarks.
      - Cleaned up redundant code and improved overall readability in the affected files.
      
      * [Refactor] Clean up formatting and update subproject commit
      
      - Updated the subproject commit reference in the TVM directory to indicate a dirty state.
      - Removed unnecessary blank lines and improved formatting in the `benchmark_matmul` and `benchmark_matmul_fp8` scripts for better readability.
      - Streamlined the function definitions in the `flashattn` example script to enhance clarity and maintainability.
      
      * [Refactor] Update AutoTuner configuration handling
      
      - Modified the AutoTuner class to check if kernel parameters are set before processing tunable arguments, improving robustness in configuration handling.
      - Enhanced the logic for skipping compilation when tunable parameters are already provided, ensuring efficient use of resources.
      - Updated comments for clarity and maintainability.
      
      * lint fix
      
      * Update TVM subproject commit to indicate dirty state and modify MHA backward test cases
      
      - Updated the subproject commit reference in the TVM directory to reflect a dirty state.
      - Adjusted the `test_mha_bwd` function to use a new configuration for the MHA backward tests, changing the context size from 128 to 256.
      - Uncommented the main testing function call for potential execution.
      d110d087
    • dependabot[bot]'s avatar
      Bump transformers from 4.50.0 to 4.51.0 in /examples/bitnet-1.58b (#615) · 78056597
      dependabot[bot] authored
      Bumps [transformers](https://github.com/huggingface/transformers) from 4.50.0 to 4.51.0.
      - [Release notes](https://github.com/huggingface/transformers/releases)
      - [Commits](https://github.com/huggingface/transformers/compare/v4.50.0...v4.51.0
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: transformers
        dependency-version: 4.51.0
        dependency-type: direct:production
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      78056597
    • Lei Wang's avatar
      [Enhancement] Update ReduceOp initialization values for integer types (#614) · 80ffea6d
      Lei Wang authored
      * [Enhancement] Update ReduceOp initialization values for integer types
      
      - Modified the `MakeInitValue` method in `ReduceOp` to handle integer data types correctly by returning appropriate minimum and maximum values based on the bit width.
      - Added checks for integer types to ensure correct initialization for `kMax` and `kMin` reduction types, enhancing the robustness of the reduction operations.
      
      * [Enhancement] Update ReduceOp to handle unsigned integer initialization values
      
      - Enhanced the `MakeInitValue` method in `ReduceOp` to include support for unsigned integer data types.
      - Added conditions to return appropriate initialization values for `kMax` and `kMin` reduction types based on the data type, improving the robustness of reduction operations.
      80ffea6d
  9. 04 Jul, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Phaseout Pass ParallelLoopTransformer (#611) · 42c3b452
      Lei Wang authored
      * Refactor layout inference by removing the ParallelLoopTransformer class. Updated layout inference logic to streamline buffer access collection and condition handling in parallel loops. This change simplifies the code structure and enhances maintainability.
      
      * Update MHA backward test cases to use reduced dimensions for batch size and context length
      42c3b452
    • Lei Wang's avatar
      [Doc] Phaseout Legacy documentations (#610) · d9ae74c6
      Lei Wang authored
      - Added a new entry in the README for the introduction of `T.gemm_sp` supporting 2:4 sparse tensor core.
      - Removed several outdated documentation files related to convolution, flash attention, and other tutorials to streamline the documentation structure.
      d9ae74c6
  10. 03 Jul, 2025 1 commit
    • botbw's avatar
      [Experimental][Language] add `T.GEMM_SP` for sm90 sparse tensor core (#526) · be44758c
      botbw authored
      
      
      * [experimental] add a draft gemm_sp
      
      * [3rdparty] bump cutlass to v3.9.3
      
      * [lint] run format.sh
      
      * [chore] rebase
      
      * [chore] use abs path
      
      * [gemm_sp] add metadata layout
      
      * [ci] add more example
      
      * [lint] run format.sh
      
      * [chore] polish
      
      * [chore] move gemm_sp to experimental
      
      * [chore] polish
      
      * [lint] run format.sh
      
      * [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test
      
      * Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy.
      * Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process.
      * Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality.
      
      * Implement Test
      
      * [Enhancement] Update GEMM SP and SM89 templates for improved functionality
      
      * Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture.
      * Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets.
      * Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types.
      * Added conditional inclusion for GEMM SP header based on CUDA architecture version.
      
      * lint fix
      
      * [gemm_sp] support more layout and data types
      
      * Enhancement: sync T.gemm_sp's layout inference with T.gemm
      
      * Enhancement: support more block_k in compress util
      
      * [Enhancement] enable block_k=64
      
      * [Lint] run format.sh
      
      * [Enhancement] compressor support more dtype
      
      * Enhancement: enable block_K=32
      
      * [Lint] format.sh
      
      * [Fixbug] fix shape
      
      * Refactor: sync gemm
      
      * [Enhancement] enable transpose
      
      * [Enhancement] enable fp8_e4m3
      
      * [Enhancement] enable int8
      
      * [Lint] run format.sh
      
      * [Benchmark] add gemm_sp benchmark
      
      * [Example] fix 256 threads hang
      
      * [CI] fix ci
      
      * [Chore] resolve gemini feedback
      
      * [Benchmark] increase search space
      
      * [Lint] format
      
      * [CI] skip sparse tensor core related tests as only sm90 is supported
      
      * [CI] pass local run
      
      * Update gemm_sm89.h
      
      * lint fix
      
      * lint fix
      
      * [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags
      
      - Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation.
      - Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag.
      - Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected.
      - Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch.
      - Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module.
      
      * Update test_compress_utils.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      be44758c
  11. 02 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Introduce option `TL_DISABLE_FAST_MATH` and `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` (#609) · d7aebf4d
      Lei Wang authored
      * [Enhancement] Introduce new PassConfig options for fast math and PTXAS verbosity
      
      - Added `kDisableFastMath` and `kEnablePTXASVerboseOutput` configuration options to enhance control over compilation settings.
      - Updated `LibraryGenerator` to utilize these new pass configurations, allowing for more flexible compilation behavior based on user preferences.
      - Enhanced `PassConfigKey` enumeration to include the new options, ensuring they can be configured appropriately in the pass context.
      
      * [Refactor] Update PTXAS verbosity configuration key in LibraryGenerator
      
      - Changed the configuration key for PTXAS verbosity from `TL_VERBOSE_PTXAS_OUTPUT` to `TL_ENABLE_PTXAS_VERBOSE_OUTPUT` to align with the new naming convention introduced in recent enhancements.
      - This update ensures consistency in the configuration options used within the `LibraryGenerator` class, improving clarity and maintainability of the code.
      
      * lint fix
      d7aebf4d
  12. 01 Jul, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Support tf32 gemm_rs (#607) · 0ff81755
      Lei Wang authored
      - Added a line break in `quickstart.py` for better readability.
      - Simplified the JIT kernel compilation in `quickstart.py` by removing the unused execution backend option.
      - Modified `example_elementwise_add.py` to disable cache for `tilelang` and optimized the element-wise addition kernel by utilizing shared memory for input tensors, improving performance.
      - Updated default values for matrix dimensions and block sizes in the argument parser to enhance usability.
      0ff81755
  13. 30 Jun, 2025 2 commits
  14. 27 Jun, 2025 2 commits
  15. 26 Jun, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Introduce PassConfig `TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE`... · 3ca5a4ba
      Lei Wang authored
      [Enhancement] Introduce PassConfig `TL_ENABLE_AGGRESSIVE_SHARED_MEMORY_MERGE` to enable aggressive shared memory reuse (#602)
      
      * [Enhancement] Add aggressive shared memory merge option in memory allocation
      
      - Introduced a new configuration option `tl.enable_aggressive_shared_memory_merge` to enable aggressive merging of shared memory allocations.
      - Updated the `SharedMemLinearAccessPatternFinder` class to support an aggressive merge strategy, allowing for improved memory reuse.
      - Modified the `MergeSharedMemoryAllocations` function to incorporate the new merging strategy based on the configuration.
      - Enhanced the `PassConfigKey` enumeration to include the new aggressive merge option, ensuring it can be configured appropriately.
      
      * lint fix
      
      * [Enhancement] Add aggressive shared memory merge configuration option
      
      - Introduced a new configuration option `kEnableAggressiveSharedMemoryMerge` to enable aggressive merging of shared memory allocations, enhancing memory management capabilities.
      
      * [Enhancement] Update MergeSharedMemoryAllocations to support aggressive merge option
      
      - Modified the `MergeSharedMemoryAllocations` function to accept an `enable_aggressive_merge` parameter, allowing for more flexible memory management.
      - Introduced a new helper function `should_enable_aggressive_merge` to determine the aggressive merge configuration based on the pass context and target.
      - Updated the relevant calls in the `phase.py` and `__init__.py` files to utilize the new aggressive merge functionality, enhancing the overall memory allocation strategy.
      3ca5a4ba
    • Lei Wang's avatar
      [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks (#599) · a664c998
      Lei Wang authored
      * [Enhancement] Improve error messaging for global and shared range legality checks in LowerBulkCopy
      
      - Updated error messages in the LowerBulkCopy function to provide clearer context when global and shared ranges are illegal.
      - Enhanced the readability of the error output by including tensor names, improving debugging and validation processes during bulk copy operations.
      
      * [Enhancement] Refine error messaging in LowerBulkCopy for global and shared range checks
      
      - Improved the clarity of error messages in the LowerBulkCopy function by enhancing the output format.
      - Included additional context in error messages to aid debugging when global and shared ranges are found to be illegal, ensuring better traceability during bulk copy operations.
      a664c998
  16. 25 Jun, 2025 1 commit
    • Cunxiao Ni's avatar
      [Example] Update examples to use @tilelang.jit (#597) · 3db18726
      Cunxiao Ni authored
      
      
      * [Example] Update kernel compilation in examples to use @tilelang.jit
      
      - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
      - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
      - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.
      
      * format
      
      * Update example_tilelang_sparse_gqa_decode_varlen_indice.py
      
      * Update example_dequant_gemm_fine_grained.py
      
      * Update example_gemm_autotune.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      3db18726
  17. 24 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Add strict layout map for improved buffer layout inference (#594) · 18889821
      Lei Wang authored
      - Introduced a `strict_layout_map` to enhance layout inference by ensuring that buffers with strict layout requirements are properly accounted for during the inference process.
      - Updated the inference logic to check for the presence of buffers in the `strict_layout_map` before applying layout changes, improving the accuracy of layout assignments.
      - Refactored the layout inference steps to include the copying of layouts into the new strict map, ensuring a clear separation of layout handling based on inference levels.
      18889821
  18. 23 Jun, 2025 2 commits
    • Jianqiao Lu's avatar
      [Example] Add a easy version for online softmax (#593) · a6b52c52
      Jianqiao Lu authored
      
      
      * feat: add a easy version for online softmax
      
      * fix: set x & y to fragment memory to load data from global memory
      
      * feat: apply format check
      
      * Add License
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      a6b52c52
    • Lei Wang's avatar
      [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy (#592) · 78651bae
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      
      * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy
      
      - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation.
      - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function.
      78651bae
  19. 22 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Improve memory access condition checks in GlobalMemChecker (#591) · 41ec2bc6
      Lei Wang authored
      * [Enhancement] Improve memory access condition checks in GlobalMemChecker
      
      - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations.
      - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis.
      
      * lintfix
      41ec2bc6
  20. 21 Jun, 2025 1 commit