1. 19 Nov, 2025 1 commit
  2. 16 Nov, 2025 1 commit
  3. 05 Oct, 2025 1 commit
  4. 30 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Example] Specify a fixed commit for the flash-linear-attention repository and... · 3ad6202d
      Lei Wang authored
      [Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples (#913)
      
      - Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository.
      - Refactored import paths in benchmark_nsa_fwd.py for better organization.
      - Added a new function to generate configurations for autotuning.
      - Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility.
      - Changed allocation of shared memory for accumulators to optimize performance.
      3ad6202d
  5. 18 Sep, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Turn off `ENABLE_FAST_MATH` by default (#846) · e7e38355
      Lei Wang authored
      * [Enhancement] Enable fast math optimization in tilelang JIT configurations
      
      - Updated multiple examples and kernel functions to include `pass_configs` for enabling fast math optimization.
      - Added support for the `TL_ENABLE_FAST_MATH` configuration option in the built-in operations.
      - Enhanced the `LibraryGenerator` to handle the new fast math configuration, ensuring compatibility with existing settings.
      - Updated documentation to reflect the changes in fast math handling and deprecation of the `TL_DISABLE_FAST_MATH` option.
      
      * lint fix
      
      * [Refactor] Introduce deprecated_warning utility for improved deprecation handling
      
      - Added a new `deprecated_warning` function to streamline deprecation messages.
      - Updated the `LibraryGenerator` to utilize the new function for warning about the deprecated `TL_DISABLE_FAST_MATH` configuration.
      - Enhanced the `deprecated` decorator to support phaseout version messaging, improving clarity for users.
      e7e38355
  6. 22 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Merge bulk copy into copy and improve layout inference for bulk copy (#746) · 5c11d245
      Lei Wang authored
      * [Refactor] Merge bulk copy into copy and refactor layout inference for bulk copy
      
      * Deleted the `bulk_copy` operator implementation and its header file as it is no longer needed.
      * Introduced a new function `cuTensorMapType()` to return the data type for CUDA tensor mapping.
      * Updated related files to reflect these changes, ensuring that the codebase remains clean and maintainable.
      
      * lint fix
      
      * Fix typos in intrinsic names and remove unused print statement in block_sparse_attn_tilelang.py. Updated references from `ptx_ldmatirx` to `ptx_ldmatrix` across multiple files for consistency.
      
      * remove bulk copy
      
      * Refactor copy and atomic add operations to support TMA lower configuration
      
      - Updated `GetCopyInst` to accept a `disable_tma_lower` parameter, allowing for conditional usage of TMA in bulk load/store operations.
      - Modified `Lower` method in `Copy` to incorporate the new TMA configuration.
      - Refactored `AtomicAdd::Lower` to streamline layout inference and vectorization logic.
      - Removed unused `disable_tma_lower` field from `LowerArgs` structure for clarity.
      - Enhanced atomic add vectorization by replacing the buggy implementation with a more robust loop vectorization approach.
      
      * Enhance TMA bulk copy logic in `LowerBulkCopy` method
      
      - Added a condition to set `desc.swizzle` to `CU_TENSOR_MAP_SWIZZLE_NONE` when `shared_layout` matches `linear_layout`, improving clarity in layout handling.
      - Updated warning log to provide more detailed information about fallback scenarios, including source and destination buffer names and shapes, enhancing debugging capabilities.
      
      * lint fix
      
      * Remove fallback logging for non-swizzled global layout in `LowerBulkCopy` method to streamline the bulk copy logic. This change enhances code clarity by eliminating unnecessary warning messages related to inner box dimensions.
      
      * Enhance reshape kernel compilation in `run_reshape` and `run_reshape_smem_1d_2_2d` functions
      
      - Updated the `tl.compile` method to include `pass_configs` that disable TMA lower and warp specialization, addressing shared memory layout transformation limitations.
      - Added TODO comments to indicate the need for further improvements in shared memory handling.
      
      * Update `native_sparse_attention` function to include TMA configuration options
      
      - Added `pass_configs` to the JIT decorator to disable TMA lower and warp specialization, addressing potential issues with shared memory layout transformations.
      - Updated comments to clarify modifications in tensor shapes for inference, specifically setting `q` sequence length to 1.
      
      * Refactor JIT decorator formatting in `native_sparse_attention` function
      
      - Improved readability by reformatting the JIT decorator parameters for `native_sparse_attention`, ensuring consistent style across the codebase.
      - No functional changes were made; this update focuses on code clarity and maintainability.
      
      * Enhance thread management and logging in TileLang compilation
      
      - Added a method to check if printing is enabled during compilation, improving control over logging behavior.
      - Updated the JIT kernel class to utilize the new method for logging compilation status, ensuring consistent and clear output.
      - Added comments to clarify the purpose of changes and improve code readability.
      
      * Add warp specialization scope and refactor register management in TileLang
      
      - Introduced a new constant `kWarpSpecializationScope` in `builtin.h` for better attribute management.
      - Removed the `SetMaxNRegCollector` class and its related logic from `warp_specialized_rewriter.cc`, streamlining the warp specialization process.
      - Added functions `annotate_producer_reg_dealloc` and `annotate_consumer_reg_alloc` in `builtin.py` to facilitate register management.
      - Implemented `AnnotateWarpGroupRegAlloc` in `__init__.py` to inject register allocation calls into warp-specialized functions, enhancing the overall register handling in the compilation process.
      
      * Refactor test for InjectSetMaxNReg pass in TileLang
      
      - Improved readability by restructuring conditional checks and assertions in the test cases.
      - Enhanced clarity in the collection of `set_max_nreg` calls by simplifying the logic.
      - Ensured consistent formatting and spacing throughout the test functions for better maintainability.
      
      * Enhance bulk copy and store checks in `Copy` class
      
      - Updated scope validation for source and destination tensors in `CheckBulkLoad` and `CheckBulkStore` methods to include both `shared.dyn` and `shared` as valid options.
      - Modified `CheckLDSMCopy` and `CheckSTSMCopy` methods to accommodate the new scope validation, ensuring compatibility with shared memory configurations.
      - Improved logging in `LowerBulkCopy` to provide clearer warnings regarding unsupported swizzle layouts, including source and destination names for better debugging.
      
      * lint fix
      5c11d245
  7. 12 Jul, 2025 1 commit
  8. 25 Jun, 2025 1 commit
    • Cunxiao Ni's avatar
      [Example] Update examples to use @tilelang.jit (#597) · 3db18726
      Cunxiao Ni authored
      
      
      * [Example] Update kernel compilation in examples to use @tilelang.jit
      
      - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead.
      - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability.
      - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed.
      
      * format
      
      * Update example_tilelang_sparse_gqa_decode_varlen_indice.py
      
      * Update example_dequant_gemm_fine_grained.py
      
      * Update example_gemm_autotune.py
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      3db18726
  9. 16 Jun, 2025 1 commit
  10. 04 Jun, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Include several examples into ci (#531) · 3ca3a8af
      Lei Wang authored
      * Remove unused 2D continuous cumulative sum example and related functions from the cumsum module.
      
      * lint fix
      
      * fix split k example
      
      * Enable cache disabling in gemm_streamk example and add validation checks in if_stmt_binding transformation
      
      * Update gemm_streamk example to use tilelang's cdiv function for block calculations and add copyright notice
      3ca3a8af
  11. 18 May, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] refactor `tilelang.jit` to support a faster and more flexible kernel cache (#501) · 25a50f1a
      Lei Wang authored
      * [Refactor] Update JIT kernel functions and streamline GEMM tests
      
      * Renamed and refactored matmul and run_gemm functions to matmul_kernel_jit and run_gemm_kernel_jit for clarity.
      * Removed redundant JIT decorator from the matmul function, ensuring it is applied only to the kernel function.
      * Updated test function names to reflect changes in the kernel functions, enhancing consistency and readability.
      * Cleaned up commented-out code and unnecessary imports to improve overall code quality.
      
      * Update main function call in GEMM test to use tilelang testing framework
      
      * Update README and example scripts to include JIT decorator comments
      
      * Added comments in README.md and various example scripts to indicate the use of the @tilelang.jit decorator for returning torch functions.
      * Removed redundant comments that previously instructed to add the decorator, streamlining the documentation and improving clarity.
      
      * Update GEMM test parameters for improved performance
      
      * Set num_stages to 0 and adjusted matrix dimensions in test functions to enhance performance and consistency across GEMM tests in test_tilelang_kernel_gemm.py.
      25a50f1a
  12. 12 Apr, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement][Pipeline] More precise copy code block detection in pipeline (#384) · abaacde5
      Lei Wang authored
      * Update legalize_safe_memory_access.cc
      
      * Add cache path handling and file locking in Cython adapter
      
      - Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management.
      - Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling.
      - These changes aim to optimize the compilation process and prevent race conditions during library loading.
      
      * lint fix
      
      * refactor
      
      * refactor
      
      * Add GlobalCopyPatternDetector to identify global memory copy patterns
      
      - Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements.
      - Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes.
      - Improved code clarity and maintainability by encapsulating detection logic within the new class.
      
      * Refactor copy stage detection logic in pipeline planning
      
      - Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage.
      - Removed redundant checks for read and write scopes, enhancing code clarity and maintainability.
      
      * lint fix
      abaacde5