1. 03 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Introduce GemmInst for different targets handling (#688) · d2afb513
      Lei Wang authored
      * [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling
      
      - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture.
      - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic.
      - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture.
      - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase.
      
      * bug fix
      
      * test fix
      
      * lint fix
      d2afb513
    • Lei Wang's avatar
      [Refactor] Rebase pipeline injector from upstream tvm (#687) · 73bf8346
      Lei Wang authored
      * [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling
      
      - Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline.
      - Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling.
      - Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction.
      - Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure.
      - Removed obsolete code related to previous buffer access methods for clarity and maintainability.
      
      * test fix
      73bf8346
    • yyttt6's avatar
      [Feature]:Add auto vectorize for atomic add (#686) · b45e9c45
      yyttt6 authored
      * [Feature]:Add auto vectorize for atomic add
      
      * fix
      
      * fix2
      
      * format
      b45e9c45
  2. 01 Aug, 2025 1 commit
  3. 31 Jul, 2025 5 commits
    • Cunxiao Ni's avatar
      [Fix] fix some issues with JIT decorators existing in the examples (#681) · 950ed16c
      Cunxiao Ni authored
      
      
      * [Fix] fix some issues with JIT decorators existing in the examples
      
      * format
      
      * Uses PassConfigKey instand of str
      
      ---------
      Co-authored-by: default avatarCunxiao <nicunxiao@bytedance.com>
      950ed16c
    • Yu Cheng's avatar
      [Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc (#685) · 689ee52b
      Yu Cheng authored
      - Renamed TMAFinder to ProducerBufferDetector and improved handling of CallNode and BufferLoadNode.
      - This change aims to enhance code maintainability and performance by more accurately tracking producer buffer usage.
      689ee52b
    • alex_xiao's avatar
      Add Flash Attn example on amd mi300 series (#682) · adcba275
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      adcba275
    • Yu Cheng's avatar
      [Enhancement] Enhance warp specialization logic (#680) · 05f2fc6d
      Yu Cheng authored
      
      
      - Removed unnecessary configurations from the @tilelang.jit decorator in `example_grouped_gemm_fwd.py`, simplifying the kernel compilation process.
      - Updated the `grouped_gemm` function to accept a tuple for batch sizes, enhancing compatibility with the kernel invocation.
      - Added logic in `warp_specialized_rewriter.cc` to track buffer usage in `CallNode` expressions, improving the handling of TMA load operations.
      
      This refactor aims to streamline the code and improve maintainability while ensuring better performance in grouped matrix multiplication operations.
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      05f2fc6d
    • Yang Chen's avatar
      [Enhancement] Output cache-file-related messages with verbose=True (#683) · 042c60fb
      Yang Chen authored
      This is a minor enhancement to output verbose messages indicating where
      cache files are saved and loaded. These messages are useful for
      examining the relevant intermediate files.
      042c60fb
  4. 30 Jul, 2025 5 commits
    • Lei Wang's avatar
      [CI] Update CI workflow to use Python 3.12 (#679) · eb026b79
      Lei Wang authored
      * Update CI workflow to use Python 3.12 and enable build isolation for pip installations
      
      - Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements.
      - Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs.
      
      * Update CI workflow to trigger on pull requests instead of pull_request_target
      
      - Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process.
      
      * Refactor CI workflow to remove unnecessary repository and token settings
      
      - Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information.
      
      * Remove pip install command from CI workflow to streamline installation process
      
      * Refactor reshape functions and tests for shared memory operations
      
      - Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity.
      - Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`.
      - Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.
      eb026b79
    • Lei Wang's avatar
      [Refactor] Phaseout version with commit id in editable model (#677) · ca1138c3
      Lei Wang authored
      
      
      * merge from lab
      
      * Add `TILELANG_PRINT_ON_COMPILATION`
      
      * Update CI workflow to disable build isolation for pip installations in testing requirements
      
      - Changed the `PIP_NO_BUILD_ISOLATION` environment variable from `1` to `0` in the CI configuration, ensuring that pip installs the testing requirements without build isolation. This adjustment aims to improve compatibility and streamline the installation process during CI runs.
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      ca1138c3
    • Yichen Yan's avatar
      Do not check for short variables (#676) · 4878cc5d
      Yichen Yan authored
      which there's a lot
      4878cc5d
    • Siyuan Feng's avatar
      Refactor to support upstream tvm (#595) · a7c9a8b9
      Siyuan Feng authored
      **Summarize part of the rebase pr:**
      
      1. **Support T.thread_return() → CUDA return syntax**  
         Added support for translating `T.thread_return()` to CUDA's native `return` statement.
      
      2. **Dynamic type support for function inputs**  
         Functions now accept dynamically typed parameters using `typing`:
         ```python
         dyn_type = T.int32 or T.float
         @T.prim_func
         def main(
             a: dyn_type,
         )
         ```
      
      3. **Device Function Codegen**  
         Added support for generating `__device__` functions in CUDA:
         ```python
         @I.ir_module
         class Module:
             @T.prim_func(private=True)
             def add(a: T.int32, b: T.int32) -> T.int32:
                 return a + b
      
             @T.prim_func
             def main(
                 A: T.Buffer((128, 128), "int32"),
                 B: T.Buffer((128, 128), "int32"),
                 C: T.Buffer((128, 128), "int32"),
             ):
                 T.func_attr({"global_symbol": "main"})
                 length: T.int32 = Module.add(64, 64)  # Host call
                 for bx in...
      a7c9a8b9
    • Wenhao Xie's avatar
      Update ci.yml (#675) · 8edd6941
      Wenhao Xie authored
      8edd6941
  5. 29 Jul, 2025 6 commits
  6. 25 Jul, 2025 2 commits
  7. 24 Jul, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] Improve buffer conflict detection in thread storage synchronization (#658) · a16f0cf5
      Lei Wang authored
      * [Enhancement] Improve buffer conflict detection in thread storage synchronization
      
      - Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`.
      - Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons.
      - Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity.
      
      * example fix
      
      * enhancement
      
      * improve ci
      a16f0cf5
    • Wenhao Xie's avatar
      [Bugfix][Docs] Update documentation build process and configurations for autoapi support (#663) · c8edb957
      Wenhao Xie authored
      * [Bugfix][Docs] Update documentation build process and configurations for autoapi support
      
      * lint fix
      c8edb957
    • Zhengju Tang's avatar
      [BugFix] Do not modify strict layout in common or relax level of layout... · fe6cdc9d
      Zhengju Tang authored
      
      [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653)
      
      * [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking
      
      * Lint
      
      * test fix
      
      * Update CI workflow to install dependencies without user site packages
      
      - Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory.
      
      * Update CI workflow to install pip without user site packages
      
      - Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment.
      
      * Update requirements-test.txt
      
      * reduce ci problem size,
      
      * Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      fe6cdc9d
  8. 23 Jul, 2025 5 commits
    • Zhang Jason's avatar
    • Wenhao Xie's avatar
      [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes... · d764dca8
      Wenhao Xie authored
      
      [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656)
      
      * [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control
      
      * lint fix
      
      * upd
      
      * lint fix
      
      * fix typo
      
      * update typing
      
      * update the use case of compile flags
      
      * ci fix
      
      * fix
      
      * Fix CI workflow to correctly activate virtual environment from shared cache directory
      
      * use local cache
      
      * fix
      
      * fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      d764dca8
    • Lei Wang's avatar
      [Cache] Support shared cache directories for multiple process (#649) · 267d9b3b
      Lei Wang authored
      
      
      * Support shared cache directories for multiple users
      
      * ruff fix
      
      * ci_fix
      
      * Add CI step to show worker info
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      267d9b3b
    • Lei Wang's avatar
    • Wenhao Xie's avatar
      [Bugfix][CI] Bug fixing and migrate CI from ada to hopper (#652) · e9a608e2
      Wenhao Xie authored
      
      
      * fix CI bugs in hopper
      
      * lint fix
      
      * Update bulk_copy.cc
      
      * Refactor bulk copy logic in LowerBulkCopy function
      
      - Removed unnecessary blank lines for improved code readability.
      - Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides.
      - Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings.
      
      * test fix
      
      * ci fix
      
      * Update flash-attention dependencies and clean up example code
      
      - Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`.
      - Removed unused imports and commented-out code in various example files to enhance readability and maintainability.
      - Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`.
      - Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity.
      - Deleted the `example_mha_inference.py` file as it is no longer needed.
      
      * Update CI workflow to remove `--user` flag from pip install commands
      
      - Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment.
      
      * Update CI workflow to include `--no-user` flag in pip install commands
      
      - Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment.
      
      * Update CI workflow to include `--no-user` flag in pip install command for wheel mode
      
      - Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment.
      
      * test fix
      
      * avoid conflict with system environments
      
      * test fix
      
      * add commnets
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      e9a608e2
  9. 22 Jul, 2025 1 commit
  10. 21 Jul, 2025 2 commits
    • Lei Wang's avatar
      [Refactor] Remove small array reuse condition in shared memory allocation merging (#654) · 8205791d
      Lei Wang authored
      - Eliminated the condition that disabled the reuse of small arrays (const_nbits <= 32) in the `MergeSharedMemoryAllocations` function, allowing for more flexible memory management.
      - Added a comment in `OptimizeForTarget` to clarify the order of applying `MergeSharedMemoryAllocations` after `SplitHostDevice`, ensuring correct allocation site handling in device functions.
      8205791d
    • meinie's avatar
      [Bugfix] Assign Target for jit kernel (#648) · 6e994b12
      meinie authored
      
      
      * fix: Copy Target to self.target
      
      * refactor: Remove unused target attribute and adjust context management in JITKernel
      
      - Removed the unused `target` attribute from the `JITKernel` class.
      - Updated the context management in the `compile` method to utilize `self.target`, improving clarity and ensuring proper resource handling during compilation.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      6e994b12
  11. 20 Jul, 2025 2 commits
  12. 17 Jul, 2025 2 commits
    • Lei Wang's avatar
      [Enhancement] Align dynamic shared memory allocations in phase.py (#644) · b060c9f7
      Lei Wang authored
      - Added a comment to clarify the alignment of dynamic shared memory allocations in the `OptimizeForTarget` function.
      - Refactored the handling of shared memory allocation merging and synchronization to streamline the process, ensuring consistent behavior regardless of the aggressive merge flag.
      - Improved code clarity by removing redundant conditional checks related to synchronization and memory allocation.
      b060c9f7
    • Lei Wang's avatar
      [Enhancement] Add Cython cache directory to setup.py (#643) · 6c0a5841
      Lei Wang authored
      - Included the Cython cache directory in the list of source files for the TileLang build process, ensuring proper handling of cached Cython files during the build.
      6c0a5841
  13. 16 Jul, 2025 3 commits
    • YizhaoGao's avatar
      [Example] Add paged block-sparse flash-decoding kernel (#638) · 2aded11a
      YizhaoGao authored
      
      
      * Add paged block-sparse flash-decoding kernel
      
      * Update example_tilelang_sparse_gqa_decode_paged.py
      
      * lint fix
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      2aded11a
    • Lei Wang's avatar
      [Enhancement] Extend pythonic_expr to support dtype mapping in utils.py (#641) · 60974197
      Lei Wang authored
      - Updated the `pythonic_expr` function to accept an optional `dtype_map` parameter, allowing for more flexible type conversions.
      - Refactored calls to `pythonic_expr` in `TLCUDASourceWrapper` to utilize the new mapping feature, improving type handling in kernel generation.
      - Enhanced code clarity by consolidating repeated calls to `pythonic_expr` into a private method within the wrapper class.
      60974197
    • Lei Wang's avatar
      [Bugfix] Put thread_extent into reduce (#640) · 156ff85e
      Lei Wang authored
      * [Enhancement] Update AllReduce operation to include thread offset in kernel generation
      
      - Modified the `ReduceOp::Lower` method to incorporate the thread offset in the AllReduce kernel generation for the sm_90 architecture.
      - This change improves the accuracy of thread management during reduction operations, enhancing performance on specific GPU architectures.
      
      * [Enhancement] Refactor thread offset handling in AllReduce kernel generation
      
      - Updated the `ReduceOp::Lower` method to streamline the handling of thread offset for AllReduce operations, ensuring consistent usage across different architectures.
      - This change enhances code clarity and maintains performance improvements for the sm_90 architecture by reducing redundancy in thread offset calculations.
      156ff85e