1. 10 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Pipeline] Optimize inject software pipeline and pipeline planing pass (#706) · 376ba9eb
      Lei Wang authored
      * Refactor inject_pipeline.cc to improve version handling and add unique producer head tracking
      
      - Updated version check to allow for cases with two or more versions.
      - Adjusted logic to decrement num_versions when multi-versioning is not needed.
      - Introduced a helper function to ensure unique producer heads are added to the commit group.
      - Removed obsolete AddAllocBuffers method to streamline code.
      
      * lint fix
      
      * Refactor pipeline planning logic to enhance copy stage dependency management
      
      - Removed obsolete conditional expression handling from the pipeline planning code.
      - Introduced a new structure to manage copy stage dependency reads, improving clarity and efficiency.
      - Updated logic to correctly identify producer stages for copy stages, ensuring accurate pipeline stage assignment.
      - Added a new block sparse matrix multiplication function in the testing suite to validate the pipeline planning changes.
      
      * Update ci.yml
      
      * Fix structural equality checks in AddUnique and Contains methods to compare buffer references instead of entire regions in pipeline planning.
      
      * Refactor pipeline planning logic to improve copy stage dependency propagation
      
      - Updated structural equality checks in AddUnique and Contains methods to use buffer reference comparison.
      - Enhanced the iteration logic for managing copy stage dependencies, ensuring accurate identification of producer stages.
      - Added safeguards against exceeding maximum iterations during dependency propagation.
      376ba9eb
  2. 08 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Layout] Introduce a new layout inference mechanism (#699) · 407117e1
      Lei Wang authored
      
      
      * Implement new free stage layout inference.
      
      * Fix bug
      
      * Make replication upcasting and unnormalizable iterators safe.
      
      * Better handling of updating with more replica
      
      * Remove unnecessary check.
      
      * Fix compilation.
      
      * Fix setup.py.
      
      * Simplify development mode.
      
      * Allow ParallelOp layout when there's already a compatible layout specified
      
      * lint fix
      
      * Add ProveFragmentContains function to validate thread access between small and large fragments
      
      This function checks if the threads accessing elements of a smaller fragment are a subset of those accessing a larger fragment, ensuring valid access during updates. The implementation includes deriving thread indices, computing logical indices, and verifying thread mappings.
      
      * Update dependencies in requirements files
      
      * Remove 'thefuzz' from requirements-dev.txt
      * Specify exact versions for 'torch' and add 'flash_attn' in requirements-test.txt
      
      * Update CI workflow to use SHA256 hash for requirements file
      
      * Update requirements and CI workflow for flash attention
      
      * Removed specific version for 'torch' in requirements-test.txt
      * Added installation of 'flash_attn==2.5.8' in CI workflow to ensure compatibility
      
      * Refactor flash attention import handling in examples
      
      * Removed availability checks for 'flash_attn' in multiple example scripts.
      * Simplified import statements for 'flash_attn' to ensure consistent usage across examples.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      407117e1
    • Lei Wang's avatar
      [CI] Remove Flash Attention dependency (#705) · 87aae294
      Lei Wang authored
      * Update flash-attn version in requirements-test.txt from <=2.2.0 to ==2.5.8
      
      * lint fix
      
      * Remove unused dependencies from requirements-test.txt
      
      * Update import path for padding functions in example MHA forward variable length script
      
      * Refactor code formatting in bert_padding.py for improved readability
      87aae294
    • Yichen Yan's avatar
      Trivial update to calculate target arch (#702) · da74c09d
      Yichen Yan authored
      
      
      * Trivial update to calculate target arch
      
      * Update tilelang/contrib/nvrtc.py
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      
      * fmt
      
      ---------
      Co-authored-by: default avatargemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
      da74c09d
  3. 07 Aug, 2025 2 commits
  4. 06 Aug, 2025 2 commits
    • Lei Wang's avatar
      [Example] Optimize warp specialize flashmla example (#698) · a1149cab
      Lei Wang authored
      * [Enhancement] Disable cache and append git commit ID to version in tilelang (#688)
      
      * Disabled caching in quickstart example for improved performance.
      * Added a function to retrieve the current git commit ID and appended it to the version string if not already present, enhancing version tracking and debugging capabilities.
      
      * revert quickstart
      
      * optimize code.
      a1149cab
    • Lei Wang's avatar
      [Version] Keep local commit id as it somehow help with debugging (#697) · ed1b96d5
      Lei Wang authored
      * [Enhancement] Disable cache and append git commit ID to version in tilelang (#688)
      
      * Disabled caching in quickstart example for improved performance.
      * Added a function to retrieve the current git commit ID and appended it to the version string if not already present, enhancing version tracking and debugging capabilities.
      
      * revert quickstart
      ed1b96d5
  5. 05 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Smem Reuse] Optimize to do memory alignment on identical buffers. (#693) · 17fafc1b
      Lei Wang authored
      * [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling
      
      - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture.
      - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic.
      - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture.
      - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase.
      
      * bug fix
      
      * test fix
      
      * lint fix
      
      * phase out Canonialize
      
      * add option --expt-relaxed-constexpr
      
      * [Enhancement] Introduce tilelang intrinsic operations for GEMM
      
      - Added `tl_gemm` and `tl_gemm_sp` built-in operations to support general and sparse matrix multiplication in tilelang.
      - Updated the lowering logic in `Gemm` and `GemmSP` to utilize the new tilelang operations.
      - Enhanced CUDA and HIP code generation to handle the new GEMM operations, ensuring proper argument validation and external call printing.
      - Implemented shared memory alignment planning for GEMM operations to optimize performance on supported architectures.
      
      * lint fix
      
      * lint fix
      
      * test fix
      
      * test fix
      
      * rebase
      
      * Update builtin.cc
      17fafc1b
  6. 04 Aug, 2025 1 commit
  7. 03 Aug, 2025 3 commits
    • Lei Wang's avatar
      [Refactor] Introduce GemmInst for different targets handling (#688) · d2afb513
      Lei Wang authored
      * [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling
      
      - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture.
      - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic.
      - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture.
      - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase.
      
      * bug fix
      
      * test fix
      
      * lint fix
      d2afb513
    • Lei Wang's avatar
      [Refactor] Rebase pipeline injector from upstream tvm (#687) · 73bf8346
      Lei Wang authored
      * [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling
      
      - Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline.
      - Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling.
      - Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction.
      - Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure.
      - Removed obsolete code related to previous buffer access methods for clarity and maintainability.
      
      * test fix
      73bf8346
    • yyttt6's avatar
      [Feature]:Add auto vectorize for atomic add (#686) · b45e9c45
      yyttt6 authored
      * [Feature]:Add auto vectorize for atomic add
      
      * fix
      
      * fix2
      
      * format
      b45e9c45
  8. 01 Aug, 2025 1 commit
  9. 31 Jul, 2025 5 commits
    • Cunxiao Ni's avatar
      [Fix] fix some issues with JIT decorators existing in the examples (#681) · 950ed16c
      Cunxiao Ni authored
      
      
      * [Fix] fix some issues with JIT decorators existing in the examples
      
      * format
      
      * Uses PassConfigKey instand of str
      
      ---------
      Co-authored-by: default avatarCunxiao <nicunxiao@bytedance.com>
      950ed16c
    • Yu Cheng's avatar
      [Enhancement] Refactored buffer detection logic in warp_specialized_rewriter.cc (#685) · 689ee52b
      Yu Cheng authored
      - Renamed TMAFinder to ProducerBufferDetector and improved handling of CallNode and BufferLoadNode.
      - This change aims to enhance code maintainability and performance by more accurately tracking producer buffer usage.
      689ee52b
    • alex_xiao's avatar
      Add Flash Attn example on amd mi300 series (#682) · adcba275
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      adcba275
    • Yu Cheng's avatar
      [Enhancement] Enhance warp specialization logic (#680) · 05f2fc6d
      Yu Cheng authored
      
      
      - Removed unnecessary configurations from the @tilelang.jit decorator in `example_grouped_gemm_fwd.py`, simplifying the kernel compilation process.
      - Updated the `grouped_gemm` function to accept a tuple for batch sizes, enhancing compatibility with the kernel invocation.
      - Added logic in `warp_specialized_rewriter.cc` to track buffer usage in `CallNode` expressions, improving the handling of TMA load operations.
      
      This refactor aims to streamline the code and improve maintainability while ensuring better performance in grouped matrix multiplication operations.
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      05f2fc6d
    • Yang Chen's avatar
      [Enhancement] Output cache-file-related messages with verbose=True (#683) · 042c60fb
      Yang Chen authored
      This is a minor enhancement to output verbose messages indicating where
      cache files are saved and loaded. These messages are useful for
      examining the relevant intermediate files.
      042c60fb
  10. 30 Jul, 2025 5 commits
    • Lei Wang's avatar
      [CI] Update CI workflow to use Python 3.12 (#679) · eb026b79
      Lei Wang authored
      * Update CI workflow to use Python 3.12 and enable build isolation for pip installations
      
      - Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements.
      - Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs.
      
      * Update CI workflow to trigger on pull requests instead of pull_request_target
      
      - Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process.
      
      * Refactor CI workflow to remove unnecessary repository and token settings
      
      - Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information.
      
      * Remove pip install command from CI workflow to streamline installation process
      
      * Refactor reshape functions and tests for shared memory operations
      
      - Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity.
      - Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`.
      - Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.
      eb026b79
    • Lei Wang's avatar
      [Refactor] Phaseout version with commit id in editable model (#677) · ca1138c3
      Lei Wang authored
      
      
      * merge from lab
      
      * Add `TILELANG_PRINT_ON_COMPILATION`
      
      * Update CI workflow to disable build isolation for pip installations in testing requirements
      
      - Changed the `PIP_NO_BUILD_ISOLATION` environment variable from `1` to `0` in the CI configuration, ensuring that pip installs the testing requirements without build isolation. This adjustment aims to improve compatibility and streamline the installation process during CI runs.
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      ca1138c3
    • Yichen Yan's avatar
      Do not check for short variables (#676) · 4878cc5d
      Yichen Yan authored
      which there's a lot
      4878cc5d
    • Siyuan Feng's avatar
      Refactor to support upstream tvm (#595) · a7c9a8b9
      Siyuan Feng authored
      **Summarize part of the rebase pr:**
      
      1. **Support T.thread_return() → CUDA return syntax**  
         Added support for translating `T.thread_return()` to CUDA's native `return` statement.
      
      2. **Dynamic type support for function inputs**  
         Functions now accept dynamically typed parameters using `typing`:
         ```python
         dyn_type = T.int32 or T.float
         @T.prim_func
         def main(
             a: dyn_type,
         )
         ```
      
      3. **Device Function Codegen**  
         Added support for generating `__device__` functions in CUDA:
         ```python
         @I.ir_module
         class Module:
             @T.prim_func(private=True)
             def add(a: T.int32, b: T.int32) -> T.int32:
                 return a + b
      
             @T.prim_func
             def main(
                 A: T.Buffer((128, 128), "int32"),
                 B: T.Buffer((128, 128), "int32"),
                 C: T.Buffer((128, 128), "int32"),
             ):
                 T.func_attr({"global_symbol": "main"})
                 length: T.int32 = Module.add(64, 64)  # Host call
                 for bx in...
      a7c9a8b9
    • Wenhao Xie's avatar
      Update ci.yml (#675) · 8edd6941
      Wenhao Xie authored
      8edd6941
  11. 29 Jul, 2025 6 commits
  12. 25 Jul, 2025 2 commits
  13. 24 Jul, 2025 3 commits
    • Lei Wang's avatar
      [Enhancement] Improve buffer conflict detection in thread storage synchronization (#658) · a16f0cf5
      Lei Wang authored
      * [Enhancement] Improve buffer conflict detection in thread storage synchronization
      
      - Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`.
      - Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons.
      - Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity.
      
      * example fix
      
      * enhancement
      
      * improve ci
      a16f0cf5
    • Wenhao Xie's avatar
      [Bugfix][Docs] Update documentation build process and configurations for autoapi support (#663) · c8edb957
      Wenhao Xie authored
      * [Bugfix][Docs] Update documentation build process and configurations for autoapi support
      
      * lint fix
      c8edb957
    • Zhengju Tang's avatar
      [BugFix] Do not modify strict layout in common or relax level of layout... · fe6cdc9d
      Zhengju Tang authored
      
      [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653)
      
      * [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking
      
      * Lint
      
      * test fix
      
      * Update CI workflow to install dependencies without user site packages
      
      - Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory.
      
      * Update CI workflow to install pip without user site packages
      
      - Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment.
      
      * Update requirements-test.txt
      
      * reduce ci problem size,
      
      * Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      fe6cdc9d
  14. 23 Jul, 2025 5 commits
    • Zhang Jason's avatar
    • Wenhao Xie's avatar
      [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes... · d764dca8
      Wenhao Xie authored
      
      [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656)
      
      * [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control
      
      * lint fix
      
      * upd
      
      * lint fix
      
      * fix typo
      
      * update typing
      
      * update the use case of compile flags
      
      * ci fix
      
      * fix
      
      * Fix CI workflow to correctly activate virtual environment from shared cache directory
      
      * use local cache
      
      * fix
      
      * fix
      
      * fix
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      d764dca8
    • Lei Wang's avatar
      [Cache] Support shared cache directories for multiple process (#649) · 267d9b3b
      Lei Wang authored
      
      
      * Support shared cache directories for multiple users
      
      * ruff fix
      
      * ci_fix
      
      * Add CI step to show worker info
      
      ---------
      Co-authored-by: default avatarChenggang Zhao <chenggangz@deepseek.com>
      267d9b3b
    • Lei Wang's avatar
    • Wenhao Xie's avatar
      [Bugfix][CI] Bug fixing and migrate CI from ada to hopper (#652) · e9a608e2
      Wenhao Xie authored
      
      
      * fix CI bugs in hopper
      
      * lint fix
      
      * Update bulk_copy.cc
      
      * Refactor bulk copy logic in LowerBulkCopy function
      
      - Removed unnecessary blank lines for improved code readability.
      - Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides.
      - Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings.
      
      * test fix
      
      * ci fix
      
      * Update flash-attention dependencies and clean up example code
      
      - Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`.
      - Removed unused imports and commented-out code in various example files to enhance readability and maintainability.
      - Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`.
      - Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity.
      - Deleted the `example_mha_inference.py` file as it is no longer needed.
      
      * Update CI workflow to remove `--user` flag from pip install commands
      
      - Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment.
      
      * Update CI workflow to include `--no-user` flag in pip install commands
      
      - Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment.
      
      * Update CI workflow to include `--no-user` flag in pip install command for wheel mode
      
      - Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment.
      
      * test fix
      
      * avoid conflict with system environments
      
      * test fix
      
      * add commnets
      
      ---------
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      e9a608e2