"git@developer.sourcefind.cn:dadigang/Ventoy.git" did not exist on "e73ac04cb6eb86ca6e08c1f9c47506d2ee003e67"
  1. 09 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Bugfix] Enhane LetStmt Handling in Pipeline Transform (#1212) · 85218bd9
      Lei Wang authored
      * [Enhancement] Introduce LetWrapper for handling loop variable substitutions in pipeline rewriting
      
      * Added LetWrapper struct to encapsulate variable and value pairs for loop variable substitutions.
      * Updated PipelineRewriter to accept a vector of LetWrapper instances, allowing for proper handling of Let statements that depend on the pipeline loop variable.
      * Enhanced the BuildPipeline method to incorporate LetWrapper instances into rewritten blocks, ensuring correct substitutions during pipeline execution.
      * Refactored logic for processing Let statements to differentiate between those that use the loop variable and those that do not, improving the flexibility of the pipeline transformation.
      
      * Refactor lambda expression for clarity in loop variable usage check in inject_pipeline.cc
      
      * [Test] Add regression test for loop variable handling in kernel compilation
      
      * Introduced a new test case to verify correct handling of loop variables in the kernel compilation process, addressing a regression issue with InjectSoftwarePipeline.
      * The test ensures that the loop variable is not left as a free variable, which previously caused failures in MakePackedAPI.
      * Configurations are set to disable warp specialization and TMA lowering to align with the original issue reproduction.
      
      * Remove unused import in regression test for loop variable handling in kernel compilation
      85218bd9
  2. 08 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Enhancement] Improve handling of negative indices for ramp and broadcast node (#1207) · 918a21bd
      Lei Wang authored
      * [Enhancement] Improve handling of negative indices in legalize_negative_index pass
      
      * Added logic to handle scalar and vector indices separately, enhancing the ability to determine non-negativity and negativity of indices.
      * Introduced detailed logging for cases where non-negativity cannot be proven, improving debugging capabilities.
      * Refactored index state determination for vector types, including support for Ramp and Broadcast nodes.
      
      * Fix incorrect lane handling in legalize_negative_index pass by dereferencing lanes to obtain the correct integer value.
      
      * Enhance legalize_negative_index pass by including necessary header for TIR operations. This addition supports improved functionality and maintainability of the transformation logic.
      918a21bd
  3. 07 Nov, 2025 1 commit
  4. 06 Nov, 2025 1 commit
  5. 05 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Langauge] Support n>256 for v2 (#1182) · b66a93c5
      Lei Wang authored
      * fix
      
      * lint fix
      
      * fix
      
      * lint fix
      
      * fix
      
      * upd
      
      * support n>256
      
      * Remove unnecessary pass configurations for fast math in MHA forward BHSD latency script.
      
      * lint fix
      
      * lint fix
      b66a93c5
  6. 04 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve Python3.9 compatibility for ParamSpec and Self (#1190) · 7d961892
      Lei Wang authored
      * [Feature] Enhance fill operation to support various buffer types
      
      - Added support for `BufferLoad` in the `fill` function to handle different buffer types.
      - Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
      - Introduced checks for static bounds in region definitions to ensure safety during operations.
      - Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.
      
      * lint fix
      
      * [Refactor] Improve Python compatibility for ParamSpec and Self
      
      - Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively.
      - Updated type annotations across multiple files to ensure consistent usage of typing features.
      
      * [Update] Require Python 3.9 and enhance type annotations
      
      - Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`.
      - Removed references to Python 3.8 in classifiers.
      - Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility.
      - Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files.
      
      * [Refactor] Update import statements to enhance type annotations
      
      - Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity.
      - Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function.
      - Adjusted import statements in the language proxy file to maintain consistency in type annotations.
      
      * disable rocm rs nt test.
      
      * lint fix
      7d961892
  7. 03 Nov, 2025 1 commit
  8. 02 Nov, 2025 1 commit
    • Lei Wang's avatar
      [Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb
      Lei Wang authored
      
      
      * remove debug print
      
      * pipeline fix
      
      * use the correct buffer access scope
      
      * rs support
      
      * warp warpgroup_fence_operand
      
      * fix
      
      * fp8 dtype ptx enhance
      
      * mma fix
      
      * TCGEN05 Interface
      
      * tcgen05 support
      
      * rebase
      
      * update
      
      * Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.
      
      * lint fix
      
      * Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.
      
      * wgmma fix
      
      ---------
      Co-authored-by: default avatarZhiwen Mo <zm125@ic.ac.uk>
      aef0a6bb
  9. 31 Oct, 2025 2 commits
    • Lei Wang's avatar
      [Bugfix] Enable code lowering with producer‑copy‑only program (#1168) · 7a80b6df
      Lei Wang authored
      * bugfix
      
      * lint fix
      
      * Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.
      
      * Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic.
      
      * Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.
      7a80b6df
    • Lei Wang's avatar
      [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28
      Lei Wang authored
      
      
      * 3rdparty tvm bump
      
      * bump tvm into v0.22.0
      
      * lint fix
      
      * rebase tvm
      
      * Update submodule tvm to latest commit 3085bc4
      
      * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang
      
      * test fix
      
      * add requirement
      
      * atomic_fix
      
      * atomic_fix
      
      * phaseout py39
      
      * optimize
      
      * optimize
      
      * lint fix
      
      * do not clean cache
      
      * do not clean cache
      
      * [Minor] Minor update for Python versions and dependencies
      
      * [Lint] fix lint for py39
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Sync CI changes from upstream/sdist
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Update `repair-wheel-command`
      
      * [Minor] update abi3audit result format
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] fix build
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * [Deps] pin apache-tvm-ffi version
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Deps] Restore Python 3.8 support
      
      * [Build] use `apache-tvm-ffi`'s `libtvm_ffi`
      
      * [BugFix] use `;` as delimiter for RPATH on macOS
      
      * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`
      
      * [Build] support `sccache` if available
      
      * [Build] add CIBW import test
      
      * [Build][CI] enable ccache for CIBW on Linux
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * Revert "[Build][CI] enable ccache for CIBW on Linux"
      
      This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.
      
      * [CI] fix perfbench bot
      
      * [BugFix] use Python 3.9 to build wheel
      
      * [Minor] update perfbench bot envs
      
      * [BugFix] fix CIBW environment on Linux
      
      * [CI] skip import test on CentOS 7
      
      * [CI] use Python urllib to download file instead of Wget
      
      ---------
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@pku.edu.cn>
      10911e28
  10. 29 Oct, 2025 2 commits
  11. 28 Oct, 2025 1 commit
  12. 27 Oct, 2025 2 commits
  13. 24 Oct, 2025 1 commit
  14. 23 Oct, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi (#1111) · 86c8bb46
      Lei Wang authored
      * [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic
      
      * Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function.
      * Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty.
      
      * lint fix
      
      * remove debug print
      86c8bb46
  15. 22 Oct, 2025 1 commit
  16. 21 Oct, 2025 4 commits
  17. 20 Oct, 2025 4 commits
  18. 17 Oct, 2025 2 commits
  19. 16 Oct, 2025 1 commit
  20. 15 Oct, 2025 2 commits
  21. 14 Oct, 2025 1 commit
  22. 13 Oct, 2025 1 commit
  23. 11 Oct, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36
      Lei Wang authored
      [Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)
      
      * • InjectFenceProxy docs and tests
      
        - annotate proxy fence injector with context comments for async/generic detection
        - add compiler internals doc covering the pass mechanics and link it in docs index
        - repair fence proxy test by fixing descriptor init usage and fence counter logic
      
      * do not consider call_extern as async.
      
      * doc update.
      
      * reduce test size for sparse mla
      ddfaac36
  24. 10 Oct, 2025 3 commits
    • Chaofan Lin's avatar
      [Bugfix] Fix dummy kernel compliation (#962) · 7913fb1d
      Chaofan Lin authored
      
      
      * [Bugfix] Fix visit EvaluateNode in BufferGemmCollector
      
      * address comment
      
      * lint
      
      * fix
      
      * Add TileLang SplitHostDevice pass and tighten issue 830 test names
      
      * lint fix
      
      * enhance for kernel value unpacking.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      7913fb1d
    • Xuehai Pan's avatar
      [CI] add `pre-commit` integration (#955) · 8fe35402
      Xuehai Pan authored
      
      
      * chore: misc cleanup
      
      * feat: add pre-commit config
      
      * chore: update lint dependencies
      
      * style: fix lint issues
      
      * feat: add pre-commit hooks
      
      * fix: fix typos
      
      * chore: update .gitattributes
      
      * [Lint]: [pre-commit.ci] auto fixes [...]
      
      * docs: update CONTRIBUTING.md
      
      * chore: update default venv name
      
      * chore: revert and exclude CUDA files
      
      ---------
      Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
      8fe35402
    • Lei Wang's avatar
      [Bugfix] Do not force inline let stmt (#947) · f8ae600c
      Lei Wang authored
      * remove debug print
      
      * Remove inline let expressions from the LowerAndLegalize function in phase.py
      
      * add test
      
      * Update sparse MLA examples to support SKV adjustment and correctness checks
      
      - Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
      - Added check_correctness parameter to test functions for validation of outputs.
      - Updated test cases to reflect new SKV values and correctness checks.
      
      * reduce test shape
      
      * Update documentation structure and refactor main function parameters in example_fusedmoe_tilelang.py
      
      - Added a new section for compiler internals in the documentation.
      - Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.
      
      * Update buffer access checks in merge_shared_memory_allocations.cc
      
      - Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level.
      - Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.
      
      * lint fix
      
      * Support pipeline with LetStmt
      
      * lint fix
      
      * • Fix LowerTileOp let handling to avoid LetInline dependency
      
        - inline let-bound BufferLoad nodes via resolver helpers and structured return
        - remap layouts/buffers using original data vars and only rewrite when needed
        - update pipeline planner to understand let-bound address_of buffers
        - document the new inline behaviour in docs/let_inline_fix.md
      
      * fix for wgmma pipeline with let binding
      
      * lint fix
      
      * test fix
      
      * reduce smem usage.
      
      * let binding enhancement
      
      * fix for dpgm
      
      * fix simplify
      
      * lint fix
      
      * use tilelang.Simplify instead of tir.Simplify
      
      * • Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage
      
        - register the new config in builtin headers/registration
        - add helper to pipeline enabling LetInline based on pass context
        - document LetStmt inlining controls and usage
      f8ae600c
  25. 09 Oct, 2025 1 commit
    • Lei Wang's avatar
      [TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28
      Lei Wang authored
      * [Feature] Introduce WGMMA support and enhance GEMM layout handling
      
      - Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
      - Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
      - Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
      - Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
      - Improved documentation for layout functions and GEMM operations to clarify usage and parameters.
      
      These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.
      
      * [Refactor] Clean up code formatting and enhance layout function readability
      
      - Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
      - Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
      - Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
      - Enhanced comments and documentation in layout-related files to clarify usage and parameters.
      
      These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.
      
      * [Feature] Add descriptor initialization and offset manipulation for WGMMA
      
      - Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
      - Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
      - Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
      - Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
      - Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.
      
      These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.
      
      * [Refactor] Improve code formatting and readability in various files
      
      - Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
      - Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
      - Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
      - Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.
      
      These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.
      
      * [Update] Update subproject commit and refactor layout function call
      
      - Updated the subproject commit for `cutlass` to indicate a dirty state.
      - Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.
      
      These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.
      
      * support more data types
      
      * gemm_rs support
      
      * lint fix
      
      * wgmma wrapper
      
      * Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.
      
      * Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.
      
      * Comprehensively support WGMMA GEMM SS
      
      * remove debug print
      
      * lint fix
      
      * remove debug print
      
      * reduce bwd test shape
      
      * lint fix
      
      * clear cache for pytest
      
      * lint fix
      
      * Update sparse MLA examples to support SKV adjustment and correctness checks
      
      - Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
      - Added check_correctness parameter to test functions for validation of outputs.
      - Updated test cases to reflect new SKV values and correctness checks.
      
      * test fix
      
      * adjust test case
      
      * test fix
      
      * skip some test currently
      a13cde28
  26. 05 Oct, 2025 1 commit
  27. 02 Oct, 2025 1 commit
    • Zhiwen Mo's avatar
      [Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa
      Zhiwen Mo authored
      * Implements tcgen05.ld instruction support for copying from shared.tmem
        to local.fragment on SM100/Blackwell architecture. Adds layout inference
        and lowering logic for tensor memory operations with proper physical
        coordinate range analysis and warpgroup alignment checks.
      
        Changes:
        - Add kTMemLoad and kTMemStore to CopyInst enumeration
        - Implement CheckTMemLoad() and CheckTMemStore() validation functions
        - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
        - Add tmem layout inference in InferLayout() using expandTcgen05Layout
        - Support multiple instruction variants (32dp32b/64b/128b/256b)
        - Add physical layout bounds analysis for tmem coordinates
        - Change clear_accum from bool to PrimExpr in GEMM operations
        - Fix std::optional access checks in layout_inference.cc
        - Add tmem_allocate/deallocate PTX intrinsic support
        - Fix cooperative_groups grid.sync() code generation
      
      * fix
      
      * pipeline fix
      
      * bug fix
      
      * bool fix
      5ccac4fa