1. 04 Nov, 2025 1 commit
  2. 03 Nov, 2025 2 commits
  3. 31 Oct, 2025 1 commit
    • Lei Wang's avatar
      [FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28
      Lei Wang authored
      
      
      * 3rdparty tvm bump
      
      * bump tvm into v0.22.0
      
      * lint fix
      
      * rebase tvm
      
      * Update submodule tvm to latest commit 3085bc4
      
      * Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang
      
      * test fix
      
      * add requirement
      
      * atomic_fix
      
      * atomic_fix
      
      * phaseout py39
      
      * optimize
      
      * optimize
      
      * lint fix
      
      * do not clean cache
      
      * do not clean cache
      
      * [Minor] Minor update for Python versions and dependencies
      
      * [Lint] fix lint for py39
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Sync CI changes from upstream/sdist
      
      * [Lint] fix lint for ROCm
      
      * [Build][CI] Update `repair-wheel-command`
      
      * [Minor] update abi3audit result format
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] fix build
      
      * [Lint] fix lint for ROCm
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * [Deps] pin apache-tvm-ffi version
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Build] set Python 3.9 Limited API for Cython target
      
      * [Deps] Restore Python 3.8 support
      
      * [Build] use `apache-tvm-ffi`'s `libtvm_ffi`
      
      * [BugFix] use `;` as delimiter for RPATH on macOS
      
      * [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`
      
      * [Build] support `sccache` if available
      
      * [Build] add CIBW import test
      
      * [Build][CI] enable ccache for CIBW on Linux
      
      * [BugFix] set rpath for libtvm and libtvm_runtime
      
      * Revert "[Build][CI] enable ccache for CIBW on Linux"
      
      This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.
      
      * [CI] fix perfbench bot
      
      * [BugFix] use Python 3.9 to build wheel
      
      * [Minor] update perfbench bot envs
      
      * [BugFix] fix CIBW environment on Linux
      
      * [CI] skip import test on CentOS 7
      
      * [CI] use Python urllib to download file instead of Wget
      
      ---------
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@pku.edu.cn>
      10911e28
  4. 29 Oct, 2025 1 commit
  5. 27 Oct, 2025 3 commits
  6. 22 Oct, 2025 1 commit
  7. 20 Oct, 2025 1 commit
  8. 19 Oct, 2025 1 commit
  9. 18 Oct, 2025 1 commit
  10. 16 Oct, 2025 1 commit
  11. 15 Oct, 2025 1 commit
    • Xuehai Pan's avatar
      [CI][Refactor] Merge test CI workflow files into one (#973) · 8ce27782
      Xuehai Pan authored
      * refactor: merge test CI workflow files into one
      
      * chore: set `UV_INDEX_STRATEGY=unsafe-best-match`
      
      * feat: add AST test with Python 3.8
      
      * feat: implement manual caching mechanism for self-hosted runners
      
      * refactor: simplify cache logic for self-hosted runners
      
      * chore: clear uv cache on failure
      
      * chore: print format.sh output to logs
      
      * chore: improve uv caching
      
      * chore: disable parallel test
      
      * chore: use `PYTHONDEVMODE=1` in CI
      
      * feat: enable coredump generation
      
      * fix: fix perfbench condition
      
      * Revert "feat: enable coredump generation"
      
      This reverts commit c52da65cb572932e09905d08c43a39ec3cf47c54.
      
      * chore: move example CI down
      
      * Revert "chore: move example CI down"
      
      This reverts commit 9d8e65055e01d955c5268a9a6705d270c2de0d57.
      
      * chore: skip example `test_example_mha_sink_bwd_bhsd`
      
      * chore: skip example `test_example_gqa_sink_bwd_bhsd`
      
      * fix: fix example argument passing
      
      * fix: loosen test criteria
      
      * chore: rename `CMAKE_CONFIG...
      8ce27782
  12. 14 Oct, 2025 1 commit
  13. 13 Oct, 2025 2 commits
    • Cunxiao Ni's avatar
      [CI] Removes redundant environment variable (#1020) · eb37e459
      Cunxiao Ni authored
      * [CI] Removes redundant environment variable
      Removes the `UV_INDEX_URL`
      
      * triggle CI
      
      * triggle CI
      
      * triggle CI
      
      * triggle CI
      eb37e459
    • Yichen Yan's avatar
      [Build] Migrate to scikit-build-core (#939) · d89ba5b8
      Yichen Yan authored
      
      
      * cleanup
      
      * init
      
      * build first wheel that may not work
      
      * build cython ext
      
      * fix tvm build
      
      * use sabi
      
      * update rpath to support auditwheel
      
      * pass editible build
      
      * update ci
      
      * fix warnings
      
      * do not use ccache in self host runner
      
      * test local uv cache
      
      * test pip index
      
      * update lib search to respect new lib location
      
      * fix
      
      * update ci
      
      * enable cuda by default
      
      * update src map
      
      * fix
      
      * fix
      
      * fix
      
      * Generate version with backend and git information at build time
      
      * copy tvm_cython to wheels
      
      * fix tvm lib search
      
      * fmt
      
      * remove unused
      
      * auto detect ccache
      
      * add back backend-related files
      
      * remove jit cython adaptor to simplify code
      
      * fmt
      
      * fix ci
      
      * ci fix 2
      
      * ci fix 3
      
      * workaround metal
      
      * ci fix 4
      
      * fmt
      
      * fmt
      
      * Revert "ci fix 4"
      
      This reverts commit d1de8291c3e40927955f3ad3cf87a75c78813676.
      
      * tmp
      
      * fix metal
      
      * trivial cleanup
      
      * add detailed build-time version for cuda
      
      * add back mlc
      
      * Restore wheel info and other trivial updates
      
      * update
      
      * fix cuda
      
      * upd
      
      * fix metal ci
      
      * test for ga build
      
      * test for nvidia/cuda
      
      * test ubuntu 20
      
      * fix
      
      * fix
      
      * Do not use `uv build`
      
      * fix
      
      * fix
      
      * log toolchain version
      
      * merge wheel
      
      * update
      
      * debug
      
      * fix
      
      * update
      
      * skip rocm
      
      * update artifacts each
      
      * fix
      
      * fix
      
      * add mac
      
      * fix cache
      
      * fix cache
      
      * fix cache
      
      * reset and add comment
      
      * upd
      
      * fix git version
      
      * update deps
      
      * trivial update
      
      * use in-tree build dir and install to src to speedup editable build
      
      * Revert "use in-tree build dir and install to src to speedup editable build"
      
      This reverts commit 6ab87b05c5eed811210136b8dca4fc3677dd51f2.
      
      * add build-dir
      
      * update docs
      
      * remove old scrips
      
      * [1/n] cleanup scripts
      
      * [Lint]: [pre-commit.ci] auto fixes [...]
      
      * fix and update
      
      * wait for tvm fix
      
      * revert some tmp fix
      
      * fix
      
      * fix
      
      * spell
      
      * doc update
      
      * test cibuildwheel
      
      * fix and test macos on ci
      
      * Update .github/workflows/dist.yml
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@outlook.com>
      
      * fix
      
      * test ga event
      
      * cleanup
      
      * bump tvm to support api3
      
      * test final version
      
      * add cron
      
      * Update .github/workflows/dist.yml
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@outlook.com>
      
      * fix
      
      * test ccache for metal cibuildwheel
      
      * test newer macos
      
      * finish
      
      ---------
      Co-authored-by: default avatarpre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
      Co-authored-by: default avatarXuehai Pan <XuehaiPan@outlook.com>
      d89ba5b8
  14. 11 Oct, 2025 1 commit
  15. 09 Oct, 2025 7 commits
    • Lei Wang's avatar
      [TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28
      Lei Wang authored
      * [Feature] Introduce WGMMA support and enhance GEMM layout handling
      
      - Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
      - Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
      - Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
      - Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
      - Improved documentation for layout functions and GEMM operations to clarify usage and parameters.
      
      These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.
      
      * [Refactor] Clean up code formatting and enhance layout function readability
      
      - Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
      - Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
      - Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
      - Enhanced comments and documentation in layout-related files to clarify usage and parameters.
      
      These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.
      
      * [Feature] Add descriptor initialization and offset manipulation for WGMMA
      
      - Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
      - Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
      - Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
      - Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
      - Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.
      
      These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.
      
      * [Refactor] Improve code formatting and readability in various files
      
      - Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
      - Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
      - Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
      - Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.
      
      These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.
      
      * [Update] Update subproject commit and refactor layout function call
      
      - Updated the subproject commit for `cutlass` to indicate a dirty state.
      - Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.
      
      These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.
      
      * support more data types
      
      * gemm_rs support
      
      * lint fix
      
      * wgmma wrapper
      
      * Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.
      
      * Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.
      
      * Comprehensively support WGMMA GEMM SS
      
      * remove debug print
      
      * lint fix
      
      * remove debug print
      
      * reduce bwd test shape
      
      * lint fix
      
      * clear cache for pytest
      
      * lint fix
      
      * Update sparse MLA examples to support SKV adjustment and correctness checks
      
      - Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
      - Added check_correctness parameter to test functions for validation of outputs.
      - Updated test cases to reflect new SKV values and correctness checks.
      
      * test fix
      
      * adjust test case
      
      * test fix
      
      * skip some test currently
      a13cde28
    • dependabot[bot]'s avatar
      [CI]: Bump actions/checkout from 2 to 5 (#953) · 10adb79f
      dependabot[bot] authored
      Bumps [actions/checkout](https://github.com/actions/checkout) from 2 to 5.
      - [Release notes](https://github.com/actions/checkout/releases)
      - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
      - [Commits](https://github.com/actions/checkout/compare/v2...v5
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: actions/checkout
        dependency-version: '5'
        dependency-type: direct:production
        update-type: version-update:semver-major
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      10adb79f
    • dependabot[bot]'s avatar
      [CI]: Bump actions/github-script from 7 to 8 (#954) · 5d881a57
      dependabot[bot] authored
      Bumps [actions/github-script](https://github.com/actions/github-script) from 7 to 8.
      - [Release notes](https://github.com/actions/github-script/releases)
      - [Commits](https://github.com/actions/github-script/compare/v7...v8
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: actions/github-script
        dependency-version: '8'
        dependency-type: direct:production
        update-type: version-update:semver-major
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      5d881a57
    • dependabot[bot]'s avatar
      [CI]: Bump astral-sh/setup-uv from 6 to 7 (#952) · b6f90d25
      dependabot[bot] authored
      Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 6 to 7.
      - [Release notes](https://github.com/astral-sh/setup-uv/releases)
      - [Commits](https://github.com/astral-sh/setup-uv/compare/v6...v7
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: astral-sh/setup-uv
        dependency-version: '7'
        dependency-type: direct:production
        update-type: version-update:semver-major
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      b6f90d25
    • dependabot[bot]'s avatar
      [CI]: Bump actions/setup-python from 2 to 6 (#951) · d8fedc17
      dependabot[bot] authored
      Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 6.
      - [Release notes](https://github.com/actions/setup-python/releases)
      - [Commits](https://github.com/actions/setup-python/compare/v2...v6
      
      )
      
      ---
      updated-dependencies:
      - dependency-name: actions/setup-python
        dependency-version: '6'
        dependency-type: direct:production
        update-type: version-update:semver-major
      ...
      Signed-off-by: default avatardependabot[bot] <support@github.com>
      Co-authored-by: default avatardependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
      d8fedc17
    • Xuehai Pan's avatar
    • Xuehai Pan's avatar
      [CI] enable dependabot for GHA workflows (#950) · f6d4bd3a
      Xuehai Pan authored
      * chore: add .editorconfig
      
      * feat: enable dependabot for GHA workflows
      f6d4bd3a
  16. 07 Oct, 2025 1 commit
  17. 01 Oct, 2025 1 commit
  18. 26 Sep, 2025 1 commit
  19. 06 Sep, 2025 1 commit
    • Cunxiao Ni's avatar
      [CI]Adds pytest timeout to CI (#792) · bcfc8343
      Cunxiao Ni authored
      * [CI]Adds pytest timeout to CI
      
      Adds a timeout to pytest runs in CI to prevent jobs from hanging indefinitely.
      This also adds `pytest-timeout` to the test requirements.
      
      * fix lint
      bcfc8343
  20. 03 Sep, 2025 1 commit
    • Cunxiao Ni's avatar
      [CI] Adds pytest-durations for test timing (#782) · 141e01fb
      Cunxiao Ni authored
      * [Ci] Adds pytest-durations for test timing
      
      Adds `pytest-durations` to the test requirements and configures pytest to display test durations.
      
      This helps in identifying slow-running tests and optimizing the test suite for faster feedback.
      
      * add amd ci durations
      
      * Removes flash_attn installation from CI
      141e01fb
  21. 02 Sep, 2025 2 commits
    • Lei Wang's avatar
      [Cache] Introduce detailed target information for the disk kernel cache (#780) · 7ffc5b44
      Lei Wang authored
      * Fix type hint for target_host parameter in compile function to allow None value
      
      * Refactor target handling in compile function to utilize determine_target for improved clarity and consistency
      
      * Update PrintConst function in codegen_cuda.cc to use hexfloat format for bfloat16 and float8/float4 types, while adding scientific notation comments for clarity. This change enhances the representation of floating-point constants in the generated code.
      
      * Refactor PrintType function in codegen_cuda.cc to remove unnecessary failure conditions for floating-point types with lane counts greater than 4. This change simplifies the logic and improves code clarity.
      
      * Enhance benchmark_matmul.py to conditionally print Reference TFlops only if ref_latency is not None. Update param.py to ensure target is converted to string for consistency. Refactor tuner.py to utilize determine_target for improved clarity in target handling.
      
      * Remove automatic commit and push step from AMD and NVIDIA CI workflows to streamline the process and avoid unnecessary commits.
      7ffc5b44
    • Lei Wang's avatar
      [Lint] Introduce clang-tidy into format.sh (#777) · cdc5d8d3
      Lei Wang authored
      * [Refactor] Update Clang-Tidy Checks and Improve Code Consistency
      
      - Enhanced .clang-tidy configuration by adding specific checks for better bug detection and performance optimization.
      - Refactored function signatures across multiple files to use `const` references for parameters, improving performance and code clarity.
      - Updated various methods to ensure consistent handling of parameters, particularly in `AddPredicate`, `Substitute`, and `PlanLoopPartition` functions.
      - Improved readability by replacing size checks with `empty()` method calls in several locations, ensuring clearer intent in the code.
      - General code cleanup and adherence to best practices for better maintainability.
      
      * [Refactor] Enhance Code Consistency and Clang-Tidy Configuration
      
      - Updated .clang-tidy configuration to include additional checks for improved code quality and performance.
      - Refactored function signatures across multiple files to use `const` references, enhancing performance and clarity.
      - Replaced size checks with `empty()` method calls in various locations for clearer intent.
      - Improved handling of parameters in several functions, ensuring consistent usage of `std::move` where applicable.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [Refactor] Integrate Clang-Tidy Checks and Enhance Code Consistency
      
      - Added clang-tidy checks to the format script for improved code quality assurance.
      - Refactored function signatures across multiple files to consistently use `const` references, enhancing performance and clarity.
      - Updated the requirements-lint.txt file to include clang-tidy as a dependency.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [CI] Update AMD CI Workflow to Include Build Directory Creation
      
      - Added steps to create a build directory and configure CMake with ROCm support during the format check process.
      - Ensured cleanup of the build directory after the format check to maintain a clean workspace.
      
      * [Refactor] Remove Unused Member Variables in AtomicAddNode and CopyNode
      
      - Removed the `args_` member variable from both `AtomicAddNode` and `CopyNode` classes to streamline the code and eliminate unnecessary data members.
      - This change enhances code clarity and maintainability by focusing on relevant attributes for each class.
      
      * [Refactor] Update Clang-Tidy Integration and Code Improvements
      
      - Modified the format script to include the `-fix` option in the clang-tidy command for automatic code fixes.
      - Refactored the `AtomicAddVectorizePlanner` class to improve variable handling and consistency, including changes to member variable types and function signatures.
      - Enhanced code clarity by removing unnecessary `std::move` calls and ensuring consistent usage of types across the class.
      - General code cleanup to adhere to best practices and improve maintainability.
      
      * [Refactor] Improve Parameter Handling and Consistency in AtomicAddVectorize
      
      - Updated function signatures in `AtomicAddVectorizePlanResult` and `AtomicAddVectorizeRewriter` to use `const` references and `std::move` for better performance and clarity.
      - Enhanced the `UpdateVectorSize` method to accept `const Array<PrimExpr>&` for improved efficiency.
      - General code cleanup to maintain consistency and adhere to best practices.
      
      * [CI] Add Git Submodule Initialization to CI Workflow
      
      - Included a step to initialize and update git submodules recursively in the CI workflow.
      - This change ensures that all necessary submodules are available during the format check process, improving build reliability.
      
      * [CI] Add Git Submodule Update Step to Format Check
      
      - Included a command to initialize and update git submodules recursively in the CI workflow during the format check process.
      - This enhancement ensures that all required submodules are available, contributing to improved build reliability.
      
      * [Refactor] Update Function Signatures in AtomicAddVectorize
      
      - Modified the `VectorizeAtomicAdd` function signature to use `const` references for `thread_var` and `thread_bounds`, enhancing performance and code clarity.
      - This change aligns with previous refactoring efforts to improve parameter handling and consistency across the codebase.
      cdc5d8d3
  22. 23 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Merge ThreadPartialSync and ThreadStorageSync (#741) · 6b125028
      Lei Wang authored
      * Remove `thread_partial_sync.cc` and refactor `thread_storage_sync.cc` to streamline synchronization handling. Introduce `thread_sync_types.h` for thread-bound key definitions and reserved named barriers. Update related logic in `ThreadSyncInserter` and `TileLangThreadSync` for improved clarity and efficiency.
      
      * Remove `sync_thread_partial` references and related documentation from the codebase. Update CUDA and HIP code generation files to eliminate calls to the removed function. Refactor `__sync_thread_partial` to `sync_thread_partial` in CUDA common header for consistency.
      
      * Remove unused import of `bulk_copy.h` in `codegen_hip.cc` to enhance code clarity and maintainability.
      
      * Add import of `bulk_copy.h` in `codegen_hip.cc` to support new functionality.
      
      * typo fix
      
      * Update data type in reduce_sum tests from float16 to float32 for consistency and clarity. Remove redundant dtype tests and streamline run functions. Enhance reshape kernel compilation with pass configurations to address shared memory layout issues.
      
      * lint fix
      
      * test fix
      
      * Enhance CI configuration by adding verbose output to pip install command for better visibility during installation.
      
      * use ninja instead of make
      
      * Add CMake configuration step for Ninja build system in setup.py
      
      * Update pyproject.toml to include additional build dependencies: build, torch, tox, auditwheel, patchelf, and ninja.
      
      * Enhance CI configuration by adding verbose output to pytest commands for improved test visibility.
      
      * Update pyproject.toml to add Cython as a build dependency. Enhance thread storage synchronization in thread_storage_sync.cc by introducing new thread variable handling and improving index disjointness checks.
      
      * Update data type in cumulative sum tests from float16 to float32 for consistency. Modify run_cumsum function to utilize the updated dtype and enhance result validation with assertions. Adjust test cases accordingly.
      
      * Refactor storage access handling by introducing buffer data mapping in TileLangStorageAccessVisitor. Enhance access entry structure to include pointer access flag. Update thread storage synchronization to accommodate new buffer data mappings. Adjust quickstart example to print kernel source for debugging purposes.
      
      * Refactor linear index conversion in TileLangStorageAccessVisitor to utilize the analyzer for simplification. Update buffer index calculations to ensure consistent simplification of range expressions.
      
      * bugfix
      
      * Refactor buffer index calculation in TileLangStorageAccessVisitor to simplify access handling. Removed unused buffer mapping logic, ensuring consistent buffer index generation with a default ramp.
      
      * Refactor TileLangStorageAccessVisitor to replace buffer indices with buffer ranges for improved pointer access handling. Update AccessEntry structure to include buffer_ranges and adjust thread storage synchronization logic to account for pointer access conflicts.
      
      * Refactor thread storage synchronization to replace 'shared.dyn' with 'shared' for consistency in memory allocation. Update related test cases to reflect this change and ensure proper functionality.
      6b125028
  23. 18 Aug, 2025 1 commit
    • alex_xiao's avatar
      [CI] Fix AMD CI (#729) · a86223f4
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Enhance AMD example script and update CI workflows
      
      - Improved the `example_amd_flash_attn_fwd.py` script for better clarity and organization.
      - Added new CI workflows for AMD and documentation publishing.
      - Updated various requirements files to include necessary dependencies.
      - Introduced new test cases and examples for better coverage and functionality.
      - Refactored existing code for improved readability and maintainability.
      
      * Remove redundant tool cache cleanup step in AMD CI workflow
      
      * Remove `torch` dependency from `requirements-rocm.txt` to streamline requirements.
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      a86223f4
  24. 15 Aug, 2025 2 commits
    • alex_xiao's avatar
      [CI][AMD] Add AMD GPU CI and fix some related bugs (#694) · 8e1b88f3
      alex_xiao authored
      
      
      * [Enhancement] Refactor buffer index handling for improved precision and clarity (#668)
      
      - Enhanced buffer index handling to address precision issues by removing redundant operations.
      - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection.
      - Updated related documentation to reflect changes in buffer management practices.
      
      * Remove obsolete test script for AMD example, streamlining the examples directory.
      
      * Remove unused dtype_size variable in AMD example script to streamline code.
      
      * Add input configuration file and update AMD example script for enhanced flexibility
      
      - Introduced a new input.txt file for configurable parameters.
      - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack.
      - Streamlined the main function for better clarity and organization.
      - Added a new test script to facilitate running the example with specified parameters.
      
      * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations
      
      - Deleted input.txt and test.sh files as they are no longer needed.
      - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance.
      - Reintroduced swizzle usage in the kernel for better performance.
      
      * Refactor AMD example script for FlashAttention-2
      
      - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`.
      - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls.
      - Removed outdated comments and improved code organization for better readability.
      
      * Refactor formatting in AMD FlashAttention example script
      
      - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function.
      - Streamlined the `main` function parameter formatting for consistency.
      - Removed unnecessary blank lines to enhance overall code organization.
      
      * Update example_amd_flash_attn_fwd.py
      
      * Update AMD FlashAttention example and TVM submodule
      
      - Added a new example script `example_amd_flash_attn_fwd_k_block.py` for FlashAttention with K-blocking support.
      - Enhanced `example_amd_flash_attn_fwd.py` by expanding configuration options for block sizes and threads.
      - Updated the TVM submodule to the latest commit for improved functionality.
      - Introduced a new test script `test.sh` to facilitate running the new example with specified parameters.
      
      * Add CI workflow for automated format checking and testing
      
      - Introduced a new GitHub Actions workflow in `amd_ci.yml` to automate format checks and testing for pull requests.
      - The workflow includes steps for setting up a Python environment, running format checks, and executing tests.
      - Removed obsolete example script `example_amd_flash_attn_fwd_k_block.py` and test script `test.sh` to streamline the examples directory.
      
      * Rename CI workflow from "CI" to "AMD CI" for clarity and specificity.
      
      * Update AMD CI workflow to include copying PyTorch, TorchVision, and Torchaudio packages to the virtual environment for improved dependency management.
      
      * Update AMD CI workflow to install pytest directly instead of using requirements-test.txt
      
      * Update AMD CI workflow to remove 'flash-attn' from requirements and install dependencies from requirements-test.txt
      
      * Refactor AMD CI workflow to enhance clarity in removing 'flash-attn' from requirements-test.txt before installation
      
      * Remove Torchaudio package copying from AMD CI workflow to streamline dependency management.
      
      * Refactor AMD CI workflow to remove the format-check job and streamline the build-test process by directly copying PyTorch and TorchVision packages to the virtual environment.
      
      * Add installation of ROCm in AMD CI workflow
      
      - Included a step to execute the `install_rocm.sh` script for improved setup.
      - Removed unnecessary blank line for better readability in the workflow script.
      
      * Remove installation step for ROCm in AMD CI workflow to simplify the setup process.
      
      * Update AMD CI workflow to run specific test file with verbose output instead of all tests.
      
      * Add new tilelang built-in operations for AMD architecture
      
      - Introduced `tvm_mfma`, `tvm_mfma_store`, `tvm_rdna_wmma`, and `tvm_rdna_wmma_store` built-in operations to enhance support for matrix multiplication and storage in tilelang.
      - Each operation is configured with the appropriate number of inputs and marked as opaque in terms of call effects.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Updated block sizes and num_split_q parameters in `get_configs` for improved autotuning.
      - Modified `T.gemm` calls in `fast_flashattn` to utilize `GemmWarpPolicy.FullRow`, optimizing performance for matrix multiplications.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Refined block sizes, thread counts, and added new parameters in `get_configs` to optimize autotuning.
      - Adjusted `fast_flashattn` function to incorporate new parameters for panel size and coalesced widths, improving memory access patterns.
      
      * Enhance autotuner configurations and memory handling in AMD example
      
      - Expanded block sizes and thread counts in `get_configs` for improved autotuning capabilities.
      - Updated `fast_flashattn` to utilize a new shared memory allocation strategy, optimizing memory access patterns during GEMM operations.
      
      * Refine autotuner configurations and memory usage in AMD example
      
      - Reduced block sizes and adjusted thread counts in `get_configs` for optimized autotuning.
      - Updated `fast_flashattn` to utilize register fragments for accumulation, minimizing LDS usage and enhancing performance during GEMM operations.
      
      * Update autotuner configurations in AMD example for enhanced performance
      
      - Expanded block sizes and thread counts in `get_configs` to improve autotuning capabilities.
      - Adjusted `num_split_q` and `v_coalesced_width` parameters for better optimization during GEMM operations.
      
      * Enhance autotuner configurations and GEMM operations in AMD example
      
      - Expanded thread counts in `get_configs` to include higher values for improved autotuning.
      - Updated `fast_flashattn` to adjust accumulation logic and ensure proper handling of causal conditions, optimizing performance during matrix multiplications.
      
      * Update AMD CI workflow and remove obsolete test script
      
      - Modified the CI workflow to run on multiple environments: self-hosted, amd, and gpu.
      - Deleted the outdated `test.sh` script from the examples directory, streamlining the project structure.
      
      * Remove TVM subproject from 3rdparty directory
      
      * Refactor configuration generation and accumulation logic in AMD example
      
      - Reformatted the `get_configs` function for improved readability by aligning parameters.
      - Adjusted the `fast_flashattn` function to enhance clarity in the conditional logic for accumulation, ensuring better handling of causal conditions.
      
      * Enhance AMD CI workflow with additional logging and setup steps
      
      - Added echo statements to provide feedback during the CI process, indicating when the environment is running on an AMD GPU, copying necessary packages, and installing requirements.
      - Improved clarity in the workflow by explicitly stating when the project is being installed and when tests are being executed.
      
      * Comment out package copying in AMD CI workflow to prevent potential issues during environment setup
      
      * Update AMD CI workflow to install nightly versions of PyTorch and remove obsolete package copying steps
      
      * Enhance BuildTileLangHIP function by adding whitespace for improved readability
      
      * Refactor kTVMGridConstant definition for clarity and remove unnecessary comment
      
      * Update TVM subproject to latest commit a64a5926a6e59f5417ef2501f9d88b467337cf6a
      
      * lint fix
      
      * Update AMD CI workflow to use requirements-rocm.txt for dependency installation
      
      * fix ci
      
      * Remove dependency on format-check from AMD CI workflow
      
      * fix ci
      
      * fix ci
      
      * fix ci
      
      * Remove format-check job from AMD CI workflow
      
      * Add torch to requirements-rocm.txt and remove explicit pip install commands from AMD CI workflow
      
      * Add dependency on format-check job in AMD CI workflow
      
      * Add format-check job to AMD CI workflow
      
      * Update format-check job in AMD CI workflow to run on self-hosted environment
      
      * Enhance format-check job in AMD CI workflow with improved Python environment setup and automatic commit of lint changes
      
      * Update amd_ci.yml
      
      ---------
      Co-authored-by: default avatarxinxyxiao <xinyxiao@amd.com>
      Co-authored-by: default avatarLei Wang <34334180+LeiWang1999@users.noreply.github.com>
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      8e1b88f3
    • Wenhao Xie's avatar
      [CI] fix docs ci (#720) · 6545b084
      Wenhao Xie authored
      6545b084
  25. 14 Aug, 2025 1 commit
  26. 10 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Pipeline] Optimize inject software pipeline and pipeline planing pass (#706) · 376ba9eb
      Lei Wang authored
      * Refactor inject_pipeline.cc to improve version handling and add unique producer head tracking
      
      - Updated version check to allow for cases with two or more versions.
      - Adjusted logic to decrement num_versions when multi-versioning is not needed.
      - Introduced a helper function to ensure unique producer heads are added to the commit group.
      - Removed obsolete AddAllocBuffers method to streamline code.
      
      * lint fix
      
      * Refactor pipeline planning logic to enhance copy stage dependency management
      
      - Removed obsolete conditional expression handling from the pipeline planning code.
      - Introduced a new structure to manage copy stage dependency reads, improving clarity and efficiency.
      - Updated logic to correctly identify producer stages for copy stages, ensuring accurate pipeline stage assignment.
      - Added a new block sparse matrix multiplication function in the testing suite to validate the pipeline planning changes.
      
      * Update ci.yml
      
      * Fix structural equality checks in AddUnique and Contains methods to compare buffer references instead of entire regions in pipeline planning.
      
      * Refactor pipeline planning logic to improve copy stage dependency propagation
      
      - Updated structural equality checks in AddUnique and Contains methods to use buffer reference comparison.
      - Enhanced the iteration logic for managing copy stage dependencies, ensuring accurate identification of producer stages.
      - Added safeguards against exceeding maximum iterations during dependency propagation.
      376ba9eb
  27. 08 Aug, 2025 1 commit
    • Lei Wang's avatar
      [Layout] Introduce a new layout inference mechanism (#699) · 407117e1
      Lei Wang authored
      
      
      * Implement new free stage layout inference.
      
      * Fix bug
      
      * Make replication upcasting and unnormalizable iterators safe.
      
      * Better handling of updating with more replica
      
      * Remove unnecessary check.
      
      * Fix compilation.
      
      * Fix setup.py.
      
      * Simplify development mode.
      
      * Allow ParallelOp layout when there's already a compatible layout specified
      
      * lint fix
      
      * Add ProveFragmentContains function to validate thread access between small and large fragments
      
      This function checks if the threads accessing elements of a smaller fragment are a subset of those accessing a larger fragment, ensuring valid access during updates. The implementation includes deriving thread indices, computing logical indices, and verifying thread mappings.
      
      * Update dependencies in requirements files
      
      * Remove 'thefuzz' from requirements-dev.txt
      * Specify exact versions for 'torch' and add 'flash_attn' in requirements-test.txt
      
      * Update CI workflow to use SHA256 hash for requirements file
      
      * Update requirements and CI workflow for flash attention
      
      * Removed specific version for 'torch' in requirements-test.txt
      * Added installation of 'flash_attn==2.5.8' in CI workflow to ensure compatibility
      
      * Refactor flash attention import handling in examples
      
      * Removed availability checks for 'flash_attn' in multiple example scripts.
      * Simplified import statements for 'flash_attn' to ensure consistent usage across examples.
      
      ---------
      Co-authored-by: default avatarHuanqi Cao <caohuanqi@deepseek.com>
      407117e1
  28. 30 Jul, 2025 1 commit
    • Lei Wang's avatar
      [CI] Update CI workflow to use Python 3.12 (#679) · eb026b79
      Lei Wang authored
      * Update CI workflow to use Python 3.12 and enable build isolation for pip installations
      
      - Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements.
      - Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs.
      
      * Update CI workflow to trigger on pull requests instead of pull_request_target
      
      - Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process.
      
      * Refactor CI workflow to remove unnecessary repository and token settings
      
      - Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information.
      
      * Remove pip install command from CI workflow to streamline installation process
      
      * Refactor reshape functions and tests for shared memory operations
      
      - Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity.
      - Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`.
      - Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.
      eb026b79