- 17 Oct, 2025 7 commits
-
-
Chaofan Lin authored
* [Refactor] Refactor Pass to support recursive load/store rewrite * lint * recursive collect conds for call_extern * fix name * [Lint]: [pre-commit.ci] auto fixes [...] * lint * [Lint]: [pre-commit.ci] auto fixes [...] * lint * [Lint]: [pre-commit.ci] auto fixes [...] * address comment * rename pad_value to safe_value * lint * add oob store test * [Lint]: [pre-commit.ci] auto fixes [...] * fix * fix --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Lei Wang authored
* [Enhancement] Improve layout inference for local buffer handling in parallel operations * Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions. * Updated the condition for determining parallel loop execution to account for local buffer stores. * Cleaned up comments for clarity and future considerations. * [Refactor] Clean up parallel loop condition formatting in layout inference * Reformatted the condition for determining parallel loop execution for better readability. * Maintained existing logic while enhancing code clarity for future modifications. --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
LJC00118 authored
* improve CUDA compiler detection in CMake * Minor fix
-
Lei Wang authored
-
LJC00118 authored
* remove last dimension stride must be 1 constraint * add vectorize test * minor fix * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Lei Wang authored
-
Tong WU authored
[Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper (#1024) * [Enhancement] Add support for symbolic dimensions in Cython kernel adapter and improve static shape validation in wrapper * [BugFix] Fix shape mismatch and deprecate `T.if()` in fused_moe example * [Fix] Add `is_symbolic_expr` function to check for symbolic expressions in TIR - Introduced a new utility function `is_symbolic_expr` to determine if an expression is a symbolic expression, enhancing type checking capabilities. - Updated shape handling in `CythonKernelAdapter` to utilize the new function, improving handling for symbolic shapes.
-
- 16 Oct, 2025 4 commits
-
-
Xuehai Pan authored
* [CI] fix ROCm CI * feat: add a hook to error out on no test runs
-
Lei Wang authored
[Bugfix] Improves compatibility when checking for MPS availability in different PyTorch builds. (#1051)
-
Yichen Yan authored
-
Yuqi Dong authored
* update * format * rabbit
-
- 15 Oct, 2025 8 commits
-
-
Yu Cheng authored
-
Tong WU authored
* [BugFix] Phaseout dependency of Triton in sink examples to make CI happy - Added `benchmark_gqa_sink_fwd.py` and `benchmark_mha_sink_fwd.py` to evaluate performance of GQA and MHA attention mechanisms using Triton. - Refactored existing attention sink implementations to remove Triton kernel definitions from the reference programs, streamlining the code. - Updated input generation and benchmarking logic to enhance configurability and performance measurement. - Improved overall structure and organization of the examples for better clarity and usability. * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Xuehai Pan authored
* refactor: merge test CI workflow files into one * chore: set `UV_INDEX_STRATEGY=unsafe-best-match` * feat: add AST test with Python 3.8 * feat: implement manual caching mechanism for self-hosted runners * refactor: simplify cache logic for self-hosted runners * chore: clear uv cache on failure * chore: print format.sh output to logs * chore: improve uv caching * chore: disable parallel test * chore: use `PYTHONDEVMODE=1` in CI * feat: enable coredump generation * fix: fix perfbench condition * Revert "feat: enable coredump generation" This reverts commit c52da65cb572932e09905d08c43a39ec3cf47c54. * chore: move example CI down * Revert "chore: move example CI down" This reverts commit 9d8e65055e01d955c5268a9a6705d270c2de0d57. * chore: skip example `test_example_mha_sink_bwd_bhsd` * chore: skip example `test_example_gqa_sink_bwd_bhsd` * fix: fix example argument passing * fix: loosen test criteria * chore: rename `CMAKE_CONFIGURE_OPTIONS` -> `CLANG_TIDY_CMAKE_OPTIONS` for clarity * feat: enable parallel testings * chore: update pytest options * remove skipped test as now been resolved * chore: empty commit to re-trigger ci * test for n 1 * chore: remove ` --numprocesses=1` option in example * chore: disable failfast * chore: update cibw selection * fix: fix git submodule clone * chore: update cibw commands * fix: fix yapf multiprocessing * chore: setup ccache for CIBW on macOS only * chore: update comments * chore: update artifact listing * fix: do not fail if not found nvcc in PATH * fix: fix flash-attn installation * chore: update dist workflow trigger * chore: remove outdated comments * chore(workflows/dist): simplify build matrix strategy * fix: fix CUDA path finding * fix: fix CUDA path finding * chore: imcrease CI timeout * ci: disable failfast * fix: hide path prefix * chore: more verbose * chore: disable PR trigger for dist workflow * fix: seed for tests * fix: use nightly torch for ROCm tests * chore: enable PR trigger for dist workflow * chore: stop uploading debug wheels as artifacts in PR * chore: do not run workflows in forks * chore: housekeep requirements * chore: use Nightly-ROCm-6.3 for CI * chore: use Nightly-ROCm-6.4 for CI * Update ROCm toolkit version to 7.0 * chore: restore previous rocm-ci.yml for test * fix: cleanup PYTHONPATH * chore: remove previous rocm-ci.yml * ci fix * chore: remove previous rocm-ci.yml * chore: enable parallel example run --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
alex_xiao <xinyuxiao2024@gmail.com>
-
alex_xiao authored
* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example...
-
Lei Wang authored
* Expose CUDA warp/lane intrinsics in TileLang frontend * generalize warp indexing intrinsics and add coverage * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
LJC00118 authored
* Remove an incorrect check * add fp8 pack function * code lint * minor fix * minor fix * minor fix * Minor fix * Minor fix
-
Lei Wang authored
-
Lei Wang authored
* keep >> instead of / * re think replicate * lint fix * handle const int buffers * rep fix --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
- 14 Oct, 2025 8 commits
-
-
Lei Wang authored
-
Lei Wang authored
* recover flex parallel process * lint fix --------- Co-authored-by:Zhiwen Mo <zm125@ic.ac.uk>
-
Tong WU authored
* [Enhancement] Update abs function for half_t and bfloat_t to use cutlass implementation * [Lint]: [pre-commit.ci] auto fixes [...] * optimize amd ci --------- Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Cunxiao Ni authored
* [CI] Removes debug print statements from the example. * add parse args * [Lint]: [pre-commit.ci] auto fixes [...] * format --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Lei Wang authored
* chained assignments * test update * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Yichen Yan authored
* Load libs from build dir, if present, to support faster rebuild. * typo * upd * refine check * md lint
-
Xuehai Pan authored
Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Donot lower ceildiv to >> * lint fix * test fix * fallback ceildiv changes
-
- 13 Oct, 2025 4 commits
-
-
Cunxiao Ni authored
* [CI] Removes redundant environment variable Removes the `UV_INDEX_URL` * triggle CI * triggle CI * triggle CI * triggle CI
-
Yichen Yan authored
* cleanup * init * build first wheel that may not work * build cython ext * fix tvm build * use sabi * update rpath to support auditwheel * pass editible build * update ci * fix warnings * do not use ccache in self host runner * test local uv cache * test pip index * update lib search to respect new lib location * fix * update ci * enable cuda by default * update src map * fix * fix * fix * Generate version with backend and git information at build time * copy tvm_cython to wheels * fix tvm lib search * fmt * remove unused * auto detect ccache * add back backend-related files * remove jit cython adaptor to simplify code * fmt * fix ci * ci fix 2 * ci fix 3 * workaround metal * ci fix 4 * fmt * fmt * Revert "ci fix 4" This reverts commit d1de8291c3e40927955f3ad3cf87a75c78813676. * tmp * fix metal * trivial cleanup * add detailed build-time version for cuda * add back mlc * Restore wheel info and other trivial updates * update * fix cuda * upd * fix metal ci * test for ga build * test for nvidia/cuda * test ubuntu 20 * fix * fix * Do not use `uv build` * fix * fix * log toolchain version * merge wheel * update * debug * fix * update * skip rocm * update artifacts each * fix * fix * add mac * fix cache * fix cache * fix cache * reset and add comment * upd * fix git version * update deps * trivial update * use in-tree build dir and install to src to speedup editable build * Revert "use in-tree build dir and install to src to speedup editable build" This reverts commit 6ab87b05c5eed811210136b8dca4fc3677dd51f2. * add build-dir * update docs * remove old scrips * [1/n] cleanup scripts * [Lint]: [pre-commit.ci] auto fixes [...] * fix and update * wait for tvm fix * revert some tmp fix * fix * fix * spell * doc update * test cibuildwheel * fix and test macos on ci * Update .github/workflows/dist.yml Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com> * fix * test ga event * cleanup * bump tvm to support api3 * test final version * add cron * Update .github/workflows/dist.yml Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com> * fix * test ccache for metal cibuildwheel * test newer macos * finish --------- Co-authored-by:
pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by:
Xuehai Pan <XuehaiPan@outlook.com>
-
Lei Wang authored
-
Yuqi Dong authored
* update * update * update * update
-
- 12 Oct, 2025 3 commits
-
-
Yuqi Dong authored
* [Refactor]:Add support for torch version lower than 2.6.0 * update
-
Zhengju Tang authored
* [BugFix] Robust gemm policy for sparse_mla_fwd in Hopper and Ada Lovelace architectures * [Lint]
-
Degeneracy-Evil authored
* [Bugfix] Add NVIDIA HPC SDK support in CUDA detection (#974) Enhanced CUDA detection to recognize NVIDIA HPC SDK installations: - Added path check for nvhpc in nvcc binary path - Added fallback scan for default nvhpc paths: /opt/nvidia/hpc_sdk/Linux_x86_64 - Maintained backward compatibility with standard CUDA installations Verification: - Tested on Ubuntu 24.04 with NVIDIA HPC SDK 25.7 - Confirmed detection works without manual CUDA_HOME or CUDA_PATH setting Fixes #974 * [Bugfix] Fix CUDA home detection logic * [Bugfix] Safely handle None cuda_home during CUDA detection Adds a check for None before validating the CUDA home path to prevent errors when the path is not set. * [Bugfix] Fix CUDA detection edge cases in nvhpc support (#974) - Improved nvhpc path detection logic - Added None check for cuda_home to avoid crashes - Maintained existing CUDA installation compatibility Fixes #974 * chore: rerun CI --------- Co-authored-by:NaNExist <138002947+NaNExist@users.noreply.github.com>
-
- 11 Oct, 2025 6 commits
-
-
Yu Cheng authored
* [Feature][Example] Support TMA reduce operation and update GQA bwd example * move GQA bwd with TMA reduce to new example * [Lint]: [pre-commit.ci] auto fixes [...] --------- Co-authored-by:pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
-
Lei Wang authored
* remove debug print * pipeline fix * use the correct buffer access scope
-
Lei Wang authored
-
Lei Wang authored
[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977) * • InjectFenceProxy docs and tests - annotate proxy fence injector with context comments for async/generic detection - add compiler internals doc covering the pass mechanics and link it in docs index - repair fence proxy test by fixing descriptor init usage and fence counter logic * do not consider call_extern as async. * doc update. * reduce test size for sparse mla
-
Lei Wang authored
* feat: add parser overrides for local.var aug assign. * lint fix
-
Lei Wang authored
* support cumsum-1d * cumsum 1d support
-