Commits · 2b1f5990653bd5767a39100d3670c0f37d89f5b7 · OpenDAS / tilelang

"git@developer.sourcefind.cn:dadigang/Ventoy.git" did not exist on "e73ac04cb6eb86ca6e08c1f9c47506d2ee003e67"

09 Nov, 2025 1 commit

[Bugfix] Enhane LetStmt Handling in Pipeline Transform (#1212) · 85218bd9

Lei Wang authored Nov 09, 2025

* [Enhancement] Introduce LetWrapper for handling loop variable substitutions in pipeline rewriting

* Added LetWrapper struct to encapsulate variable and value pairs for loop variable substitutions.
* Updated PipelineRewriter to accept a vector of LetWrapper instances, allowing for proper handling of Let statements that depend on the pipeline loop variable.
* Enhanced the BuildPipeline method to incorporate LetWrapper instances into rewritten blocks, ensuring correct substitutions during pipeline execution.
* Refactored logic for processing Let statements to differentiate between those that use the loop variable and those that do not, improving the flexibility of the pipeline transformation.

* Refactor lambda expression for clarity in loop variable usage check in inject_pipeline.cc

* [Test] Add regression test for loop variable handling in kernel compilation

* Introduced a new test case to verify correct handling of loop variables in the kernel compilation process, addressing a regression issue with InjectSoftwarePipeline.
* The test ensures that the loop variable is not left as a free variable, which previously caused failures in MakePackedAPI.
* Configurations are set to disable warp specialization and TMA lowering to align with the original issue reproduction.

* Remove unused import in regression test for loop variable handling in kernel compilation

85218bd9

08 Nov, 2025 1 commit

[Enhancement] Improve handling of negative indices for ramp and broadcast node (#1207) · 918a21bd

Lei Wang authored Nov 09, 2025

* [Enhancement] Improve handling of negative indices in legalize_negative_index pass

* Added logic to handle scalar and vector indices separately, enhancing the ability to determine non-negativity and negativity of indices.
* Introduced detailed logging for cases where non-negativity cannot be proven, improving debugging capabilities.
* Refactored index state determination for vector types, including support for Ramp and Broadcast nodes.

* Fix incorrect lane handling in legalize_negative_index pass by dereferencing lanes to obtain the correct integer value.

* Enhance legalize_negative_index pass by including necessary header for TIR operations. This addition supports improved functionality and maintainability of the transformation logic.

918a21bd

07 Nov, 2025 1 commit

[Bugfix] Improves the accuracy of dependency analysis in the storage access (#1205) · c8ec3469

Lei Wang authored Nov 07, 2025

* Refactor storage access visitor in TileLang to improve readability and maintainability. Organized includes, enhanced comments, and preserved access summaries during condition evaluations in IfThenElse statements. Adjusted handling of buffer accesses and thread invariance checks for better clarity.

* lint fix

c8ec3469

06 Nov, 2025 1 commit
- [Feat] Add A Pass to Handle Negative Index (#1192) · 0592834f
  Kurisu authored Nov 06, 2025
  
  0592834f
05 Nov, 2025 1 commit

[Langauge] Support n>256 for v2 (#1182) · b66a93c5

Lei Wang authored Nov 05, 2025

* fix

* lint fix

* fix

* lint fix

* fix

* upd

* support n>256

* Remove unnecessary pass configurations for fast math in MHA forward BHSD latency script.

* lint fix

* lint fix

b66a93c5

04 Nov, 2025 1 commit

[Refactor] Improve Python3.9 compatibility for ParamSpec and Self (#1190) · 7d961892

Lei Wang authored Nov 04, 2025

* [Feature] Enhance fill operation to support various buffer types

- Added support for `BufferLoad` in the `fill` function to handle different buffer types.
- Updated `Fill` class to process region descriptors and buffer regions, improving flexibility in buffer handling.
- Introduced checks for static bounds in region definitions to ensure safety during operations.
- Refactored loop induction variable handling in `FillNode` to accommodate sliced regions.

* lint fix

* [Refactor] Improve Python compatibility for ParamSpec and Self

- Added compatibility handling for ParamSpec and Self to support Python versions below 3.10 and 3.11 respectively.
- Updated type annotations across multiple files to ensure consistent usage of typing features.

* [Update] Require Python 3.9 and enhance type annotations

- Updated the minimum required Python version from 3.8 to 3.9 in `pyproject.toml`.
- Removed references to Python 3.8 in classifiers.
- Changed type annotations from `int | None` to `Optional[int]` in multiple example files for better clarity and compatibility.
- Improved import statements to use `collections.abc` for `Iterable` and `contextlib` for `AbstractContextManager` in relevant files.

* [Refactor] Update import statements to enhance type annotations

- Replaced imports from `typing` with `collections.abc` for `Iterable` and `Mapping` in relevant files to improve compatibility and clarity.
- Updated the caching decorator from `functools.lru_cache` to `functools.cache` for better performance in the C++ compiler retrieval function.
- Adjusted import statements in the language proxy file to maintain consistency in type annotations.

* disable rocm rs nt test.

* lint fix

7d961892

03 Nov, 2025 1 commit
- [Fix] fix type imcompatible error in #1115 (#1180) · 4ef94f22
  Kurisu authored Nov 04, 2025
```
* Fix incompatible floordiv in packed api

* fix lint
```
  4ef94f22
02 Nov, 2025 1 commit

[Language] Expose `T.warpgroup_fence_operand` for nvcc code motion (#986) · aef0a6bb

Lei Wang authored Nov 03, 2025



* remove debug print

* pipeline fix

* use the correct buffer access scope

* rs support

* warp warpgroup_fence_operand

* fix

* fp8 dtype ptx enhance

* mma fix

* TCGEN05 Interface

* tcgen05 support

* rebase

* update

* Enhance TCGEN05 support by adding new intrinsic operations and descriptors. Introduced `ptx_tcgen05_mma_ts` for tensor-memory to shared-memory instructions and `tcgen05_mma_arrive` for signaling barrier completion. Updated existing descriptors and code generation logic to accommodate these changes, ensuring compatibility with new instruction sets. Refactored related allocation functions and improved handling of shared memory descriptors.

* lint fix

* Refactor buffer reference handling in CUDA code generation and update test execution in tilelang. Ensure default annotations for unrolling are set correctly in TIR IR module.

* wgmma fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

aef0a6bb

31 Oct, 2025 2 commits

[Bugfix] Enable code lowering with producer‑copy‑only program (#1168) · 7a80b6df

Lei Wang authored Oct 31, 2025

* bugfix

* lint fix

* Enhance warp group register allocation to handle missing consumer bodies gracefully. Updated logic to annotate producer side when consumer is absent, ensuring robustness in degenerate warp-specialized patterns.

* Refactor VisitExpr_ method in inject_tma_barrier.cc for improved readability. Adjusted formatting and spacing for clarity in barrier handling logic.

* Update barrier handling in inject_tma_barrier.cc to accommodate newly appended entries. Adjusted the size of the replace vector to ensure it covers the full needed length, and modified the logic for appending barriers based on the updated replace conditions.

7a80b6df

[FFI] Rebase tvm to v0.22.0 to utilize tvm-ffi (#1108) · 10911e28

Lei Wang authored Oct 31, 2025



* 3rdparty tvm bump

* bump tvm into v0.22.0

* lint fix

* rebase tvm

* Update submodule tvm to latest commit 3085bc4

* Refactor: Update configuration retrieval in CopyNode and adjust test registration in tilelang

* test fix

* add requirement

* atomic_fix

* atomic_fix

* phaseout py39

* optimize

* optimize

* lint fix

* do not clean cache

* do not clean cache

* [Minor] Minor update for Python versions and dependencies

* [Lint] fix lint for py39

* [Lint] fix lint for ROCm

* [Build][CI] Sync CI changes from upstream/sdist

* [Lint] fix lint for ROCm

* [Build][CI] Update `repair-wheel-command`

* [Minor] update abi3audit result format

* [Lint] fix lint for ROCm

* [BugFix] fix build

* [Lint] fix lint for ROCm

* [BugFix] set rpath for libtvm and libtvm_runtime

* [Deps] pin apache-tvm-ffi version

* [Build] set Python 3.9 Limited API for Cython target

* [Build] set Python 3.9 Limited API for Cython target

* [Deps] Restore Python 3.8 support

* [Build] use `apache-tvm-ffi`'s `libtvm_ffi`

* [BugFix] use `;` as delimiter for RPATH on macOS

* [BugFix] use `--ignore-missing-dependencies` for `delocate-wheel`

* [Build] support `sccache` if available

* [Build] add CIBW import test

* [Build][CI] enable ccache for CIBW on Linux

* [BugFix] set rpath for libtvm and libtvm_runtime

* Revert "[Build][CI] enable ccache for CIBW on Linux"

This reverts commit cd9ab57bb5ddd2572c60bcbbebde81480a658fd3.

* [CI] fix perfbench bot

* [BugFix] use Python 3.9 to build wheel

* [Minor] update perfbench bot envs

* [BugFix] fix CIBW environment on Linux

* [CI] skip import test on CentOS 7

* [CI] use Python urllib to download file instead of Wget

---------
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>

10911e28

29 Oct, 2025 2 commits

[Bugfix] Enhance LetStmt handling in Vectorize Loop Pass (#1159) · 79730b11

Lei Wang authored Oct 30, 2025

* [Refactor] Enhance TLVectorizer with loop vectorization convenience method and improve let variable handling

* lint fix

* let test fix

* lint fix

79730b11

[Enhancement] Enhance Cast operations Vectorization (#1156) · feef9ef6
LJC00118 authored Oct 29, 2025
```
* Enhance Cast vectorized

* Add Parallel vectorized cast test

* code lint

* merge newest commit
```
feef9ef6

28 Oct, 2025 1 commit
- [Bugfix] Implement classic arena algorithm for shmem merge and WAW conflict detection (#1146) · f7ba45d8
  Lei Wang authored Oct 29, 2025
```
* atomic_fix

* atomic_fix

* mem fix

* lint fix

* add some comments

* fix

* fix

* lint fix

* handle async copy

* lint fix
```
  f7ba45d8
27 Oct, 2025 2 commits
- [Bugfix] Correctly construct the argument list for atomic add based on the vector size (#1137) · 7d389a43
  Lei Wang authored Oct 28, 2025
```
* atomic_fix

* atomic_fix
```
  7d389a43
- [Enhancement] Add missing `fence_barrier_init` primitive after mbarrier init (#1121) · 17a63976
  Yu Cheng authored Oct 27, 2025
```
* [Enhancement] Add missing  primitive after mbarrier init

* lint
```
  17a63976
24 Oct, 2025 1 commit
- [Bugfix] Resolve mixed stride dtype issue (inconsistent int32/int64 values) (#1119) · 65c4711f
  Lei Wang authored Oct 24, 2025
```
* fix int32 dtype issue

* lint fix

* lint

* lint fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  65c4711f
23 Oct, 2025 1 commit

[Refactor] Improve scalar handling in CopyNode and update loop partition dtype logi (#1111) · 86c8bb46

Lei Wang authored Oct 23, 2025

* [Refactor] Improve scalar handling in CopyNode and update loop partition dtype logic

* Refactored CopyNode::MakeSIMTLoop to handle scalar cases more efficiently by moving the scalar check to the end of the function.
* Updated loop_partition.cc to set a default DataType for thread and vector extents, ensuring compatibility when loop_vars_ is empty.

* lint fix

* remove debug print

86c8bb46

22 Oct, 2025 1 commit

[CI][Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow (#1044) · 5683e6a6

Xuehai Pan authored Oct 22, 2025

* [Lint] Retire `format.sh` and add `clang-tidy` to GHA workflow

* chore: update clang-tidy settings

* chore: upgrade clang-format and clang-tidy version

* lint: resolve clang-tidy errors

* [Maint] restore format.sh

* [CI] pre-commit autoupdate

* [Minor] fix `command -v` usage

5683e6a6

21 Oct, 2025 4 commits

[Bugfix] Fix missing host cuTensorMapEncodeIm2col call (#1094) · 5cb5c068
Yu Cheng authored Oct 22, 2025

5cb5c068

[Language] Support tilelang `alloc_var(dtype, init=x)` (#1092) · bddb125e

Lei Wang authored Oct 21, 2025

* - carry existing local-var initializer map into OpaqueBlockLower, reattach it to
    generated Allocates and the PrimFunc attrs
  - thread the map through FlattenBuffer and StorageRewrite so flattened/merged
    allocations keep their tl.local_var_init annotations
  - teach annotation handling to accept scalar initializers, resolve buffers, and merge
    with existing stat

* lint fix

* enhance

* lint fix

* lint fix

bddb125e

[PassConfig] Introduce PassConfig `TL_STORAGE_REWRITE_DETECT_INPLACE` (#1089) · cdc67fc4

Lei Wang authored Oct 21, 2025

* • Enable configurable StorageRewrite inplace detection

  - Add kStorageRewriteDetectInplace constant and register the flag with PassContext so C++ code no longer hard-codes the key.
  - Wire StorageRewrite to include TileLang builtin constants and honor the new config toggle when deciding inplace reuse.
  - Document the flag across Python surfaces (PassConfigKey, JIT/autotuner docs) with usage guidance and simplified IR examples.

* lint fix

* add test

* lint fix

cdc67fc4

[BugFix] Add memory order argument for non-vectorized atomic add (#1081) · 1d4b7180

Zhengju Tang authored Oct 21, 2025

* [BugFix] Add memory order argument for non-vectorized atomic add

* [Lint]

* [BugFix] Memory order

* [Lint]

* [BugFix] Argument in cuda template

* [Lint]

1d4b7180

20 Oct, 2025 4 commits

[Enhancement] Update async intrinsic handling in inject_fence_proxy (#1068) · bb8b3cd7

Tong WU authored Oct 21, 2025



* [Enhancement] Update async intrinsic handling in inject_fence_proxy

* Added support for wgmma async intrinsics in IsAsyncIntrinsic function.
* Changed handling of unknown externs to treat them as Generic instead of Async, improving accuracy in proxy kind determination.

* test fix

* Update testing/python/transform/test_tilelang_transform_inject_fence_proxy.py
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

bb8b3cd7

[Bugfix] Fix missing reg alloc in custom warp specialization (#1084) · f8d3e73e
Yu Cheng authored Oct 21, 2025

f8d3e73e
[Layout] Utilizing IsEqual instead of StructuralEqual (#1073) · 6a388c0e
Lei Wang authored Oct 20, 2025

6a388c0e

[Parallel] Support `T.Parallel` with dynamic extents (#990) · 27701c3d

Lei Wang authored Oct 20, 2025

* Allow dynamic extents in loop partition; warn when layout inversion falls back to NoCheck

* add test and introduce predicate

* test fix

* fix

* enhance

* inverse with level

* test fix

* bug fix

27701c3d

17 Oct, 2025 2 commits

[Refactor] Refactor Pass `LegalizeSafeMemoryAccess` to support recursive load/store rewrite (#1050) · 72111642

Chaofan Lin authored Oct 17, 2025



* [Refactor] Refactor Pass  to support recursive load/store rewrite

* lint

* recursive collect conds for call_extern

* fix name

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* lint

* [Lint]: [pre-commit.ci] auto fixes [...]

* address comment

* rename pad_value to safe_value

* lint

* add oob store test

* [Lint]: [pre-commit.ci] auto fixes [...]

* fix

* fix

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

72111642

[Enhancement] Introduce a workaround for layout inference for local buffer store (#1055) · 278c0fbf

Lei Wang authored Oct 17, 2025



* [Enhancement] Improve layout inference for local buffer handling in parallel operations

* Added logic to check if a loop only manipulates "local" buffers, which affects thread binding decisions.
* Updated the condition for determining parallel loop execution to account for local buffer stores.
* Cleaned up comments for clarity and future considerations.

* [Refactor] Clean up parallel loop condition formatting in layout inference

* Reformatted the condition for determining parallel loop execution for better readability.
* Maintained existing logic while enhancing code clarity for future modifications.

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>

278c0fbf

16 Oct, 2025 1 commit
- [Feature]: Add test for atomicadd auto vectorize and remove useless code (#1019) · 0ff4f427
  Yuqi Dong authored Oct 16, 2025
```
* update

* format

* rabbit
```
  0ff4f427
15 Oct, 2025 2 commits
- [Refactor] Use `has_simt_copy` to decide whether to insert `set_max_nreg` (#982) · bd1c7b39
  Yu Cheng authored Oct 16, 2025
  
  bd1c7b39
- [TIR] Revert some changes of Pass `LowerIntrin` (#1035) · e5399527
  Lei Wang authored Oct 15, 2025
```
* keep >> instead of /

* re think replicate

* lint fix

* handle const int buffers

* rep fix

---------
Co-authored-by: Zhiwen Mo <zm125@ic.ac.uk>
```
  e5399527
14 Oct, 2025 1 commit
- [Transform] Migrate `LowerIntrin` from tvm into tilelang (#999) · 7a5077e4
  Lei Wang authored Oct 14, 2025
```
* Donot lower ceildiv to >>

* lint fix

* test fix

* fallback ceildiv changes
```
  7a5077e4
13 Oct, 2025 1 commit
- [Bugfix] Fix atomicadd auto vectorize identify var error (#883) · 340bfc50
  Yuqi Dong authored Oct 13, 2025
```
* update

* update

* update

* update
```
  340bfc50
11 Oct, 2025 1 commit

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group... · ddfaac36

Lei Wang authored Oct 11, 2025

[Refactor] Refactor Pass `InjectFenceProxy` and expose some warp group primitives in frontend (#977)

* • InjectFenceProxy docs and tests

  - annotate proxy fence injector with context comments for async/generic detection
  - add compiler internals doc covering the pass mechanics and link it in docs index
  - repair fence proxy test by fixing descriptor init usage and fence counter logic

* do not consider call_extern as async.

* doc update.

* reduce test size for sparse mla

ddfaac36

10 Oct, 2025 3 commits

[Bugfix] Fix dummy kernel compliation (#962) · 7913fb1d

Chaofan Lin authored Oct 10, 2025



* [Bugfix] Fix visit EvaluateNode in BufferGemmCollector

* address comment

* lint

* fix

* Add TileLang SplitHostDevice pass and tighten issue 830 test names

* lint fix

* enhance for kernel value unpacking.

---------
Co-authored-by: LeiWang1999 <leiwang1999@outlook.com>

7913fb1d

[CI] add `pre-commit` integration (#955) · 8fe35402

Xuehai Pan authored Oct 10, 2025



* chore: misc cleanup

* feat: add pre-commit config

* chore: update lint dependencies

* style: fix lint issues

* feat: add pre-commit hooks

* fix: fix typos

* chore: update .gitattributes

* [Lint]: [pre-commit.ci] auto fixes [...]

* docs: update CONTRIBUTING.md

* chore: update default venv name

* chore: revert and exclude CUDA files

---------
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

8fe35402

[Bugfix] Do not force inline let stmt (#947) · f8ae600c

Lei Wang authored Oct 10, 2025

* remove debug print

* Remove inline let expressions from the LowerAndLegalize function in phase.py

* add test

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* reduce test shape

* Update documentation structure and refactor main function parameters in example_fusedmoe_tilelang.py

- Added a new section for compiler internals in the documentation.
- Refactored the main function in example_fusedmoe_tilelang.py to accept parameters for hidden dimensions, expert configurations, and batch/sequence sizes, improving flexibility and readability.

* Update buffer access checks in merge_shared_memory_allocations.cc

- Changed the condition for buffer access from less than (<) to less than or equal to (<=) to allow access at the same scope level.
- Adjusted the logic for determining the access level when touching buffers to ensure correct handling of scope levels.

* lint fix

* Support pipeline with LetStmt

* lint fix

* • Fix LowerTileOp let handling to avoid LetInline dependency

  - inline let-bound BufferLoad nodes via resolver helpers and structured return
  - remap layouts/buffers using original data vars and only rewrite when needed
  - update pipeline planner to understand let-bound address_of buffers
  - document the new inline behaviour in docs/let_inline_fix.md

* fix for wgmma pipeline with let binding

* lint fix

* test fix

* reduce smem usage.

* let binding enhancement

* fix for dpgm

* fix simplify

* lint fix

* use tilelang.Simplify instead of tir.Simplify

* • Add TL_FORCE_LET_INLINE pass config and gate eager LetInline usage

  - register the new config in builtin headers/registration
  - add helper to pipeline enabling LetInline based on pass context
  - document LetStmt inlining controls and usage

f8ae600c

09 Oct, 2025 1 commit

[TileOp] Implement WGMMA for T.gemm_v2 (#813) · a13cde28

Lei Wang authored Oct 10, 2025

* [Feature] Introduce WGMMA support and enhance GEMM layout handling

- Added support for the WGMMA intrinsic in the TileLang framework, enabling efficient matrix multiplication on newer architectures.
- Refactored GEMM layout functions to accept a boolean parameter for K dimension handling, improving flexibility in layout generation.
- Updated layout inference logic to accommodate new WGMMA configurations and ensure compatibility with existing GEMM operations.
- Enhanced Python bindings for layout functions, allowing for better integration and usability in user-defined operations.
- Improved documentation for layout functions and GEMM operations to clarify usage and parameters.

These changes enhance the performance and usability of GEMM operations, particularly for advanced architectures, while maintaining backward compatibility with existing implementations.

* [Refactor] Clean up code formatting and enhance layout function readability

- Improved code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated layout function signatures to enhance clarity, particularly in `gemm_layouts.cc`, `layout.cc`, and `layout.h`.
- Refactored lambda functions in `builtin.cc` and `gemm_py.cc` for improved structure and maintainability.
- Enhanced comments and documentation in layout-related files to clarify usage and parameters.

These changes contribute to a cleaner codebase and improved maintainability of layout functions in the TileLang framework.

* [Feature] Add descriptor initialization and offset manipulation for WGMMA

- Introduced new TileLang builtins `initialize_descriptor` and `increase_descriptor_offset` to facilitate descriptor management for WGMMA operations.
- Updated `builtin.cc` and `builtin.h` to define and document the new builtins, enhancing the framework's capabilities for descriptor handling.
- Modified `codegen_cuda.cc` and `ptx.cc` to integrate the new builtins into the code generation process, ensuring proper assembly generation for WGMMA operations.
- Enhanced the `GemmWGMMA` class to utilize the new descriptor functionalities, improving the efficiency of matrix multiplication operations.
- Updated related tests and documentation to reflect the new features and ensure comprehensive coverage.

These changes enhance the TileLang framework's support for advanced matrix operations on newer architectures, improving performance and usability.

* [Refactor] Improve code formatting and readability in various files

- Enhanced code formatting across multiple files for better readability, including consistent indentation and line breaks.
- Updated function signatures and comments in `builtin.h`, `codegen_cuda.cc`, and `ptx.cc` to improve clarity.
- Refactored descriptor initialization and offset manipulation functions in `builtin.py` and `wgmma_macro_generator.py` for improved structure.
- Cleaned up unnecessary whitespace and improved alignment in `common.h` and `allocate.py`.

These changes contribute to a cleaner and more maintainable codebase in the TileLang framework.

* [Update] Update subproject commit and refactor layout function call

- Updated the subproject commit for `cutlass` to indicate a dirty state.
- Refactored the `UpdateAnalyzer` function in `layout.cc` to call `LayoutNode::getVarMap()` instead of `getVarMap()`, improving clarity and ensuring proper context for variable mapping.

These changes enhance the maintainability and clarity of the layout handling in the TileLang framework.

* support more data types

* gemm_rs support

* lint fix

* wgmma wrapper

* Remove debug logging for wgmma assembly code and refactor swizzle byte size calculations in wgmma macro generator. Enhanced handling of leading and stride byte offsets based on swizzle mode, improving clarity and performance in tensor core intrinsic emissions.

* Refactor GEMM layout functions to replace 'kfactor' with 'k_inner' for improved clarity and consistency. Update includes necessary changes in error messages for Hopper and Sm100 layouts. Additionally, include a new header for CUTE utilities in common.h.

* Comprehensively support WGMMA GEMM SS

* remove debug print

* lint fix

* remove debug print

* reduce bwd test shape

* lint fix

* clear cache for pytest

* lint fix

* Update sparse MLA examples to support SKV adjustment and correctness checks

- Changed SKV parameter from 32768 to 8192 in sparse MLA backward and forward tests.
- Added check_correctness parameter to test functions for validation of outputs.
- Updated test cases to reflect new SKV values and correctness checks.

* test fix

* adjust test case

* test fix

* skip some test currently

a13cde28

05 Oct, 2025 1 commit
- [Example] Disable TMA and enable FastMath for NSA Examples (#941) · 3aecab8f
  Lei Wang authored Oct 05, 2025
```
* tma disable

* int64 cast fix.
```
  3aecab8f
02 Oct, 2025 1 commit

[Bugfix] Fix tensor memory copy layout (#933) · 5ccac4fa

Zhiwen Mo authored Oct 03, 2025

* Implements tcgen05.ld instruction support for copying from shared.tmem
  to local.fragment on SM100/Blackwell architecture. Adds layout inference
  and lowering logic for tensor memory operations with proper physical
  coordinate range analysis and warpgroup alignment checks.

  Changes:
  - Add kTMemLoad and kTMemStore to CopyInst enumeration
  - Implement CheckTMemLoad() and CheckTMemStore() validation functions
  - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics
  - Add tmem layout inference in InferLayout() using expandTcgen05Layout
  - Support multiple instruction variants (32dp32b/64b/128b/256b)
  - Add physical layout bounds analysis for tmem coordinates
  - Change clear_accum from bool to PrimExpr in GEMM operations
  - Fix std::optional access checks in layout_inference.cc
  - Add tmem_allocate/deallocate PTX intrinsic support
  - Fix cooperative_groups grid.sync() code generation

* fix

* pipeline fix

* bug fix

* bool fix

5ccac4fa