- 12 Dec, 2025 1 commit
-
-
Lei Wang authored
-
- 30 Sep, 2025 1 commit
-
-
botbw authored
* [CI] optimize CI time * [CI] fix transpose && format * [misc] apply coderabbit suggestions && fix typo
-
- 30 Jul, 2025 1 commit
-
-
Siyuan Feng authored
**Summarize part of the rebase pr:** 1. **Support T.thread_return() → CUDA return syntax** Added support for translating `T.thread_return()` to CUDA's native `return` statement. 2. **Dynamic type support for function inputs** Functions now accept dynamically typed parameters using `typing`: ```python dyn_type = T.int32 or T.float @T.prim_func def main( a: dyn_type, ) ``` 3. **Device Function Codegen** Added support for generating `__device__` functions in CUDA: ```python @I.ir_module class Module: @T.prim_func(private=True) def add(a: T.int32, b: T.int32) -> T.int32: return a + b @T.prim_func def main( A: T.Buffer((128, 128), "int32"), B: T.Buffer((128, 128), "int32"), C: T.Buffer((128, 128), "int32"), ): T.func_attr({"global_symbol": "main"}) length: T.int32 = Module.add(64, 64) # Host call for bx in...
-
- 23 Jul, 2025 1 commit
-
-
Wenhao Xie authored
* fix CI bugs in hopper * lint fix * Update bulk_copy.cc * Refactor bulk copy logic in LowerBulkCopy function - Removed unnecessary blank lines for improved code readability. - Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides. - Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings. * test fix * ci fix * Update flash-attention dependencies and clean up example code - Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`. - Removed unused imports and commented-out code in various example files to enhance readability and maintainability. - Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`. - Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity. - Deleted the `example_mha_inference.py` file as it is no longer needed. * Update CI workflow to remove `--user` flag from pip install commands - Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment. * Update CI workflow to include `--no-user` flag in pip install commands - Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment. * Update CI workflow to include `--no-user` flag in pip install command for wheel mode - Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment. * test fix * avoid conflict with system environments * test fix * add commnets --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
- 03 Jul, 2025 1 commit
-
-
botbw authored
* [experimental] add a draft gemm_sp * [3rdparty] bump cutlass to v3.9.3 * [lint] run format.sh * [chore] rebase * [chore] use abs path * [gemm_sp] add metadata layout * [ci] add more example * [lint] run format.sh * [chore] polish * [chore] move gemm_sp to experimental * [chore] polish * [lint] run format.sh * [Enhancement] Improve bulk copy handling and update GEMM sparse tensor test * Added a warning log for unsupported non-swizzled global layouts in the bulk copy operation, ensuring fallback to normal copy. * Refactored the GEMM sparse tensor test by removing unnecessary imports and simplifying the kernel compilation process. * Updated the test to directly call the `run_gemm_sp` function, enhancing clarity and functionality. * Implement Test * [Enhancement] Update GEMM SP and SM89 templates for improved functionality * Refactored GEMM SP computation to enhance warp partitioning logic, ensuring compatibility with Hopper architecture. * Updated layout inference to support new WGMMA conditions and improved error messaging for unsupported targets. * Modified SM89 templates to utilize new MMA atom structures, enhancing performance and compatibility with fp8 types. * Added conditional inclusion for GEMM SP header based on CUDA architecture version. * lint fix * [gemm_sp] support more layout and data types * Enhancement: sync T.gemm_sp's layout inference with T.gemm * Enhancement: support more block_k in compress util * [Enhancement] enable block_k=64 * [Lint] run format.sh * [Enhancement] compressor support more dtype * Enhancement: enable block_K=32 * [Lint] format.sh * [Fixbug] fix shape * Refactor: sync gemm * [Enhancement] enable transpose * [Enhancement] enable fp8_e4m3 * [Enhancement] enable int8 * [Lint] run format.sh * [Benchmark] add gemm_sp benchmark * [Example] fix 256 threads hang * [CI] fix ci * [Chore] resolve gemini feedback * [Benchmark] increase search space * [Lint] format * [CI] skip sparse tensor core related tests as only sm90 is supported * [CI] pass local run * Update gemm_sm89.h * lint fix * lint fix * [Enhancement] Add support for sparse GEMM and initialize CUDA architecture flags - Introduced a new boolean flag `enable_sparse_gemm_` to control the inclusion of sparse GEMM functionality in CUDA code generation. - Updated the `Finish` method to conditionally include the sparse GEMM header based on the new flag. - Implemented logic in `VisitStmt_` to enable sparse GEMM when the corresponding external call is detected. - Added a function to initialize the `TORCH_CUDA_ARCH_LIST` environment variable based on the target compute version, enhancing compatibility with PyTorch. - Refactored the initialization function into the appropriate module and ensured it is called in the sparse utilities module. * Update test_compress_utils.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-