- 09 Oct, 2025 7 commits
-
-
dependabot[bot] authored
Bumps [astral-sh/setup-uv](https://github.com/astral-sh/setup-uv) from 6 to 7. - [Release notes](https://github.com/astral-sh/setup-uv/releases) - [Commits](https://github.com/astral-sh/setup-uv/compare/v6...v7 ) --- updated-dependencies: - dependency-name: astral-sh/setup-uv dependency-version: '7' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
dependabot[bot] authored
Bumps [actions/setup-python](https://github.com/actions/setup-python) from 2 to 6. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](https://github.com/actions/setup-python/compare/v2...v6 ) --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: '6' dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
Wenhao Xie authored
-
Xiaoyu Zhang authored
-
Xuehai Pan authored
-
Shawn Liu authored
-
Xuehai Pan authored
* chore: add .editorconfig * feat: enable dependabot for GHA workflows
-
- 07 Oct, 2025 3 commits
-
-
Yichen Yan authored
* Reset * Fix other CUDA issue * fmt * fmt * fix cuda error * fix * fix * fmt * cleanup * fix * remove copyright * trivial update * readme update * lint fix --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Xiaoyu Zhang authored
* unify nvrtc check style * unify nvrtc check style * unify nvrtc check style
-
Lei Wang authored
- Introduced new functions for buffer load copy with stride and parallel execution. - Enhanced the copy logic in `copy.py` to simplify nested if statements for BufferLoad nodes. - Added corresponding test cases for the new buffer load functionalities.
-
- 06 Oct, 2025 3 commits
-
-
Cunxiao Ni authored
* [Profiler]Adds CUPTI profiler support * format * rafactor cupti profiler * format * rafactor * rafactor * fix lint * fix lint * refactor * add profiler tests --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Zhichen Zeng authored
* Add sparse mla bwd example * add bwd into test * Update README with bwd impl * comment * format fix * lint fix * fwd fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Tong WU authored
* revert split+sum template for MHA backward * lint * Update example_mha_bwd.py * Update example_mha_bwd_wgmma_pipelined.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 05 Oct, 2025 3 commits
-
-
Lei Wang authored
* tma disable * int64 cast fix.
-
Lei Wang authored
[Example] Introduce split+sum template, and optimize `atomic_add` performance for bwd examples (#940) * example fix * lint fix * bug fix * reduce test size.
-
Cunxiao Ni authored
* [Example] Fix lint to improve grouped GEMM performance with TMA * fix lint
-
- 04 Oct, 2025 3 commits
-
-
Tong WU authored
* [Enhancement] Enhance the GQA backward kernel by calculating `dq` and `dv` via copy&sum * [Example] Implement GQA backward example for Hopper with customized tiling and pipeline * [Example] Add relevant tests * Fix all typos of wrong shape of `V_shared` in macros
-
Lei Wang authored
-
lijinpei authored
* [Example] Optimize online_softmax example - Y should be output in float16. - BN needs to be equal to N to be really online. - On my H100 machine, this increase speedup from 1.424x to 2.788x. * enhance --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 02 Oct, 2025 2 commits
-
-
Zhiwen Mo authored
* Implements tcgen05.ld instruction support for copying from shared.tmem to local.fragment on SM100/Blackwell architecture. Adds layout inference and lowering logic for tensor memory operations with proper physical coordinate range analysis and warpgroup alignment checks. Changes: - Add kTMemLoad and kTMemStore to CopyInst enumeration - Implement CheckTMemLoad() and CheckTMemStore() validation functions - Add LowerTmemCopy() to generate tcgen05.ld/st/cp PTX intrinsics - Add tmem layout inference in InferLayout() using expandTcgen05Layout - Support multiple instruction variants (32dp32b/64b/128b/256b) - Add physical layout bounds analysis for tmem coordinates - Change clear_accum from bool to PrimExpr in GEMM operations - Fix std::optional access checks in layout_inference.cc - Add tmem_allocate/deallocate PTX intrinsic support - Fix cooperative_groups grid.sync() code generation * fix * pipeline fix * bug fix * bool fix
-
Lei Wang authored
* [Layout] Add IsCompletedReplicated method and enhance layout inference in ParallelOpNode - Introduced IsCompletedReplicated method in FragmentNode to check if a buffer is fully replicated. - Enhanced InferLayout in ParallelOpNode to handle layout inference for replicated buffers, ensuring only fragment[0] access is allowed. - Updated error handling for non-zero index access in fragment buffers to improve robustness. * [Layout] Improve code formatting and readability in layout.cc and parallel.cc - Enhanced formatting in FragmentNode's IsCompletedReplicated method for better clarity. - Updated InferLayout method in ParallelOpNode to improve code readability by adjusting line breaks and indentation. - Ensured consistent formatting across conditional statements and comments for improved maintainability. * updt * optimize const index related op * bug fix * reduce gdn test * test fix * lintfix * lint fix * test fix
-
- 01 Oct, 2025 5 commits
-
-
Wenhao Xie authored
-
Yu Cheng authored
-
Lei Wang authored
* Update requirements and refactor benchmark script for deepseek_nsa example - Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository. - Refactored import paths in benchmark_nsa_fwd.py for better organization. - Added a new function to generate configurations for autotuning. - Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility. - Changed allocation of shared memory for accumulators to optimize performance. * Refactor import paths in dequantization examples to use dequantize_utils - Updated import statements in multiple dequantization example scripts to replace references to the removed utils.py file with the new dequantize_utils module. - Ensured consistency across example scripts for better organization and maintainability.
-
M.D_v2.5 authored
In cases where JITKernel.artifact is None, it'll spit error - ``` 2025-10-01 01:06:18 [TileLang:tilelang:ERROR]: Error saving kernel source code to disk: 'NoneType' object has no attribute 'kernel_source' ``` Looking at properties of JITKernel, it seems that `JITKernel.kernel_source` is a better way to achieve this. Ref: https://github.com/tile-ai/tilelang/blob/main/tilelang/jit/kernel.py#L453-L455 Co-authored-by:
Dylan <miaoding.dai@plus.ai>
-
Tong WU authored
* [Cache] Add compile_flags parameter to KernelCache hash keys * [Cache] Update compile_flags parameter to accept both List[str] and str types * lint * [Refactor] Update compile_flags parameter to accept Union[List[str], str] type
-
- 30 Sep, 2025 3 commits
-
-
botbw authored
* [CI] optimize CI time * [CI] fix transpose && format * [misc] apply coderabbit suggestions && fix typo
-
Lei Wang authored
[Example] Specify a fixed commit for the flash-linear-attention repository and optimize nsa examples (#913) - Updated the requirements.txt to specify a fixed commit for the flash-linear-attention repository. - Refactored import paths in benchmark_nsa_fwd.py for better organization. - Added a new function to generate configurations for autotuning. - Modified the tilelang_sparse_attention function to accept parameters for block size, number of stages, and threads, enhancing flexibility. - Changed allocation of shared memory for accumulators to optimize performance.
-
Wenhao Xie authored
-
- 29 Sep, 2025 8 commits
-
-
Lei Wang authored
* Remove unused `fp8_mqa_logits.py` file and update README.md to reflect new directory structure and file descriptions for deepseek_v32 example. Added sections for architecture overview, Lightning Indexer, Top-k Selector, and Sparse MLA Forward implementations. * Update linting configurations and improve code formatting in deepseek_v32 example scripts - Added per-file ignores for the inference directory in `pyproject.toml`. - Refactored code in `topk_selector.py`, `convert.py`, `generate.py`, `kernel.py`, and `model.py` to enhance readability by adjusting spacing and line breaks. - Ensured consistent formatting across function definitions and assertions for better clarity. * Refactor test functions in deepseek_v32 example scripts for improved clarity and consistency - Updated `fp8_lighting_indexer.py` to define a dedicated test function for the lighting indexer. - Refactored `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` to standardize test function parameters and improve readability. - Enhanced `topk_selector.py` by introducing a test function with parameters for batch size and sequence length. - Ensured all test functions are invoked correctly in the main execution block. * Enhance test functions in deepseek_v32 example scripts with CUDA requirements and parameterization - Added CUDA requirements decorators to `test_example_sparse_mla_fwd` and `test_example_sparse_mla_fwd_pipelined`. - Parameterized test functions to use specific small shapes for testing, improving test coverage and clarity. * lint fix * Update README.md to correct image path for DeepSeek V3.2 architecture diagram
-
Wenxuan Tan authored
* fix flops comp and softmax scale * format
-
Lei Wang authored
-
Wenhao Xie authored
* [Typo] Fix backend name for Huawei Ascend chips * update
-
Lei Wang authored
* Update README.md to include directory structure and file descriptions for deepseek_v32 example * Refactor and clean up deepseek_v32 example scripts - Removed unused imports and functions from `fp8_mqa_logits.py` to streamline the code. - Improved formatting and readability in `sparse_mla_fwd_pipelined.py` and `sparse_mla_fwd.py` by adjusting function signatures and indentation. - Added `# ruff: noqa` comments to suppress linting warnings in multiple files. - Enhanced the `generate_random_cu_seqlens` function in `utils.py` for better clarity and organization. - Updated print statements for consistency in output formatting.
-
Wenhao Xie authored
-
Lei Wang authored
* [Refactor] Enhance CopyNode Lower method to support disable_tma flag and improve flash attention implementation * Updated the CopyNode Lower method to correctly include the disable_tma flag in the GetCopyInst call. * Refactored the flash attention implementation to selectively disable TMA for specific copy operations while allowing it for others. * Addressed linting issues for improved code quality * sparse mla kernels * Remove deprecated sparse MLA and utility files to streamline the codebase.
-
Jiaxing Ding authored
-
- 28 Sep, 2025 2 commits
-
-
Tong WU authored
* Fix CopyNode Lower method to include disable_tma flag in GetCopyInst call * Refactor flash attention implementation to disable TMA for specific copy and allow TMA for other operations * attempt to fix lint
-
Zhiwen Mo authored
* update sm100 related utcmma, tmem, ld/st256 in src * update sm100 related utcmma, tmem, ld/st256 in tilelang * Remove deprecated GEMM examples and related README documentation for SM100 architecture support * Update GEMM implementation to replace UTCMMA with TCGEN5MMA across relevant files * Remove gemm_umma.py example and update README to reflect TCGEN5MMA terminology changes * Update README.md for gemm_sm100 example by removing outdated API sections and streamlining documentation * Update README and source files to reflect TCGEN5.MMA terminology changes * Refactor CUDA GEMM header for improved readability
-
- 26 Sep, 2025 1 commit
-
-
Lei Wang authored
[Layout] Introduce Flexible Parallel to Support T.serial and local buffers inside T.Parallel loop (#844) * Support T.serial and local buffers inside T.Parallel loop. * Fix reducer layout in T.Parallel nested inside other loops * Debug output with LOG(INFO) * Add disable option for WGMMA. * fix * Use DLOG; fix missing registration for new pass config * bug fix * lint fix * Enhance GEMM instruction set with UTCMMA and improve local buffer handling in casting example * Update format.sh shebang, improve logging in layout inference, and enhance buffer store wrapper with detailed comments * Enhance GEMM instantiation logic and improve layout inference for local buffer detection - Updated the GEMM instantiation logic to include a check for WGMMA compatibility, ensuring that the conditions for using WGMMA are more robust. - Refined the layout inference process to better identify when loops manipulate only local buffers, improving the accuracy of thread binding decisions in parallel loops. --------- Co-authored-by:Huanqi Cao <caohuanqi@deepseek.com>
-