- 11 Aug, 2025 1 commit
-
-
FeiyangChen authored
* gemm_with_stride sm89 * fix offset issue * bug fix * format * sm80 support * add sm90 * add testing * format * add static_assert for wgmma * Enhance error message for inner_box_dim validation in LowerBulkCopy * lint fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 10 Aug, 2025 2 commits
-
-
Zhengju Tang authored
* [MXFP4] Dequantize FP4 kernel example, MX scale todo * [BugFix] Fix the bug of fp4&fp16 exponential bias * [MXFP4] Add group scale factor for BF16xMXFP4 gemm * [Lint] * [Test] Add test script for BF16xMXFP4 gemm * [Lint] * [BugFix] Fix the shape of scale tensor * Update example_dequant_gemm_fp4_hopper.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Lei Wang authored
* Refactor inject_pipeline.cc to improve version handling and add unique producer head tracking - Updated version check to allow for cases with two or more versions. - Adjusted logic to decrement num_versions when multi-versioning is not needed. - Introduced a helper function to ensure unique producer heads are added to the commit group. - Removed obsolete AddAllocBuffers method to streamline code. * lint fix * Refactor pipeline planning logic to enhance copy stage dependency management - Removed obsolete conditional expression handling from the pipeline planning code. - Introduced a new structure to manage copy stage dependency reads, improving clarity and efficiency. - Updated logic to correctly identify producer stages for copy stages, ensuring accurate pipeline stage assignment. - Added a new block sparse matrix multiplication function in the testing suite to validate the pipeline planning changes. * Update ci.yml * Fix structural equality checks in AddUnique and Contains methods to compare buffer references instead of entire regions in pipeline planning. * Refactor pipeline planning logic to improve copy stage dependency propagation - Updated structural equality checks in AddUnique and Contains methods to use buffer reference comparison. - Enhanced the iteration logic for managing copy stage dependencies, ensuring accurate identification of producer stages. - Added safeguards against exceeding maximum iterations during dependency propagation.
-
- 08 Aug, 2025 3 commits
-
-
Lei Wang authored
* Implement new free stage layout inference. * Fix bug * Make replication upcasting and unnormalizable iterators safe. * Better handling of updating with more replica * Remove unnecessary check. * Fix compilation. * Fix setup.py. * Simplify development mode. * Allow ParallelOp layout when there's already a compatible layout specified * lint fix * Add ProveFragmentContains function to validate thread access between small and large fragments This function checks if the threads accessing elements of a smaller fragment are a subset of those accessing a larger fragment, ensuring valid access during updates. The implementation includes deriving thread indices, computing logical indices, and verifying thread mappings. * Update dependencies in requirements files * Remove 'thefuzz' from requirements-dev.txt * Specify exact versions for 'torch' and add 'flash_attn' in requirements-test.txt * Update CI workflow to use SHA256 hash for requirements file * Update requirements and CI workflow for flash attention * Removed specific version for 'torch' in requirements-test.txt * Added installation of 'flash_attn==2.5.8' in CI workflow to ensure compatibility * Refactor flash attention import handling in examples * Removed availability checks for 'flash_attn' in multiple example scripts. * Simplified import statements for 'flash_attn' to ensure consistent usage across examples. --------- Co-authored-by:Huanqi Cao <caohuanqi@deepseek.com>
-
Lei Wang authored
* Update flash-attn version in requirements-test.txt from <=2.2.0 to ==2.5.8 * lint fix * Remove unused dependencies from requirements-test.txt * Update import path for padding functions in example MHA forward variable length script * Refactor code formatting in bert_padding.py for improved readability
-
Yichen Yan authored
* Trivial update to calculate target arch * Update tilelang/contrib/nvrtc.py Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> * fmt --------- Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
- 07 Aug, 2025 2 commits
-
-
Zhengju Tang authored
* [GDN] Add examples for GDN forward and backward kernels * [Refactor] Folder structure refactor for duplicated utils * [Test] Add test script for kernels * [Refactor] Rename examples to align with the repo * [Lint] Modify README * [Update] Modified README to align upstream repo * [BugFix] Path of FLA * [Fix] Copyright and test * [Lint] * [CI] Add GDN compilation test CI * [Lint] * [BugFix] Import error of fla
-
dependabot[bot] authored
Bumps [transformers](https://github.com/huggingface/transformers) from 4.52.1 to 4.53.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.52.1...v4.53.0 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.53.0 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 06 Aug, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Disable cache and append git commit ID to version in tilelang (#688) * Disabled caching in quickstart example for improved performance. * Added a function to retrieve the current git commit ID and appended it to the version string if not already present, enhancing version tracking and debugging capabilities. * revert quickstart * optimize code.
-
Lei Wang authored
* [Enhancement] Disable cache and append git commit ID to version in tilelang (#688) * Disabled caching in quickstart example for improved performance. * Added a function to retrieve the current git commit ID and appended it to the version string if not already present, enhancing version tracking and debugging capabilities. * revert quickstart
-
- 05 Aug, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture. - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic. - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture. - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase. * bug fix * test fix * lint fix * phase out Canonialize * add option --expt-relaxed-constexpr * [Enhancement] Introduce tilelang intrinsic operations for GEMM - Added `tl_gemm` and `tl_gemm_sp` built-in operations to support general and sparse matrix multiplication in tilelang. - Updated the lowering logic in `Gemm` and `GemmSP` to utilize the new tilelang operations. - Enhanced CUDA and HIP code generation to handle the new GEMM operations, ensuring proper argument validation and external call printing. - Implemented shared memory alignment planning for GEMM operations to optimize performance on supported architectures. * lint fix * lint fix * test fix * test fix * rebase * Update builtin.cc
-
- 04 Aug, 2025 1 commit
-
-
Wenhao Xie authored
* use more efficient bf16 type related conversion * update macro
-
- 03 Aug, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture. - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic. - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture. - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase. * bug fix * test fix * lint fix
-
Lei Wang authored
* [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling - Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline. - Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling. - Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction. - Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure. - Removed obsolete code related to previous buffer access methods for clarity and maintainability. * test fix
-
yyttt6 authored
* [Feature]:Add auto vectorize for atomic add * fix * fix2 * format
-
- 01 Aug, 2025 1 commit
-
-
Lei Wang authored
* Add `--ptxas-options=--register-usage-level=10` option * lint fix --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 31 Jul, 2025 5 commits
-
-
Cunxiao Ni authored
* [Fix] fix some issues with JIT decorators existing in the examples * format * Uses PassConfigKey instand of str --------- Co-authored-by:Cunxiao <nicunxiao@bytedance.com>
-
Yu Cheng authored
- Renamed TMAFinder to ProducerBufferDetector and improved handling of CallNode and BufferLoadNode. - This change aims to enhance code maintainability and performance by more accurately tracking producer buffer usage.
-
alex_xiao authored
* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py --------- Co-authored-by:
xinxyxiao <xinyxiao@amd.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yu Cheng authored
- Removed unnecessary configurations from the @tilelang.jit decorator in `example_grouped_gemm_fwd.py`, simplifying the kernel compilation process. - Updated the `grouped_gemm` function to accept a tuple for batch sizes, enhancing compatibility with the kernel invocation. - Added logic in `warp_specialized_rewriter.cc` to track buffer usage in `CallNode` expressions, improving the handling of TMA load operations. This refactor aims to streamline the code and improve maintainability while ensuring better performance in grouped matrix multiplication operations. Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Yang Chen authored
This is a minor enhancement to output verbose messages indicating where cache files are saved and loaded. These messages are useful for examining the relevant intermediate files.
-
- 30 Jul, 2025 5 commits
-
-
Lei Wang authored
* Update CI workflow to use Python 3.12 and enable build isolation for pip installations - Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements. - Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs. * Update CI workflow to trigger on pull requests instead of pull_request_target - Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process. * Refactor CI workflow to remove unnecessary repository and token settings - Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information. * Remove pip install command from CI workflow to streamline installation process * Refactor reshape functions and tests for shared memory operations - Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity. - Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`. - Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.
-
Lei Wang authored
* merge from lab * Add `TILELANG_PRINT_ON_COMPILATION` * Update CI workflow to disable build isolation for pip installations in testing requirements - Changed the `PIP_NO_BUILD_ISOLATION` environment variable from `1` to `0` in the CI configuration, ensuring that pip installs the testing requirements without build isolation. This adjustment aims to improve compatibility and streamline the installation process during CI runs. --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Yichen Yan authored
which there's a lot
-
Siyuan Feng authored
**Summarize part of the rebase pr:** 1. **Support T.thread_return() → CUDA return syntax** Added support for translating `T.thread_return()` to CUDA's native `return` statement. 2. **Dynamic type support for function inputs** Functions now accept dynamically typed parameters using `typing`: ```python dyn_type = T.int32 or T.float @T.prim_func def main( a: dyn_type, ) ``` 3. **Device Function Codegen** Added support for generating `__device__` functions in CUDA: ```python @I.ir_module class Module: @T.prim_func(private=True) def add(a: T.int32, b: T.int32) -> T.int32: return a + b @T.prim_func def main( A: T.Buffer((128, 128), "int32"), B: T.Buffer((128, 128), "int32"), C: T.Buffer((128, 128), "int32"), ): T.func_attr({"global_symbol": "main"}) length: T.int32 = Module.add(64, 64) # Host call for bx in... -
Wenhao Xie authored
-
- 29 Jul, 2025 6 commits
-
-
Yang Chen authored
* [Enhancement] passing verbose to LibraryGenerator This PR enables passing a verbose parameter to LibraryGenerator via CtypesKernelAdapter and CythonKernelAdapter. When verbose is set to True, we will print out the NVCC compilation command. This slightly improves debuggability. * fix ci --------- Co-authored-by:xwhzz <wh.xie@outlook.com>
-
Wenhao Xie authored
-
Wenhao Xie authored
* update format check ci * upd * upd
-
Yang Chen authored
cmake doesn't take the nvcc specified by CUDA_HOME by default. Consequently, the follow command failed for me because cmake still used the nvcc from the default location (e.g. in my case /usr/local/cuda/bin/nvcc): ``` $ PATH=/home/yangche/cuda-12.8/bin:$PATH CUDA_HOME=/home/yangche/cuda-12.8 pip install -e . -v ``` This minor fix enforces cmake to use the nvcc specified by the CUDA_HOME env.
-
alex_xiao authored
* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. --------- Co-authored-by:xinxyxiao <xinyxiao@amd.com>
-
- 25 Jul, 2025 2 commits
- 24 Jul, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Improve buffer conflict detection in thread storage synchronization - Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`. - Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons. - Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity. * example fix * enhancement * improve ci
-
Wenhao Xie authored
* [Bugfix][Docs] Update documentation build process and configurations for autoapi support * lint fix
-
Zhengju Tang authored
[BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653) * [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking * Lint * test fix * Update CI workflow to install dependencies without user site packages - Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory. * Update CI workflow to install pip without user site packages - Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment. * Update requirements-test.txt * reduce ci problem size, * Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 23 Jul, 2025 3 commits
-
-
Zhang Jason authored
Co-authored-by:zhangnju <ningzhan@SMC-SC-DI08-33.dh144.dcgpu>
-
Wenhao Xie authored
[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656) * [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control * lint fix * upd * lint fix * fix typo * update typing * update the use case of compile flags * ci fix * fix * Fix CI workflow to correctly activate virtual environment from shared cache directory * use local cache * fix * fix * fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Support shared cache directories for multiple users * ruff fix * ci_fix * Add CI step to show worker info --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-