- 03 Aug, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Refactor GEMM operations for improved warp partitioning and target instruction handling - Introduced a new `GetGemmInst` method to determine the appropriate GEMM instruction based on block size and target architecture. - Updated `ComputeWarpPartition` to accept the GEMM instruction type, enhancing flexibility in warp partitioning logic. - Added `TargetGetWarpSize` utility to streamline warp size retrieval based on target architecture. - Refactored layout inference and lowering methods to utilize the new GEMM instruction handling, improving clarity and maintainability of the codebase. * bug fix * test fix * lint fix
-
Lei Wang authored
* [Enhancement] Introduce software pipeline rewriter and refactor buffer access handling - Added a new `PipelineOpaqueAccessRewriter` class to manage opaque buffer accesses in the software pipeline. - Refactored the `PipelineBodyRewriter` to utilize the new rewriter for improved buffer access handling. - Enhanced the `PipelineRewriter` to support additional fragment information and streamline pipeline construction. - Updated tests to reflect changes in buffer management and access patterns, ensuring compatibility with the new structure. - Removed obsolete code related to previous buffer access methods for clarity and maintainability. * test fix
-
yyttt6 authored
* [Feature]:Add auto vectorize for atomic add * fix * fix2 * format
-
- 01 Aug, 2025 1 commit
-
-
Lei Wang authored
* Add `--ptxas-options=--register-usage-level=10` option * lint fix --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
- 31 Jul, 2025 5 commits
-
-
Cunxiao Ni authored
* [Fix] fix some issues with JIT decorators existing in the examples * format * Uses PassConfigKey instand of str --------- Co-authored-by:Cunxiao <nicunxiao@bytedance.com>
-
Yu Cheng authored
- Renamed TMAFinder to ProducerBufferDetector and improved handling of CallNode and BufferLoadNode. - This change aims to enhance code maintainability and performance by more accurately tracking producer buffer usage.
-
alex_xiao authored
* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py --------- Co-authored-by:
xinxyxiao <xinyxiao@amd.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yu Cheng authored
- Removed unnecessary configurations from the @tilelang.jit decorator in `example_grouped_gemm_fwd.py`, simplifying the kernel compilation process. - Updated the `grouped_gemm` function to accept a tuple for batch sizes, enhancing compatibility with the kernel invocation. - Added logic in `warp_specialized_rewriter.cc` to track buffer usage in `CallNode` expressions, improving the handling of TMA load operations. This refactor aims to streamline the code and improve maintainability while ensuring better performance in grouped matrix multiplication operations. Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Yang Chen authored
This is a minor enhancement to output verbose messages indicating where cache files are saved and loaded. These messages are useful for examining the relevant intermediate files.
-
- 30 Jul, 2025 5 commits
-
-
Lei Wang authored
* Update CI workflow to use Python 3.12 and enable build isolation for pip installations - Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements. - Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs. * Update CI workflow to trigger on pull requests instead of pull_request_target - Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process. * Refactor CI workflow to remove unnecessary repository and token settings - Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information. * Remove pip install command from CI workflow to streamline installation process * Refactor reshape functions and tests for shared memory operations - Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity. - Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`. - Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.
-
Lei Wang authored
* merge from lab * Add `TILELANG_PRINT_ON_COMPILATION` * Update CI workflow to disable build isolation for pip installations in testing requirements - Changed the `PIP_NO_BUILD_ISOLATION` environment variable from `1` to `0` in the CI configuration, ensuring that pip installs the testing requirements without build isolation. This adjustment aims to improve compatibility and streamline the installation process during CI runs. --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Yichen Yan authored
which there's a lot
-
Siyuan Feng authored
**Summarize part of the rebase pr:** 1. **Support T.thread_return() → CUDA return syntax** Added support for translating `T.thread_return()` to CUDA's native `return` statement. 2. **Dynamic type support for function inputs** Functions now accept dynamically typed parameters using `typing`: ```python dyn_type = T.int32 or T.float @T.prim_func def main( a: dyn_type, ) ``` 3. **Device Function Codegen** Added support for generating `__device__` functions in CUDA: ```python @I.ir_module class Module: @T.prim_func(private=True) def add(a: T.int32, b: T.int32) -> T.int32: return a + b @T.prim_func def main( A: T.Buffer((128, 128), "int32"), B: T.Buffer((128, 128), "int32"), C: T.Buffer((128, 128), "int32"), ): T.func_attr({"global_symbol": "main"}) length: T.int32 = Module.add(64, 64) # Host call for bx in... -
Wenhao Xie authored
-
- 29 Jul, 2025 6 commits
-
-
Yang Chen authored
* [Enhancement] passing verbose to LibraryGenerator This PR enables passing a verbose parameter to LibraryGenerator via CtypesKernelAdapter and CythonKernelAdapter. When verbose is set to True, we will print out the NVCC compilation command. This slightly improves debuggability. * fix ci --------- Co-authored-by:xwhzz <wh.xie@outlook.com>
-
Wenhao Xie authored
-
Wenhao Xie authored
* update format check ci * upd * upd
-
Yang Chen authored
cmake doesn't take the nvcc specified by CUDA_HOME by default. Consequently, the follow command failed for me because cmake still used the nvcc from the default location (e.g. in my case /usr/local/cuda/bin/nvcc): ``` $ PATH=/home/yangche/cuda-12.8/bin:$PATH CUDA_HOME=/home/yangche/cuda-12.8 pip install -e . -v ``` This minor fix enforces cmake to use the nvcc specified by the CUDA_HOME env.
-
alex_xiao authored
* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. --------- Co-authored-by:xinxyxiao <xinyxiao@amd.com>
-
- 25 Jul, 2025 2 commits
- 24 Jul, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Improve buffer conflict detection in thread storage synchronization - Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`. - Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons. - Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity. * example fix * enhancement * improve ci
-
Wenhao Xie authored
* [Bugfix][Docs] Update documentation build process and configurations for autoapi support * lint fix
-
Zhengju Tang authored
[BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653) * [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking * Lint * test fix * Update CI workflow to install dependencies without user site packages - Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory. * Update CI workflow to install pip without user site packages - Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment. * Update requirements-test.txt * reduce ci problem size, * Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 23 Jul, 2025 5 commits
-
-
Zhang Jason authored
Co-authored-by:zhangnju <ningzhan@SMC-SC-DI08-33.dh144.dcgpu>
-
Wenhao Xie authored
[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656) * [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control * lint fix * upd * lint fix * fix typo * update typing * update the use case of compile flags * ci fix * fix * Fix CI workflow to correctly activate virtual environment from shared cache directory * use local cache * fix * fix * fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Support shared cache directories for multiple users * ruff fix * ci_fix * Add CI step to show worker info --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Lei Wang authored
-
Wenhao Xie authored
* fix CI bugs in hopper * lint fix * Update bulk_copy.cc * Refactor bulk copy logic in LowerBulkCopy function - Removed unnecessary blank lines for improved code readability. - Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides. - Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings. * test fix * ci fix * Update flash-attention dependencies and clean up example code - Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`. - Removed unused imports and commented-out code in various example files to enhance readability and maintainability. - Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`. - Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity. - Deleted the `example_mha_inference.py` file as it is no longer needed. * Update CI workflow to remove `--user` flag from pip install commands - Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment. * Update CI workflow to include `--no-user` flag in pip install commands - Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment. * Update CI workflow to include `--no-user` flag in pip install command for wheel mode - Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment. * test fix * avoid conflict with system environments * test fix * add commnets --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
- 22 Jul, 2025 1 commit
-
-
Yu Cheng authored
- Implemented a new role assignment for `AllocateNode` in `warp_specialized_rewriter.cc`, setting the role to `kConsumer` to ensure proper handling of memory allocation scenarios. - This can avoid bug when using T.reduce(clear=False)
-
- 21 Jul, 2025 2 commits
-
-
Lei Wang authored
- Eliminated the condition that disabled the reuse of small arrays (const_nbits <= 32) in the `MergeSharedMemoryAllocations` function, allowing for more flexible memory management. - Added a comment in `OptimizeForTarget` to clarify the order of applying `MergeSharedMemoryAllocations` after `SplitHostDevice`, ensuring correct allocation site handling in device functions.
-
meinie authored
* fix: Copy Target to self.target * refactor: Remove unused target attribute and adjust context management in JITKernel - Removed the unused `target` attribute from the `JITKernel` class. - Updated the context management in the `compile` method to utilize `self.target`, improving clarity and ensuring proper resource handling during compilation. --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 20 Jul, 2025 2 commits
-
-
Yu Cheng authored
* [Bugfix] Adjust role assignment in warp specialization based on read access - Updated the role assignment logic in `warp_specialized_rewriter.cc` to set the role to `kConsumer` when no reads are detected, ensuring correct behavior in memory access scenarios. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
Lei Wang authored
-
- 17 Jul, 2025 2 commits
-
-
Lei Wang authored
- Added a comment to clarify the alignment of dynamic shared memory allocations in the `OptimizeForTarget` function. - Refactored the handling of shared memory allocation merging and synchronization to streamline the process, ensuring consistent behavior regardless of the aggressive merge flag. - Improved code clarity by removing redundant conditional checks related to synchronization and memory allocation.
-
Lei Wang authored
- Included the Cython cache directory in the list of source files for the TileLang build process, ensuring proper handling of cached Cython files during the build.
-
- 16 Jul, 2025 3 commits
-
-
YizhaoGao authored
* Add paged block-sparse flash-decoding kernel * Update example_tilelang_sparse_gqa_decode_paged.py * lint fix --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
- Updated the `pythonic_expr` function to accept an optional `dtype_map` parameter, allowing for more flexible type conversions. - Refactored calls to `pythonic_expr` in `TLCUDASourceWrapper` to utilize the new mapping feature, improving type handling in kernel generation. - Enhanced code clarity by consolidating repeated calls to `pythonic_expr` into a private method within the wrapper class.
-
Lei Wang authored
* [Enhancement] Update AllReduce operation to include thread offset in kernel generation - Modified the `ReduceOp::Lower` method to incorporate the thread offset in the AllReduce kernel generation for the sm_90 architecture. - This change improves the accuracy of thread management during reduction operations, enhancing performance on specific GPU architectures. * [Enhancement] Refactor thread offset handling in AllReduce kernel generation - Updated the `ReduceOp::Lower` method to streamline the handling of thread offset for AllReduce operations, ensuring consistent usage across different architectures. - This change enhances code clarity and maintains performance improvements for the sm_90 architecture by reducing redundancy in thread offset calculations.
-