- 31 Jul, 2025 3 commits
-
-
alex_xiao authored
* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. * Add input configuration file and update AMD example script for enhanced flexibility - Introduced a new input.txt file for configurable parameters. - Modified the example_amd_flash_attn_fwd.py script to allow for a wider range of configurations, including additional options for num_stages, enable_rasterization, and k_pack. - Streamlined the main function for better clarity and organization. - Added a new test script to facilitate running the example with specified parameters. * Remove input configuration file and obsolete test script; enhance AMD example with swizzle layout annotations - Deleted input.txt and test.sh files as they are no longer needed. - Updated example_amd_flash_attn_fwd.py to include swizzle layout annotations for shared memory, improving bank conflict avoidance. - Reintroduced swizzle usage in the kernel for better performance. * Refactor AMD example script for FlashAttention-2 - Updated function names for clarity, changing `get_v2_configs` to `get_configs` and `fast_flashattn_v2` to `fast_flashattn`. - Streamlined the main function by renaming `main_v2` to `main` and adjusting the corresponding calls. - Removed outdated comments and improved code organization for better readability. * Refactor formatting in AMD FlashAttention example script - Improved code readability by adjusting line breaks and indentation in the `fast_flashattn` function. - Streamlined the `main` function parameter formatting for consistency. - Removed unnecessary blank lines to enhance overall code organization. * Update example_amd_flash_attn_fwd.py --------- Co-authored-by:
xinxyxiao <xinyxiao@amd.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Yu Cheng authored
- Removed unnecessary configurations from the @tilelang.jit decorator in `example_grouped_gemm_fwd.py`, simplifying the kernel compilation process. - Updated the `grouped_gemm` function to accept a tuple for batch sizes, enhancing compatibility with the kernel invocation. - Added logic in `warp_specialized_rewriter.cc` to track buffer usage in `CallNode` expressions, improving the handling of TMA load operations. This refactor aims to streamline the code and improve maintainability while ensuring better performance in grouped matrix multiplication operations. Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Yang Chen authored
This is a minor enhancement to output verbose messages indicating where cache files are saved and loaded. These messages are useful for examining the relevant intermediate files.
-
- 30 Jul, 2025 5 commits
-
-
Lei Wang authored
* Update CI workflow to use Python 3.12 and enable build isolation for pip installations - Changed the Python version in the CI configuration from 3.9 to 3.12 to ensure compatibility with the latest features and improvements. - Updated the `PIP_NO_BUILD_ISOLATION` environment variable from `0` to `1` in the CI configuration, allowing pip to install testing requirements with build isolation enabled, which enhances the installation process during CI runs. * Update CI workflow to trigger on pull requests instead of pull_request_target - Changed the event trigger in the CI configuration from `pull_request_target` to `pull_request` to ensure the workflow runs on pull requests, enhancing the integration process. * Refactor CI workflow to remove unnecessary repository and token settings - Removed the repository and token parameters from the checkout step in the CI configuration, simplifying the workflow setup and improving security by not exposing sensitive information. * Remove pip install command from CI workflow to streamline installation process * Refactor reshape functions and tests for shared memory operations - Renamed and updated `reshape_test_smem` to `reshape_test_smem_1d_2_2d` and `run_reshape_smem` to `run_reshape_smem_1d_2_2d` for clarity. - Introduced a new reshape function `reshape_test_smem_2d_2_1d` and its corresponding runner `run_reshape_smem_2d_2_1d`. - Updated tests to reflect the new function names and added a test for the 2D to 1D reshape functionality, enhancing test coverage and clarity.
-
Lei Wang authored
* merge from lab * Add `TILELANG_PRINT_ON_COMPILATION` * Update CI workflow to disable build isolation for pip installations in testing requirements - Changed the `PIP_NO_BUILD_ISOLATION` environment variable from `1` to `0` in the CI configuration, ensuring that pip installs the testing requirements without build isolation. This adjustment aims to improve compatibility and streamline the installation process during CI runs. --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Yichen Yan authored
which there's a lot
-
Siyuan Feng authored
**Summarize part of the rebase pr:** 1. **Support T.thread_return() → CUDA return syntax** Added support for translating `T.thread_return()` to CUDA's native `return` statement. 2. **Dynamic type support for function inputs** Functions now accept dynamically typed parameters using `typing`: ```python dyn_type = T.int32 or T.float @T.prim_func def main( a: dyn_type, ) ``` 3. **Device Function Codegen** Added support for generating `__device__` functions in CUDA: ```python @I.ir_module class Module: @T.prim_func(private=True) def add(a: T.int32, b: T.int32) -> T.int32: return a + b @T.prim_func def main( A: T.Buffer((128, 128), "int32"), B: T.Buffer((128, 128), "int32"), C: T.Buffer((128, 128), "int32"), ): T.func_attr({"global_symbol": "main"}) length: T.int32 = Module.add(64, 64) # Host call for bx in... -
Wenhao Xie authored
-
- 29 Jul, 2025 6 commits
-
-
Yang Chen authored
* [Enhancement] passing verbose to LibraryGenerator This PR enables passing a verbose parameter to LibraryGenerator via CtypesKernelAdapter and CythonKernelAdapter. When verbose is set to True, we will print out the NVCC compilation command. This slightly improves debuggability. * fix ci --------- Co-authored-by:xwhzz <wh.xie@outlook.com>
-
Wenhao Xie authored
-
Wenhao Xie authored
* update format check ci * upd * upd
-
Yang Chen authored
cmake doesn't take the nvcc specified by CUDA_HOME by default. Consequently, the follow command failed for me because cmake still used the nvcc from the default location (e.g. in my case /usr/local/cuda/bin/nvcc): ``` $ PATH=/home/yangche/cuda-12.8/bin:$PATH CUDA_HOME=/home/yangche/cuda-12.8 pip install -e . -v ``` This minor fix enforces cmake to use the nvcc specified by the CUDA_HOME env.
-
alex_xiao authored
* [Enhancement] Refactor buffer index handling for improved precision and clarity (#668) - Enhanced buffer index handling to address precision issues by removing redundant operations. - Streamlined the logic for determining buffer overlaps, ensuring more accurate conflict detection. - Updated related documentation to reflect changes in buffer management practices. * Remove obsolete test script for AMD example, streamlining the examples directory. * Remove unused dtype_size variable in AMD example script to streamline code. --------- Co-authored-by:xinxyxiao <xinyxiao@amd.com>
-
- 25 Jul, 2025 2 commits
- 24 Jul, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Improve buffer conflict detection in thread storage synchronization - Added a new boolean variable `range_is_overlap` to accurately determine if buffer indices overlap, enhancing the conflict detection logic in `thread_storage_sync.cc`. - Updated the return logic to reflect the overlap status, ensuring correct conflict resolution based on buffer index comparisons. - Removed an unnecessary comment in `OptimizeForTarget` to streamline the code and improve clarity. * example fix * enhancement * improve ci
-
Wenhao Xie authored
* [Bugfix][Docs] Update documentation build process and configurations for autoapi support * lint fix
-
Zhengju Tang authored
[BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking (#653) * [BugFix] Do not modify strict layout in common or relax level of layout inference. More conditions on layout checking * Lint * test fix * Update CI workflow to install dependencies without user site packages - Modified the installation commands in the CI workflow to include the `--no-user` flag for both `requirements-dev.txt` and `requirements-test.txt`, ensuring that packages are installed in the virtual environment rather than the user site directory. * Update CI workflow to install pip without user site packages - Added the `--no-user` flag to the pip installation command in the CI workflow for both development and testing dependencies, ensuring that packages are installed within the virtual environment. * Update requirements-test.txt * reduce ci problem size, * Refactor example_mla_decode.py for consistent formatting and remove unused imports in test_example_mla_decode.py --------- Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com> Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 23 Jul, 2025 5 commits
-
-
Zhang Jason authored
Co-authored-by:zhangnju <ningzhan@SMC-SC-DI08-33.dh144.dcgpu>
-
Wenhao Xie authored
[Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control (#656) * [Enhancement] Add compile_flags parameter to JIT kernel and adapter classes for improved compilation control * lint fix * upd * lint fix * fix typo * update typing * update the use case of compile flags * ci fix * fix * Fix CI workflow to correctly activate virtual environment from shared cache directory * use local cache * fix * fix * fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Support shared cache directories for multiple users * ruff fix * ci_fix * Add CI step to show worker info --------- Co-authored-by:Chenggang Zhao <chenggangz@deepseek.com>
-
Lei Wang authored
-
Wenhao Xie authored
* fix CI bugs in hopper * lint fix * Update bulk_copy.cc * Refactor bulk copy logic in LowerBulkCopy function - Removed unnecessary blank lines for improved code readability. - Enhanced stride validation by checking for null pointers in global stride calculations, ensuring robustness against symbolic strides. - Updated pass configuration handling in dynamic tile language tests to streamline dynamic alignment and TMA lower pass settings. * test fix * ci fix * Update flash-attention dependencies and clean up example code - Downgraded `flash-attn` dependency version in `requirements-test.txt` to `<=2.2.0`. - Removed unused imports and commented-out code in various example files to enhance readability and maintainability. - Updated the `flashattn` function signature to include default parameters for `block_M`, `block_N`, `num_stages`, and `threads`. - Cleaned up the `example_mha_fwd_varlen.py` and `example_mha_bwd_wgmma_pipelined.py` files by removing unnecessary comments and improving code clarity. - Deleted the `example_mha_inference.py` file as it is no longer needed. * Update CI workflow to remove `--user` flag from pip install commands - Removed the `--user` flag from the pip install commands in both the development and testing sections of the CI workflow to ensure proper installation of dependencies in the virtual environment. * Update CI workflow to include `--no-user` flag in pip install commands - Added the `--no-user` flag to the pip install commands in both the development and testing sections of the CI workflow to ensure dependencies are installed correctly within the virtual environment. * Update CI workflow to include `--no-user` flag in pip install command for wheel mode - Added the `--no-user` flag to the pip install command in the wheel mode section of the CI workflow to ensure dependencies are installed correctly within the virtual environment. * test fix * avoid conflict with system environments * test fix * add commnets --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
- 22 Jul, 2025 1 commit
-
-
Yu Cheng authored
- Implemented a new role assignment for `AllocateNode` in `warp_specialized_rewriter.cc`, setting the role to `kConsumer` to ensure proper handling of memory allocation scenarios. - This can avoid bug when using T.reduce(clear=False)
-
- 21 Jul, 2025 2 commits
-
-
Lei Wang authored
- Eliminated the condition that disabled the reuse of small arrays (const_nbits <= 32) in the `MergeSharedMemoryAllocations` function, allowing for more flexible memory management. - Added a comment in `OptimizeForTarget` to clarify the order of applying `MergeSharedMemoryAllocations` after `SplitHostDevice`, ensuring correct allocation site handling in device functions.
-
meinie authored
* fix: Copy Target to self.target * refactor: Remove unused target attribute and adjust context management in JITKernel - Removed the unused `target` attribute from the `JITKernel` class. - Updated the context management in the `compile` method to utilize `self.target`, improving clarity and ensuring proper resource handling during compilation. --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 20 Jul, 2025 2 commits
-
-
Yu Cheng authored
* [Bugfix] Adjust role assignment in warp specialization based on read access - Updated the role assignment logic in `warp_specialized_rewriter.cc` to set the role to `kConsumer` when no reads are detected, ensuring correct behavior in memory access scenarios. * Apply suggestion from @gemini-code-assist[bot] Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
-
Lei Wang authored
-
- 17 Jul, 2025 2 commits
-
-
Lei Wang authored
- Added a comment to clarify the alignment of dynamic shared memory allocations in the `OptimizeForTarget` function. - Refactored the handling of shared memory allocation merging and synchronization to streamline the process, ensuring consistent behavior regardless of the aggressive merge flag. - Improved code clarity by removing redundant conditional checks related to synchronization and memory allocation.
-
Lei Wang authored
- Included the Cython cache directory in the list of source files for the TileLang build process, ensuring proper handling of cached Cython files during the build.
-
- 16 Jul, 2025 5 commits
-
-
YizhaoGao authored
* Add paged block-sparse flash-decoding kernel * Update example_tilelang_sparse_gqa_decode_paged.py * lint fix --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
- Updated the `pythonic_expr` function to accept an optional `dtype_map` parameter, allowing for more flexible type conversions. - Refactored calls to `pythonic_expr` in `TLCUDASourceWrapper` to utilize the new mapping feature, improving type handling in kernel generation. - Enhanced code clarity by consolidating repeated calls to `pythonic_expr` into a private method within the wrapper class.
-
Lei Wang authored
* [Enhancement] Update AllReduce operation to include thread offset in kernel generation - Modified the `ReduceOp::Lower` method to incorporate the thread offset in the AllReduce kernel generation for the sm_90 architecture. - This change improves the accuracy of thread management during reduction operations, enhancing performance on specific GPU architectures. * [Enhancement] Refactor thread offset handling in AllReduce kernel generation - Updated the `ReduceOp::Lower` method to streamline the handling of thread offset for AllReduce operations, ensuring consistent usage across different architectures. - This change enhances code clarity and maintains performance improvements for the sm_90 architecture by reducing redundancy in thread offset calculations.
-
Lei Wang authored
-
Lei Wang authored
* [Enhancement] Improve memory access condition checks in GlobalMemChecker - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations. - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis. * lintfix * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation. - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function. * [Refactor] Update barrier and clear operations in warp specialization examples - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization. - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency. * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management. - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements. - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior. - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output. - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements. * [Enhancement] Add support for disabling TMA in Copy operations - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior. - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag. - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations. - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations. * [Refactor] Clean up whitespace and formatting in multiple files - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`. - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase. * [Enhancement] Refactor flash attention implementation for improved performance and configurability - Split the shared memory allocations for query and key-value pairs to optimize memory usage. - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example. - Updated kernel execution parameters to improve thread management and synchronization. - Enhanced the overall structure of the flash attention function for better readability and maintainability. * fix * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget * Refactor barrier handling and update example configurations - Replaced commented-out barrier creation with new barrier allocation in GEMM example. - Updated kernel configuration in warp specialization example to include async copy settings. - Enhanced barrier management in the phase optimization process to improve synchronization handling. - Introduced new barrier allocation function for better memory management in shared contexts. * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization. - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code. - Added exit() call in OptimizeForTarget to halt execution after barrier lowering. * Enhance CMake configuration and clean up example scripts - Enabled compile command export in CMakeLists.txt for better build integration. - Removed unnecessary print statement in the warp specialization example. - Cleaned up commented-out code in GEMM example for improved readability. - Updated barrier handling in shared memory allocation transformations for better synchronization. * Refactor barrier handling in warp specialization examples - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization. - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts. - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management. * Update lower_shared_barrier.cc * Update phase.py * Update warp specialization example and Cython wrapper - Removed commented-out pass configuration options in the warp specialization example for clarity. - Added functionality to write the generated kernel source to a file named "kernel.cu". - Enhanced Cython wrapper to support boolean type conversion for improved type handling. * Add storage synchronization call in shared barrier transformation - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process. * remove debug files * Remove kernel source output to file in warp specialization example * remove comments * Refactor tensor handling and update test execution in TileLang - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency. - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`. - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing. - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity. * Update test_tilelang_language_reshape.py
-
- 15 Jul, 2025 4 commits
-
-
Lei Wang authored
-
Yu Cheng authored
[Dev] Update benchmark and decoding scripts to refine condition checks and optimize tensor operations (#637) - Enhanced the condition in `compare_ab` to ensure baseline checks align with target exclusions. - Removed unnecessary tensor allocation in `mla_decode_tilelang`, optimizing memory usage and improving performance by directly using shared tensors in GEMM operations.
-
Lei Wang authored
* [Enhancement] Add argument simplification option to StmtSimplifier - Introduced a new `simplify_arguments` flag in the `StmtSimplifier::Apply` method to control argument simplification behavior. - Updated the `Simplify` function to accept the new flag, allowing for enhanced flexibility in the simplification process. - Adjusted the `LowerAndLegalize` and `_Simplify` functions to utilize the new argument, ensuring consistent behavior across the codebase. - Added comments to clarify the purpose of the new flag and its impact on simplification logic. * lint fix * [Enhancement] Improve layout inference and reduce operation handling - Updated `ParallelOp::InferLayout` to check for pure buffer stores, enhancing layout inference logic. - Modified `ReduceOp::Lower` to include all threads in the AllReduce operation, improving performance on specific architectures. - Added a TODO comment in `AllReduce` to consider merging synchronization barriers for optimization. * lint fix * [Enhancement] Add input validation for GEMM parameters - Introduced checks to ensure that the dimensions M and N are divisible by their respective warp sizes (kMPerWarp and kNPerWarp) in the Gemm::ComputeWarpPartition method. - Added informative error messages to assist in debugging when the input parameters do not meet the required conditions. * bug fix
-
Yuqing Xia authored
-