- 08 May, 2025 1 commit
-
-
Lei Wang authored
* Add example for warp specialization with flash attention * Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang. * Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance. * Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics. * Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation. * Add memory barrier functions to builtin.py * Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization. * Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters. * The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1. * Improved code organization and readability by adding blank lines for better separation of logical sections. * Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py * Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`. * Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability. * [Refactor] Update barrier functions and add new example for GEMM with warp specialization * Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency. * Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation. * Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations. * Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios. * Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations. * lint fix * Update loop variable checks in SIMT loop and buffer region validation * Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling. * Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging. * lint fix * test fix
-
- 07 May, 2025 1 commit
-
-
Yuxi Chi authored
* fix get_swizzle_layout implementation. * format.
-
- 06 May, 2025 5 commits
-
-
Lei Wang authored
* [Feature] Add cache directory management functions in tilelang.cache * Introduced `get_cache_dir` and `set_cache_dir` functions to manage the kernel cache directory. * Updated `KernelCache` class to store cache directory as a `Path` object for improved path handling. * Enhanced documentation with examples for new cache directory functions. * [Refactor] Update cache imports in tilelang.__init__.py * Added `set_cache_dir` and `get_cache_dir` functions to the import statement for improved cache directory management. * This change enhances the accessibility of cache directory management functions within the module.
-
Lei Wang authored
* [Enhancement] Introduce pass_configs parameter for kernel compilation * Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations. * Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase. * Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output. * Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution. * lint fix
-
Lei Wang authored
* [Feature] Add TILELANG_CHECK_LAST_ERROR macro for improved error handling in CUDA and HIP * Introduced TILELANG_CHECK_LAST_ERROR macro to streamline error checking for kernel launches in both CUDA and HIP. * Updated kernel launch code in wrapper.py to utilize the new macro, enhancing readability and maintainability. * This change improves error reporting by providing detailed messages when kernel execution fails. * [Refactor] Standardize error message formatting in TILELANG_CHECK_LAST_ERROR macro * Updated the TILELANG_CHECK_LAST_ERROR macro in both CUDA and HIP implementations to ensure consistent formatting of error messages. * Enhanced readability by aligning the error message structure across different platforms, improving maintainability of error handling code.
-
Lei Wang authored
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections. * [Feature] Add examples for warp specialization and TMA barrier integration * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers. * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance. * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch. * Updated the `phase.py` to include TMA barrier injection in the optimization process. * Improved documentation and comments for better clarity on usage and functionality. * [Feature] Add example for warp specialization in GEMM with TMA barriers * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers. * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance. * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation. * Enhanced documentation and comments for clarity on usage and functionality. * lint fix * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement. * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results. * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis. * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness. * lint fix * [Feature] Add new examples for warp specialization and TMA integration * Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`. * Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance. * Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples. * Updated the TileLang API to support these examples and improve kernel compilation and testing processes. * Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities. * lint fix * Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.
-
- 03 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections. * [Feature] Add examples for warp specialization and TMA barrier integration * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers. * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance. * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch. * Updated the `phase.py` to include TMA barrier injection in the optimization process. * Improved documentation and comments for better clarity on usage and functionality. * [Feature] Add example for warp specialization in GEMM with TMA barriers * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers. * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance. * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation. * Enhanced documentation and comments for clarity on usage and functionality. * lint fix * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement. * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results. * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis. * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness. * lint fix
-
- 01 May, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Improve layout inference accuracy in ParallelOp (#441) * Added logic to use non-replicated buffers as source buffers for more accurate layout inference. * Enhanced comments to clarify the rationale behind buffer selection in layout inference process. * [Enhancement] Add error handling macros and refactor loop partitioning logic * Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches. * Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent. * Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis. * Updated pass configuration management to streamline vectorization control in the optimization process. * lint fix * remove debug print * [Refactor] Update legalize_safe_memory_access.cc to improve memory access handling * Replaced Apache License header with MIT License. * Added logic to handle local buffer conditions in memory access. * Introduced IsLocalBuffer function to check buffer scope. * Enhanced comments for clarity on memory access operations.
-
- 30 Apr, 2025 2 commits
-
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
-
dependabot[bot] authored
Bumps [transformers](https://github.com/huggingface/transformers) from 4.48.0 to 4.50.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.48.0...v4.50.0 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.50.0 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 29 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Improve layout inference accuracy in ParallelOp (#441) * Added logic to use non-replicated buffers as source buffers for more accurate layout inference. * Enhanced comments to clarify the rationale behind buffer selection in layout inference process. * [Enhancement] Add error handling macros and refactor loop partitioning logic * Introduced TILELANG_CHECK macro for improved error handling in CUDA and HIP code, providing detailed error messages for kernel launches. * Enhanced loop partitioning logic to handle fragment buffers more effectively, ensuring correct replication based on thread extent. * Added logging for thread range in PlanLoopPartition to aid in debugging and performance analysis. * Updated pass configuration management to streamline vectorization control in the optimization process. * lint fix * remove debug print
-
- 28 Apr, 2025 1 commit
-
-
Lei Wang authored
* Added logic to use non-replicated buffers as source buffers for more accurate layout inference. * Enhanced comments to clarify the rationale behind buffer selection in layout inference process.
-
- 27 Apr, 2025 3 commits
-
-
Cunxiao Ni authored
* [Enhancement] Reduce CPU overhead during kernel execution * fix lint
-
Lei Wang authored
* Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management.
-
Gabriel Wu authored
* Fix typo * bugfix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 26 Apr, 2025 3 commits
-
-
Lei Wang authored
[Enhancement] Simplify vectorization process in loop_vectorize.cc and add atomic add test (#436) (#439) * Removed redundant simplification step in vectorization logic to streamline performance. * Introduced a new test for atomic addition in TileLang, validating functionality with a reference implementation using PyTorch.
-
yyttt6 authored
* yes * [Bugfix] fix the unexpected keyword error of autotune * format * test
-
Lei Wang authored
* [Enhancement] Update reduce operations to support clear option in sum and abs sum (#436) * Modified reduce_sum and reduce_absmax functions to include a clear parameter, allowing for accumulation on existing values. * Updated ReduceOp::Lower method to handle initialization and buffer duplication based on the clear flag for sum and abs sum operations. * Added new tests for reduce_sum and reduce_max with clear functionality to ensure correctness in various scenarios. * Enhanced documentation for reduce functions to clarify the behavior of the clear parameter. * lint fix * Update tensor type annotations in test_tilelang_transform_annotate_device_regions.py from Buffer to Tensor * Update tensor type in reduce sum tests from float16 to float32 for improved precision
-
- 25 Apr, 2025 3 commits
-
-
Lei Wang authored
* [Enhancement] Improve error handling in layout inference and update profiler type in tests * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B. * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy. * lint fix * [Refactor] Update OperandTraits to include num_warp_n parameter * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations. * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios. * lint fix * [Refactor] Update DispatchInstruction templates to include N parameter * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations. * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios. * [Refactor] Simplify store buffer scope checks in pipeline planning * Removed redundant condition for 'local' scope in the store buffer checks, streamlining the logic for identifying global copy patterns. * Enhanced code clarity by reducing complexity in the conditional statements.
-
Lei Wang authored
* [Enhancement] Improve error handling in layout inference and update profiler type in tests * Added a detailed error message in the layout inference for local.fragment to clarify the requirement for trans_B. * Updated the profiler type in the cumulative sum test from TensorSupplyType.One to TensorDistributionType.Randn for better profiling accuracy. * lint fix * [Refactor] Update OperandTraits to include num_warp_n parameter * Modified OperandTraits templates across gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional num_warp_n parameter for improved flexibility in layout and copy operations. * Adjusted Copy type selection based on the new parameter to enhance performance and adaptability in various scenarios. * lint fix * [Refactor] Update DispatchInstruction templates to include N parameter * Modified DispatchInstruction templates in gemm_sm80.h, gemm_sm89.h, and gemm_sm90.h to include an additional N parameter, enhancing flexibility in tile size calculations. * Adjusted MMA_Group definitions to use std::min for improved handling of warp sizes, ensuring better performance and adaptability in various scenarios.
-
Lei Wang authored
* [Refactor] Adjust layout inference calculations in Gemm and ParallelOp * Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference. * Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience. * Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code. * [Refactor] Clean up layout inference logic in Gemm and ParallelOp * Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process. * Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality. * Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts. * lint fix * [Enhancement] Improve cumulative sum functionality and annotations handling * Updated the `cumsum` function to include detailed documentation and error handling for dimension bounds. * Modified the `run_cumsum` test to utilize a random tensor supply type for profiling, enhancing test robustness. * Added annotations to the fused loop in `loop_fusion_utils.h`, ensuring proper metadata is preserved during loop fusion. * lint fix
-
- 24 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Adjust layout inference calculations in Gemm and ParallelOp * Updated block size calculation in Gemm to account for the range of thread bounds, improving accuracy in layout inference. * Simplified layout conflict error messages in ParallelOp for better clarity, enhancing debugging experience. * Removed redundant buffer checks in ParallelOp layout inference logic, streamlining the code. * [Refactor] Clean up layout inference logic in Gemm and ParallelOp * Removed unnecessary warning log in Gemm related to WGMMA conditions, streamlining the layout inference process. * Commented out redundant checks in ParallelOp's layout inference, improving code clarity while maintaining functionality. * Enhanced error messages in ParallelOp to provide clearer context for layout conflicts, aiding in debugging efforts. * lint fix
-
- 23 Apr, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Improve layout inference in Copy operation (#426) * Updated the Copy operation to infer layouts at multiple levels (kCommon, kStrict, kFree) for enhanced flexibility in layout optimization. * Added detailed documentation for layout inference levels in ParallelOp, clarifying their purposes and use cases. * Refactored layout inference logic to accommodate new levels, improving overall robustness and performance in parallel operations. * lint fix
-
Lei Wang authored
* Update submodule 'tvm' to latest commit f4a8f9b * lint fix
-
- 22 Apr, 2025 7 commits
-
-
Lei Wang authored
-
Lei Wang authored
* [Feature] Implement CumSum operation in TileLang * Added CumSumOp class for cumulative sum operations, including argument validation and lowering logic. * Introduced CumSum2D template for CUDA, supporting both forward and reverse cumulative sums. * Created tests for CumSum functionality in shared memory and fragment contexts. * Updated language interface to include cumsum operation, enhancing the reduction capabilities of TileLang. * Refactored reduce.py to support cumsum functionality with appropriate memory allocation and copying mechanisms. * lint fix
-
Yu Cheng authored
* Introduced logic to check for TMA+WS enablement based on annotations in the pipeline planning stage. * Enhanced the handling of `order_anno` and `stage_anno` to determine if TMA+WS is activated, improving flexibility in loop processing. * Refactored the existing code to maintain clarity while integrating the new feature.
-
FeiyangChen authored
-
Yu Cheng authored
* Updated the layout inference in ParallelOp to improve the selection of source buffers for layout accuracy. * Introduced logic to choose the read source buffer based on the number of indices, ensuring more precise layout inference. * Refactored the loop handling to maintain clarity and improve the overall robustness of the layout inference process.
-
Zhao Wu authored
* Support to find Cython path more automatically * lint fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* [Enhancement] Introduce thread range management in layout and operation handling * Added `SetThreadRange` method to `FragmentNode` for managing thread ranges. * Updated `LayoutNode::Inverse` to provide more informative error messages. * Refactored layout inference and operation lowering to utilize `thread_bounds` instead of `block_size`, enhancing flexibility for thread management. * Introduced new tests for tilelang operations to validate thread range functionality and ensure correctness in parallel execution scenarios. * lint fix * [Refactor] Improve thread variable handling in layout inference and operation lowering * Removed workaround for undefined thread_var in layout inference, ensuring proper handling of thread bounds. * Updated logic to define thread bounds based on the presence of thread_var, enhancing robustness in thread management. * Refactored thread_var initialization in lower_tile_op to maintain consistency across the codebase. * [Refactor] Update thread variable handling in layout inference and operation lowering * Refactored thread variable checks to ensure bounds are only accessed when defined, improving safety and clarity. * Initialized thread_var with a default range to prevent undefined behavior. * Updated logic in lower_tile_op to align with new thread variable handling, enhancing consistency across the codebase.
-
- 21 Apr, 2025 3 commits
-
-
alex_xiao authored
-
Lei Wang authored
* Introduced a new function `get_nvcc_compiler` in nvcc.py to obtain the path to the nvcc compiler. * Updated LibraryGenerator to use `get_nvcc_compiler` instead of hardcoding the nvcc command, improving maintainability and flexibility.
-
Lei Wang authored
* [New Feature] Add FP8 Flash Attention Implementation (#412) * Introduce a new example script for FP8 Flash Attention in `example_mla_decode_kv_fp8.py`, showcasing the use of tilelang for efficient attention computation. * Implement the `flashattn` function with optimized memory management and kernel execution. * Include a reference program for comparison and performance evaluation. * Add command-line argument parsing for batch size, number of heads, and dimensions to facilitate testing and experimentation. * Enhance the overall structure and readability of the code. This addition aims to improve the performance of attention mechanisms in deep learning models by leveraging FP8 precision and optimized kernel execution. * lint fix * optimize quick start * lint fix
-
- 19 Apr, 2025 4 commits
-
-
You Jiacheng authored
* [Language] make linter and type checker happy with mocking * Apply suggestions from code review Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> * Refactor BaseTensor class in proxy.py to implement __getitem__ and __setitem__ methods, enhancing type checking and linting compliance. Added method stubs for from_ptr and other subclasses for improved clarity and maintainability. * Refactor type imports in proxy.py to enhance clarity and maintainability by replacing the built-in Self type with the Self type from typing_extensions. --------- Co-authored-by:
Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by:
LeiWang1999 <leiwang1999@outlook.com>
-
Lei Wang authored
* Phase out attr * Remove unused dependencies from requirements files * Update TVM submodule to latest commit 4776d31
-
Andy Luo authored
* Update Installation.md * Update installation prerequisites in documentation * cu128 Docker Image --------- Co-authored-by:
Lei Wang <34334180+LeiWang1999@users.noreply.github.com> Co-authored-by:
Andy <andyluo918@gmail.com>
-
Lei Wang authored
* Update TVM submodule and enhance vectorization logic in loop_vectorize.cc - Updated the TVM submodule to the latest commit. - Simplified the vectorization process by ensuring that the vectorized expression is simplified after vectorization, improving expression handling. - Added checks in loop_fusion_utils.h to prevent fusion of loops with non-power-of-2 extents, enhancing robustness in loop transformations. * lint fix
-
- 18 Apr, 2025 1 commit
-
-
Lei Wang authored
-