- 08 May, 2025 2 commits
-
-
Lei Wang authored
[Refactor] Update barrier functions and remove argparse in example_warp_specialize_flashmla.py (#457) * Refactored barrier functions to use new signatures for improved clarity and consistency. * Replaced `mbarrier_arrive` and `mbarrier_wait_parity` with `barrier_arrive` and `barrier_wait` respectively. * Removed argparse dependency and replaced it with hardcoded parameters for batch size and dimensions in the main function, simplifying the example script.
-
Lei Wang authored
* Add example for warp specialization with flash attention * Introduced a new example script `example_warp_specialize_flashmla.py` demonstrating flash attention using warp specialization in TileLang. * Implemented the `flashattn` function with shared memory allocation and memory barrier synchronization for improved performance. * Added a reference program for validation against PyTorch's implementation, including profiling for latency and performance metrics. * Removed the outdated `example_warp_specialize_mla.py` to streamline examples and focus on the new implementation. * Add memory barrier functions to builtin.py * Introduced `barrier_wait` and `barrier_arrive` functions for memory barrier synchronization. * Enhanced documentation with detailed docstrings for both functions, clarifying their usage and parameters. * The `barrier_wait` function serves as a wrapper for `mbarrier_wait_parity`, supporting parity values 0 and 1. * Improved code organization and readability by adding blank lines for better separation of logical sections. * Enhance code readability by adding blank lines in example_warp_specialize_flashmla.py and builtin.py * Added blank lines to improve code organization and separation of logical sections in `example_warp_specialize_flashmla.py`. * Included blank lines in `builtin.py` around the `wait_wgmma` and `barrier_wait` functions for better readability. * [Refactor] Update barrier functions and add new example for GEMM with warp specialization * Refactored memory barrier functions in `example_warp_specialize_flashmla.py` to use the new `barrier_wait` and `barrier_arrive` methods for improved clarity and consistency. * Introduced a new example script `example_warp_specialize_gemm_copy_gemm_0_1.py` demonstrating matrix multiplication with warp specialization and shared memory allocation. * Enhanced the `layout.cc` and `elem.cc` files to improve structural equality checks and error handling in copy operations. * Updated `warpgroup.py` to refine thread ID calculations for better performance in warp specialization scenarios. * Added new shuffle operations in `builtin.py` for enhanced functionality in parallel computations. * lint fix * Update loop variable checks in SIMT loop and buffer region validation * Modified checks in `elem.cc` to ensure loop variable sizes are less than or equal to source and destination range sizes for better error handling. * Adjusted assertions in `copy.py` to reflect the updated logic, allowing for more flexible region extent comparisons and improved error messaging. * lint fix * test fix
-
- 06 May, 2025 2 commits
-
-
Lei Wang authored
* [Enhancement] Introduce pass_configs parameter for kernel compilation * Added a new `pass_configs` parameter to the `tilelang.compile` function to allow for more flexible kernel compilation configurations. * Updated related classes and methods to accommodate the new parameter, ensuring compatibility across the codebase. * Enhanced the `torch_assert_close` function to include customizable tensor names for better debugging output. * Refactored input handling in example scripts to streamline the process of obtaining inputs for kernel execution. * lint fix
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections. * [Feature] Add examples for warp specialization and TMA barrier integration * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers. * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance. * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch. * Updated the `phase.py` to include TMA barrier injection in the optimization process. * Improved documentation and comments for better clarity on usage and functionality. * [Feature] Add example for warp specialization in GEMM with TMA barriers * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers. * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance. * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation. * Enhanced documentation and comments for clarity on usage and functionality. * lint fix * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement. * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results. * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis. * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness. * lint fix * [Feature] Add new examples for warp specialization and TMA integration * Introduced multiple new example scripts demonstrating warp specialization techniques, including `example_warp_specialize_flashmla.py`, `example_warp_specialize_gemm_barrierpipe_stage2.py`, `example_warp_specialize_gemm_copy_0_gemm_1.py`, `example_warp_specialize_gemm_copy_1_gemm_0.py`, and `example_warp_specialize_gemm_softpipe_stage2.py`. * Each example showcases matrix multiplication with warp specialization and TMA barriers, implementing kernel functions with shared memory allocation and memory barrier synchronization for enhanced performance. * Added a test suite in `test_example_warp_specialize.py` to validate the functionality of the new examples. * Updated the TileLang API to support these examples and improve kernel compilation and testing processes. * Removed outdated example scripts to streamline the codebase and enhance clarity on available functionalities. * lint fix * Remove outdated example scripts for warp specialization and TMA integration to streamline the codebase. This includes `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, `example_warp_specialize_gemm_stage2.py`, and `example_warp_specialize_mla.py`, which are no longer needed following recent updates and improvements in the TileLang API.
-
- 03 May, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections. * [Feature] Add examples for warp specialization and TMA barrier integration * Introduced three new example scripts: `example_warp_specialize_gemm.py`, `example_warp_specialize_gemm_barrier4.py`, and `example_warp_specialize_mla.py` demonstrating matrix multiplication with warp specialization and TMA barriers. * Implemented kernel functions with shared memory allocation and memory barrier synchronization for improved performance. * Enhanced the TileLang API with new methods for compiling and testing kernels in Python using PyTorch. * Updated the `phase.py` to include TMA barrier injection in the optimization process. * Improved documentation and comments for better clarity on usage and functionality. * [Feature] Add example for warp specialization in GEMM with TMA barriers * Introduced a new example script `example_warp_specialize_gemm_stage2.py` demonstrating matrix multiplication using warp specialization and TMA barriers. * Implemented a kernel function with shared memory allocation and memory barrier synchronization for enhanced performance. * Included functionality to compile the kernel into a PyTorch-compatible function and validate its correctness against PyTorch's reference implementation. * Enhanced documentation and comments for clarity on usage and functionality. * lint fix * [Feature] Implement WarpSpecializedDetector for TMA and MBarrier Detection * Added the `WarpSpecializedDetector` class to identify the presence of TMA operations and memory barrier operations within a given TIR statement. * Enhanced the `WarpSpecialized` pass to utilize the detector, allowing for conditional substitution based on the detection results. * Improved code organization by including necessary headers and utilizing the `IRVisitorWithAnalyzer` for analysis. * This addition aims to optimize warp specialization by ensuring that only relevant functions are transformed, enhancing performance and correctness. * lint fix
-
- 30 Apr, 2025 2 commits
-
-
Lei Wang authored
* [Refactor] Update KernelLaunch to clarify CPU and GPU kernel launch logic * Added comments to distinguish between CPU and GPU kernel launch sections for better code readability. * Changed the creation of empty blocks to use a consistent "root" identifier, enhancing clarity in frame management. * [Refactor] Rename operations for consistency in lower_hopper_intrin and related files * Updated function names from CamelCase to snake_case for better consistency across the codebase. * Refactored calls to `CreateTMADescriptorOp`, `CreateListofMBarrierOp`, and similar functions to their new names: `create_tma_descriptor`, `create_list_of_mbarrier`, etc. * Adjusted corresponding test cases to reflect these changes, ensuring compatibility with the new naming conventions. * [Refactor] Rename operations to snake_case for consistency * Updated function names from CamelCase to snake_case across various files, including `CreateTMADescriptorOp` to `create_tma_descriptor`, `GetMBarrierOp` to `get_mbarrier`, and others. * Adjusted corresponding calls and definitions in the codebase to reflect these naming changes, ensuring uniformity and improved readability. * Enhanced layout inference and loop partitioning logic to accommodate the new naming conventions. * [Feature] Introduce Warp Specialization and Eliminate Storage Sync for MBarrier * Added a new example `gemm_ws.py` demonstrating matrix multiplication with warp specialization using TileLang. * Implemented `WarpSpecializeFrame` and `WarpSpecialize` functionality to manage warp group indices in TIR frames. * Introduced `EliminateStorageSyncForMBarrier` transformation to optimize storage synchronization in mbarrier regions. * Enhanced the TileLang API with new methods for retrieving block and thread extents. * Updated the `LowerAndLegalize` and `OptimizeForTarget` functions to incorporate the new transformation. * Improved layout inference and kernel launch logic for better performance and clarity. * [Refactor] Clean up code formatting and improve readability * Added blank lines for better separation of code blocks in `gemm_ws.py`, `phase.py`, `kernel.py`, and `warpgroup.py`. * Reformatted the `tilelang.compile` call in `gemm_ws.py` for improved clarity. * Updated comments in `warpgroup.py` to clarify the availability of the `WarpSpecialize` function for NVIDIA GPUs. * Ensured consistent spacing and formatting across multiple files to enhance overall code readability. * lint fix * [Refactor] Update mbarrier functions for improved clarity and consistency * Refactored `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to accept explicit parameters for better readability. * Updated calls in `gemm_ws.py` to use the new function signatures, enhancing code clarity. * Adjusted `warpgroup.py` to remove unused thread extent variable, streamlining the code. * Added detailed docstrings to clarify usage examples for memory barrier functions. * Added blank lines in `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` for improved code readability and separation of logical sections.
-
dependabot[bot] authored
Bumps [transformers](https://github.com/huggingface/transformers) from 4.48.0 to 4.50.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.48.0...v4.50.0 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.50.0 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
- 26 Apr, 2025 1 commit
-
-
yyttt6 authored
* yes * [Bugfix] fix the unexpected keyword error of autotune * format * test
-
- 21 Apr, 2025 1 commit
-
-
Lei Wang authored
* [New Feature] Add FP8 Flash Attention Implementation (#412) * Introduce a new example script for FP8 Flash Attention in `example_mla_decode_kv_fp8.py`, showcasing the use of tilelang for efficient attention computation. * Implement the `flashattn` function with optimized memory management and kernel execution. * Include a reference program for comparison and performance evaluation. * Add command-line argument parsing for batch size, number of heads, and dimensions to facilitate testing and experimentation. * Enhance the overall structure and readability of the code. This addition aims to improve the performance of attention mechanisms in deep learning models by leveraging FP8 precision and optimized kernel execution. * lint fix * optimize quick start * lint fix
-
- 16 Apr, 2025 2 commits
-
-
dependabot[bot] authored
Bumps [transformers](https://github.com/huggingface/transformers) from 4.40 to 4.48.0. - [Release notes](https://github.com/huggingface/transformers/releases) - [Commits](https://github.com/huggingface/transformers/compare/v4.40.0...v4.48.0 ) --- updated-dependencies: - dependency-name: transformers dependency-version: 4.48.0 dependency-type: direct:production ... Signed-off-by:
dependabot[bot] <support@github.com> Co-authored-by:
dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
-
Lei Wang authored
* Update copyright notice in example_mha_bwd_wgmma_pipelined.py to reflect Tile-AI Corporation ownership. * lint fix
-
- 14 Apr, 2025 1 commit
-
-
Lei Wang authored
* Update README.md for deepseek_mla: Refine performance comparison details and add acknowledgment section. Adjusted performance metrics for TileLang, highlighting its efficiency over Triton and assembly kernels. Included gratitude to the AMD ROCm team for their contributions. * Update README.md for deepseek_mla: Clarify performance metrics for TileLang, specifying the range of performance parity with hand-optimized assembly kernels. This adjustment enhances the accuracy of the comparative analysis against Triton implementations.
-
- 13 Apr, 2025 1 commit
-
-
Zhengju Tang authored
* [Dynamic Symbolic] Add pass_config to customize vectorization and tail split * Lint * Only check for vectorized dimension. Add docs. * Lint * Update comment for cache directory in .gitignore * Use CUTLASS convention to represent dynamic alignment. Fix bugs * Add benchmark examples * Add more benchmarks. Fix accumulate type bug. * Lint * Lint * Test Lint * Lint * Test Lint * Lint * Fix typo * Lint * Lint --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 12 Apr, 2025 3 commits
-
-
Lei Wang authored
* Update legalize_safe_memory_access.cc * Add cache path handling and file locking in Cython adapter - Introduced a new cache path based on the code hash for the Cython JIT adapter, enhancing cache management. - Added a lock file mechanism to ensure safe access during cache operations, improving concurrency handling. - These changes aim to optimize the compilation process and prevent race conditions during library loading. * lint fix * refactor * refactor * Add GlobalCopyPatternDetector to identify global memory copy patterns - Introduced a new class, GlobalCopyPatternDetector, to detect specific memory copy patterns in statements. - Enhanced the PipelinePlanner to utilize this detector for determining copy stages based on global and local memory scopes. - Improved code clarity and maintainability by encapsulating detection logic within the new class. * Refactor copy stage detection logic in pipeline planning - Simplified the determination of copy stages by directly assigning the result of GlobalCopyPatternDetector to pinfo.copy_stage. - Removed redundant checks for read and write scopes, enhancing code clarity and maintainability. * lint fix
-
Lei Wang authored
* Add deepseek_mla to documentation index (#380) * lint fix
-
Lei Wang authored
* [Add] Introduce deepseek_mla documentation for high-performance FlashMLA with TileLang - Added a comprehensive guide on writing high-performance kernels using TileLang, focusing on the Multi-Head Latent Attention (MLA) mechanism. - Included benchmark results comparing FlashMLA, TileLang, Torch, Triton, and FlashInfer, highlighting TileLang's efficiency and ease of use. - Detailed implementation strategies, including layout inference, threadblock swizzling, shared memory swizzling, and warp specialization. - Provided examples and explanations of optimization techniques to enhance performance in GPU kernel programming. * doc update * [Add] Enhance AMD FlashMLA implementation and documentation - Refactored variable names in `benchmark_mla_decode_amd_tilelang.py` for clarity, changing `Q_shared` and `Q_pe_shared` to `Q_local` and `Q_pe_local` to reflect their usage in register allocation. - Added a new `README.md` detailing the high-performance FlashMLA implementation on AMD MI300X accelerators, including architectural considerations, optimization strategies, and performance evaluation. - Introduced a performance comparison figure to illustrate the efficiency of the TileLang implementation against other frameworks. * lint fix * [Add] Expand deepseek_mla documentation for AMD MI300X optimization strategies - Introduced a new section detailing architectural differences and optimization strategies for implementing FlashMLA on AMD MI300X accelerators. - Highlighted key considerations such as instruction set variations, shared memory constraints, tile size flexibility, and memory bank conflict swizzling. - Included performance evaluation results demonstrating TileLang's efficiency compared to other frameworks. - Discussed future optimization opportunities for memory bank conflict mitigation and dimension parallelization.
-
- 10 Apr, 2025 2 commits
-
-
Haodong Tian authored
* [Bugfix] Adjust Autotuner threadpool `max_workers` limit to available CPUs * [Example] Small fix on example_blocksparse_gemm.py
-
Lei Wang authored
* [Add] Introduce benchmark scripts for MLA decoding with AMD support - Added three new benchmark scripts: `benchmark_mla_decode_amd_tilelang.py`, `benchmark_mla_decode_amd_torch.py`, and `benchmark_mla_decode_amd_triton.py` to evaluate the performance of the MLA decoding mechanism across different frameworks. - Each script includes implementations for attention calculation, performance profiling, and output validation against reference implementations. - Enhanced command-line argument parsing for customizable input parameters, including batch size, number of heads, and dimensions. - Integrated performance comparison functionality to facilitate benchmarking between different implementations. * lint fix * lint fix --------- Co-authored-by:Zhiwen Mo <zhiwen.mo25@ic.ac.uk>
-
- 09 Apr, 2025 3 commits
-
-
Lei Wang authored
* [Bugfix] Correct dynamic shared memory size error handling in HIP wrapper - Updated the error handling logic in `PREDEF_ATTRIBUTE_SET_DYNAMIC_MEMORY_HIP` to check if the dynamic shared memory size exceeds the maximum limit of 65536. - Improved error message clarity by specifying the function name and the attempted size, ensuring better debugging information. - Ensured the function returns 0 upon successful setting of the dynamic shared memory size. * [Add] Implement example for MLA decoding with AMD support - Introduced a new example script `example_mla_decode_amd.py` demonstrating the use of the flash attention mechanism with AMD hardware. - Implemented functions for attention calculation, including support for split processing and combining outputs. - Added command-line argument parsing for customizable input parameters such as batch size, number of heads, and dimensions. - Included a reference implementation for validation against the Tile-AI output, ensuring correctness of the implementation. - Enhanced performance profiling and output comparison for debugging and optimization purposes. * lint fix
-
Yuqing Xia authored
[Example] Handle Scenarios in Which a Threadblock is Assigned Only Invalid Block Indices for Sparse Attention (#361) * Fix issue where threadblock with only invalid blocks produces incorrect output. * fix score scale * format
-
Yu Cheng authored
* Added a new example script `example_gemm_autotune.py` to demonstrate autotuning for matrix multiplication (GEMM) using TileLang. * Implemented functions for generating configurations, selecting the best configuration, and benchmarking performance. * Refactored the existing `matmul` function to support dynamic configuration parameters and improved kernel compilation. * Updated the main execution block to include command-line argument parsing for matrix dimensions and autotuning options. * Enhanced the example to validate results against a reference implementation, ensuring correctness in matrix multiplication operations.
-
- 08 Apr, 2025 2 commits
-
-
Yu Cheng authored
- Modified the `group_per_split_token_cast_to_fp8` function to support `bfloat16`, `float`, and `float16` data types. - Updated local fragment allocations to use the new `accum_dtype` for consistency. - Enhanced the main execution block to handle different tensor data types based on the specified `dtype`, improving flexibility in tensor operations.
-
Lei Wang authored
* [Refactor] Update import structure in benchmark_mla.py - Moved the import of `flash_mla` functions to the `run_flash_mla` function for better encapsulation. - Added a comment for `flashinfer` installation to clarify dependencies. - Cleaned up unused imports to enhance code readability. * lint fix
-
- 07 Apr, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Update GEMM examples and autotuner for improved performance - Modified `example_gemm_intrinsics.py` to enhance matrix multiplication configurations, increasing warp sizes and adjusting data types for better performance. - Updated the kernel compilation process to utilize the new `tilelang.compile` method and improved latency measurement with the profiler. - Refactored `example_gemm.py` to include a new autotuning configuration and ensure consistency in latency checks against reference results. - Adjusted tensor supply generation in `tilelang/utils/tensor.py` to use `torch.randn` for better randomness in tensor initialization. - Enhanced the `JITContext` in `tilelang/autotuner/__init__.py` to replace the profiler with a kernel instance for performance measurement, improving the overall structure of the autotuner. * bug fix * fix * [Enhancement] Update convolution tests and profiling assertions - Added a random seed setting for reproducibility in convolution tests. - Removed several redundant convolution test cases to streamline the testing process. - Updated the assertion in the matrix multiplication profiling to include a maximum mismatched ratio for improved accuracy in results. - Enabled the main testing function for better test execution. * lint fix
-
- 06 Apr, 2025 2 commits
-
-
Yu Cheng authored
- Modified the `group_per_split_token_cast_to_fp8` function to include a conditional check for batch sizes, ensuring that the scaling factor is applied only when within the valid range. This change enhances the robustness of the FP8 conversion process for grouped per-split tokens.
-
YizhaoGao authored
* [Example] Add triton block sparse gqa decode * lint fix --------- Co-authored-by:LeiWang1999 <leiwang1999@outlook.com>
-
- 05 Apr, 2025 3 commits
-
-
Yu Cheng authored
Implements FP8 type conversion functionality for grouped per-split tokens. The script includes several helper functions for handling tensor TMA alignment and FP8 conversion, enhancing support for FP8 data types and providing performance benchmarks. This change provides users with more flexible examples of FP8 operations.
-
Lei Wang authored
* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities. - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities. - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management. - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes. * [Refactor] Clean up whitespace in CUDA-related files - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability. - This change enhances the overall organization of the codebase without altering functionality. * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`. - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement. - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration. - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations. * lint fix * Update submodule and enhance FP8 type handling in CUDA codegen - Updated the TVM submodule to the latest commit. - Modified FP8 type handling in `codegen_cuda.cc` to use more descriptive type codes. - Improved constant printing for FP8 and bfloat16 types, ensuring correct representation in generated code. - Added error handling for missing configuration keys in the AutoTuner class. * lint fix * Remove print statement from example script * lint fix * fix --------- Co-authored-by:LeiWang1999 <wyatuestc@gmail.com>
-
Yuqing Xia authored
* add example gqa decode wgmma pipelined * add sparse gqa * support num split * support num split * add if condition * add heuristic num split * clean code * add ref * fix bug * add torch ref * fix bug * integrate to torch * symbolic * clean mask * rm actual_num_blocks * clean code * get num_sm via torch * add sparse gqa decode example * format * rm example_gqa_decode_wgmma_pipelined.py * Add license headers to example scripts * format * Remove commented-out cache disabling lines --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 04 Apr, 2025 3 commits
-
-
Yu Cheng authored
- Introduced a new local fragment for squared values to improve performance. - Updated the computation of the RMS normalization to use the new fragment, enhancing memory efficiency. - Refactored the final multiplication step to operate on the local fragment instead of shared memory. - Added a configuration option to the kernel compilation for better control over TMA lowering. These changes enhance the efficiency and clarity of the RMS normalization implementation.
-
Lei Wang authored
- Added a manual seed for reproducibility in PyTorch. - Refactored local variable allocations for better memory management. - Enhanced parallel processing in the flashattn function to improve performance. - Updated layout annotations for clarity and efficiency. These changes optimize the flash attention mechanism and ensure consistent behavior across runs.
-
Zhengju Tang authored
* [Dynamic Symbolic] Adaptively vectorize with different condition expressions * Format * Format * Format * Format * Add MIT License headers to Python files * Simplify return statement in loop vectorization --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 03 Apr, 2025 4 commits
-
-
Lei Wang authored
* [Enhancement] Introduce CUDA driver module and refactor CUDA device handling - Added a new `cuda_driver` module to encapsulate CUDA device properties and functionalities. - Updated `CUDA` class in `cuda.py` to utilize the new driver for fetching device name and shared memory capabilities. - Introduced `get_device_name` and `get_shared_memory_per_block` functions in the `cuda_driver` for improved device property management. - This refactor enhances code organization and maintainability while improving the handling of CUDA device attributes. * [Refactor] Clean up whitespace in CUDA-related files - Removed unnecessary blank lines in `cuda.py`, `__init__.py`, and `cuda_driver.py` to improve code readability and maintainability. - This change enhances the overall organization of the codebase without altering functionality. * [Benchmark] Add FP8 Matrix Multiplication Benchmark Script - Introduced a new benchmark script for FP8 matrix multiplication in `benchmark/matmul_fp8/benchmark_matmul.py`. - The script includes functions for reference matrix multiplication, configuration generation for autotuning, and an autotuned kernel for performance measurement. - Added command-line argument parsing for matrix dimensions and the option to enable BitBLAS roller for search space exploration. - The benchmark computes and prints the best latency and performance metrics, enhancing the benchmarking capabilities for FP8 operations. * lint fix --------- Co-authored-by:LeiWang1999 <wyatuestc@gmail.com>
-
Chunan Zeng authored
-
yyttt6 authored
* refactor autotune * refactor autotune * refactor autotune * refactor autotune * format init.py * add tutorial for autotune * merge * merge * format analyzer * add readme for analyzer * format * [Tools] Summarize TFLOPS Information from a tilelang program * Summarize TFLOPS Information from a tilelang program
-
Yu Cheng authored
* [Dev] Add FP8 Quantization Examples and Absolute Maximum Reduction Operation Support * Added `example_per_token_cast_to_fp8.py` in examples/cast, providing token-wise FP8 quantization implementation. * Added `example_triton_cast_to_fp8.py` in examples/cast, providing Triton-based FP8 quantization implementation. * Added support for absolute maximum (absmax) reduction operation in reduce.cc and reduce.h. * Implemented `reduce_absmax` function in reduce.py, allowing absolute maximum reduction on input buffers. * Updated tilelang.language module to include the new `reduce_absmax` function. These changes enhance FP8 quantization capabilities and extend reduction operation support. * [Enhancement] Update per_token_cast_to_fp8 for improved FP8 quantization * Modified the `per_token_cast_to_fp8` function to support variable block sizes and improved memory layout annotations. * Adjusted the handling of absolute maximum values and scaling factors for better performance and accuracy. * Updated the main execution block to allow for larger matrix dimensions and refined the profiler setup for benchmarking. These changes enhance the flexibility and efficiency of the FP8 quantization process. * lint * [Dev] Update per_token_cast_fp8.py
-
- 31 Mar, 2025 2 commits
-
-
Lei Wang authored
* Remove logging statement from LoopVectorizerDynamic Substitute method for cleaner output. * Refactor flashattn example to improve CUDA configuration handling - Updated the `flashattn` function in `example_gqa_decode.py` to utilize a heuristic configuration based on CUDA device capabilities, enhancing compatibility with different architectures. - Replaced local variable allocations with more efficient constructs and removed unnecessary logging statements for cleaner output. - Adjusted the `do_bench` method call to streamline performance profiling. * lint fix
-
Lei Wang authored
* [Enhancement] Add support for CUDA architecture 8.9 in GEMM template - Introduced conditional inclusion of "gemm_sm89.h" for CUDA architectures 8.9 and above, enhancing compatibility with newer hardware. - This change ensures that the GEMM template can leverage optimizations specific to the 8.9 architecture, improving performance for users with compatible GPUs. * lintfix * [Refactor] Clean up includes in gemm_sm89.h - Removed duplicate inclusion of "common.h" and added "cuda_fp8.h" for improved clarity and organization. - This change enhances the maintainability of the code by ensuring that header files are included only once and in a logical order. * [Enhancement] Improve KernelCache with in-memory caching and detailed docstrings - Added an in-memory cache to the KernelCache class to enhance performance by reducing disk access. - Updated the __new__ method to initialize the memory cache and added logic to check the cache before loading from disk. - Enhanced docstrings across multiple methods to provide clearer explanations of parameters and return values, improving code readability and maintainability. - Implemented a clear_cache method to clear both in-memory and disk caches, ensuring efficient cache management. * lint fix * typofix * [Refactor] Update matmul and flashattn function calls to return structured results - Modified the matmul and flashattn function calls to return a single object containing latency, configuration, and reference latency, improving code clarity and reducing the number of returned variables. - Updated all relevant instances in benchmark and example scripts to accommodate the new return structure, ensuring consistent usage across the codebase. * lint fix
-
- 30 Mar, 2025 2 commits
-
-
Leslin authored
* Update elementwise_add.py [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError * Update rms_norm.py [Bugfix] Replace profiler.mod with profiler.adapter to fix AttributeError * Remove adapter argument from do_bench call * Remove adapter argument from do_bench call --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
Haodong Tian authored
* [Bugfix] Configure autotuner specific logger for correct level handling - Previously, logging relied on basicConfig, which configured the root logger. This caused the named autotuner logger to ignore DEBUG messages. - This commit sets up a dedicated logger for autotuner, correctly route DEBUG messages to 'autotuner.log' and INFO+ messages to the console. * [Bugfix] Fix tensor_supply for boolean type - Previously `get_tensor_supply` used `torch.randint(-2, 3)` as a fallback, which caused error when the dtype was `torch.bool`. - This commits adds an `is_boolean` check in `KernelParam` and updates `get_tensor_supply` to specifically use `torch.randint(0, 2)` for boolean dtypes. * [Bugfix] Always regenerate JIT inputs during tuning - Removes the caching for `self.jit_input_tensors` within `AutoTuner`. When different autotuning configurations can alter the required input tensor shapes or other properties, reusing cached inputs from a previous configuration lead to errors or incorrect assessments. - This change ensures that `profiler._get_inputs()` is called unconditionally for each configuration evaluation. Since `_get_inputs` is assumed to be relatively inexpensive, the potential overhead is considered acceptable. * [Example] Update example_blocksparse_gemm for autotuner * Run code formatter * [Feature] Enable custom tensor supply and input caching control in Autotuner - Previously, tensor generation was tied to `supply_type` and input caching behavior across configurations was less explicit/controlled. - This commit introduces a `supply_prog` parameter to allow providing a custom function for generating input tensors, overriding the default mechanism. - Adds a `cache_input_tensors` flag (default True) to control input tensor caching: - If True, tensors are generated once per configuration and reused for repetitions, with a check for potential shape mismatches between configurations. - If False, tensors are regenerated for every configuration trial. - Refactors internal input tensor handling using supplier functions for clarity. - Adds a `check_tensor_list_compatibility` utility for shape comparison. * [Example] Update example_blocksparse_gemm for autotuner * Run code formatter * [Example] Small fix in example_blocksparse_gemm * [Fix] Raise error if autotuning yields no valid configuration
-