- 16 Jul, 2025 1 commit
-
-
Lei Wang authored
* [Enhancement] Improve memory access condition checks in GlobalMemChecker - Updated the condition checks in the GlobalMemChecker to utilize symbolic bounds in the CanProve method, enhancing the accuracy of memory access validations. - This change ensures that both upper and lower bound conditions are evaluated with improved proof strength, contributing to more robust memory access analysis. * lintfix * [Enhancement] Add legality checks for shared memory and global range in LowerBulkCopy - Implemented checks to ensure that the shared memory range and global range are legal during the bulk copy operation. - Added assertions to validate that the extents of global and shared ranges match, improving the robustness of memory access validation in the LowerBulkCopy function. * [Refactor] Update barrier and clear operations in warp specialization examples - Replaced `mbarrier_wait_parity` and `mbarrier_arrive` with `barrier_wait` and `barrier_arrive` for improved clarity and consistency in synchronization. - Adjusted the order of `clear` operations for local fragments in `example_warp_specialize_gemm_copy_1_gemm_0` to enhance parallel execution efficiency. * [Enhancement] Implement thread partial synchronization and improve shared memory allocation handling - Added support for thread partial barrier synchronization in CUDA, allowing for more flexible thread management. - Enhanced the `MergeSharedMemoryAllocations` function to accept alignment bytes, improving memory allocation efficiency based on target requirements. - Updated the `Lower` methods in `Copy` and `Fill` classes to include conditional predicates for thread execution, ensuring better control over thread behavior. - Refactored the `print` function to include warp group and warp IDs for more detailed debugging output. - Improved the handling of dynamic shared memory allocations in the `LowerAndLegalize` function to align with target-specific requirements. * [Enhancement] Add support for disabling TMA in Copy operations - Introduced a new `disable_tma` parameter in the `Copy` class to control thread memory access behavior. - Updated the `Lower` method to conditionally execute bulk copy operations based on the `disable_tma` flag. - Enhanced the `copy` function to accept the `disable_tma` argument, allowing for more flexible memory copy operations. - Improved handling of `coalesced_width` to ensure it defaults to -1 when not provided, enhancing robustness in memory operations. * [Refactor] Clean up whitespace and formatting in multiple files - Removed unnecessary blank lines and adjusted line breaks for improved code readability in `example_mla_decode.py`, `example_warp_specialize_gemm_copy_gemm_0_1.py`, `phase.py`, and `copy.py`. - Ensured consistent formatting across functions to enhance maintainability and clarity of the codebase. * [Enhancement] Refactor flash attention implementation for improved performance and configurability - Split the shared memory allocations for query and key-value pairs to optimize memory usage. - Introduced command-line arguments for batch size, number of heads, and dimensions, enhancing flexibility in running the example. - Updated kernel execution parameters to improve thread management and synchronization. - Enhanced the overall structure of the flash attention function for better readability and maintainability. * fix * Update layout inference in ParallelOp to account for thread bounds; remove debug print in OptimizeForTarget * Refactor barrier handling and update example configurations - Replaced commented-out barrier creation with new barrier allocation in GEMM example. - Updated kernel configuration in warp specialization example to include async copy settings. - Enhanced barrier management in the phase optimization process to improve synchronization handling. - Introduced new barrier allocation function for better memory management in shared contexts. * Refactor barrier handling in LowerAndLegalize and OptimizeForTarget - Reintroduced barrier lowering in OptimizeForTarget to enhance synchronization. - Removed commented-out barrier lowering in LowerAndLegalize for cleaner code. - Added exit() call in OptimizeForTarget to halt execution after barrier lowering. * Enhance CMake configuration and clean up example scripts - Enabled compile command export in CMakeLists.txt for better build integration. - Removed unnecessary print statement in the warp specialization example. - Cleaned up commented-out code in GEMM example for improved readability. - Updated barrier handling in shared memory allocation transformations for better synchronization. * Refactor barrier handling in warp specialization examples - Replaced commented-out mbarrier code with new barrier allocation using T.alloc_barrier for improved synchronization. - Updated barrier wait and arrive calls to align with the new allocation method across multiple example scripts. - Enhanced code readability by removing unnecessary comments and ensuring consistent barrier management. * Update lower_shared_barrier.cc * Update phase.py * Update warp specialization example and Cython wrapper - Removed commented-out pass configuration options in the warp specialization example for clarity. - Added functionality to write the generated kernel source to a file named "kernel.cu". - Enhanced Cython wrapper to support boolean type conversion for improved type handling. * Add storage synchronization call in shared barrier transformation - Introduced a new evaluation statement to call the TVM storage sync function with "shared" as an argument, enhancing synchronization in the shared barrier handling process. * remove debug files * Remove kernel source output to file in warp specialization example * remove comments * Refactor tensor handling and update test execution in TileLang - Changed `Buffer` to `Tensor` in `customize.py` for better type consistency. - Updated `mbarrier_wait_parity` and `mbarrier_arrive` functions in `builtin.py` to use `tir.BufferLoad` instead of `BufferLoad`. - Commented out the main testing function in `test_tilelang_language_reshape.py` and replaced it with a direct call to `run_reshape_smem` for streamlined testing. - Removed unnecessary NVCC compiler flags in `libgen.py` to reduce verbosity. * Update test_tilelang_language_reshape.py
-
- 25 Jun, 2025 1 commit
-
-
Cunxiao Ni authored
* [Example] Update kernel compilation in examples to use @tilelang.jit - Refactored multiple examples to eliminate the use of `tilelang.compile` for kernel creation, directly invoking the functions instead. - Added `@tilelang.jit` decorators with appropriate output indices to enhance performance and maintainability. - Improved code clarity by simplifying the kernel invocation process across various examples, ensuring consistency in how kernels are defined and executed. * format * Update example_tilelang_sparse_gqa_decode_varlen_indice.py * Update example_dequant_gemm_fine_grained.py * Update example_gemm_autotune.py --------- Co-authored-by:Lei Wang <34334180+LeiWang1999@users.noreply.github.com>
-
- 16 May, 2025 1 commit
-
-
Yu Cheng authored
* [Refactor] Update example_mla_decode.py and add tests for block_sparse_attn_tilelang * Refactor example_mla_decode.py to define a main function for better structure and clarity. * Introduce test_example_mla_decode.py to validate the functionality of example_mla_decode. * Refactor block_sparse_attn_tilelang.py to define a main function and add test_block_sparse_attn_tilelang.py for testing. * Ensure all new test files are integrated with tilelang testing framework. * [Test] Enhance test_example_mla_decode with argument mocking * Update test_example_mla_decode.py to mock sys.argv for better test isolation. * Ensure the main function of example_mla_decode is called with the correct arguments during testing.
-
- 26 Mar, 2025 1 commit
-
-
Lei Wang authored
* [Refactor] Improve flash attention example and layout comparison logic - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code. - Updated the handling of `lse_local_split` to utilize parallel processing for better performance. - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example. - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons. * lint fix * [Enhancement] Add support for shared memory scope in Fill operation - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation. - Implemented parallel operation and layout inference for improved performance in shared memory scenarios. - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling. * [Refactor] Remove deprecated decorator and enhance Cython kernel handling - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization. - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution. - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments. - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced. - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs. * [Feature] Add matrix multiplication test and kernel implementation - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives. - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types. - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation. - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs. - Minor formatting improvements in `deprecated.py` for better readability. * lint fix * [Refactor] Update tensor creation in matrix multiplication test - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency. - Updated imports in `__init__.py` to include `make_tensor`. - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers. * [Refactor] Update tensor definitions across multiple files - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity. - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations. - Improved documentation in README and example files to reflect changes in tensor usage. * lint fix * [Refactor] Update tensor types in attention and matrix multiplication examples - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity. - Adjusted tensor definitions in benchmark and example files to align with the new tensor types. - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files. * lint fix * [Refactor] Update tensor types in GEMM example and test files - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity. - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions. * [Refactor] Update tensor usage in customize.py - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions. - Improved code clarity by standardizing buffer usage across the file. * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions. - Improved code clarity by standardizing buffer usage across the test file. * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions. - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions. * [Refactor] Introduce Tensor alias for Buffer in proxy.py - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`. - This change enhances clarity and consistency in tensor usage across the codebase.
-
- 16 Mar, 2025 1 commit
-
-
Yu Cheng authored
- Replaced instances of `tilelang.lower` and `tilelang.Profiler` with `tilelang.compile` and the new profiler interface in multiple example files. - Enhanced the kernel compilation process to utilize the updated API, improving consistency and maintainability. - Adjusted benchmarking logic to use the new profiler methods for better clarity and functionality in performance testing. - Cleaned up whitespace and improved formatting for better readability across the modified files.
-
- 07 Mar, 2025 1 commit
-
-
You Jiacheng authored
It's slightly faster than T.copy then RS-GEMM, and simpler.
-
- 06 Mar, 2025 1 commit
-
-
Yu Cheng authored
* [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 - Remove redundant `acc_s_0` fragment in flash attention kernel - Simplify memory copy and reduction operations - Reorder memory copy and scaling steps for improved performance - Add Hopper-specific synchronization method in CUDA reduce template - Update reduce operation to use architecture-specific synchronization * [Dev] Add DeepSeek MLA Decoding (Paged+Varlen) kernel and Performance Benchmark Script - Implement comprehensive MLA (Multi-Head Latent Attention) decoding benchmark script - Add support for multiple implementations: Torch, TileLang, FlashMLA, FlashInfer, and Triton - Create flexible configuration for benchmarking different batch sizes, sequence lengths, and head configurations - Implement performance comparison and CSV output for detailed performance analysis - Add command-line argument support for targeted benchmarking and comparison * [Dev] Refactor MLA Paged Decoding Kernel with Improved Block Handling and Precision - Replace `d` parameter with `dv` to clarify value dimension in MLA decoding - Enhance block distribution logic for split KV processing - Improve handling of remaining blocks in split KV computation - Add initialization of `lse_max_local` to prevent potential precision issues - Optimize block start and range calculations for more accurate sequence processing * lint
-
- 05 Mar, 2025 1 commit
-
-
Yu Cheng authored
[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 (#141) - Remove redundant `acc_s_0` fragment in flash attention kernel - Simplify memory copy and reduction operations - Reorder memory copy and scaling steps for improved performance - Add Hopper-specific synchronization method in CUDA reduce template - Update reduce operation to use architecture-specific synchronization
-
- 03 Mar, 2025 2 commits
-
-
Yu Cheng authored
-
Yu Cheng authored
* [Dev] Add RetNet Linear Attention example * [Dev] Add WgmmaSync rewriter for pipelined WGMMA operations and add MHA WGMMA pipelined example (FA3-like scheduling) This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler: - Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc` - Added pass registration for `RewriteWgmmaSync` - Updated `tilelang/engine/phase.py` to include the new transformation pass - Updated `tilelang/transform/__init__.py` to expose the new pass The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels. * [Bugfix] Fix bug in ThreadTagChecker for warp specialization Improve thread tag validation in warp specialized rewriter to prevent unintended transformations: - Add more precise checks for threadIdx.y and threadIdx.z - Validate thread extent to ensure only single-extent thread bindings are allowed - Prevent warp specialization for multi-extent thread bindings in y and z dimensions * lint * [CI] Add TMA descriptor attribute to transformed module in test case * [Dev] Refactor DeepSeek MLA Decode Example with Non-Split and Split Flash Attention Implementations - Add new `flash_attn` macro for non-split flash attention implementation - Add swizzled layout for tile in shared memory - Use threadblock swizzle to imporve L2 cache hit rate * [Dev] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks - Add detailed README.md explaining MLA (Multi-Head Latent Attention) implementation - Include performance benchmark images for batch sizes 64 and 128 - Add layout visualization images for QK and PV operations - Implement torch reference implementations in torch_refs.py - Update example_mla_decode.py with command-line argument support and flexible configuration - Add performance benchmarking and comparison with other implementations
-
- 26 Feb, 2025 1 commit
-
-
Lei Wang authored
* Add DeepSeek MLA decode example with Flash Attention implementation * Add GEMM SplitK and StreamK example implementations This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques: - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations. * Refactor GEMM SplitK and StreamK example implementations Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts: - Remove unused import (Profiler) in splitk example - Simplify line breaks and improve code readability - Standardize indentation and remove unnecessary whitespace - Optimize atomic add and copy operations for better clarity * Add block sparse attention benchmarks for multiple libraries This commit introduces comprehensive block sparse attention benchmarks for different libraries: - TileLang block sparse FMHA implementation - Triton block sparse FMHA implementation - PyTorch reference block sparse FMHA implementation - FlashAttention dense FMHA reference implementation The benchmarks include: - Configurable benchmark parameters (batch size, heads, sequence length, etc.) - Sparse mask generation using top-k and threshold methods - Performance measurement for different sparse attention configurations - Utility functions for mask generation and benchmarking * Refactor block sparse attention benchmarks with code style improvements - Add Ruff linter ignore comments to benchmark files - Improve code formatting and line breaks - Remove unused imports - Standardize print statement formatting - Enhance code readability across multiple library benchmarks * lint fix * Add CUDA atomic operations for BFLOAT16 and update function naming - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd) - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values - Update kernel and language customization to use new function names - Add return type annotations in profiler module * lint fix * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates: - Group Query Attention (GQA) implementation - Flash Attention forward pass - Performance benchmarking - Configurable parameters for batch, heads, sequence length, and dimension - Autotuning support - Reference implementation comparison * Refactor IR lowering pipeline into modular phases This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases: - `LowerAndLegalize`: Handles initial IR legalization and transformation - `OptimizeForTarget`: Applies target-specific optimizations The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability. * lintfix * nas kernel * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements - Increased default number of heads and selected blocks in TileLang NSA example - Added Ruff linter ignore comments to reference.py - Standardized function signatures and improved code readability across NSA implementations * Add utility math functions for integer operations - Implement `next_power_of_2()` to calculate the next power of 2 for an integer - Add `cdiv()` function for ceiling division of integers * Add utility math functions for integer operations - Implement `next_power_of_2()` to calculate the next power of 2 for an integer - Add `cdiv()` function for ceiling division of integers * Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation - Update flash attention kernel to support positional embeddings (PE) - Modify reference implementation to handle PE and group query attention - Increase default batch size and adjust benchmarking parameters - Improve kernel performance and readability - Add einops and torch operations for more flexible tensor manipulation * Update README.md with corrected Flash MLA Decoding example path - Modify the example link for Flash MLA Decoding to point to the correct directory - Ensure accurate navigation to the DeepSeek MLA decoding example
-
- 23 Feb, 2025 2 commits
-
-
Lei Wang authored
* Add DeepSeek MLA decode example with Flash Attention implementation * Add GEMM SplitK and StreamK example implementations This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques: - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations. * Refactor GEMM SplitK and StreamK example implementations Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts: - Remove unused import (Profiler) in splitk example - Simplify line breaks and improve code readability - Standardize indentation and remove unnecessary whitespace - Optimize atomic add and copy operations for better clarity
-
Yu Cheng authored
* [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized * Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests * [Dev] Add mha backward example * [Dev] Add mla decode example * bug fix * Add triton impl * Add gqa decode example * [Dev] Add GQA decode example * lint * delete unused triton example * set default profiler to 'auto'
-