1. 26 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Deprecated `T.Buffer` as arguments and rename related calls into `T.Tensor` (#281) · bf8a6fc1
      Lei Wang authored
      * [Refactor] Improve flash attention example and layout comparison logic
      
      - Removed unnecessary annotation for `lse_local_split` in the flash attention example to streamline the code.
      - Updated the handling of `lse_local_split` to utilize parallel processing for better performance.
      - Refactored kernel compilation and profiling logic to enhance clarity and maintainability in the flash attention example.
      - Added a condition in `FragmentNode::IsEqual` to handle broadcast cases, improving the robustness of layout comparisons.
      
      * lint fix
      
      * [Enhancement] Add support for shared memory scope in Fill operation
      
      - Introduced handling for `shared.dyn` and `shared` memory scopes in the Fill operation.
      - Implemented parallel operation and layout inference for improved performance in shared memory scenarios.
      - Updated thread loop partitioning and vectorization logic to accommodate new memory scope handling.
      
      * [Refactor] Remove deprecated decorator and enhance Cython kernel handling
      
      - Removed the deprecated decorator from the main module and added a new implementation in the utils module for better organization.
      - Introduced a pointer map in the Cython kernel adapter to manage pointer arguments, improving runtime shape resolution.
      - Updated the Cython kernel wrapper to utilize the new pointer map for handling kernel arguments.
      - Enhanced error checking in the tensor utility functions to ensure static shapes are enforced.
      - Added a new proxy module for buffer and tensor handling, streamlining the interface for TIR programs.
      
      * [Feature] Add matrix multiplication test and kernel implementation
      
      - Introduced a new test file `test_tilelang_language_ptr.py` that implements a matrix multiplication function using TileLang's primitives.
      - The `matmul_test` function defines a kernel for performing tile-level GEMM operations with customizable block sizes and data types.
      - Added a `run_matmul` function to compile and execute the kernel, along with a test function to validate the implementation.
      - Updated the `proxy.py` file to enhance type handling for buffer and tensor proxies, ensuring compatibility with TIR programs.
      - Minor formatting improvements in `deprecated.py` for better readability.
      
      * lint fix
      
      * [Refactor] Update tensor creation in matrix multiplication test
      
      - Replaced `T.Tensor.from_ptr` with `T.make_tensor` in `matmul_test` for improved clarity and consistency.
      - Updated imports in `__init__.py` to include `make_tensor`.
      - Added `make_tensor` function in `proxy.py` to streamline tensor creation from pointers.
      
      * [Refactor] Update tensor definitions across multiple files
      
      - Replaced instances of `T.Tensor` with updated tensor definitions in various benchmark and example files to enhance consistency and clarity.
      - Adjusted tensor shapes and types in functions related to matrix multiplication, attention mechanisms, and other operations.
      - Improved documentation in README and example files to reflect changes in tensor usage.
      
      * lint fix
      
      * [Refactor] Update tensor types in attention and matrix multiplication examples
      
      - Replaced instances of `T.Tensor` with `T.SharedTensor` and `T.FragmentTensor` in various attention and matrix multiplication functions to improve consistency and clarity.
      - Adjusted tensor definitions in benchmark and example files to align with the new tensor types.
      - Enhanced the overall structure and readability of the code by standardizing tensor usage across multiple files.
      
      * lint fix
      
      * [Refactor] Update tensor types in GEMM example and test files
      
      - Replaced instances of `T.Tensor` with `T.LocalTensor` and `T.Buffer` in the GEMM example and related test functions to improve consistency and clarity.
      - Enhanced the overall structure of the code by standardizing tensor usage across multiple files, aligning with recent updates in tensor definitions.
      
      * [Refactor] Update tensor usage in customize.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `reshape` and `view` functions to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the file.
      
      * [Refactor] Update tensor types in test_tilelang_transform_annotate_device_regions.py
      
      - Replaced instances of `T.Tensor` with `T.Buffer` in the `before` and `expected` methods of the `TestAnnotateThreadExtent` and `TestAnnotateDeviceScope` classes to enhance consistency with recent tensor definitions.
      - Improved code clarity by standardizing buffer usage across the test file.
      
      * [Refactor] Update tensor types to SharedBuffer and FragmentBuffer
      
      - Replaced instances of `T.SharedTensor` and `T.FragmentTensor` with `T.SharedBuffer` and `T.FragmentBuffer` across multiple benchmark, example, and test files to enhance consistency with recent tensor definitions.
      - Improved code clarity and structure by standardizing buffer usage in attention and matrix multiplication functions.
      
      * [Refactor] Introduce Tensor alias for Buffer in proxy.py
      
      - Added a new alias `Tensor` for `Buffer` in `proxy.py` to facilitate JIT compilation, ensuring that inputs and outputs are mapped with `torch.Tensor`.
      - This change enhances clarity and consistency in tensor usage across the codebase.
      bf8a6fc1
  2. 22 Mar, 2025 2 commits
    • Chaofan Lin's avatar
      [Bugfix] Fix Benchmark/Example Code for Autotuning (#254) · 0430cfe7
      Chaofan Lin authored
      
      
      * fix tune args
      
      * lint
      
      * Refactor gemm example and autotuner logging
      
      - Updated `ref_program` in `example_gemm.py` to return the result of matrix multiplication instead of modifying an input parameter.
      - Changed logging filename in `__init__.py` from 'out.log' to 'autotuner.log' for better clarity.
      - Modified JIT kernel compilation process to include `out_idx` directly in the adapter creation, enhancing flexibility.
      - Improved validation of `result_idx` in `BaseKernelAdapter` to ensure it falls within valid bounds.
      
      * Refactor `ref_program` in `benchmark_matmul_intrinsic.py` to use the `@` operator for matrix multiplication instead of `torch.matmul`, simplifying the implementation by removing the unused parameter `C`.
      
      ---------
      Co-authored-by: default avatarLeiWang1999 <leiwang1999@outlook.com>
      0430cfe7
    • Lei Wang's avatar
      [Example] Implement Kernel Example cumsum (#258) · cd9ec62e
      Lei Wang authored
      * Add GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Introduced a new example script `example_tilelang_cumsum.py` that generates a GPU kernel for 2D continuous cumulative sum.
      - Implemented functions to handle kernel configuration, memory allocation, and inclusive scan operations.
      - Added a main execution block to demonstrate the kernel's functionality using PyTorch for tensor operations.
      - Enhanced the example with error handling for power-of-two configurations and validation of results against PyTorch's built-in cumulative sum function.
      
      * Refactor TileLang examples and enhance kernel compilation
      
      - Updated `example_tilelang_cumsum.py` to improve GPU kernel generation for 2D continuous cumulative sum, including better parameter handling and error checking.
      - Refactored `example_mha_bwd.py` to enhance kernel compilation readability and maintainability.
      - Modified `kernel_cache.py` to prevent saving kernels to disk when using the DLPack backend, ensuring proper cache management.
      - Added `get_block_bindings` function to `kernel.py` for improved access to block bindings in kernel launch frames.
      - Cleaned up import statements in `__init__.py` for better organization and clarity.
      
      * Enhance GPU kernel for 2D continuous cumulative sum in TileLang example
      
      - Added additional spacing for improved readability in `example_tilelang_cumsum.py`.
      - Refined kernel structure to enhance clarity and maintainability during GPU kernel generation for cumulative sum operations.
      cd9ec62e
  3. 20 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Refactor] Phaseout LLVM Dependency by Making it Optional (#247) · f2e99180
      Lei Wang authored
      * remove llvm build
      
      * [Refactor] Update kernel compilation and profiling in examples
      
      - Replaced `tilelang.lower` with `tilelang.compile` in multiple example scripts to streamline kernel compilation.
      - Updated profiling calls to utilize the new `get_profiler` method, enhancing performance measurement consistency.
      - Adjusted assertions and benchmarking methods to align with the new profiling structure across various examples, ensuring correctness and clarity in performance evaluations.
      
      * lint fix
      
      * License Update
      
      * [Refactor] Improve code formatting and documentation in CUDA header and HIP runtime files
      
      - Adjusted formatting in `cuda.h` for better readability, including alignment of comments and struct fields.
      - Cleaned up whitespace and improved comment clarity in `rt_mod_hip.cc` to enhance code maintainability.
      
      * [Refactor] Enhance formatting and clarity in CUDA header and HIP runtime files
      
      - Improved comment alignment and readability in `cuda.h`.
      - Cleaned up whitespace and formatting in `rt_mod_hip.cc` to enhance maintainability.
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * lint fix
      
      * fix
      
      * License update
      
      * [Enhancement] Update JITKernel to use artifact for kernel source
      
      - Assigned the generated artifact to `self.artifact` for better management.
      - Updated kernel source references to use `artifact.kernel_source` for consistency in execution backend handling.
      
      * lint fix
      
      * Add @tilelang.testing.requires_llvm decorator to vectorization tests
      
      * Enhance setup.py and env.py for library management
      
      - Added functionality to remove original files after copying in CMakeBuild.
      - Updated TVM_LIBRARY_PATH in env.py to include the PyPI build library path for better integration.
      
      * Refactor TVM_LIBRARY_PATH assignment for improved readability in env.py
      
      * Refactor CMakeBuild file handling in setup.py
      
      - Added a check to ensure the target library directory exists before copying .so files.
      - Improved the logic for creating the target directory and copying files to enhance robustness.
      
      * bugfix
      
      * Rename BuildTLDebug to BuildTileLangCUDAWithoutCompile and update registration. Add @tilelang.testing.requires_llvm decorator to multiple tests for LLVM requirement.
      
      * lint fix
      
      * Enhance TileLang code generation by adding support for device code generation without compilation. Updated `host_codegen` and `device_codegen` functions to include new transformations and registration for `tilelang_hip_without_compile`. Refactored JIT kernel adapters to accommodate host and device modules, improving overall integration and flexibility.
      
      * lint fix
      
      * Add support for C target in device code generation
      
      - Updated `device_codegen_without_compile` to include handling for the C target by registering the `tilelang_cpp` function.
      
      * [Enhancement] Implement auto-clear cache feature based on environment variable
      
      * Added TILELANG_CLEAR_CACHE environment variable to control cache clearing.
      * Updated CI workflow to set TILELANG_CLEAR_CACHE during testing.
      * Modified cache initialization to clear cache if TILELANG_CLEAR_CACHE is set to true.
      
      * [Refactor] Update kernel invocation and import paths in tests and cache
      
      * Changed kernel invocation in `test_tilelang_kernel_dequantize_gemm.py` to return the result.
      * Updated import statements in `test_tilelang_kernel_int4_gemm_mma.py` to use `bitblas` instead of `tilelang`.
      * Refactored paths for artifact and parameters in `kernel_cache.py` for better maintainability.
      
      * [Refactor] Clean up whitespace and improve code formatting in kernel_cache.py
      
      * Removed unnecessary blank lines and adjusted spacing for better readability in the KernelCache class.
      * Enhanced overall code formatting to align with project standards.
      
      * [Enhancement] Add bfloat16 test case and improve kernel caching logic
      
      * Introduced a new test case for bfloat16 matrix multiplication in `test_tilelang_kernel_gemm_mma_intrinsic.py`.
      * Updated `KernelCache` to handle multiple kernel source files and improve error handling during saving and loading.
      * Refactored `JITKernel` to support instantiation from a database, enhancing flexibility in kernel management.
      * Adjusted `CtypesKernelAdapter` and `CythonKernelAdapter` to utilize the new kernel loading mechanism from the database.
      * Improved code formatting and readability across several files.
      
      * lint fix
      
      * Update bfloat16 matrix multiplication test case to use larger dimensions for improved coverage
      f2e99180
  4. 18 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Dev] Implement FlashAttention3 Backward (#244) · c264f37f
      Yu Cheng authored
      * [BugFix] Fix bug of missing MBarrierExpectTX
      
      * [Dev] Implement FlashAttention3 Backward
      
      - Added a new example for Flash Attention using pipelined WGMMA, including forward and backward pass implementations.
      - Introduced functions for forward and backward processing, leveraging tilelang for optimized tensor operations.
      - Enhanced the attention mechanism with support for both causal and non-causal configurations.
      - Included command-line arguments for batch size, number of heads, context size, and head dimension for flexibility in testing.
      - Updated GEMM operations to support a new `wg_wait` parameter for improved synchronization in kernel execution.
      c264f37f
  5. 16 Mar, 2025 1 commit
    • Yu Cheng's avatar
      [Refactor] Update kernel compilation and profiling in examples (#225) · 889451eb
      Yu Cheng authored
      - Replaced instances of `tilelang.lower` and `tilelang.Profiler` with `tilelang.compile` and the new profiler interface in multiple example files.
      - Enhanced the kernel compilation process to utilize the updated API, improving consistency and maintainability.
      - Adjusted benchmarking logic to use the new profiler methods for better clarity and functionality in performance testing.
      - Cleaned up whitespace and improved formatting for better readability across the modified files.
      889451eb
  6. 14 Mar, 2025 1 commit
  7. 13 Mar, 2025 2 commits
    • Yu Cheng's avatar
      [Dev] Add GQA backward example (#205) · a55f3686
      Yu Cheng authored
      - Introduce `example_gqa_bwd.py` demonstrating the backward pass of FlashAttention with pipelined execution.
      - Implement forward and backward functions for FlashAttention, including preprocessing and postprocessing steps.
      - Enhance argument parsing for batch size, heads, context size, and dimensions.
      - Include a reference implementation for validation and performance benchmarking.
      a55f3686
    • Yu Cheng's avatar
      [Dev] Add new example for FlashAttention with pipelined execution (#200) · c2b9b59d
      Yu Cheng authored
      - Introduce `example_gqa_fwd_bshd_wgmma_pipelined.py` demonstrating a pipelined implementation of FlashAttention.
      - Update sequence length parameter in existing example to 8192 and adjust number of stages for improved performance.
      - Enhance argument parsing to accommodate new configurations for batch size, heads, and groups.
      c2b9b59d
  8. 12 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Feature] Support Async Pipeline inference within if scope (#198) · 7ccec53b
      Lei Wang authored
      * Optimize CMake build process with dynamic job count calculation
      
      - Modify build_csrc function to use 90% of available CPU cores
      - Ensure at least one job is used during compilation
      - Improve build performance by dynamically adjusting parallel job count
      
      * Optimize build_csrc function with multiprocessing module
      
      - Replace os.cpu_count() with multiprocessing.cpu_count()
      - Maintain existing 90% CPU utilization logic
      - Improve CPU core count calculation for build process
      
      * Add dynamic shape support with out_idx in Cython JIT kernel compilation
      
      - Implement `run_cython_dynamic_shape_with_out_idx` function in test_tilelang_jit_gemm_cython.py
      - Update Cython wrapper to handle dynamic symbolic shapes during tensor allocation
      - Add support for resolving dynamic shape dimensions using input tensor references
      - Enhance flexibility of JIT kernel compilation with symbolic shape handling
      
      * Enhance error reporting for dynamic symbolic shape resolution in Cython JIT kernel
      
      - Add detailed error message when a dynamic symbolic dimension is not found in dynamic_symbolic_map
      - Improve debugging by providing context about missing symbolic dimensions
      - Maintain existing dynamic shape resolution logic
      
      * Fix Copy operation handling for scalar and multi-dimensional tensors
      
      - Add special handling for scalar tensor copy operations
      - Enhance error reporting in MakeIndices method with more detailed diagnostic information
      - Improve SIMT loop generation to support zero-dimensional tensors
      - Add explicit check and handling for scalar tensor scenarios
      
      * Refactor Copy operation code formatting and improve readability
      
      - Improve code formatting in MakeIndices and MakeSIMTLoop methods
      - Add line breaks to enhance readability of complex ICHECK statements
      - Simplify code structure in scalar tensor handling
      - Remove unnecessary whitespace and improve code alignment
      
      * Simplify GEMM example with direct kernel compilation
      
      - Update copyright header to Tile-AI Corporation
      - Remove Profiler import and usage
      - Replace tilelang.lower() with tilelang.compile()
      - Simplify kernel execution workflow
      - Update kernel source retrieval method
      
      * Enhance block sparse attention implementation
      
      - Update `blocksparse_flashattn` to use 2 stages for improved performance.
      - Change `block_mask_dtype` from `int8` to `bool` for better memory efficiency.
      - Modify condition checks in the kernel to utilize boolean values.
      - Introduce a new example for top-k sparse attention and a benchmark for native sparse attention.
      - Add support for asynchronous copy in PTX and improve pipeline planning with condition handling.
      
      * Refactor and clean up code formatting across multiple files
      
      - Added whitespace for improved readability in `example_blocksparse_gemm.py`, `example_tilelang_nsa_fwd.py`, and `benchmark_nsa_fwd.py`.
      - Enhanced code structure and alignment in `inject_ptx_async_copy.cc` and `pipeline_planning.cc`.
      - Updated comments and documentation for clarity in `__init__.py` and `phase.py`.
      - Ensured consistent formatting and style across the codebase.
      7ccec53b
  9. 09 Mar, 2025 1 commit
    • Lei Wang's avatar
      [Feat] Introduce new caching mechanism for compiled kernels (#176) · 7bde63d5
      Lei Wang authored
      * Add kernel caching mechanism to TileLang
      
      - Implement a new `cached` function in `tilelang/cache/__init__.py` to cache and reuse compiled kernels
      - Expose the `cached` function in the main `tilelang/__init__.py`
      - Add a test case for cached matrix multiplication in `testing/python/cache/test_tilelang_cache_matmul.py`
      - Provide a `clear_cache()` function to reset the kernel cache when needed
      
      * Refactor kernel caching test and implementation
      
      - Simplify the `cached` function in `tilelang/cache/__init__.py`
      - Update test script `test_tilelang_cache_matmul.py` to use `tilelang.testing.main()`
      - Remove unnecessary whitespace and improve code formatting
      
      * Update import for `cached` function in MHA examples
      
      - Modify import statement in `example_mha_bwd.py` and `test_tilelang_kernel_mha_bwd.py`
      - Change import from `tilelang.profiler import cached` to `tilelang import cached`
      - Align with recent refactoring of kernel caching mechanism
      
      * Refactor `cached` function signature in kernel caching
      
      - Update function signature to use keyword-only arguments for `target` and `target_host`
      - Improve parameter order and readability of the `cached` decorator
      - Maintain existing functionality while enhancing function definition
      7bde63d5
  10. 02 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Kernel] Implement different SEQ Q/KV examples with block sparse (#133) · 159af5df
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      
      * Update TileLang kernel test with import path changes for MMA layout and macro generator
      
      - Modify import statements in test_tilelang_kernel_dequantize_gemm.py
      - Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
      - Update main function to use tilelang.testing.main()
      
      * Add Block Sparse Attention Examples for TileLang and Triton
      
      - Implement block sparse attention kernels for both TileLang and Triton
      - Add utility functions for generating sparse attention masks using top-k and threshold methods
      - Support causal and variable-length attention scenarios
      - Include test cases for different sequence length configurations
      - Demonstrate block-level sparse attention with configurable parameters
      
      * Refactor Block Sparse Attention Examples with Code Style Improvements
      
      - Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
      - Enhance readability by adjusting line breaks and indentation
      - Simplify kernel and function calls with better formatting
      - Add whitespace and line break improvements for better code clarity
      159af5df
    • Lei Wang's avatar
      [Refactor] Set default log level from waning into info (#132) · 9ba96f19
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      9ba96f19
  11. 28 Feb, 2025 2 commits
    • Lei Wang's avatar
      [Example] Implememt FMHA Varlen Example (#131) · dd5d955c
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      
      * nas kernel
      
      * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
      
      - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
      - Increased default number of heads and selected blocks in TileLang NSA example
      - Added Ruff linter ignore comments to reference.py
      - Standardized function signatures and improved code readability across NSA implementations
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation
      
      - Update flash attention kernel to support positional embeddings (PE)
      - Modify reference implementation to handle PE and group query attention
      - Increase default batch size and adjust benchmarking parameters
      - Improve kernel performance and readability
      - Add einops and torch operations for more flexible tensor manipulation
      
      * Update README.md with corrected Flash MLA Decoding example path
      
      - Modify the example link for Flash MLA Decoding to point to the correct directory
      - Ensure accurate navigation to the DeepSeek MLA decoding example
      
      * Refactor Native Sparse Attention Kernel and Improve Utility Functions
      
      This commit introduces several improvements:
      - Simplified native sparse attention kernel by inlining macro functions in example_tilelang_nsa.py
      - Enhanced error handling in loop_partition.cc with more informative error messages
      - Updated print.py to support multi-dimensional buffer printing
      - Improved torch_assert_close in testing/__init__.py with more detailed mismatch reporting
      - Reduced default absolute tolerance in torch comparison from 1e-3 to 1e-2
      - Added shape validation and detailed mismatch information in tensor comparison
      
      * Refactor Code Formatting and Improve Utility Functions
      
      This commit introduces several code formatting and utility improvements:
      - Add Ruff linter ignore comment in example_tilelang_nsa.py
      - Enhance code readability in loop_partition.cc and lower_tile_op.cc with improved line breaks
      - Simplify print_flat_buffer_with_condition in print.py
      - Refactor torch_assert_close in testing/__init__.py with improved line formatting
      
      * Enhance Buffer Printing Support for Fragment and Shared Memory Buffers
      
      This commit improves the print functionality in print.py by:
      - Adding support for printing fragment memory buffers
      - Implementing a new print_fragment_buffer_with_condition macro
      - Extending print_shared_buffer_with_condition for shared memory buffers
      - Updating the generic print function to handle different buffer scopes
      
      * Resolve merge conflict in print.py
      
      Remove merge conflict marker and clean up whitespace in the print module
      
      * Add Variable-Length Multi-Head Attention (MHA) Example with Flash Attention Support
      
      Introduce a new example script `example_mha_fwd_varlen.py` that demonstrates:
      - Variable-length Multi-Head Attention (MHA) implementation
      - Flash Attention forward pass with padding mask support
      - Performance benchmarking for variable-length sequences
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Reference implementation comparison with PyTorch and FlashAttention
      
      * Refactor Flash Attention Variable-Length MHA Example
      
      Improve code formatting and readability in the variable-length multi-head attention example:
      - Add Ruff linter ignore comment
      - Enhance code style with consistent formatting
      - Remove unused imports
      - Improve line breaks and indentation
      - Simplify function signatures and lambda expressions
      dd5d955c
    • Yu Cheng's avatar
      [Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA... · 0d873fcf
      Yu Cheng authored
      [Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA WGMMA pipelined example (#128)
      
      * [Dev] Add RetNet Linear Attention example
      
      * [Dev] Add WgmmaSync rewriter for pipelined WGMMA operations and add MHA WGMMA pipelined example (FA3-like scheduling)
      
      This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler:
      
      - Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc`
      - Added pass registration for `RewriteWgmmaSync`
      - Updated `tilelang/engine/phase.py` to include the new transformation pass
      - Updated `tilelang/transform/__init__.py` to expose the new pass
      
      The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels.
      
      * [Bugfix] Fix bug in ThreadTagChecker for warp specialization
      
      Improve thread tag validation in warp specialized rewriter to prevent unintended transformations:
      - Add more precise checks for threadIdx.y and threadIdx.z
      - Validate thread extent to ensure only single-extent thread bindings are allowed
      - Prevent warp specialization for multi-extent thread bindings in y and z dimensions
      
      * lint
      
      * [CI] Add TMA descriptor attribute to transformed module in test case
      0d873fcf
  12. 25 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Example] Add GQA Example (#118) · 2b97e98a
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      2b97e98a
  13. 22 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Example] Implement simple block sparse kernel (#106) · c7462abf
      Lei Wang authored
      * Remove Torch CPP backend and update execution backend options
      
      - Remove TorchCPPKernelAdapter and related code from JIT modules
      - Update execution backend options in jit/__init__.py, kernel.py, and adapter/__init__.py
      - Remove "torch_cpp" from supported execution backend literals
      - Simplify backend validation and remove unused torch_cpp-related code
      。
      
      * lint fix
      
      * Add block sparse attention implementations for TileLang and Triton
      
      - Implement block sparse attention kernels for TileLang and Triton
      - Add example scripts for block sparse attention with top-k and threshold-based masking
      - Include utility functions for generating sparse attention masks
      - Demonstrate causal attention with block-level sparsity
      - Add test cases to validate sparse attention implementations against PyTorch reference
      c7462abf
  14. 11 Feb, 2025 1 commit
    • Yu Cheng's avatar
      [Dev] Add mha backward example (#77) · a6fe61e2
      Yu Cheng authored
      * [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized
      
      * Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests
      
      * [Dev] Add mha backward example
      a6fe61e2
  15. 25 Jan, 2025 4 commits
    • Yu Cheng's avatar
      [CI][Test] Add test cases for tilelang kernel FlashAttention (#54) · bedab1a0
      Yu Cheng authored
      * [Dev] Add FlashDecoding example
      
      * [CI][Test] Add test cases for tilelang kernel convolution
      
      * [CI][Test] Add test cases for tilelang kernel FlashAttention
      
      * Reduce the number of stages to ensure the shared memory allocation is valid
      
      * Temporarily remove the dim128 case
      
      * lint
      
      * update einops in requirements-dev.txt
      
      * update einops in requirements-test.txt
      
      * remove einops in requirements-dev.txt
      bedab1a0
    • Yu Cheng's avatar
      [CI][Test] Add test cases for tilelang kernel convolution (#51) · 34de04a6
      Yu Cheng authored
      * [CI][Test] Add test cases for tilelang kernel convolution
      34de04a6
    • Lei Wang's avatar
      [Doc] Remove unnecessary layout annotation (#49) · 47ecc791
      Lei Wang authored
      * [Doc] Update documentation structure and content: add overview section, revise project name, and change theme to Furo
      
      * [Feature] Add device-side debug printing functions and integrate into kernel interface
      
      * lint fix
      
      * remove debug print
      
      * implement test for debug
      
      * lint fix
      
      * add some comments
      
      * Enhance fragment design and assert fragment print
      
      * enhance debug print
      
      * add test for msg
      
      * lint fix
      
      * format
      
      * add flash decoding exmaples
      
      * remove comment
      
      * test simplified
      47ecc791
    • Yu Cheng's avatar
      [Dev] Add FlashDecoding example (#46) · cc08ba50
      Yu Cheng authored
      cc08ba50
  16. 11 Jan, 2025 1 commit
    • Lei Wang's avatar
      [Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c
      Lei Wang authored
      
      
      * Add format.sh script for code formatting and linting
      
      * docs update
      
      * center align the title
      
      * lint fix
      
      * add ignore
      
      * Add .gitignore for 3rdparty directory
      
      * Add requirements-dev.txt, requirements-test.txt, and requirements.txt
      
      * 3rdparty
      
      * Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h
      
      * Refactor CMakeLists.txt and include statements
      
      - Update CMakeLists.txt to use a newer version of CMake and add project name
      - Remove unnecessary include directories
      
      Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc
      
      - Update include paths to use relative paths instead of absolute paths
      
      * Update submodule for 3rdparty/tvm
      
      * update
      
      * load dll first
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * git keep update
      
      * Refactor CMakeLists.txt and include statements
      
      * Refactor CMakeLists.txt and include statements
      
      * refactor code structure
      
      * Update Readme
      
      * CMakeLists Customized
      
      * update readme
      
      * update README
      
      * update readme
      
      * update usage
      
      * with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import
      
      * annotate lower transform global func with `transform` prefix
      
      * Migrate Simplify Pass from tilelang tvm branch
      
      * enhance system environment handling with __init__ and CMake
      
      * Initial commit
      
      * CODE_OF_CONDUCT.md committed
      
      * LICENSE committed
      
      * README.md committed
      
      * SECURITY.md committed
      
      * SUPPORT.md committed
      
      * CODE_OF_CONDUCT Commit
      
      * LICENSE Commit
      
      * SECURITY Commit
      
      * SUPPORT Commit
      
      * Modify Support
      
      * Update README.md
      
      * security ci update
      
      * remove examples
      
      * Update and implement clang-format
      
      * add composable kernel components
      
      * Migrate from latest update
      
      * submodule update
      
      * Test update
      
      * Update License
      
      * Spell check
      
      * lint fix
      
      * add clang-tidy to apply static analysis for c source
      
      * update tilelang examples
      
      * Update Install Docs
      
      * Refactor filetree
      
      * Enhance Install
      
      * conflict resloved
      
      * annotate_version
      
      * Initial Update
      
      * test fix
      
      * install
      
      * Implement setup.py
      
      * lint fix
      
      * Separate Init
      
      * Separate test
      
      * docker file commit
      
      * add logo
      
      * Update Readme and Examples
      
      * update readme
      
      * update logo
      
      * Implement AMD Installation
      
      * Add License
      
      * Update AMD MI300x Benchmark
      
      * update README
      
      * update mi300 benchmark scripts
      
      * update ignore
      
      * enhance build scirpt
      
      * update image
      
      * enhance setup.py to remove duplicated libraries
      
      * remove debug files
      
      * update readme
      
      * update image
      
      * update gemm examples
      
      * update flashattention README
      
      * readme update
      
      * add cmake into requirements
      
      * libinfo fix
      
      * auto update submodule
      
      * lint fix
      
      * Fix AMD Build and Test
      
      * Update check for transpose attribute for CDNA Arch
      
      * typo fix for amd
      
      * Implement Matmul Benchmark
      
      * Refactor Code
      
      * [TypoFix] Fix GEMM Example
      
      * [Docs] Init Linear Attention README
      
      * [TYPO] Typo fix
      
      * [Lint] Lint Fix
      
      * enhance example with intrinsics
      
      * [Enhancement] Improve Buffer Collection during IR Parser
      
      * [Dev] Introduce Current classmethod to get current frame
      
      * submodule update
      
      * fake test pass update
      
      * support thread_extent_api
      
      * code optimize
      
      * Add GEMM function implementation for matrix multiplication
      
      * Update logging format to reflect TileLang in logger messages
      
      * Refactor CMakeLists.txt for improved readability and set default build type to Release
      
      * Support Gemm SS Primitives Implementation
      
      * [README] Upload Tile Language Logo (#5)
      
      * update logo
      
      * Update README.md to enhance formatting and center the title
      
      ---------
      Co-authored-by: default avatarmicrosoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
      Co-authored-by: default avatarMicrosoft Open Source <microsoftopensource@users.noreply.github.com>
      Co-authored-by: default avatarYu Cheng <yu.cheng@pku.edu.cn>
      57ab687c