Commits · 166a9585f8e3a169690107929c1a36c84b3f9807 · OpenDAS / tilelang

07 Mar, 2025 6 commits

[Dev] Use SS-GEMM for PV in mla (#165) · 166a9585
You Jiacheng authored Mar 08, 2025
```
It's slightly faster than T.copy then RS-GEMM, and simpler.
```
166a9585
Add Docker build scripts for local and PyPI distribution (#166) · d3f26ef8
Lei Wang authored Mar 07, 2025

d3f26ef8

[Example] Implement NSA Decode tilelang exampls (#168) · 69f35439

Lei Wang authored Mar 07, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

* Add Native Sparse Attention Examples with Tilelang and Triton Implementations

- Introduce new example scripts for native sparse attention:
  * example_tilelang_nsa_fwd.py: Forward pass implementation using TileLang
  * example_tilelang_nsa_decode.py: Decoding-specific sparse attention implementation
  * example_triton_nsa_fwd.py: Triton-based sparse attention forward pass
- Update reference.py with naive implementations for sparse attention
- Support different sparse attention scenarios including forward pass and inference
- Add comprehensive testing and validation against reference implementations

* lint fix

69f35439

[Bugfix] Cast bool dtype into int8 in blocksparse examples (#167) · b6c48453

Lei Wang authored Mar 07, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

* Refactor Native Sparse Attention Example with Enhanced Triton Kernel

- Update parallel_nsa_fwd_kernel to support more flexible sparse attention computation
- Add support for block counts and offsets in the Triton kernel
- Modify kernel grid and computation logic for improved performance
- Update example script to use naive_nsa_simple reference implementation
- Improve type hints and kernel configuration

b6c48453

[Refactor] Replace `T.thread_binding` with `T.get_thread_binding` in examples and test cases (#163) · de1ba1e4

Lei Wang authored Mar 07, 2025

* [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

* [Refactor] Update Thread Binding and Import Statements in TileLang Kernels

- Replace T.thread_binding() with T.get_thread_binding() across multiple kernel test files
- Update import statements for MMA layout and macro generator in dequantize GEMM and FP8 examples
- Move map_torch_type utility function to tilelang.utils.tensor
- Remove unnecessary imports and improve code organization

de1ba1e4

[Enhancement] Improve CUDA path detection (#157) · 901deae1

Wenhao Xie authored Mar 07, 2025

* [Typo] Fix formatting in installation instructions in README.md

* [Enhancement] Improve CUDA path detection and update configuration handling

* fix typo

* remove IS_WINDOWS constant

* lint fix

* Improve error messages for CUDA detection failure

* lint fix

* lint fix

* Fix .gitignore to correctly include venv directory

901deae1

06 Mar, 2025 8 commits

Refactor MLA decode kernel: Replace T.If with native Python if statement (#162) · cfcbcf1e

Lei Wang authored Mar 07, 2025

Simplify the control flow in the MLA decode kernel by replacing TileLang's T.If construct with a standard Python if statement. This change improves code readability and maintains the existing logic for handling sequence length constraints during block-wise computation.

cfcbcf1e

[Carver] Multi-Threads Compilation for Fast Auto Tuning (#156) · 18be9e07
Chaofan Lin authored Mar 07, 2025
```
* [Carver] Multi-Threads Compilation for Fast Auto Tuning

* Add progress bar for compilation

* lint
```
18be9e07

Add cpu jit with backend ctypes (#154) · 782ca9f6

xs-keju authored Mar 06, 2025



* Add cpu jit with backend ctypes

* Resolve some lint issues

* Apply PR feedback on head file and kernel example

* Add test cases

* Resolve formatting issues

* Resolve formatting issues

---------
Co-authored-by: xxw <1990389406@qq.con>

782ca9f6

Add libstdcxx-ng-12 to Dockerfiles for CUDA versions (#160) · 3486e27e

Lei Wang authored Mar 06, 2025

Update Dockerfiles for CUDA 118, 120, 121, 123, 124, 125, and 126 to install libstdcxx-ng-12 from conda-forge, ensuring consistent standard library support across different CUDA versions

3486e27e

[Release] Bump Version to v0.1.2 (#155) · 237dab0d

Lei Wang authored Mar 06, 2025

* Remove Torch CPP backend and update execution backend options

- Remove TorchCPPKernelAdapter and related code from JIT modules
- Update execution backend options in jit/__init__.py, kernel.py, and adapter/__init__.py
- Remove "torch_cpp" from supported execution backend literals
- Simplify backend validation and remove unused torch_cpp-related code
。

* lint fix

* Add block sparse attention implementations for TileLang and Triton

- Implement block sparse attention kernels for TileLang and Triton
- Add example scripts for block sparse attention with top-k and threshold-based masking
- Include utility functions for generating sparse attention masks
- Demonstrate causal attention with block-level sparsity
- Add test cases to validate sparse attention implementations against PyTorch reference

* Bump version to 0.1.1

* Bump version to 0.1.2

237dab0d

[Dev][Benchmark] Add MLA paged decoding example and benchmark script (#158) · be9abf18

Yu Cheng authored Mar 06, 2025

* [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16

- Remove redundant `acc_s_0` fragment in flash attention kernel
- Simplify memory copy and reduction operations
- Reorder memory copy and scaling steps for improved performance
- Add Hopper-specific synchronization method in CUDA reduce template
- Update reduce operation to use architecture-specific synchronization

* [Dev] Add DeepSeek MLA Decoding (Paged+Varlen) kernel and Performance Benchmark Script

- Implement comprehensive MLA (Multi-Head Latent Attention) decoding benchmark script
- Add support for multiple implementations: Torch, TileLang, FlashMLA, FlashInfer, and Triton
- Create flexible configuration for benchmarking different batch sizes, sequence lengths, and head configurations
- Implement performance comparison and CSV output for detailed performance analysis
- Add command-line argument support for targeted benchmarking and comparison

* [Dev] Refactor MLA Paged Decoding Kernel with Improved Block Handling and Precision

- Replace `d` parameter with `dv` to clarify value dimension in MLA decoding
- Enhance block distribution logic for split KV processing
- Improve handling of remaining blocks in split KV computation
- Add initialization of `lse_max_local` to prevent potential precision issues
- Optimize block start and range calculations for more accurate sequence processing

* lint

be9abf18

[Carver] Enhance Carver Adaptation for MatMul Benchmarking (#153) · 3c53297b

Lei Wang authored Mar 06, 2025

* [Refactor] Consolidate GemmWarpPolicy Enum and Add Utility Method

- Move GemmWarpPolicy from copy.py and gemm.py to primitives/gemm/base.py
- Implement from_warp_partition class method to determine warp policy
- Add docstring with examples for policy determination
- Remove duplicate GemmWarpPolicy class definitions

* [Enhancement] Add TensorCore Intrinsic Matrix Multiplication Benchmarks

- Implement two new matrix multiplication benchmark scripts:
  1. `benchmark_matmul_intrinsic.py`: Uses TensorCore intrinsics with advanced configuration
  2. `benchmark_matmul.py`: Provides a more generic matrix multiplication benchmark

- Add support for roller-based configuration generation in both benchmarks
- Enhance MMA macro generator to handle 2D and 4D output buffer layouts
- Implement flexible autotuning configurations with multiple parameters
- Support different data types and accumulation modes
- Add command-line arguments for matrix dimensions and roller configuration

* lint fix

* Fix roller hints generation in get_roller_hints_from_func

- Simplify roller hints generation logic
- Ensure policy-based configuration is always emitted when a policy is available
- Remove redundant None check for roller hints

* Add shared memory for matrix multiplication in benchmark and quickstart examples

- Modify benchmark_matmul.py and quickstart.py to include C_shared allocation
- Change accumulation dtype from float16 to float in benchmark_matmul.py
- Update matrix multiplication kernels to use shared memory for result storage
- Enable CUDA kernel source printing in quickstart example

3c53297b

[Enhancement] Optimize TileLang Build Process with Dynamic CPU Core Allocation (#152) · e945dae2

Lei Wang authored Mar 06, 2025

- Calculate 75% of available CPU cores for make jobs
- Prevent system unresponsiveness during build
- Dynamically adjust make job count based on system resources

e945dae2

05 Mar, 2025 6 commits

[Refactor] Remove BitBLAS Import in Benchmark (#150) · 8b9edc3e
Chaofan Lin authored Mar 06, 2025

8b9edc3e

[Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation (#148) · 0e2eae42

Lei Wang authored Mar 05, 2025

- Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
- Modify roller hints generation using new TileLang Carver template and utility functions
- Update get_roller_hints_from_func to handle None cases and improve return logic
- Adjust DefaultPolicy to handle different codegen dictionary formats

0e2eae42

[Enhancement] Enable runtime tensor data type validation (#146) · d0434c3e

Lei Wang authored Mar 05, 2025

* Fix debug print buffer template for unsigned char type

- Update debug_print_buffer_value template specialization for unsigned char
- Modify test_tilelang_debug_print.py to include additional dtype tests
- Add test case for uint8 dtype in debug print buffer function

* Refactor debug print buffer template formatting for unsigned char

- Improve code formatting for debug_print_buffer_value template specialization
- Adjust line breaks and indentation for better readability
- Maintain consistent code style with other template specializations

* Extract map_torch_type utility function to tilelang.utils.tensor

- Move map_torch_type function from multiple test files to a centralized location
- Import map_torch_type from tilelang.utils.tensor in kernel test files
- Improve code reusability by creating a shared utility function for type mapping

* Add buffer dtype mapping for Cython kernel adapter

- Introduce buffer_dtype_map in CythonKernelAdapter to track buffer variable dtypes
- Add _process_buffer_dtype method to extract dtype information from TIR function
- Update CythonKernelWrapper to support setting and validating buffer dtypes
- Enhance type checking during kernel execution with dtype verification
- Improve logging message for Cython JIT adapter compilation

* Add static shape mapping for Cython kernel adapter

- Introduce static_shape_map in CythonKernelAdapter to track buffer variable static shapes
- Add _process_static_shape method to extract static shape information from TIR function
- Update CythonKernelWrapper to support setting and validating static shapes
- Enhance type checking during kernel execution with static shape verification

* Add Multi-Head Attention (MHA) Backward Pass Test for TileLang Kernel

- Implement comprehensive test for Multi-Head Attention backward pass
- Support both causal and non-causal attention scenarios
- Add reference implementation for comparing kernel outputs
- Test different batch sizes, head counts, sequence lengths, and head dimensions
- Verify forward and backward pass correctness using torch.testing.assert_close

* Set random seed for MHA backward pass test

- Add random seed initialization for consistent test reproducibility
- Use tilelang.testing.set_random_seed(42) to ensure deterministic test results

d0434c3e

[Enhancement] Support debug print for unsigned char datatype (#145) · bb60f6ce

Lei Wang authored Mar 05, 2025

* Fix debug print buffer template for unsigned char type

- Update debug_print_buffer_value template specialization for unsigned char
- Modify test_tilelang_debug_print.py to include additional dtype tests
- Add test case for uint8 dtype in debug print buffer function

* Refactor debug print buffer template formatting for unsigned char

- Improve code formatting for debug_print_buffer_value template specialization
- Adjust line breaks and indentation for better readability
- Maintain consistent code style with other template specializations

bb60f6ce

[Refactor] Rename gemm fp8 example as we currently lack `T.gemm` support for fp8 (#144) · 37d44f24

Lei Wang authored Mar 05, 2025

* Change default log level from WARNING to INFO in TileLang initialization

* Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support

- Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
- Remove unused imports and simplify function signature
- Modify `flashattn` function to handle max sequence length as a separate argument
- Update kernel call to include max sequence length parameter
- Improve code readability and remove commented-out code
- Add print statement to confirm successful assertion

* Refactor code formatting in TileLang lowering and example files

- Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
- Simplify line breaks and reduce unnecessary whitespace
- Enhance code readability by adjusting indentation and line breaks
- Update example MHA forward pass script with cleaner tensor initialization

* Update TileLang kernel test with import path changes for MMA layout and macro generator

- Modify import statements in test_tilelang_kernel_dequantize_gemm.py
- Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
- Update main function to use tilelang.testing.main()

* Add Block Sparse Attention Examples for TileLang and Triton

- Implement block sparse attention kernels for both TileLang and Triton
- Add utility functions for generating sparse attention masks using top-k and threshold methods
- Support causal and variable-length attention scenarios
- Include test cases for different sequence length configurations
- Demonstrate block-level sparse attention with configurable parameters

* Refactor Block Sparse Attention Examples with Code Style Improvements

- Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
- Enhance readability by adjusting line breaks and indentation
- Simplify kernel and function calls with better formatting
- Add whitespace and line break improvements for better code clarity

* Enhance Layout Plotting with Multi-Replication and Dynamic Visualization

- Update plot_layout function to support multiple replications in thread and value mapping
- Improve thread and value mapping to handle replicated layouts
- Dynamically adjust figure size and legend positioning
- Add print statements for saved plot file paths
- Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting

* Refactor AtomicAdd functions in CUDA common header

- Implement a generic template for AtomicAdd function
- Specialize templates for half_t, bfloat16_t, and pointer types
- Reorganize and clean up existing AtomicAdd implementations
- Improve type handling and conversion in atomic operations

* Remove unused import in MHA backward test file

- Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py
- Add blank line for improved code formatting
- Minor code cleanup in test file

* Add FP8 GEMM Example with TensorCore Intrinsics

- Implement a new example for FP8 matrix multiplication using TensorCore intrinsics
- Support E4M3 and E5M2 floating-point 8-bit data types
- Add README with notes on current FP8 implementation limitations
- Include correctness test for FP8 GEMM with different configurations
- Demonstrate swizzle layout and pipeline optimizations for FP8 computation

37d44f24

[Dev] Adjust computation logic to avoid precision loss when casting acc_s from... · e1d82bf3

Yu Cheng authored Mar 05, 2025

[Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 (#141)

- Remove redundant `acc_s_0` fragment in flash attention kernel
- Simplify memory copy and reduction operations
- Reorder memory copy and scaling steps for improved performance
- Add Hopper-specific synchronization method in CUDA reduce template
- Update reduce operation to use architecture-specific synchronization

e1d82bf3

04 Mar, 2025 3 commits

[Dev][Doc] Enhance Flash Attention Implementation in GQA Decoding Example and Fix Typo (#139) · 3d7b2dc5

Yu Cheng authored Mar 04, 2025

- Add non-split flash attention macro for more flexible kernel generation
- Implement `main_no_split` function to handle single-split scenarios
- Modify kernel selection logic to dynamically choose between split and non-split implementations

3d7b2dc5

[Bugfix] Add missing definition for AtomicAdd (#138) · 3960d3d0

Lei Wang authored Mar 04, 2025

* Change default log level from WARNING to INFO in TileLang initialization

* Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support

- Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
- Remove unused imports and simplify function signature
- Modify `flashattn` function to handle max sequence length as a separate argument
- Update kernel call to include max sequence length parameter
- Improve code readability and remove commented-out code
- Add print statement to confirm successful assertion

* Refactor code formatting in TileLang lowering and example files

- Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
- Simplify line breaks and reduce unnecessary whitespace
- Enhance code readability by adjusting indentation and line breaks
- Update example MHA forward pass script with cleaner tensor initialization

* Update TileLang kernel test with import path changes for MMA layout and macro generator

- Modify import statements in test_tilelang_kernel_dequantize_gemm.py
- Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
- Update main function to use tilelang.testing.main()

* Add Block Sparse Attention Examples for TileLang and Triton

- Implement block sparse attention kernels for both TileLang and Triton
- Add utility functions for generating sparse attention masks using top-k and threshold methods
- Support causal and variable-length attention scenarios
- Include test cases for different sequence length configurations
- Demonstrate block-level sparse attention with configurable parameters

* Refactor Block Sparse Attention Examples with Code Style Improvements

- Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
- Enhance readability by adjusting line breaks and indentation
- Simplify kernel and function calls with better formatting
- Add whitespace and line break improvements for better code clarity

* Enhance Layout Plotting with Multi-Replication and Dynamic Visualization

- Update plot_layout function to support multiple replications in thread and value mapping
- Improve thread and value mapping to handle replicated layouts
- Dynamically adjust figure size and legend positioning
- Add print statements for saved plot file paths
- Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting

* Refactor AtomicAdd functions in CUDA common header

- Implement a generic template for AtomicAdd function
- Specialize templates for half_t, bfloat16_t, and pointer types
- Reorganize and clean up existing AtomicAdd implementations
- Improve type handling and conversion in atomic operations

* Remove unused import in MHA backward test file

- Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py
- Add blank line for improved code formatting
- Minor code cleanup in test file

3960d3d0

[Doc] Add MLA Decoding Performance Benchmarks and Documentation (#137) · e89e8b6c

Yu Cheng authored Mar 04, 2025

- Update news and MLA performance benchmark in README.md
- Move performance benchmark and layout images to a dedicated 'figures' directory
- Improve code formatting and image references in documentation

e89e8b6c

03 Mar, 2025 3 commits

[Debug] Improve Memory Layout Plot (#136) · e32311b2

Lei Wang authored Mar 04, 2025

* Change default log level from WARNING to INFO in TileLang initialization

* Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support

- Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
- Remove unused imports and simplify function signature
- Modify `flashattn` function to handle max sequence length as a separate argument
- Update kernel call to include max sequence length parameter
- Improve code readability and remove commented-out code
- Add print statement to confirm successful assertion

* Refactor code formatting in TileLang lowering and example files

- Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
- Simplify line breaks and reduce unnecessary whitespace
- Enhance code readability by adjusting indentation and line breaks
- Update example MHA forward pass script with cleaner tensor initialization

* Update TileLang kernel test with import path changes for MMA layout and macro generator

- Modify import statements in test_tilelang_kernel_dequantize_gemm.py
- Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
- Update main function to use tilelang.testing.main()

* Add Block Sparse Attention Examples for TileLang and Triton

- Implement block sparse attention kernels for both TileLang and Triton
- Add utility functions for generating sparse attention masks using top-k and threshold methods
- Support causal and variable-length attention scenarios
- Include test cases for different sequence length configurations
- Demonstrate block-level sparse attention with configurable parameters

* Refactor Block Sparse Attention Examples with Code Style Improvements

- Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
- Enhance readability by adjusting line breaks and indentation
- Simplify kernel and function calls with better formatting
- Add whitespace and line break improvements for better code clarity

* Enhance Layout Plotting with Multi-Replication and Dynamic Visualization

- Update plot_layout function to support multiple replications in thread and value mapping
- Improve thread and value mapping to handle replicated layouts
- Dynamically adjust figure size and legend positioning
- Add print statements for saved plot file paths
- Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting

e32311b2

[Doc] Update MLA Documentation (#135) · b70683b3
Yu Cheng authored Mar 04, 2025

b70683b3

[Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks (#134) · cd94aca1

Yu Cheng authored Mar 04, 2025

* [Dev] Add RetNet Linear Attention example

* [Dev] Add WgmmaSync rewriter for pipelined WGMMA operations and add MHA WGMMA pipelined example (FA3-like scheduling)

This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler:

- Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc`
- Added pass registration for `RewriteWgmmaSync`
- Updated `tilelang/engine/phase.py` to include the new transformation pass
- Updated `tilelang/transform/__init__.py` to expose the new pass

The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels.

* [Bugfix] Fix bug in ThreadTagChecker for warp specialization

Improve thread tag validation in warp specialized rewriter to prevent unintended transformations:
- Add more precise checks for threadIdx.y and threadIdx.z
- Validate thread extent to ensure only single-extent thread bindings are allowed
- Prevent warp specialization for multi-extent thread bindings in y and z dimensions

* lint

* [CI] Add TMA descriptor attribute to transformed module in test case

* [Dev] Refactor DeepSeek MLA Decode Example with Non-Split and Split Flash Attention Implementations

- Add new `flash_attn` macro for non-split flash attention implementation
- Add swizzled layout for tile in shared memory
- Use threadblock swizzle to imporve L2 cache hit rate

* [Dev] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks

- Add detailed README.md explaining MLA (Multi-Head Latent Attention) implementation
- Include performance benchmark images for batch sizes 64 and 128
- Add layout visualization images for QK and PV operations
- Implement torch reference implementations in torch_refs.py
- Update example_mla_decode.py with command-line argument support and flexible configuration
- Add performance benchmarking and comparison with other implementations

cd94aca1

02 Mar, 2025 2 commits

[Kernel] Implement different SEQ Q/KV examples with block sparse (#133) · 159af5df

Lei Wang authored Mar 02, 2025

* Change default log level from WARNING to INFO in TileLang initialization

* Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support

- Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
- Remove unused imports and simplify function signature
- Modify `flashattn` function to handle max sequence length as a separate argument
- Update kernel call to include max sequence length parameter
- Improve code readability and remove commented-out code
- Add print statement to confirm successful assertion

* Refactor code formatting in TileLang lowering and example files

- Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
- Simplify line breaks and reduce unnecessary whitespace
- Enhance code readability by adjusting indentation and line breaks
- Update example MHA forward pass script with cleaner tensor initialization

* Update TileLang kernel test with import path changes for MMA layout and macro generator

- Modify import statements in test_tilelang_kernel_dequantize_gemm.py
- Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
- Update main function to use tilelang.testing.main()

* Add Block Sparse Attention Examples for TileLang and Triton

- Implement block sparse attention kernels for both TileLang and Triton
- Add utility functions for generating sparse attention masks using top-k and threshold methods
- Support causal and variable-length attention scenarios
- Include test cases for different sequence length configurations
- Demonstrate block-level sparse attention with configurable parameters

* Refactor Block Sparse Attention Examples with Code Style Improvements

- Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
- Enhance readability by adjusting line breaks and indentation
- Simplify kernel and function calls with better formatting
- Add whitespace and line break improvements for better code clarity

159af5df

[Refactor] Set default log level from waning into info (#132) · 9ba96f19

Lei Wang authored Mar 02, 2025

* Change default log level from WARNING to INFO in TileLang initialization

* Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support

- Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
- Remove unused imports and simplify function signature
- Modify `flashattn` function to handle max sequence length as a separate argument
- Update kernel call to include max sequence length parameter
- Improve code readability and remove commented-out code
- Add print statement to confirm successful assertion

* Refactor code formatting in TileLang lowering and example files

- Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
- Simplify line breaks and reduce unnecessary whitespace
- Enhance code readability by adjusting indentation and line breaks
- Update example MHA forward pass script with cleaner tensor initialization

9ba96f19

28 Feb, 2025 4 commits

[Example] Implememt FMHA Varlen Example (#131) · dd5d955c

Lei Wang authored Mar 01, 2025

* Add DeepSeek MLA decode example with Flash Attention implementation

* Add GEMM SplitK and StreamK example implementations

This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
- `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
- `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang

Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.

* Refactor GEMM SplitK and StreamK example implementations

Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
- Remove unused import (Profiler) in splitk example
- Simplify line breaks and improve code readability
- Standardize indentation and remove unnecessary whitespace
- Optimize atomic add and copy operations for better clarity

* Add block sparse attention b...

dd5d955c

[Debug] Support `T.print` for `fragment` scope (#130) · d55386d1

Lei Wang authored Mar 01, 2025

* Add DeepSeek MLA decode example with Flash Attention implementation

* Add GEMM SplitK and StreamK example implementations

This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
- `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
- `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang

Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.

* Refactor GEMM SplitK and StreamK example implementations

Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
- Remove unused import (Profiler) in splitk example
- Simplify line breaks and improve code readability
- Standardize indentation and remove unnecessary whitespace
- Optimize atomic add and copy operations for better clarity

* Add block sparse attention benchmarks for multiple libraries

This commit introduces comprehensive block sparse attention benchmarks for different libraries:
- TileLang block sparse FMHA implementation
- Triton block sparse FMHA implementation
- PyTorch reference block sparse FMHA implementation
- FlashAttention dense FMHA reference implementation

The benchmarks include:
- Configurable benchmark parameters (batch size, heads, sequence length, etc.)
- Sparse mask generation using top-k and threshold methods
- Performance measurement for different sparse attention configurations
- Utility functions for mask generation and benchmarking

* Refactor block sparse attention benchmarks with code style improvements

- Add Ruff linter ignore comments to benchmark files
- Improve code formatting and line breaks
- Remove unused imports
- Standardize print statement formatting
- Enhance code readability across multiple library benchmarks

* lint fix

* Add CUDA atomic operations for BFLOAT16 and update function naming

- Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
- Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
- Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
- Update kernel and language customization to use new function names
- Add return type annotations in profiler module

* lint fix

* Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang

This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
- Group Query Attention (GQA) implementation
- Flash Attention forward pass
- Performance benchmarking
- Configurable parameters for batch, heads, sequence length, and dimension
- Autotuning support
- Reference implementation comparison

* Refactor IR lowering pipeline into modular phases

This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
- `LowerAndLegalize`: Handles initial IR legalization and transformation
- `OptimizeForTarget`: Applies target-specific optimizations

The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.

* lintfix

* nas kernel

* Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates

- Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
- Increased default number of heads and selected blocks in TileLang NSA example
- Added Ruff linter ignore comments to reference.py
- Standardized function signatures and improved code readability across NSA implementations

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

* Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation

- Update flash attention kernel to support positional embeddings (PE)
- Modify reference implementation to handle PE and group query attention
- Increase default batch size and adjust benchmarking parameters
- Improve kernel performance and readability
- Add einops and torch operations for more flexible tensor manipulation

* Update README.md with corrected Flash MLA Decoding example path

- Modify the example link for Flash MLA Decoding to point to the correct directory
- Ensure accurate navigation to the DeepSeek MLA decoding example

* Refactor Native Sparse Attention Kernel and Improve Utility Functions

This commit introduces several improvements:
- Simplified native sparse attention kernel by inlining macro functions in example_tilelang_nsa.py
- Enhanced error handling in loop_partition.cc with more informative error messages
- Updated print.py to support multi-dimensional buffer printing
- Improved torch_assert_close in testing/__init__.py with more detailed mismatch reporting
- Reduced default absolute tolerance in torch comparison from 1e-3 to 1e-2
- Added shape validation and detailed mismatch information in tensor comparison

* Refactor Code Formatting and Improve Utility Functions

This commit introduces several code formatting and utility improvements:
- Add Ruff linter ignore comment in example_tilelang_nsa.py
- Enhance code readability in loop_partition.cc and lower_tile_op.cc with improved line breaks
- Simplify print_flat_buffer_with_condition in print.py
- Refactor torch_assert_close in testing/__init__.py with improved line formatting

* Enhance Buffer Printing Support for Fragment and Shared Memory Buffers

This commit improves the print functionality in print.py by:
- Adding support for printing fragment memory buffers
- Implementing a new print_fragment_buffer_with_condition macro
- Extending print_shared_buffer_with_condition for shared memory buffers
- Updating the generic print function to handle different buffer scopes

* Resolve merge conflict in print.py

Remove merge conflict marker and clean up whitespace in the print module

d55386d1

[Dev] Remove buffer flatten when debug print a shared buffer (#129) · 20bbb91a

Lei Wang authored Feb 28, 2025

* Add DeepSeek MLA decode example with Flash Attention implementation

* Add GEMM SplitK and StreamK example implementations

This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
- `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
- `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang

Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.

* Refactor GEMM SplitK and StreamK example implementations

Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
- Remove unused import (Profiler) in splitk example
- Simplify line breaks and improve code readability
- Standardize indentation and remove unnecessary whitespace
- Optimize atomic add and copy operations for better clarity

* Add block sparse attention benchmarks for multiple libraries

This commit introduces comprehensive block sparse attention benchmarks for different libraries:
- TileLang block sparse FMHA implementation
- Triton block sparse FMHA implementation
- PyTorch reference block sparse FMHA implementation
- FlashAttention dense FMHA reference implementation

The benchmarks include:
- Configurable benchmark parameters (batch size, heads, sequence length, etc.)
- Sparse mask generation using top-k and threshold methods
- Performance measurement for different sparse attention configurations
- Utility functions for mask generation and benchmarking

* Refactor block sparse attention benchmarks with code style improvements

- Add Ruff linter ignore comments to benchmark files
- Improve code formatting and line breaks
- Remove unused imports
- Standardize print statement formatting
- Enhance code readability across multiple library benchmarks

* lint fix

* Add CUDA atomic operations for BFLOAT16 and update function naming

- Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
- Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
- Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
- Update kernel and language customization to use new function names
- Add return type annotations in profiler module

* lint fix

* Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang

This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
- Group Query Attention (GQA) implementation
- Flash Attention forward pass
- Performance benchmarking
- Configurable parameters for batch, heads, sequence length, and dimension
- Autotuning support
- Reference implementation comparison

* Refactor IR lowering pipeline into modular phases

This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
- `LowerAndLegalize`: Handles initial IR legalization and transformation
- `OptimizeForTarget`: Applies target-specific optimizations

The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.

* lintfix

* nas kernel

* Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates

- Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
- Increased default number of heads and selected blocks in TileLang NSA example
- Added Ruff linter ignore comments to reference.py
- Standardized function signatures and improved code readability across NSA implementations

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

* Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation

- Update flash attention kernel to support positional embeddings (PE)
- Modify reference implementation to handle PE and group query attention
- Increase default batch size and adjust benchmarking parameters
- Improve kernel performance and readability
- Add einops and torch operations for more flexible tensor manipulation

* Update README.md with corrected Flash MLA Decoding example path

- Modify the example link for Flash MLA Decoding to point to the correct directory
- Ensure accurate navigation to the DeepSeek MLA decoding example

* Refactor Native Sparse Attention Kernel and Improve Utility Functions

This commit introduces several improvements:
- Simplified native sparse attention kernel by inlining macro functions in example_tilelang_nsa.py
- Enhanced error handling in loop_partition.cc with more informative error messages
- Updated print.py to support multi-dimensional buffer printing
- Improved torch_assert_close in testing/__init__.py with more detailed mismatch reporting
- Reduced default absolute tolerance in torch comparison from 1e-3 to 1e-2
- Added shape validation and detailed mismatch information in tensor comparison

* Refactor Code Formatting and Improve Utility Functions

This commit introduces several code formatting and utility improvements:
- Add Ruff linter ignore comment in example_tilelang_nsa.py
- Enhance code readability in loop_partition.cc and lower_tile_op.cc with improved line breaks
- Simplify print_flat_buffer_with_condition in print.py
- Refactor torch_assert_close in testing/__init__.py with improved line formatting

20bbb91a

[Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA... · 0d873fcf

Yu Cheng authored Feb 28, 2025

[Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA WGMMA pipelined example (#128)

* [Dev] Add RetNet Linear Attention example

* [Dev] Add WgmmaSync rewriter for pipelined WGMMA operations and add MHA WGMMA pipelined example (FA3-like scheduling)

This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler:

- Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc`
- Added pass registration for `RewriteWgmmaSync`
- Updated `tilelang/engine/phase.py` to include the new transformation pass
- Updated `tilelang/transform/__init__.py` to expose the new pass

The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels.

* [Bugfix] Fix bug in ThreadTagChecker for warp specialization

Improve thread tag validation in warp specialized rewriter to prevent unintended transformations:
- Add more precise checks for threadIdx.y and threadIdx.z
- Validate thread extent to ensure only single-extent thread bindings are allowed
- Prevent warp specialization for multi-extent thread bindings in y and z dimensions

* lint

* [CI] Add TMA descriptor attribute to transformed module in test case

0d873fcf

27 Feb, 2025 1 commit

[JIT] Enhance cython/ctypes wrapper for tma descriptor (#126) · 7b74bb01

Lei Wang authored Feb 27, 2025



* refactor code

* enhance tutorial

* Enhance error handling and code generation in CUDA and TileLang components

This commit introduces several improvements across multiple files:
- Added more informative error messages in GEMM layout checks
- Updated CUDA codegen to support more flexible function signature generation
- Improved TMA descriptor initialization and kernel dispatch logic
- Refined library generation and source code parsing utilities
- Enhanced error handling in various adapter and wrapper classes

* Add thread tag validation for warp specialization

Introduce a ThreadTagChecker to validate that a PrimFunc only uses threadIdx.x before applying warp specialization. This prevents unintended transformations on kernels with complex thread binding and provides a clear warning to users about potential issues with warp specialization.

* Update TileLang Profiling and Compilation in Flash Decoding Examples

Refactor the profiling and compilation workflow in two flash decoding example scripts:
- Replace `tilelang.lower()` and `tilelang.Profiler()` with `tilelang.compile()`
- Simplify profiler initialization using `get_profiler()`
- Update method calls to use the new profiler and compiled kernel objects
- Maintain existing performance benchmarking and validation logic

* Refactor and clean up code formatting in TileLang testing and adapter modules

This commit includes several code style and formatting improvements:
- Adjust whitespace and line breaks in test files
- Improve code formatting in CUDA source wrapper and adapter utilities
- Enhance readability of function calls and argument handling
- Remove unnecessary whitespace and standardize indentation
- Simplify function signatures and argument parsing

* Refactor CUDA codegen and improve code formatting

This commit includes several improvements to CUDA code generation and formatting:
- Enhance function signature generation in CodeGenTileLangCUDA
- Improve code formatting and readability in CUDA-related files
- Simplify parameter handling and type annotations
- Clean up whitespace and line breaks in codegen and layout files

---------
Co-authored-by: Ubuntu <dlisuser@h100testl730RPS.xu5snccwrbtejcqqalluoku5hb.xx.internal.cloudapp.net>

7b74bb01

26 Feb, 2025 3 commits

[Dev] Add RetNet Linear Attention example (#124) · ba311311
Yu Cheng authored Feb 26, 2025

ba311311

[Example] Update GEMM FP8 Example (#123) · 13f4b5c6

Lei Wang authored Feb 26, 2025

* Add DeepSeek MLA decode example with Flash Attention implementation

* Add GEMM SplitK and StreamK example implementations

This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
- `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
- `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang

Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.

* Refactor GEMM SplitK and StreamK example implementations

Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
- Remove unused import (Profiler) in splitk example
- Simplify line breaks and improve code readability
- Standardize indentation and remove unnecessary whitespace
- Optimize atomic add and copy operations for better clarity

* Add block sparse attention benchmarks for multiple libraries

This commit introduces comprehensive block sparse attention benchmarks for different libraries:
- TileLang block sparse FMHA implementation
- Triton block sparse FMHA implementation
- PyTorch reference block sparse FMHA implementation
- FlashAttention dense FMHA reference implementation

The benchmarks include:
- Configurable benchmark parameters (batch size, heads, sequence length, etc.)
- Sparse mask generation using top-k and threshold methods
- Performance measurement for different sparse attention configurations
- Utility functions for mask generation and benchmarking

* Refactor block sparse attention benchmarks with code style improvements

- Add Ruff linter ignore comments to benchmark files
- Improve code formatting and line breaks
- Remove unused imports
- Standardize print statement formatting
- Enhance code readability across multiple library benchmarks

* lint fix

* Add CUDA atomic operations for BFLOAT16 and update function naming

- Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
- Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
- Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
- Update kernel and language customization to use new function names
- Add return type annotations in profiler module

* lint fix

* Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang

This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
- Group Query Attention (GQA) implementation
- Flash Attention forward pass
- Performance benchmarking
- Configurable parameters for batch, heads, sequence length, and dimension
- Autotuning support
- Reference implementation comparison

* Refactor IR lowering pipeline into modular phases

This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
- `LowerAndLegalize`: Handles initial IR legalization and transformation
- `OptimizeForTarget`: Applies target-specific optimizations

The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.

* lintfix

* nas kernel

* Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates

- Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
- Increased default number of heads and selected blocks in TileLang NSA example
- Added Ruff linter ignore comments to reference.py
- Standardized function signatures and improved code readability across NSA implementations

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

* Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation

- Update flash attention kernel to support positional embeddings (PE)
- Modify reference implementation to handle PE and group query attention
- Increase default batch size and adjust benchmarking parameters
- Improve kernel performance and readability
- Add einops and torch operations for more flexible tensor manipulation

* Update README.md with corrected Flash MLA Decoding example path

- Modify the example link for Flash MLA Decoding to point to the correct directory
- Ensure accurate navigation to the DeepSeek MLA decoding example

13f4b5c6

Update README.md with new example links for Flash MLA Decoding and Native Sparse Attention (#122) · f1fcfe34
Yu Cheng authored Feb 26, 2025

f1fcfe34

25 Feb, 2025 4 commits

[Example] Implement TileLang Native Sparse Attention Kernel (#121) · 3cbf8cbc

Lei Wang authored Feb 26, 2025

* Add DeepSeek MLA decode example with Flash Attention implementation

* Add GEMM SplitK and StreamK example implementations

This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
- `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
- `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang

Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.

* Refactor GEMM SplitK and StreamK example implementations

Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
- Remove unused import (Profiler) in splitk example
- Simplify line breaks and improve code readability
- Standardize indentation and remove unnecessary whitespace
- Optimize atomic add and copy operations for better clarity

* Add block sparse attention benchmarks for multiple libraries

This commit introduces comprehensive block sparse attention benchmarks for different libraries:
- TileLang block sparse FMHA implementation
- Triton block sparse FMHA implementation
- PyTorch reference block sparse FMHA implementation
- FlashAttention dense FMHA reference implementation

The benchmarks include:
- Configurable benchmark parameters (batch size, heads, sequence length, etc.)
- Sparse mask generation using top-k and threshold methods
- Performance measurement for different sparse attention configurations
- Utility functions for mask generation and benchmarking

* Refactor block sparse attention benchmarks with code style improvements

- Add Ruff linter ignore comments to benchmark files
- Improve code formatting and line breaks
- Remove unused imports
- Standardize print statement formatting
- Enhance code readability across multiple library benchmarks

* lint fix

* Add CUDA atomic operations for BFLOAT16 and update function naming

- Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
- Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
- Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
- Update kernel and language customization to use new function names
- Add return type annotations in profiler module

* lint fix

* Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang

This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
- Group Query Attention (GQA) implementation
- Flash Attention forward pass
- Performance benchmarking
- Configurable parameters for batch, heads, sequence length, and dimension
- Autotuning support
- Reference implementation comparison

* Refactor IR lowering pipeline into modular phases

This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
- `LowerAndLegalize`: Handles initial IR legalization and transformation
- `OptimizeForTarget`: Applies target-specific optimizations

The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.

* lintfix

* nas kernel

* Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates

- Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
- Increased default number of heads and selected blocks in TileLang NSA example
- Added Ruff linter ignore comments to reference.py
- Standardized function signatures and improved code readability across NSA implementations

* Add utility math functions for integer operations

- Implement `next_power_of_2()` to calculate the next power of 2 for an integer
- Add `cdiv()` function for ceiling division of integers

3cbf8cbc

[Example] Add GQA Example (#118) · 2b97e98a

Lei Wang authored Feb 26, 2025

* Add DeepSeek MLA decode example with Flash Attention implementation

* Add GEMM SplitK and StreamK example implementations

This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
- `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
- `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang

Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.

* Refactor GEMM SplitK and StreamK example implementations

Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
- Remove unused import (Profiler) in splitk example
- Simplify line breaks and improve code readability
- Standardize indentation and remove unnecessary whitespace
- Optimize atomic add and copy operations for better clarity

* Add block sparse attention benchmarks for multiple libraries

This commit introduces comprehensive block sparse attention benchmarks for different libraries:
- TileLang block sparse FMHA implementation
- Triton block sparse FMHA implementation
- PyTorch reference block sparse FMHA implementation
- FlashAttention dense FMHA reference implementation

The benchmarks include:
- Configurable benchmark parameters (batch size, heads, sequence length, etc.)
- Sparse mask generation using top-k and threshold methods
- Performance measurement for different sparse attention configurations
- Utility functions for mask generation and benchmarking

* Refactor block sparse attention benchmarks with code style improvements

- Add Ruff linter ignore comments to benchmark files
- Improve code formatting and line breaks
- Remove unused imports
- Standardize print statement formatting
- Enhance code readability across multiple library benchmarks

* lint fix

* Add CUDA atomic operations for BFLOAT16 and update function naming

- Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
- Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
- Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
- Update kernel and language customization to use new function names
- Add return type annotations in profiler module

* lint fix

* Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang

This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
- Group Query Attention (GQA) implementation
- Flash Attention forward pass
- Performance benchmarking
- Configurable parameters for batch, heads, sequence length, and dimension
- Autotuning support
- Reference implementation comparison

* Refactor IR lowering pipeline into modular phases

This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
- `LowerAndLegalize`: Handles initial IR legalization and transformation
- `OptimizeForTarget`: Applies target-specific optimizations

The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.

* lintfix

2b97e98a

[Dev] Update MLA decode kernel (#120) · b7ca76f1
Yu Cheng authored Feb 26, 2025

b7ca76f1
[Bugfix] Bugfix of pass order for hopper (#117) · 524991fe
Yu Cheng authored Feb 25, 2025

524991fe