"src/include/threadwise_2d_tensor_op.hip.hpp" did not exist on "5d2cafcb24097f86d33f7c5243a3c0f3800854ec"
  • Lei Wang's avatar
    [Example] Implement TileLang Native Sparse Attention Kernel (#121) · 3cbf8cbc
    Lei Wang authored
    * Add DeepSeek MLA decode example with Flash Attention implementation
    
    * Add GEMM SplitK and StreamK example implementations
    
    This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
    - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
    - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
    
    Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
    
    * Refactor GEMM SplitK and StreamK example implementations
    
    Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
    - Remove unused import (Profiler) in splitk example
    - Simplify line breaks and improve code readability
    - Standardize indentation and remove unnecessary whitespace
    - Optimize atomic add and copy operations for better clarity
    
    * Add block sparse attention benchmarks for multiple libraries
    
    This commit introduces comprehensive block sparse attention benchmarks for different libraries:
    - TileLang block sparse FMHA implementation
    - Triton block sparse FMHA implementation
    - PyTorch reference block sparse FMHA implementation
    - FlashAttention dense FMHA reference implementation
    
    The benchmarks include:
    - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
    - Sparse mask generation using top-k and threshold methods
    - Performance measurement for different sparse attention configurations
    - Utility functions for mask generation and benchmarking
    
    * Refactor block sparse attention benchmarks with code style improvements
    
    - Add Ruff linter ignore comments to benchmark files
    - Improve code formatting and line breaks
    - Remove unused imports
    - Standardize print statement formatting
    - Enhance code readability across multiple library benchmarks
    
    * lint fix
    
    * Add CUDA atomic operations for BFLOAT16 and update function naming
    
    - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
    - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
    - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
    - Update kernel and language customization to use new function names
    - Add return type annotations in profiler module
    
    * lint fix
    
    * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
    
    This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
    - Group Query Attention (GQA) implementation
    - Flash Attention forward pass
    - Performance benchmarking
    - Configurable parameters for batch, heads, sequence length, and dimension
    - Autotuning support
    - Reference implementation comparison
    
    * Refactor IR lowering pipeline into modular phases
    
    This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
    - `LowerAndLegalize`: Handles initial IR legalization and transformation
    - `OptimizeForTarget`: Applies target-specific optimizations
    
    The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
    
    * lintfix
    
    * nas kernel
    
    * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
    
    - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
    - Increased default number of heads and selected blocks in TileLang NSA example
    - Added Ruff linter ignore comments to reference.py
    - Standardized function signatures and improved code readability across NSA implementations
    
    * Add utility math functions for integer operations
    
    - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
    - Add `cdiv()` function for ceiling division of integers
    3cbf8cbc
reference.py 4.61 KB