1. 03 Mar, 2025 2 commits
    • Yu Cheng's avatar
      [Doc] Update MLA Documentation (#135) · b70683b3
      Yu Cheng authored
      b70683b3
    • Yu Cheng's avatar
      [Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks (#134) · cd94aca1
      Yu Cheng authored
      * [Dev] Add RetNet Linear Attention example
      
      * [Dev] Add WgmmaSync rewriter for pipelined WGMMA operations and add MHA WGMMA pipelined example (FA3-like scheduling)
      
      This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler:
      
      - Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc`
      - Added pass registration for `RewriteWgmmaSync`
      - Updated `tilelang/engine/phase.py` to include the new transformation pass
      - Updated `tilelang/transform/__init__.py` to expose the new pass
      
      The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels.
      
      * [Bugfix] Fix bug in ThreadTagChecker for warp specialization
      
      Improve thread tag validation in warp specialized rewriter to prevent unintended transformations:
      - Add more precise checks for threadIdx.y and threadIdx.z
      - Validate thread extent to ensure only single-extent thread bindings are allowed
      - Prevent warp specialization for multi-extent thread bindings in y and z dimensions
      
      * lint
      
      * [CI] Add TMA descriptor attribute to transformed module in test case
      
      * [Dev] Refactor DeepSeek MLA Decode Example with Non-Split and Split Flash Attention Implementations
      
      - Add new `flash_attn` macro for non-split flash attention implementation
      - Add swizzled layout for tile in shared memory
      - Use threadblock swizzle to imporve L2 cache hit rate
      
      * [Dev] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks
      
      - Add detailed README.md explaining MLA (Multi-Head Latent Attention) implementation
      - Include performance benchmark images for batch sizes 64 and 128
      - Add layout visualization images for QK and PV operations
      - Implement torch reference implementations in torch_refs.py
      - Update example_mla_decode.py with command-line argument support and flexible configuration
      - Add performance benchmarking and comparison with other implementations
      cd94aca1
  2. 26 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Example] Update GEMM FP8 Example (#123) · 13f4b5c6
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      
      * nas kernel
      
      * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
      
      - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
      - Increased default number of heads and selected blocks in TileLang NSA example
      - Added Ruff linter ignore comments to reference.py
      - Standardized function signatures and improved code readability across NSA implementations
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation
      
      - Update flash attention kernel to support positional embeddings (PE)
      - Modify reference implementation to handle PE and group query attention
      - Increase default batch size and adjust benchmarking parameters
      - Improve kernel performance and readability
      - Add einops and torch operations for more flexible tensor manipulation
      
      * Update README.md with corrected Flash MLA Decoding example path
      
      - Modify the example link for Flash MLA Decoding to point to the correct directory
      - Ensure accurate navigation to the DeepSeek MLA decoding example
      13f4b5c6
  3. 23 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Example] Add Split-K and Stream-K Examples and move MLA from fld to mla (#110) · 5cea760c
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      5cea760c
  4. 10 Feb, 2025 1 commit
    • Lei Wang's avatar
      [Dev] Remove unnecessary python dependencies (#69) · 2411fa28
      Lei Wang authored
      * [Enhancement] Add VectorizeLoop function and update imports for compatibility
      
      * [CI][Test] Improve test cases for vectorization and fix typos in parser comments
      
      * lint fix
      
      * Fix incorrect module reference for VectorizeLoop transformation
      
      * Refactor vectorize_loop transformation by removing unused extent mutation logic
      
      * [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen
      
      * Fix formatting in CUDA FP8 header file for consistency
      
      * Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity
      
      * Update submodule 'tvm' to latest commit for improved functionality
      
      * Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.
      
      * Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.
      
      * Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency
      
      * Add CUDA requirements to FP8 test cases and update references for clarity
      
      * Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py
      
      * Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Add CUDA requirements and FP8 test cases for matmul and gemv simulations
      
      * Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py
      
      * Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py
      
      * Add BF16 support to matrix multiplication and introduce corresponding test cases
      
      * Add a blank line for improved readability in BF16 GEMM test
      
      * Update acknowledgements in README to include supervision by Zhi Yang at Peking University
      
      * enhance acknowledgement
      
      * Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives
      
      * Update subproject commit for TVM dependency
      
      * Update subproject commit for TVM dependency
      
      * Add int4_t type and functions for packing char values in CUDA common header
      
      * Add plot_layout example and implement GetForwardVars method in layout classes
      
      * Refactor code for improved readability by adjusting line breaks and formatting in layout and test files
      
      * Fix formatting by removing unnecessary line break in layout.h
      
      * Refactor make_int4 function for improved readability by adjusting parameter formatting
      
      * Add legend to plot_layout for improved clarity of thread and local IDs
      
      * Remove unnecessary dependencies from requirements files for cleaner setup
      
      * Remove flash_mha.py and add .gitkeep to deepseek_mla directory
      
      * Add build requirements and update installation scripts for improved setup
      2411fa28