1. 07 Mar, 2025 1 commit
    • Wenhao Xie's avatar
      [Enhancement] Improve CUDA path detection (#157) · 901deae1
      Wenhao Xie authored
      * [Typo] Fix formatting in installation instructions in README.md
      
      * [Enhancement] Improve CUDA path detection and update configuration handling
      
      * fix typo
      
      * remove IS_WINDOWS constant
      
      * lint fix
      
      * Improve error messages for CUDA detection failure
      
      * lint fix
      
      * lint fix
      
      * Fix .gitignore to correctly include venv directory
      901deae1
  2. 06 Mar, 2025 8 commits
    • Lei Wang's avatar
      Refactor MLA decode kernel: Replace T.If with native Python if statement (#162) · cfcbcf1e
      Lei Wang authored
      Simplify the control flow in the MLA decode kernel by replacing TileLang's T.If construct with a standard Python if statement. This change improves code readability and maintains the existing logic for handling sequence length constraints during block-wise computation.
      cfcbcf1e
    • Chaofan Lin's avatar
      [Carver] Multi-Threads Compilation for Fast Auto Tuning (#156) · 18be9e07
      Chaofan Lin authored
      * [Carver] Multi-Threads Compilation for Fast Auto Tuning
      
      * Add progress bar for compilation
      
      * lint
      18be9e07
    • xs-keju's avatar
      Add cpu jit with backend ctypes (#154) · 782ca9f6
      xs-keju authored
      
      
      * Add cpu jit with backend ctypes
      
      * Resolve some lint issues
      
      * Apply PR feedback on head file and kernel example
      
      * Add test cases
      
      * Resolve formatting issues
      
      * Resolve formatting issues
      
      ---------
      Co-authored-by: default avatarxxw <1990389406@qq.con>
      782ca9f6
    • Lei Wang's avatar
      Add libstdcxx-ng-12 to Dockerfiles for CUDA versions (#160) · 3486e27e
      Lei Wang authored
      Update Dockerfiles for CUDA 118, 120, 121, 123, 124, 125, and 126 to install libstdcxx-ng-12 from conda-forge, ensuring consistent standard library support across different CUDA versions
      3486e27e
    • Lei Wang's avatar
      [Release] Bump Version to v0.1.2 (#155) · 237dab0d
      Lei Wang authored
      * Remove Torch CPP backend and update execution backend options
      
      - Remove TorchCPPKernelAdapter and related code from JIT modules
      - Update execution backend options in jit/__init__.py, kernel.py, and adapter/__init__.py
      - Remove "torch_cpp" from supported execution backend literals
      - Simplify backend validation and remove unused torch_cpp-related code
      。
      
      * lint fix
      
      * Add block sparse attention implementations for TileLang and Triton
      
      - Implement block sparse attention kernels for TileLang and Triton
      - Add example scripts for block sparse attention with top-k and threshold-based masking
      - Include utility functions for generating sparse attention masks
      - Demonstrate causal attention with block-level sparsity
      - Add test cases to validate sparse attention implementations against PyTorch reference
      
      * Bump version to 0.1.1
      
      * Bump version to 0.1.2
      237dab0d
    • Yu Cheng's avatar
      [Dev][Benchmark] Add MLA paged decoding example and benchmark script (#158) · be9abf18
      Yu Cheng authored
      * [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16
      
      - Remove redundant `acc_s_0` fragment in flash attention kernel
      - Simplify memory copy and reduction operations
      - Reorder memory copy and scaling steps for improved performance
      - Add Hopper-specific synchronization method in CUDA reduce template
      - Update reduce operation to use architecture-specific synchronization
      
      * [Dev] Add DeepSeek MLA Decoding (Paged+Varlen) kernel and Performance Benchmark Script
      
      - Implement comprehensive MLA (Multi-Head Latent Attention) decoding benchmark script
      - Add support for multiple implementations: Torch, TileLang, FlashMLA, FlashInfer, and Triton
      - Create flexible configuration for benchmarking different batch sizes, sequence lengths, and head configurations
      - Implement performance comparison and CSV output for detailed performance analysis
      - Add command-line argument support for targeted benchmarking and comparison
      
      * [Dev] Refactor MLA Paged Decoding Kernel with Improved Block Handling and Precision
      
      - Replace `d` parameter with `dv` to clarify value dimension in MLA decoding
      - Enhance block distribution logic for split KV processing
      - Improve handling of remaining blocks in split KV computation
      - Add initialization of `lse_max_local` to prevent potential precision issues
      - Optimize block start and range calculations for more accurate sequence processing
      
      * lint
      be9abf18
    • Lei Wang's avatar
      [Carver] Enhance Carver Adaptation for MatMul Benchmarking (#153) · 3c53297b
      Lei Wang authored
      * [Refactor] Consolidate GemmWarpPolicy Enum and Add Utility Method
      
      - Move GemmWarpPolicy from copy.py and gemm.py to primitives/gemm/base.py
      - Implement from_warp_partition class method to determine warp policy
      - Add docstring with examples for policy determination
      - Remove duplicate GemmWarpPolicy class definitions
      
      * [Enhancement] Add TensorCore Intrinsic Matrix Multiplication Benchmarks
      
      - Implement two new matrix multiplication benchmark scripts:
        1. `benchmark_matmul_intrinsic.py`: Uses TensorCore intrinsics with advanced configuration
        2. `benchmark_matmul.py`: Provides a more generic matrix multiplication benchmark
      
      - Add support for roller-based configuration generation in both benchmarks
      - Enhance MMA macro generator to handle 2D and 4D output buffer layouts
      - Implement flexible autotuning configurations with multiple parameters
      - Support different data types and accumulation modes
      - Add command-line arguments for matrix dimensions and roller configuration
      
      * lint fix
      
      * Fix roller hints generation in get_roller_hints_from_func
      
      - Simplify roller hints generation logic
      - Ensure policy-based configuration is always emitted when a policy is available
      - Remove redundant None check for roller hints
      
      * Add shared memory for matrix multiplication in benchmark and quickstart examples
      
      - Modify benchmark_matmul.py and quickstart.py to include C_shared allocation
      - Change accumulation dtype from float16 to float in benchmark_matmul.py
      - Update matrix multiplication kernels to use shared memory for result storage
      - Enable CUDA kernel source printing in quickstart example
      3c53297b
    • Lei Wang's avatar
      [Enhancement] Optimize TileLang Build Process with Dynamic CPU Core Allocation (#152) · e945dae2
      Lei Wang authored
      - Calculate 75% of available CPU cores for make jobs
      - Prevent system unresponsiveness during build
      - Dynamically adjust make job count based on system resources
      e945dae2
  3. 05 Mar, 2025 6 commits
    • Chaofan Lin's avatar
      8b9edc3e
    • Lei Wang's avatar
      [Refactor] Update BitBLAS Benchmark with TileLang Carver Imports and Roller Hints Generation (#148) · 0e2eae42
      Lei Wang authored
      - Replace BitBLAS imports with TileLang Carver imports in benchmark_matmul.py
      - Modify roller hints generation using new TileLang Carver template and utility functions
      - Update get_roller_hints_from_func to handle None cases and improve return logic
      - Adjust DefaultPolicy to handle different codegen dictionary formats
      0e2eae42
    • Lei Wang's avatar
      [Enhancement] Enable runtime tensor data type validation (#146) · d0434c3e
      Lei Wang authored
      * Fix debug print buffer template for unsigned char type
      
      - Update debug_print_buffer_value template specialization for unsigned char
      - Modify test_tilelang_debug_print.py to include additional dtype tests
      - Add test case for uint8 dtype in debug print buffer function
      
      * Refactor debug print buffer template formatting for unsigned char
      
      - Improve code formatting for debug_print_buffer_value template specialization
      - Adjust line breaks and indentation for better readability
      - Maintain consistent code style with other template specializations
      
      * Extract map_torch_type utility function to tilelang.utils.tensor
      
      - Move map_torch_type function from multiple test files to a centralized location
      - Import map_torch_type from tilelang.utils.tensor in kernel test files
      - Improve code reusability by creating a shared utility function for type mapping
      
      * Add buffer dtype mapping for Cython kernel adapter
      
      - Introduce buffer_dtype_map in CythonKernelAdapter to track buffer variable dtypes
      - Add _process_buffer_dtype method to extract dtype information from TIR function
      - Update CythonKernelWrapper to support setting and validating buffer dtypes
      - Enhance type checking during kernel execution with dtype verification
      - Improve logging message for Cython JIT adapter compilation
      
      * Add static shape mapping for Cython kernel adapter
      
      - Introduce static_shape_map in CythonKernelAdapter to track buffer variable static shapes
      - Add _process_static_shape method to extract static shape information from TIR function
      - Update CythonKernelWrapper to support setting and validating static shapes
      - Enhance type checking during kernel execution with static shape verification
      
      * Add Multi-Head Attention (MHA) Backward Pass Test for TileLang Kernel
      
      - Implement comprehensive test for Multi-Head Attention backward pass
      - Support both causal and non-causal attention scenarios
      - Add reference implementation for comparing kernel outputs
      - Test different batch sizes, head counts, sequence lengths, and head dimensions
      - Verify forward and backward pass correctness using torch.testing.assert_close
      
      * Set random seed for MHA backward pass test
      
      - Add random seed initialization for consistent test reproducibility
      - Use tilelang.testing.set_random_seed(42) to ensure deterministic test results
      d0434c3e
    • Lei Wang's avatar
      [Enhancement] Support debug print for unsigned char datatype (#145) · bb60f6ce
      Lei Wang authored
      * Fix debug print buffer template for unsigned char type
      
      - Update debug_print_buffer_value template specialization for unsigned char
      - Modify test_tilelang_debug_print.py to include additional dtype tests
      - Add test case for uint8 dtype in debug print buffer function
      
      * Refactor debug print buffer template formatting for unsigned char
      
      - Improve code formatting for debug_print_buffer_value template specialization
      - Adjust line breaks and indentation for better readability
      - Maintain consistent code style with other template specializations
      bb60f6ce
    • Lei Wang's avatar
      [Refactor] Rename gemm fp8 example as we currently lack `T.gemm` support for fp8 (#144) · 37d44f24
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      
      * Update TileLang kernel test with import path changes for MMA layout and macro generator
      
      - Modify import statements in test_tilelang_kernel_dequantize_gemm.py
      - Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
      - Update main function to use tilelang.testing.main()
      
      * Add Block Sparse Attention Examples for TileLang and Triton
      
      - Implement block sparse attention kernels for both TileLang and Triton
      - Add utility functions for generating sparse attention masks using top-k and threshold methods
      - Support causal and variable-length attention scenarios
      - Include test cases for different sequence length configurations
      - Demonstrate block-level sparse attention with configurable parameters
      
      * Refactor Block Sparse Attention Examples with Code Style Improvements
      
      - Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
      - Enhance readability by adjusting line breaks and indentation
      - Simplify kernel and function calls with better formatting
      - Add whitespace and line break improvements for better code clarity
      
      * Enhance Layout Plotting with Multi-Replication and Dynamic Visualization
      
      - Update plot_layout function to support multiple replications in thread and value mapping
      - Improve thread and value mapping to handle replicated layouts
      - Dynamically adjust figure size and legend positioning
      - Add print statements for saved plot file paths
      - Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting
      
      * Refactor AtomicAdd functions in CUDA common header
      
      - Implement a generic template for AtomicAdd function
      - Specialize templates for half_t, bfloat16_t, and pointer types
      - Reorganize and clean up existing AtomicAdd implementations
      - Improve type handling and conversion in atomic operations
      
      * Remove unused import in MHA backward test file
      
      - Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py
      - Add blank line for improved code formatting
      - Minor code cleanup in test file
      
      * Add FP8 GEMM Example with TensorCore Intrinsics
      
      - Implement a new example for FP8 matrix multiplication using TensorCore intrinsics
      - Support E4M3 and E5M2 floating-point 8-bit data types
      - Add README with notes on current FP8 implementation limitations
      - Include correctness test for FP8 GEMM with different configurations
      - Demonstrate swizzle layout and pipeline optimizations for FP8 computation
      37d44f24
    • Yu Cheng's avatar
      [Dev] Adjust computation logic to avoid precision loss when casting acc_s from... · e1d82bf3
      Yu Cheng authored
      [Dev] Adjust computation logic to avoid precision loss when casting acc_s from float to float16 (#141)
      
      - Remove redundant `acc_s_0` fragment in flash attention kernel
      - Simplify memory copy and reduction operations
      - Reorder memory copy and scaling steps for improved performance
      - Add Hopper-specific synchronization method in CUDA reduce template
      - Update reduce operation to use architecture-specific synchronization
      e1d82bf3
  4. 04 Mar, 2025 3 commits
    • Yu Cheng's avatar
      [Dev][Doc] Enhance Flash Attention Implementation in GQA Decoding Example and Fix Typo (#139) · 3d7b2dc5
      Yu Cheng authored
      - Add non-split flash attention macro for more flexible kernel generation
      - Implement `main_no_split` function to handle single-split scenarios
      - Modify kernel selection logic to dynamically choose between split and non-split implementations
      3d7b2dc5
    • Lei Wang's avatar
      [Bugfix] Add missing definition for AtomicAdd (#138) · 3960d3d0
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      
      * Update TileLang kernel test with import path changes for MMA layout and macro generator
      
      - Modify import statements in test_tilelang_kernel_dequantize_gemm.py
      - Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
      - Update main function to use tilelang.testing.main()
      
      * Add Block Sparse Attention Examples for TileLang and Triton
      
      - Implement block sparse attention kernels for both TileLang and Triton
      - Add utility functions for generating sparse attention masks using top-k and threshold methods
      - Support causal and variable-length attention scenarios
      - Include test cases for different sequence length configurations
      - Demonstrate block-level sparse attention with configurable parameters
      
      * Refactor Block Sparse Attention Examples with Code Style Improvements
      
      - Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
      - Enhance readability by adjusting line breaks and indentation
      - Simplify kernel and function calls with better formatting
      - Add whitespace and line break improvements for better code clarity
      
      * Enhance Layout Plotting with Multi-Replication and Dynamic Visualization
      
      - Update plot_layout function to support multiple replications in thread and value mapping
      - Improve thread and value mapping to handle replicated layouts
      - Dynamically adjust figure size and legend positioning
      - Add print statements for saved plot file paths
      - Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting
      
      * Refactor AtomicAdd functions in CUDA common header
      
      - Implement a generic template for AtomicAdd function
      - Specialize templates for half_t, bfloat16_t, and pointer types
      - Reorganize and clean up existing AtomicAdd implementations
      - Improve type handling and conversion in atomic operations
      
      * Remove unused import in MHA backward test file
      
      - Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py
      - Add blank line for improved code formatting
      - Minor code cleanup in test file
      3960d3d0
    • Yu Cheng's avatar
      [Doc] Add MLA Decoding Performance Benchmarks and Documentation (#137) · e89e8b6c
      Yu Cheng authored
      - Update news and MLA performance benchmark in README.md
      - Move performance benchmark and layout images to a dedicated 'figures' directory
      - Improve code formatting and image references in documentation
      e89e8b6c
  5. 03 Mar, 2025 3 commits
    • Lei Wang's avatar
      [Debug] Improve Memory Layout Plot (#136) · e32311b2
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      
      * Update TileLang kernel test with import path changes for MMA layout and macro generator
      
      - Modify import statements in test_tilelang_kernel_dequantize_gemm.py
      - Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
      - Update main function to use tilelang.testing.main()
      
      * Add Block Sparse Attention Examples for TileLang and Triton
      
      - Implement block sparse attention kernels for both TileLang and Triton
      - Add utility functions for generating sparse attention masks using top-k and threshold methods
      - Support causal and variable-length attention scenarios
      - Include test cases for different sequence length configurations
      - Demonstrate block-level sparse attention with configurable parameters
      
      * Refactor Block Sparse Attention Examples with Code Style Improvements
      
      - Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
      - Enhance readability by adjusting line breaks and indentation
      - Simplify kernel and function calls with better formatting
      - Add whitespace and line break improvements for better code clarity
      
      * Enhance Layout Plotting with Multi-Replication and Dynamic Visualization
      
      - Update plot_layout function to support multiple replications in thread and value mapping
      - Improve thread and value mapping to handle replicated layouts
      - Dynamically adjust figure size and legend positioning
      - Add print statements for saved plot file paths
      - Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting
      e32311b2
    • Yu Cheng's avatar
      [Doc] Update MLA Documentation (#135) · b70683b3
      Yu Cheng authored
      b70683b3
    • Yu Cheng's avatar
      [Dev][Doc] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks (#134) · cd94aca1
      Yu Cheng authored
      * [Dev] Add RetNet Linear Attention example
      
      * [Dev] Add WgmmaSync rewriter for pipelined WGMMA operations and add MHA WGMMA pipelined example (FA3-like scheduling)
      
      This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler:
      
      - Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc`
      - Added pass registration for `RewriteWgmmaSync`
      - Updated `tilelang/engine/phase.py` to include the new transformation pass
      - Updated `tilelang/transform/__init__.py` to expose the new pass
      
      The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels.
      
      * [Bugfix] Fix bug in ThreadTagChecker for warp specialization
      
      Improve thread tag validation in warp specialized rewriter to prevent unintended transformations:
      - Add more precise checks for threadIdx.y and threadIdx.z
      - Validate thread extent to ensure only single-extent thread bindings are allowed
      - Prevent warp specialization for multi-extent thread bindings in y and z dimensions
      
      * lint
      
      * [CI] Add TMA descriptor attribute to transformed module in test case
      
      * [Dev] Refactor DeepSeek MLA Decode Example with Non-Split and Split Flash Attention Implementations
      
      - Add new `flash_attn` macro for non-split flash attention implementation
      - Add swizzled layout for tile in shared memory
      - Use threadblock swizzle to imporve L2 cache hit rate
      
      * [Dev] Add DeepSeek MLA Decode Example with Documentation and Performance Benchmarks
      
      - Add detailed README.md explaining MLA (Multi-Head Latent Attention) implementation
      - Include performance benchmark images for batch sizes 64 and 128
      - Add layout visualization images for QK and PV operations
      - Implement torch reference implementations in torch_refs.py
      - Update example_mla_decode.py with command-line argument support and flexible configuration
      - Add performance benchmarking and comparison with other implementations
      cd94aca1
  6. 02 Mar, 2025 2 commits
    • Lei Wang's avatar
      [Kernel] Implement different SEQ Q/KV examples with block sparse (#133) · 159af5df
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      
      * Update TileLang kernel test with import path changes for MMA layout and macro generator
      
      - Modify import statements in test_tilelang_kernel_dequantize_gemm.py
      - Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
      - Update main function to use tilelang.testing.main()
      
      * Add Block Sparse Attention Examples for TileLang and Triton
      
      - Implement block sparse attention kernels for both TileLang and Triton
      - Add utility functions for generating sparse attention masks using top-k and threshold methods
      - Support causal and variable-length attention scenarios
      - Include test cases for different sequence length configurations
      - Demonstrate block-level sparse attention with configurable parameters
      
      * Refactor Block Sparse Attention Examples with Code Style Improvements
      
      - Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
      - Enhance readability by adjusting line breaks and indentation
      - Simplify kernel and function calls with better formatting
      - Add whitespace and line break improvements for better code clarity
      159af5df
    • Lei Wang's avatar
      [Refactor] Set default log level from waning into info (#132) · 9ba96f19
      Lei Wang authored
      * Change default log level from WARNING to INFO in TileLang initialization
      
      * Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support
      
      - Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
      - Remove unused imports and simplify function signature
      - Modify `flashattn` function to handle max sequence length as a separate argument
      - Update kernel call to include max sequence length parameter
      - Improve code readability and remove commented-out code
      - Add print statement to confirm successful assertion
      
      * Refactor code formatting in TileLang lowering and example files
      
      - Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
      - Simplify line breaks and reduce unnecessary whitespace
      - Enhance code readability by adjusting indentation and line breaks
      - Update example MHA forward pass script with cleaner tensor initialization
      9ba96f19
  7. 28 Feb, 2025 4 commits
    • Lei Wang's avatar
      [Example] Implememt FMHA Varlen Example (#131) · dd5d955c
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      
      * nas kernel
      
      * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
      
      - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
      - Increased default number of heads and selected blocks in TileLang NSA example
      - Added Ruff linter ignore comments to reference.py
      - Standardized function signatures and improved code readability across NSA implementations
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation
      
      - Update flash attention kernel to support positional embeddings (PE)
      - Modify reference implementation to handle PE and group query attention
      - Increase default batch size and adjust benchmarking parameters
      - Improve kernel performance and readability
      - Add einops and torch operations for more flexible tensor manipulation
      
      * Update README.md with corrected Flash MLA Decoding example path
      
      - Modify the example link for Flash MLA Decoding to point to the correct directory
      - Ensure accurate navigation to the DeepSeek MLA decoding example
      
      * Refactor Native Sparse Attention Kernel and Improve Utility Functions
      
      This commit introduces several improvements:
      - Simplified native sparse attention kernel by inlining macro functions in example_tilelang_nsa.py
      - Enhanced error handling in loop_partition.cc with more informative error messages
      - Updated print.py to support multi-dimensional buffer printing
      - Improved torch_assert_close in testing/__init__.py with more detailed mismatch reporting
      - Reduced default absolute tolerance in torch comparison from 1e-3 to 1e-2
      - Added shape validation and detailed mismatch information in tensor comparison
      
      * Refactor Code Formatting and Improve Utility Functions
      
      This commit introduces several code formatting and utility improvements:
      - Add Ruff linter ignore comment in example_tilelang_nsa.py
      - Enhance code readability in loop_partition.cc and lower_tile_op.cc with improved line breaks
      - Simplify print_flat_buffer_with_condition in print.py
      - Refactor torch_assert_close in testing/__init__.py with improved line formatting
      
      * Enhance Buffer Printing Support for Fragment and Shared Memory Buffers
      
      This commit improves the print functionality in print.py by:
      - Adding support for printing fragment memory buffers
      - Implementing a new print_fragment_buffer_with_condition macro
      - Extending print_shared_buffer_with_condition for shared memory buffers
      - Updating the generic print function to handle different buffer scopes
      
      * Resolve merge conflict in print.py
      
      Remove merge conflict marker and clean up whitespace in the print module
      
      * Add Variable-Length Multi-Head Attention (MHA) Example with Flash Attention Support
      
      Introduce a new example script `example_mha_fwd_varlen.py` that demonstrates:
      - Variable-length Multi-Head Attention (MHA) implementation
      - Flash Attention forward pass with padding mask support
      - Performance benchmarking for variable-length sequences
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Reference implementation comparison with PyTorch and FlashAttention
      
      * Refactor Flash Attention Variable-Length MHA Example
      
      Improve code formatting and readability in the variable-length multi-head attention example:
      - Add Ruff linter ignore comment
      - Enhance code style with consistent formatting
      - Remove unused imports
      - Improve line breaks and indentation
      - Simplify function signatures and lambda expressions
      dd5d955c
    • Lei Wang's avatar
      [Debug] Support `T.print` for `fragment` scope (#130) · d55386d1
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      
      * nas kernel
      
      * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
      
      - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
      - Increased default number of heads and selected blocks in TileLang NSA example
      - Added Ruff linter ignore comments to reference.py
      - Standardized function signatures and improved code readability across NSA implementations
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation
      
      - Update flash attention kernel to support positional embeddings (PE)
      - Modify reference implementation to handle PE and group query attention
      - Increase default batch size and adjust benchmarking parameters
      - Improve kernel performance and readability
      - Add einops and torch operations for more flexible tensor manipulation
      
      * Update README.md with corrected Flash MLA Decoding example path
      
      - Modify the example link for Flash MLA Decoding to point to the correct directory
      - Ensure accurate navigation to the DeepSeek MLA decoding example
      
      * Refactor Native Sparse Attention Kernel and Improve Utility Functions
      
      This commit introduces several improvements:
      - Simplified native sparse attention kernel by inlining macro functions in example_tilelang_nsa.py
      - Enhanced error handling in loop_partition.cc with more informative error messages
      - Updated print.py to support multi-dimensional buffer printing
      - Improved torch_assert_close in testing/__init__.py with more detailed mismatch reporting
      - Reduced default absolute tolerance in torch comparison from 1e-3 to 1e-2
      - Added shape validation and detailed mismatch information in tensor comparison
      
      * Refactor Code Formatting and Improve Utility Functions
      
      This commit introduces several code formatting and utility improvements:
      - Add Ruff linter ignore comment in example_tilelang_nsa.py
      - Enhance code readability in loop_partition.cc and lower_tile_op.cc with improved line breaks
      - Simplify print_flat_buffer_with_condition in print.py
      - Refactor torch_assert_close in testing/__init__.py with improved line formatting
      
      * Enhance Buffer Printing Support for Fragment and Shared Memory Buffers
      
      This commit improves the print functionality in print.py by:
      - Adding support for printing fragment memory buffers
      - Implementing a new print_fragment_buffer_with_condition macro
      - Extending print_shared_buffer_with_condition for shared memory buffers
      - Updating the generic print function to handle different buffer scopes
      
      * Resolve merge conflict in print.py
      
      Remove merge conflict marker and clean up whitespace in the print module
      d55386d1
    • Lei Wang's avatar
      [Dev] Remove buffer flatten when debug print a shared buffer (#129) · 20bbb91a
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      
      * nas kernel
      
      * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
      
      - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
      - Increased default number of heads and selected blocks in TileLang NSA example
      - Added Ruff linter ignore comments to reference.py
      - Standardized function signatures and improved code readability across NSA implementations
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation
      
      - Update flash attention kernel to support positional embeddings (PE)
      - Modify reference implementation to handle PE and group query attention
      - Increase default batch size and adjust benchmarking parameters
      - Improve kernel performance and readability
      - Add einops and torch operations for more flexible tensor manipulation
      
      * Update README.md with corrected Flash MLA Decoding example path
      
      - Modify the example link for Flash MLA Decoding to point to the correct directory
      - Ensure accurate navigation to the DeepSeek MLA decoding example
      
      * Refactor Native Sparse Attention Kernel and Improve Utility Functions
      
      This commit introduces several improvements:
      - Simplified native sparse attention kernel by inlining macro functions in example_tilelang_nsa.py
      - Enhanced error handling in loop_partition.cc with more informative error messages
      - Updated print.py to support multi-dimensional buffer printing
      - Improved torch_assert_close in testing/__init__.py with more detailed mismatch reporting
      - Reduced default absolute tolerance in torch comparison from 1e-3 to 1e-2
      - Added shape validation and detailed mismatch information in tensor comparison
      
      * Refactor Code Formatting and Improve Utility Functions
      
      This commit introduces several code formatting and utility improvements:
      - Add Ruff linter ignore comment in example_tilelang_nsa.py
      - Enhance code readability in loop_partition.cc and lower_tile_op.cc with improved line breaks
      - Simplify print_flat_buffer_with_condition in print.py
      - Refactor torch_assert_close in testing/__init__.py with improved line formatting
      20bbb91a
    • Yu Cheng's avatar
      [Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA... · 0d873fcf
      Yu Cheng authored
      [Dev][Bugfix] Fix bug in ThreadTagChecker; Add WgmmaSync rewriter and add MHA WGMMA pipelined example (#128)
      
      * [Dev] Add RetNet Linear Attention example
      
      * [Dev] Add WgmmaSync rewriter for pipelined WGMMA operations and add MHA WGMMA pipelined example (FA3-like scheduling)
      
      This commit introduces a new transformation pass `RewriteWgmmaSync` to optimize warp group matrix multiply accumulate (WGMMA) operations in the TileLang compiler:
      
      - Implemented `WgmmaSyncRewriter` in `src/transform/wgmma_sync_rewriter.cc`
      - Added pass registration for `RewriteWgmmaSync`
      - Updated `tilelang/engine/phase.py` to include the new transformation pass
      - Updated `tilelang/transform/__init__.py` to expose the new pass
      
      The rewriter intelligently manages synchronization and dependencies between WGMMA operations, improving pipeline efficiency for complex matrix multiplication kernels.
      
      * [Bugfix] Fix bug in ThreadTagChecker for warp specialization
      
      Improve thread tag validation in warp specialized rewriter to prevent unintended transformations:
      - Add more precise checks for threadIdx.y and threadIdx.z
      - Validate thread extent to ensure only single-extent thread bindings are allowed
      - Prevent warp specialization for multi-extent thread bindings in y and z dimensions
      
      * lint
      
      * [CI] Add TMA descriptor attribute to transformed module in test case
      0d873fcf
  8. 27 Feb, 2025 1 commit
    • Lei Wang's avatar
      [JIT] Enhance cython/ctypes wrapper for tma descriptor (#126) · 7b74bb01
      Lei Wang authored
      
      
      * refactor code
      
      * enhance tutorial
      
      * Enhance error handling and code generation in CUDA and TileLang components
      
      This commit introduces several improvements across multiple files:
      - Added more informative error messages in GEMM layout checks
      - Updated CUDA codegen to support more flexible function signature generation
      - Improved TMA descriptor initialization and kernel dispatch logic
      - Refined library generation and source code parsing utilities
      - Enhanced error handling in various adapter and wrapper classes
      
      * Add thread tag validation for warp specialization
      
      Introduce a ThreadTagChecker to validate that a PrimFunc only uses threadIdx.x before applying warp specialization. This prevents unintended transformations on kernels with complex thread binding and provides a clear warning to users about potential issues with warp specialization.
      
      * Update TileLang Profiling and Compilation in Flash Decoding Examples
      
      Refactor the profiling and compilation workflow in two flash decoding example scripts:
      - Replace `tilelang.lower()` and `tilelang.Profiler()` with `tilelang.compile()`
      - Simplify profiler initialization using `get_profiler()`
      - Update method calls to use the new profiler and compiled kernel objects
      - Maintain existing performance benchmarking and validation logic
      
      * Refactor and clean up code formatting in TileLang testing and adapter modules
      
      This commit includes several code style and formatting improvements:
      - Adjust whitespace and line breaks in test files
      - Improve code formatting in CUDA source wrapper and adapter utilities
      - Enhance readability of function calls and argument handling
      - Remove unnecessary whitespace and standardize indentation
      - Simplify function signatures and argument parsing
      
      * Refactor CUDA codegen and improve code formatting
      
      This commit includes several improvements to CUDA code generation and formatting:
      - Enhance function signature generation in CodeGenTileLangCUDA
      - Improve code formatting and readability in CUDA-related files
      - Simplify parameter handling and type annotations
      - Clean up whitespace and line breaks in codegen and layout files
      
      ---------
      Co-authored-by: default avatarUbuntu <dlisuser@h100testl730RPS.xu5snccwrbtejcqqalluoku5hb.xx.internal.cloudapp.net>
      7b74bb01
  9. 26 Feb, 2025 3 commits
    • Yu Cheng's avatar
      ba311311
    • Lei Wang's avatar
      [Example] Update GEMM FP8 Example (#123) · 13f4b5c6
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      
      * nas kernel
      
      * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
      
      - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
      - Increased default number of heads and selected blocks in TileLang NSA example
      - Added Ruff linter ignore comments to reference.py
      - Standardized function signatures and improved code readability across NSA implementations
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      
      * Refactor DeepSeek MLA Decode Example with Enhanced Flash Attention Implementation
      
      - Update flash attention kernel to support positional embeddings (PE)
      - Modify reference implementation to handle PE and group query attention
      - Increase default batch size and adjust benchmarking parameters
      - Improve kernel performance and readability
      - Add einops and torch operations for more flexible tensor manipulation
      
      * Update README.md with corrected Flash MLA Decoding example path
      
      - Modify the example link for Flash MLA Decoding to point to the correct directory
      - Ensure accurate navigation to the DeepSeek MLA decoding example
      13f4b5c6
    • Yu Cheng's avatar
  10. 25 Feb, 2025 4 commits
    • Lei Wang's avatar
      [Example] Implement TileLang Native Sparse Attention Kernel (#121) · 3cbf8cbc
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      
      * nas kernel
      
      * Enhance Native Sparse Attention Examples with Code Improvements and Parameter Updates
      
      - Updated example_tilelang_nsa.py and example_triton_nsa.py with code formatting and style improvements
      - Increased default number of heads and selected blocks in TileLang NSA example
      - Added Ruff linter ignore comments to reference.py
      - Standardized function signatures and improved code readability across NSA implementations
      
      * Add utility math functions for integer operations
      
      - Implement `next_power_of_2()` to calculate the next power of 2 for an integer
      - Add `cdiv()` function for ceiling division of integers
      3cbf8cbc
    • Lei Wang's avatar
      [Example] Add GQA Example (#118) · 2b97e98a
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      
      * Add example for Group Query Attention (GQA) forward pass using Flash Attention in TileLang
      
      This commit introduces a new example script `example_gqa_fwd_bshd.py` that demonstrates:
      - Group Query Attention (GQA) implementation
      - Flash Attention forward pass
      - Performance benchmarking
      - Configurable parameters for batch, heads, sequence length, and dimension
      - Autotuning support
      - Reference implementation comparison
      
      * Refactor IR lowering pipeline into modular phases
      
      This commit introduces a new module `phase.py` to modularize the IR lowering process by splitting the complex lowering pipeline into two distinct phases:
      - `LowerAndLegalize`: Handles initial IR legalization and transformation
      - `OptimizeForTarget`: Applies target-specific optimizations
      
      The changes simplify the lowering logic in multiple files by extracting the transformation steps into reusable functions, improving code readability and maintainability.
      
      * lintfix
      2b97e98a
    • Yu Cheng's avatar
      [Dev] Update MLA decode kernel (#120) · b7ca76f1
      Yu Cheng authored
      b7ca76f1
    • Yu Cheng's avatar
      [Bugfix] Bugfix of pass order for hopper (#117) · 524991fe
      Yu Cheng authored
      524991fe
  11. 24 Feb, 2025 5 commits
    • Lei Wang's avatar
      [Dev] Support vectorized value pack and atomicAdd for BFloat16 DType (#116) · 62843b88
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      
      * Add CUDA atomic operations for BFLOAT16 and update function naming
      
      - Implement AtomicAdd functions for BFLOAT16 and BFLOAT16x2 in CUDA common header
      - Rename existing atomic add functions to use PascalCase (atomicAdd -> AtomicAdd)
      - Add a new __pack_nv_bfloat162 function for packing BFLOAT16 values
      - Update kernel and language customization to use new function names
      - Add return type annotations in profiler module
      
      * lint fix
      62843b88
    • Lei Wang's avatar
      [Benchmark] Add benchmark scripts for block sparse attention (#114) · f2f67571
      Lei Wang authored
      * Add DeepSeek MLA decode example with Flash Attention implementation
      
      * Add GEMM SplitK and StreamK example implementations
      
      This commit introduces two new example scripts demonstrating advanced GEMM (matrix multiplication) techniques:
      - `example_tilelang_gemm_splitk.py`: Implements a Split-K GEMM kernel using TileLang
      - `example_tilelang_gemm_streamk.py`: Implements a Stream-K GEMM kernel using TileLang
      
      Both examples showcase different parallel computation strategies for matrix multiplication, with comprehensive testing using PyTorch reference implementations.
      
      * Refactor GEMM SplitK and StreamK example implementations
      
      Clean up and improve code formatting for the SplitK and StreamK GEMM example scripts:
      - Remove unused import (Profiler) in splitk example
      - Simplify line breaks and improve code readability
      - Standardize indentation and remove unnecessary whitespace
      - Optimize atomic add and copy operations for better clarity
      
      * Add block sparse attention benchmarks for multiple libraries
      
      This commit introduces comprehensive block sparse attention benchmarks for different libraries:
      - TileLang block sparse FMHA implementation
      - Triton block sparse FMHA implementation
      - PyTorch reference block sparse FMHA implementation
      - FlashAttention dense FMHA reference implementation
      
      The benchmarks include:
      - Configurable benchmark parameters (batch size, heads, sequence length, etc.)
      - Sparse mask generation using top-k and threshold methods
      - Performance measurement for different sparse attention configurations
      - Utility functions for mask generation and benchmarking
      
      * Refactor block sparse attention benchmarks with code style improvements
      
      - Add Ruff linter ignore comments to benchmark files
      - Improve code formatting and line breaks
      - Remove unused imports
      - Standardize print statement formatting
      - Enhance code readability across multiple library benchmarks
      
      * lint fix
      f2f67571
    • Wenhao Xie's avatar
    • Wenhao Xie's avatar
    • Lei Wang's avatar
      34a94d42