Commits · 37d44f243572952768cce11cf29ed094045d6656 · OpenDAS / tilelang

05 Mar, 2025 1 commit
[Refactor] Rename gemm fp8 example as we currently lack `T.gemm` support for fp8 (#144) · 37d44f24
Lei Wang authored Mar 05, 2025
* Change default log level from WARNING to INFO in TileLang initialization

* Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support

- Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
- Remove unused imports and simplify function signature
- Modify `flashattn` function to handle max sequence length as a separate argument
- Update kernel call to include max sequence length parameter
- Improve code readability and remove commented-out code
- Add print statement to confirm successful assertion

* Refactor code formatting in TileLang lowering and example files

- Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
- Simplify line breaks and reduce unnecessary whitespace
- Enhance code readability by adjusting indentation and line breaks
- Update example MHA forward pass script with cleaner tensor initialization

* Update TileLang kernel test with import path changes for MMA layout and macro generator

- Modify import statements in test_tilelang_kernel_dequantize_gemm.py
- Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
- Update main function to use tilelang.testing.main()

* Add Block Sparse Attention Examples for TileLang and Triton

- Implement block sparse attention kernels for both TileLang and Triton
- Add utility functions for generating sparse attention masks using top-k and threshold methods
- Support causal and variable-length attention scenarios
- Include test cases for different sequence length configurations
- Demonstrate block-level sparse attention with configurable parameters

* Refactor Block Sparse Attention Examples with Code Style Improvements

- Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
- Enhance readability by adjusting line breaks and indentation
- Simplify kernel and function calls with better formatting
- Add whitespace and line break improvements for better code clarity

* Enhance Layout Plotting with Multi-Replication and Dynamic Visualization

- Update plot_layout function to support multiple replications in thread and value mapping
- Improve thread and value mapping to handle replicated layouts
- Dynamically adjust figure size and legend positioning
- Add print statements for saved plot file paths
- Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting

* Refactor AtomicAdd functions in CUDA common header

- Implement a generic template for AtomicAdd function
- Specialize templates for half_t, bfloat16_t, and pointer types
- Reorganize and clean up existing AtomicAdd implementations
- Improve type handling and conversion in atomic operations

* Remove unused import in MHA backward test file

- Remove unnecessary argparse import from test_tilelang_kenrel_mha_bwd.py
- Add blank line for improved code formatting
- Minor code cleanup in test file

* Add FP8 GEMM Example with TensorCore Intrinsics

- Implement a new example for FP8 matrix multiplication using TensorCore intrinsics
- Support E4M3 and E5M2 floating-point 8-bit data types
- Add README with notes on current FP8 implementation limitations
- Include correctness test for FP8 GEMM with different configurations
- Demonstrate swizzle layout and pipeline optimizations for FP8 computation
37d44f24