Commits · e89e8b6cec478cae89928660905443fd5955fd8d · OpenDAS / tilelang

04 Mar, 2025 1 commit

[Doc] Add MLA Decoding Performance Benchmarks and Documentation (#137) · e89e8b6c

Yu Cheng authored Mar 04, 2025

- Update news and MLA performance benchmark in README.md
- Move performance benchmark and layout images to a dedicated 'figures' directory
- Improve code formatting and image references in documentation

e89e8b6c

03 Mar, 2025 1 commit

[Debug] Improve Memory Layout Plot (#136) · e32311b2

Lei Wang authored Mar 04, 2025

* Change default log level from WARNING to INFO in TileLang initialization

* Refactor Flash Attention Variable-Length MHA Example with Cython Backend Support

- Update `example_mha_fwd_varlen.py` to use Cython backend for kernel compilation
- Remove unused imports and simplify function signature
- Modify `flashattn` function to handle max sequence length as a separate argument
- Update kernel call to include max sequence length parameter
- Improve code readability and remove commented-out code
- Add print statement to confirm successful assertion

* Refactor code formatting in TileLang lowering and example files

- Improve line breaks and code formatting in `lower.py`, `wrapper.py`, and `tensor.py`
- Simplify line breaks and reduce unnecessary whitespace
- Enhance code readability by adjusting indentation and line breaks
- Update example MHA forward pass script with cleaner tensor initialization

* Update TileLang kernel test with import path changes for MMA layout and macro generator

- Modify import statements in test_tilelang_kernel_dequantize_gemm.py
- Replace bitblas imports with tilelang.intrinsics imports for MMA-related utilities
- Update main function to use tilelang.testing.main()

* Add Block Sparse Attention Examples for TileLang and Triton

- Implement block sparse attention kernels for both TileLang and Triton
- Add utility functions for generating sparse attention masks using top-k and threshold methods
- Support causal and variable-length attention scenarios
- Include test cases for different sequence length configurations
- Demonstrate block-level sparse attention with configurable parameters

* Refactor Block Sparse Attention Examples with Code Style Improvements

- Improve code formatting in block_sparse_attn_tilelang.py and block_sparse_attn_triton.py
- Enhance readability by adjusting line breaks and indentation
- Simplify kernel and function calls with better formatting
- Add whitespace and line break improvements for better code clarity

* Enhance Layout Plotting with Multi-Replication and Dynamic Visualization

- Update plot_layout function to support multiple replications in thread and value mapping
- Improve thread and value mapping to handle replicated layouts
- Dynamically adjust figure size and legend positioning
- Add print statements for saved plot file paths
- Modify example fragment_mma_load_a.py to uncomment and enable warp and block layout plotting

e32311b2

09 Feb, 2025 1 commit

[Tools] Introduce `plot_layout` to visualize the fragment layout (#68) · f9b6a92e

Lei Wang authored Feb 10, 2025

* [Enhancement] Add VectorizeLoop function and update imports for compatibility

* [CI][Test] Improve test cases for vectorization and fix typos in parser comments

* lint fix

* Fix incorrect module reference for VectorizeLoop transformation

* Refactor vectorize_loop transformation by removing unused extent mutation logic

* [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen

* Fix formatting in CUDA FP8 header file for consistency

* Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity

* Update submodule 'tvm' to latest commit for improved functionality

* Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.

* Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.

* Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency

* Add CUDA requirements to FP8 test cases and update references for clarity

* Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py

* Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py

* Add CUDA requirements and FP8 test cases for matmul and gemv simulations

* Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py

* Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py

* Add BF16 support to matrix multiplication and introduce corresponding test cases

* Add a blank line for improved readability in BF16 GEMM test

* Update acknowledgements in README to include supervision by Zhi Yang at Peking University

* enhance acknowledgement

* Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives

* Update subproject commit for TVM dependency

* Update subproject commit for TVM dependency

* Add int4_t type and functions for packing char values in CUDA common header

* Add plot_layout example and implement GetForwardVars method in layout classes

* Refactor code for improved readability by adjusting line breaks and formatting in layout and test files

* Fix formatting by removing unnecessary line break in layout.h

* Refactor make_int4 function for improved readability by adjusting parameter formatting

f9b6a92e