Commits · 40faabb1f20843482c16fdfabf8b6d145180a452 · OpenDAS / tilelang

23 Feb, 2025 2 commits

[Dev] Add MLA and GQA decode examples (#109) · 40faabb1

Yu Cheng authored Feb 24, 2025

* [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized

* Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests

* [Dev] Add mha backward example

* [Dev] Add mla decode example

* bug fix

* Add triton impl

* Add gqa decode example

* [Dev] Add GQA decode example

* lint

* delete unused triton example

* set default profiler to 'auto'

40faabb1

[Release] Bumpy version to v0.1.1 (#107) · d79204e5

Lei Wang authored Feb 23, 2025

* Remove Torch CPP backend and update execution backend options

- Remove TorchCPPKernelAdapter and related code from JIT modules
- Update execution backend options in jit/__init__.py, kernel.py, and adapter/__init__.py
- Remove "torch_cpp" from supported execution backend literals
- Simplify backend validation and remove unused torch_cpp-related code
。

* lint fix

* Add block sparse attention implementations for TileLang and Triton

- Implement block sparse attention kernels for TileLang and Triton
- Add example scripts for block sparse attention with top-k and threshold-based masking
- Include utility functions for generating sparse attention masks
- Demonstrate causal attention with block-level sparsity
- Add test cases to validate sparse attention implementations against PyTorch reference

* Bump version to 0.1.1

* Refactor block sparse attention examples for improved code quality

- Apply consistent code formatting and style in TileLang and Triton block sparse attention implementations
- Add ruff linter ignore comment for specific line in Triton implementation
- Improve readability by adjusting indentation and line breaks
- Standardize sparse mask generation and test function implementations
- Minor optimizations in test case configurations

* lint

d79204e5

22 Feb, 2025 1 commit

[Example] Implement simple block sparse kernel (#106) · c7462abf

Lei Wang authored Feb 23, 2025

* Remove Torch CPP backend and update execution backend options

- Remove TorchCPPKernelAdapter and related code from JIT modules
- Update execution backend options in jit/__init__.py, kernel.py, and adapter/__init__.py
- Remove "torch_cpp" from supported execution backend literals
- Simplify backend validation and remove unused torch_cpp-related code
。

* lint fix

* Add block sparse attention implementations for TileLang and Triton

- Implement block sparse attention kernels for TileLang and Triton
- Add example scripts for block sparse attention with top-k and threshold-based masking
- Include utility functions for generating sparse attention masks
- Demonstrate causal attention with block-level sparsity
- Add test cases to validate sparse attention implementations against PyTorch reference

c7462abf

21 Feb, 2025 1 commit

[JIT] Support Cython jit and make cython a default execution backend (#102) · 3471904f

Lei Wang authored Feb 21, 2025

* [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution

- Enhance CtypesKernelAdapter to handle dynamic symbolic shapes
- Add support for multi-stream kernel execution in CTypes backend
- Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py
- Add symbolic shape utility function in tilelang.language
- Update profiler to improve flexibility in benchmark selection

* Remove redundant thread binding in GEMM kernel implementations

- Remove unnecessary `thread_binding` line in GEMM kernel functions
- Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py`
- Enhance code readability by removing redundant thread binding annotation

* Fix indentation in int4 GEMM kernel test file

- Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py`
- Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls
- Improve code formatting for better readability

* [Feature] Add Cython JIT kernel support for dynamic shapes and multi-stream execution

- Implement CythonKernelAdapter to handle dynamic symbolic shapes
- Add support for multi-stream kernel execution in Cython backend
- Create comprehensive test suite for Cython GEMM kernel in test_tilelang_jit_gemm_cython.py
- Update JITKernel to include "cython" as a valid execution backend
- Add Cython-specific wrapper and library generation modules
- Update .gitignore to exclude Cython cache directory
- Modify setup.py to include Cython source files in package data

* lint fix

* [Refactor] Replace JITKernel with compile() function for kernel compilation

- Add new `compile()` function in tilelang/jit/__init__.py as a wrapper for JITKernel
- Update multiple test files and examples to use `tilelang.compile()` instead of `tilelang.JITKernel()`
- Modify kernel adapters to support optional kernel-only source retrieval
- Update `__init__.py` to import the new `compile()` function
- Improve kernel source retrieval for different execution backends

* lint fix

* remove debug print

* Add C/C++ compiler utility module and update Cython JIT kernel support

- Introduce new `tilelang/contrib/cc.py` module with cross-platform C/C++ compiler utilities
- Add functions to detect and retrieve system C/C++ compilers
- Implement cross-compilation and shared library creation support
- Update Cython JIT kernel to validate C++ compiler availability
- Modify Cython adapter to use detected C++ compiler for library generation

* Refactor float8 dtype mapping in tensor utility module

- Move float8_dtype_map inside adapt_torch2tvm function
- Simplify global scope by localizing the dtype mapping
- Maintain existing functionality for converting torch float8 tensors to TVM ndarray

* Refactor float8 dtype mapping in tensor utility module

- Move float8_dtype_map inside adapt_torch2tvm function
- Simplify global scope by localizing the dtype mapping
- Maintain existing functionality for converting torch float8 tensors to TVM ndarray

* revert

* Enhance Cython JIT adapter with Cython compiler detection

- Add `get_cython_compiler()` function to dynamically locate Cython executable
- Update Cython adapter to use detected Cython compiler instead of hardcoded command
- Raise an exception if no Cython compiler is found
- Update requirements.txt to specify minimum PyTorch version (>=2.2.0)

* Fix Cython kernel wrapper stream handling and type annotations

- Update stream parameter type to int64_t for better compatibility
- Directly use torch.cuda.current_stream().cuda_stream instead of casting
- Improve type safety and precision in Cython kernel wrapper

3471904f

20 Feb, 2025 1 commit

[Feature] Add CTypes JIT kernel support (#100) · 7c817d51

Lei Wang authored Feb 20, 2025

* [Feature] Add CTypes JIT kernel support for dynamic shapes and multi-stream execution

- Enhance CtypesKernelAdapter to handle dynamic symbolic shapes
- Add support for multi-stream kernel execution in CTypes backend
- Implement dynamic shape handling in test_tilelang_jit_gemm_ctypes.py
- Add symbolic shape utility function in tilelang.language
- Update profiler to improve flexibility in benchmark selection

* Remove redundant thread binding in GEMM kernel implementations

- Remove unnecessary `thread_binding` line in GEMM kernel functions
- Clean up code in `examples/gemm/README.md` and `testing/python/kernel/test_tilelang_kernel_int4_gemm_mma.py`
- Enhance code readability by removing redundant thread binding annotation

* Fix indentation in int4 GEMM kernel test file

- Correct indentation for function calls in `test_tilelang_kernel_int4_gemm_mma.py`
- Remove extra indentation in `mma_emitter.ldmatrix_a()` and `mma_emitter.ldmatrix_b()` calls
- Improve code formatting for better readability

7c817d51

11 Feb, 2025 1 commit

[Dev] Add mha backward example (#77) · a6fe61e2

Yu Cheng authored Feb 12, 2025

* [CI][Test] Add test cases for tilelang transform MultiVersionBuffer and WarpSpecialized

* Relax the mismatch ratio restrictions in the flash_linear_attention and mha tests

* [Dev] Add mha backward example

a6fe61e2

10 Feb, 2025 1 commit

[Dev] Remove unnecessary python dependencies (#69) · 2411fa28

Lei Wang authored Feb 10, 2025

* [Enhancement] Add VectorizeLoop function and update imports for compatibility

* [CI][Test] Improve test cases for vectorization and fix typos in parser comments

* lint fix

* Fix incorrect module reference for VectorizeLoop transformation

* Refactor vectorize_loop transformation by removing unused extent mutation logic

* [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen

* Fix formatting in CUDA FP8 header file for consistency

* Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity

* Update submodule 'tvm' to latest commit for improved functionality

* Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.

* Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.

* Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency

* Add CUDA requirements to FP8 test cases and update references for clarity

* Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py

* Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py

* Add CUDA requirements and FP8 test cases for matmul and gemv simulations

* Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py

* Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py

* Add BF16 support to matrix multiplication and introduce corresponding test cases

* Add a blank line for improved readability in BF16 GEMM test

* Update acknowledgements in README to include supervision by Zhi Yang at Peking University

* enhance acknowledgement

* Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives

* Update subproject commit for TVM dependency

* Update subproject commit for TVM dependency

* Add int4_t type and functions for packing char values in CUDA common header

* Add plot_layout example and implement GetForwardVars method in layout classes

* Refactor code for improved readability by adjusting line breaks and formatting in layout and test files

* Fix formatting by removing unnecessary line break in layout.h

* Refactor make_int4 function for improved readability by adjusting parameter formatting

* Add legend to plot_layout for improved clarity of thread and local IDs

* Remove unnecessary dependencies from requirements files for cleaner setup

* Remove flash_mha.py and add .gitkeep to deepseek_mla directory

* Add build requirements and update installation scripts for improved setup

2411fa28

09 Feb, 2025 1 commit

[Tools] Introduce `plot_layout` to visualize the fragment layout (#68) · f9b6a92e

Lei Wang authored Feb 10, 2025

* [Enhancement] Add VectorizeLoop function and update imports for compatibility

* [CI][Test] Improve test cases for vectorization and fix typos in parser comments

* lint fix

* Fix incorrect module reference for VectorizeLoop transformation

* Refactor vectorize_loop transformation by removing unused extent mutation logic

* [Enhancement] Add support for FP8 data types and global barriers in CUDA codegen

* Fix formatting in CUDA FP8 header file for consistency

* Refactor CI workflow to use 'tilelang_ci' virtual environment and update CUDA type printing for better clarity

* Update submodule 'tvm' to latest commit for improved functionality

* Refactor execution backend references from 'dl_pack' to 'dlpack' for consistency and clarity; add apply_simplify function to simplify PrimFunc or IRModule.

* Refactor CUDA code for improved readability; clean up formatting and remove unnecessary whitespace in multiple files.

* Refactor import statement in test_tilelang_kernel_dequantize_gemm.py to use 'tilelang.language' for consistency

* Add CUDA requirements to FP8 test cases and update references for clarity

* Add a blank line for improved readability in test_tilelang_kernel_fp8_gemm_mma.py

* Fix data type in reference result calculation for consistency in test_tilelang_kernel_gemm_mma_intrinsic.py

* Add CUDA requirements and FP8 test cases for matmul and gemv simulations

* Remove debug print statements and use tilelang's testing assertion for result validation in test_tilelang_kernel_gemm_mma_intrinsic.py

* Remove outdated comment regarding FP8 tests in test_tilelang_kernel_gemv_simt.py

* Add BF16 support to matrix multiplication and introduce corresponding test cases

* Add a blank line for improved readability in BF16 GEMM test

* Update acknowledgements in README to include supervision by Zhi Yang at Peking University

* enhance acknowledgement

* Replace tutorial on memory layout optimization with new tutorial on writing high-performance kernels with thread primitives

* Update subproject commit for TVM dependency

* Update subproject commit for TVM dependency

* Add int4_t type and functions for packing char values in CUDA common header

* Add plot_layout example and implement GetForwardVars method in layout classes

* Refactor code for improved readability by adjusting line breaks and formatting in layout and test files

* Fix formatting by removing unnecessary line break in layout.h

* Refactor make_int4 function for improved readability by adjusting parameter formatting

f9b6a92e

25 Jan, 2025 5 commits

[CI][Test] Add test cases for element_add (#47) · f944b79e

Cunxiao Ni authored Jan 26, 2025

* [CI][Test] Add test cases for element_add

* [Doc] fix typo

* Parallelization

* format

* remove useless condition

* format

f944b79e

[CI][Test] Add test cases for tilelang kernel FlashAttention (#54) · bedab1a0

Yu Cheng authored Jan 26, 2025

* [Dev] Add FlashDecoding example

* [CI][Test] Add test cases for tilelang kernel convolution

* [CI][Test] Add test cases for tilelang kernel FlashAttention

* Reduce the number of stages to ensure the shared memory allocation is valid

* Temporarily remove the dim128 case

* lint

* update einops in requirements-dev.txt

* update einops in requirements-test.txt

* remove einops in requirements-dev.txt

bedab1a0

[CI][Test] Add test cases for tilelang kernel convolution (#51) · 34de04a6
Yu Cheng authored Jan 25, 2025
```
* [CI][Test] Add test cases for tilelang kernel convolution
```
34de04a6

[Doc] Remove unnecessary layout annotation (#49) · 47ecc791

Lei Wang authored Jan 25, 2025

* [Doc] Update documentation structure and content: add overview section, revise project name, and change theme to Furo

* [Feature] Add device-side debug printing functions and integrate into kernel interface

* lint fix

* remove debug print

* implement test for debug

* lint fix

* add some comments

* Enhance fragment design and assert fragment print

* enhance debug print

* add test for msg

* lint fix

* format

* add flash decoding exmaples

* remove comment

* test simplified

47ecc791

[Dev] Add FlashDecoding example (#46) · cc08ba50
Yu Cheng authored Jan 25, 2025

cc08ba50

23 Jan, 2025 2 commits

[Refactor] Simplify interface via replacing argument thread binding of... · 362b3520

Lei Wang authored Jan 23, 2025

[Refactor] Simplify interface via replacing argument thread binding of intrinsics with `KernelFrame.Current` (#34)

* installation script fix

* readme typo fix

* doc fix for dequantize gemm

* [Doc] remove CODE_OF_CONDUCT.md and SECURITY.md; update references in CONTRIBUTING.md

* [Doc] add unit tests for AnnotateDeviceRegions transform; remove SUPPORT.md

* update license

* [Enhancement] add tensor supply handling for unsigned integers; improve error message for execution backend assertion

* [Refactor] improve code readability by reformatting function signatures and assertions

* [Refactor] replace torch.manual_seed with tilelang.testing.set_random_seed for consistency in random seed handling

* [Refactor] unify thread binding variable naming across kernel and example files

* [Refactor] remove unused thread binding parameter from matrix multiplication functions

* [Refactor] remove unused thread binding parameter from matrix multiplication functions

* [Refactor] enable main testing function in tilelang kernel gemm test

* bug fix

362b3520

[CI] Comprehensive Test cases Implementation of Matmul Dequantize (#32) · 7959d786

Lei Wang authored Jan 23, 2025

* installation script fix

* readme typo fix

* doc fix for dequantize gemm

* [Doc] remove CODE_OF_CONDUCT.md and SECURITY.md; update references in CONTRIBUTING.md

* [Doc] add unit tests for AnnotateDeviceRegions transform; remove SUPPORT.md

* update license

* [Enhancement] add tensor supply handling for unsigned integers; improve error message for execution backend assertion

* [Refactor] improve code readability by reformatting function signatures and assertions

* [Refactor] replace torch.manual_seed with tilelang.testing.set_random_seed for consistency in random seed handling

7959d786

20 Jan, 2025 2 commits

[Doc] Fix installation scripts and docs for dequantize gemm (#20) · 7b777b38
Lei Wang authored Jan 21, 2025
```
* installation script fix

* readme typo fix

* doc fix for dequantize gemm
```
7b777b38

[Dev][jit] Introduce jit for kernel functions (#12) · 39fc5a6d

Lei Wang authored Jan 20, 2025

* instruction update

* replace link with TileLang/tile-lang

* [Dev][Adapter] Implement Torch DLPack Kernel Adapter and related utilities

* lint fix

* Implement JIT Compiler Components

* Documents update

* lint fix

* update logo

* install script fix

39fc5a6d

11 Jan, 2025 2 commits

[Lint] Overall Typo and Linting Fixes (#13) · fa511857
Lei Wang authored Jan 11, 2025
```
* README.md fixed

* update test ci

* Lint and Typo Fix

* Clang Format Lint Fix
```
fa511857

[Initialization] Migration of Codebase from Dev Branch into Main (#10) · 57ab687c

Lei Wang authored Jan 11, 2025



* Add format.sh script for code formatting and linting

* docs update

* center align the title

* lint fix

* add ignore

* Add .gitignore for 3rdparty directory

* Add requirements-dev.txt, requirements-test.txt, and requirements.txt

* 3rdparty

* Add gemm.h, CMakeLists.txt, _ffi_api.py, __init__.py, runtime.h, reduce.h, loop_partition.h, utils.h, and loop_vectorize.h

* Refactor CMakeLists.txt and include statements

- Update CMakeLists.txt to use a newer version of CMake and add project name
- Remove unnecessary include directories

Fix include paths in layout.cc, codegen.cc, codegen.h, rt_mod.cc, frontend_legalize.cc, inject_pipeline.cc, layout_inference.cc, loop_vectorize.cc, and lower_tile_op.cc

- Update include paths to use relative paths instead of absolute paths

* Update submodule for 3rdparty/tvm

* update

* load dll first

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* git keep update

* Refactor CMakeLists.txt and include statements

* Refactor CMakeLists.txt and include statements

* refactor code structure

* Update Readme

* CMakeLists Customized

* update readme

* update README

* update readme

* update usage

* with TVM_IMPORT_PYTHON_PATH to handle own tvm build python import

* annotate lower transform global func with `transform` prefix

* Migrate Simplify Pass from tilelang tvm branch

* enhance system environment handling with __init__ and CMake

* Initial commit

* CODE_OF_CONDUCT.md committed

* LICENSE committed

* README.md committed

* SECURITY.md committed

* SUPPORT.md committed

* CODE_OF_CONDUCT Commit

* LICENSE Commit

* SECURITY Commit

* SUPPORT Commit

* Modify Support

* Update README.md

* security ci update

* remove examples

* Update and implement clang-format

* add composable kernel components

* Migrate from latest update

* submodule update

* Test update

* Update License

* Spell check

* lint fix

* add clang-tidy to apply static analysis for c source

* update tilelang examples

* Update Install Docs

* Refactor filetree

* Enhance Install

* conflict resloved

* annotate_version

* Initial Update

* test fix

* install

* Implement setup.py

* lint fix

* Separate Init

* Separate test

* docker file commit

* add logo

* Update Readme and Examples

* update readme

* update logo

* Implement AMD Installation

* Add License

* Update AMD MI300x Benchmark

* update README

* update mi300 benchmark scripts

* update ignore

* enhance build scirpt

* update image

* enhance setup.py to remove duplicated libraries

* remove debug files

* update readme

* update image

* update gemm examples

* update flashattention README

* readme update

* add cmake into requirements

* libinfo fix

* auto update submodule

* lint fix

* Fix AMD Build and Test

* Update check for transpose attribute for CDNA Arch

* typo fix for amd

* Implement Matmul Benchmark

* Refactor Code

* [TypoFix] Fix GEMM Example

* [Docs] Init Linear Attention README

* [TYPO] Typo fix

* [Lint] Lint Fix

* enhance example with intrinsics

* [Enhancement] Improve Buffer Collection during IR Parser

* [Dev] Introduce Current classmethod to get current frame

* submodule update

* fake test pass update

* support thread_extent_api

* code optimize

* Add GEMM function implementation for matrix multiplication

* Update logging format to reflect TileLang in logger messages

* Refactor CMakeLists.txt for improved readability and set default build type to Release

* Support Gemm SS Primitives Implementation

* [README] Upload Tile Language Logo (#5)

* update logo

* Update README.md to enhance formatting and center the title

---------
Co-authored-by: microsoft-github-operations[bot] <55726097+microsoft-github-operations[bot]@users.noreply.github.com>
Co-authored-by: Microsoft Open Source <microsoftopensource@users.noreply.github.com>
Co-authored-by: Yu Cheng <yu.cheng@pku.edu.cn>

57ab687c