Tile Language (**tile-lang**) is a concise domain-specific language designed to streamline the development of high-performance GPU/CPU kernels (e.g., GEMM, Dequant GEMM, FlashAttention, LinearAttention). By employing a Pythonic syntax with an underlying compiler infrastructure on top of [TVM](https://tvm.apache.org/), tile-lang allows developers to focus on productivity without sacrificing the low-level optimizations necessary for state-of-the-art performance.
Tile Language (**tile-lang**) is a concise domain-specific language designed to streamline the development of high-performance GPU/CPU kernels (e.g., GEMM, Dequant GEMM, FlashAttention, LinearAttention). By employing a Pythonic syntax with an underlying compiler infrastructure on top of [TVM](https://tvm.apache.org/), tile-lang allows developers to focus on productivity without sacrificing the low-level optimizations necessary for state-of-the-art performance.
## Tested Devices
## Tested Devices
Although tile-lang aims to be portable across a range of Devices, it has been specifically tested and validated on the following devices:
Although tile-lang aims to be portable across a range of Devices, it has been specifically tested and validated on the following devices: for NVIDIA GPUs, this includes the H100 (with Auto TMA/WGMMA support), A100, V100, RTX 4090, RTX 3090, and RTX A600; for AMD GPUs, it includes the MI250 (with Auto MatrixCore support) and the MI300X (with Async Copy support).
-**NVIDIA GPUS**:
- H100 (**with Auto TMA/WGMMA Support**),
- A100
- V100
- RTX 4090
- RTX 3090
- RTX A600
-**AMD GPUS**:
- MI250 (**with Auto MatrixCore Support**)
- MI300 (**with Async Copy Support**)
## OP Implementation Examples
## OP Implementation Examples
**tile-lang** provides the building blocks to implement a wide variety of operators. Some examples include:
**tile-lang** provides the building blocks to implement a wide variety of operators. Some examples include:
...
@@ -35,16 +25,16 @@ Within the `examples` repository, you will also find additional complex kernels
...
@@ -35,16 +25,16 @@ Within the `examples` repository, you will also find additional complex kernels
TileLang achieves exceptional performance across a variety of computational patterns. Below are selected results showcasing its capabilities:
TileLang achieves exceptional performance across a variety of computational patterns. Below are selected results showcasing its capabilities:
-Operator Performance Vs. Baselines on H100
-Flash Attention Performance on H100
<div>
<div>
<imgsrc="./images/op_benchmark_h100.png"alt="operator performance on H100"/>
<imgsrc="./images/mha_performance_h100.png"alt="operator performance on H100"/>
</div>
</div>
- MatrixCore FP16 GEMM Performance Vs. Baselines on MI300X
- Matmul Performance on GPUs (RTX 4090, A100, H100, MI300X)
<div>
<div>
<imgsrc="./images/op_benchmark_mi300_fp16_gemm_normalized_latency.png"alt="gemm fp16 performance on MI300X"/>
<imgsrc="./images/op_benchmark_consistent_gemm_fp16.png"alt="gemm fp16 performance on Gpus"/>