Auto-tuning a Tile Language program involves three main steps:
1. Implement the target program using Tile Language with reserved optimization parameters
2. Provide candidate configurations through manual search or [auto-generation using Carver](#using-carver-to-auto-generate-candidate-configurations)
3. Parallel compile and benchmark candidate configurations to identify the best performance
## Matrix Multiplication Example
The following example demonstrates auto-tuning matrix multiplication. Code has been simplified for readability - see `examples/gemm/example_gemm.py` for complete implementation.
### Step 1: Implement with Reserved Parameters
Users can implement matrix multiplication in Tile Language while reserving parameters for optimization:
```python
# Reserved parameters for optimization
defkernel(
block_M=None,
block_N=None,
block_K=None,
num_stages=None,
thread_num=None,
enable_rasteration=None,
):
dtype="float16"
accum_dtype="float"
# Matrix multiplication implementation
@T.prim_func
defmain(
A:T.Buffer((M,K),dtype),
B:T.Buffer((N,K),dtype),
C:T.Buffer((M,N),dtype),
):
# ...existing code...
returnmain
```
### Step 2: Generate Candidate Configurations
Manually define configurations or use combinatorial generation:
```python
configs=[
{
"block_M":128,
"block_N":128,
"block_K":128,
"num_stages":3,
"thread_num":128,
"enable_rasteration":True
},
{
"block_M":32,
"block_N":32,
"block_K":32,
"num_stages":0,
"thread_num":32,
"enable_rasteration":False
},
# ...additional configurations...
]
```
It can also be given by combinatorial traversal of different parameters
```python
importitertools
block_M=[64,128,256]
block_N=[64,128,256]
block_K=[32,64]
num_stages=[0,1,2,3]
thread_num=[128,256]
enable_rasterization=[True,False]
_configs=list(
itertools.product(
block_M,
block_N,
block_K,
num_stages,
thread_num,
enable_rasterization,
))
configs=[
{
"block_M":c[0],
"block_N":c[1],
"block_K":c[2],
"num_stages":c[3],
"thread_num":c[4],
"enable_rasteration":c[5]
}forcin_configs
]
```
### Step 3: Compile and Benchmark
Configure JIT compilation and benchmarking settings:
The result object contains optimized kernel implementation which can be used by users directly
## Using Carver to Auto-Generate Candidate Configurations
Carver is a lightweight framework for generating and ranking tile configurations (also known as tiling strategies, blocking schemes, or scheduling hints) for common GPU, CPU, and accelerator backends. It helps you explore efficient mappings of loops for operations such as matrix multiplication, elementwise transforms, and other reduction-oriented kernels.
or common operators, Carver provides pre-built templates (e.g., `MatmulTemplate`):
A Tile Language program (hereafter referred to as a *program*) is transformed into a hardware-executable file through several stages:
1. The user writes a Tile Language program.
2. The program undergoes multiple *Passes* for transformation and optimization (the *lower* stage, see `tilelang/engine/lower.py`), finally producing an intermediate representation (e.g., LLVM or C for CPU, CUDA for NVIDIA GPUs, etc.).
3. The generated code is compiled by the respective compiler (e.g., nvcc) into a hardware-executable file.
```{figure} ../_static/img/overview.png
:width: 300
:alt: Overview of the compilation process
:align: center
```
During this process, users may encounter roughly three categories of issues:
***Generation issues**: The Tile Language program fails to generate a valid hardware-executable file (i.e., errors during the lowering process).
***Correctness issues**: The resulting executable runs, but produces incorrect results.
***Performance issues**: The executable runs with performance significantly below the expected theoretical hardware limits.
This tutorial focuses on the first two issues—how to debug generation and correctness problems. Performance tuning often requires using vendor-provided profiling tools (e.g., **Nsight Compute**, **rocProf**, etc.) for further hardware-level analysis, which we will address in future materials.
Below, we take matrix multiplication (GEMM) as an example to demonstrate how to write and debug a Tile Language program.
## Matrix Multiplication Example
In **Tile Language**, you can use the **Tile Library** to implement matrix multiplication. Here's a complete example:
# 1. Define the kernel (matmul) with the desired dimensions
func=matmul(1024,1024,1024,128,128,32)
# 2. Compile the kernel into a torch function
# ...existing code...
```
## Debugging Generation Issues
TileLang essentially performs *progressive lowering*. For example, a `T.copy` may first be expanded into `T.Parallel` (see the pass `LowerTileOP`), which is then expanded again, eventually resulting in lower-level statements that can be translated to CUDA C code.
When the code fails to generate (for instance, a compilation error occurs), you do **not** necessarily need to jump directly into C++ passes to debug. Instead, you can first inspect the intermediate representations (IR) in Python by printing them.
For example, consider a case where a simple `T.copy` in 1D causes the lowering process to fail. The snippet below illustrates a simplified version of the problem (based on community Issue #35):
```python
@T.prim_func
defmain(Q:T.Tensor(shape_q,dtype)):
# ...existing code...
```
The TileLang lower process might yield an error such as:
```text
File "/root/TileLang/src/target/codegen_cuda.cc", line 1257
ValueError: Check failed: lanes <= 4 (8 vs. 4) : Ramp of more than 4 lanes is not allowed.
```
This indicates that somewhere during code generation, an unsupported vectorization pattern was introduced (a ramp of 8 lanes). Before diving into the underlying C++ code, it is helpful to print the IR right before code generation. For instance:
Sometimes, the kernel compiles and runs but produces incorrect results. In such cases, there are two main strategies to help debug:
1.**Use post-processing callbacks to inspect or modify the generated CUDA code.**
2.**Use the built-in `T.print` debugging primitive to inspect values at runtime.**
### Post-Processing Callbacks for Generated Source
After code generation (in the codegen pass), TileLang calls a callback function (if registered) to allow post-processing of the generated source code. In `src/target/rt_mod_cuda.cc`:
code="// modified by tilelang_callback_cuda_postproc\n"+code
returncode
kernel=tilelang.compile(matmul,target="cuda")
kernel_source=kernel.get_kernel_source()
print(kernel_source)
'''
// modified by tilelang_callback_cuda_postproc
#include "cuda_runtime.h"
...
'''
```
### Runtime Debug Prints with `T.print`
TileLang provides a built-in debugging primitive called `T.print` for printing within kernels. Be mindful of concurrency and thread synchronization when using it in GPU code. Below are some examples showing how to print buffers, variables, and other data inside TileLang programs.
The **Visual Layout Inference** tool automatically generates visual diagrams that illustrate the mapping between logical indices, thread IDs, and register file locations.
When TileLang performs layout inference, it determines how fragment buffers are distributed across threads. The visual layout tool captures this information and generates:
1.**Textual output**: A human-readable description of the layout mapping
2.**Visual diagrams**: Color-coded plots showing the thread-to-data mapping
The visual layout inference tool is controlled through the `TL_LAYOUT_VISUALIZATION_ENABLE` and `TL_LAYOUT_VISUALIZATION_FORMATS` pass configuration. By default, `TL_LAYOUT_VISUALIZATION_ENABLE` is **disabled** to avoid performance overhead during compilation.
When enabled, `TL_LAYOUT_VISUALIZATION_FORMATS` accepts string values to control output formats:
- "txt": Text output only (same as default)
- "all": Generates all formats (TXT, PDF, PNG, SVG)
- "png": Generate PNG format only
- "pdf": Generate PDF format only
- "svg": Generate SVG format only
- "txt,svg": Generate multiple formats (comma-separated) in addition to text output
The output messages of "txt" will include something like:
By carefully examining intermediate representations (IR) before final code generation—and by leveraging runtime printing through `T.print`—one can quickly diagnose where index calculations, copy logic, or other kernel operations deviate from the intended behavior. This two-pronged approach (inspecting IR transformations and using runtime prints) is often sufficient for resolving generation and correctness issues in TileLang programs.
For advanced performance tuning (e.g., analyzing memory bandwidth or occupancy), more specialized profiling tools such as **Nsight Compute**, **rocProf**, or vendor-specific profilers may be required. Those aspects will be covered in future documents.
The design style is inspired by [Google's glog](https://google.github.io/glog/stable/).
## Logging Categories
There are three primary macro types:
```c++
LOG(INFO)<<"aaa";
DLOG(INFO)<<"aaa";
VLOG(1)<<"aaa";
```
-**LOG**: Standard logging preserved in code for displaying necessary information at different levels during runtime. Most Tilelang C++ error reporting is implemented via `LOG(FATAL) << "error msg"`.
-**DLOG**: Debug logging for developer debugging output. DLOG is controlled at build time by the TVM_LOG_DEBUG environment variable and is **eliminated in Release builds through dead code elimination**.
- The key difference between LOG(DEBUG) and DLOG is this build-time elimination. We recommend using DLOG over LOG(DEBUG), as the latter has overlapping functionality and gets compiled into the release runtime.
-**VLOG**: [Verbose logging](https://google.github.io/glog/stable/logging/#verbose-logging), primarily for debugging. Its main feature is customizable verbosity levels. For example, VLOG(n) where n can be 1, 2, 3, 4, 5, or 6, enabling complex tracing requirements. In contrast, LOG and DLOG typically use predefined verbose levels like INFO and DEBUG.
- In practical Tilelang development, VLOG is used less frequently.
- TVM's VLOG is implemented using DLOG, thus inheriting DLOG's characteristics.
Additional useful macros include various **CHECK** variants:
-**DCHECK**: Debug mode CHECK, only compiled in debug builds
-**ICHECK**: Internal Check that should exist in Release builds. When ICHECK fails, the entire system should report an error.
## Logging Verbose Levels
TVM defines 5 levels for LOG and DLOG (adding DEBUG compared to glog):
```c++
#define TVM_LOG_LEVEL_DEBUG 0
#define TVM_LOG_LEVEL_INFO 1
#define TVM_LOG_LEVEL_WARNING 2
#define TVM_LOG_LEVEL_ERROR 3
#define TVM_LOG_LEVEL_FATAL 4
```
## Using Logging in TileLang Development
### Guidelines
For temporary debugging output in your code, there are no restrictions (you can even use std::cout). Just remember to remove it before submitting a PR.
For meaningful logging that should remain in the Tilelang codebase:
- Critical correctness checks: Use ICHECK with sufficient error messages to facilitate debugging when issues arise.
- Complex Pass debugging: For passes requiring intermediate output that may need future review (e.g., LayoutInference), use DLOG.
- General INFO/WARNING messages: Use standard LOG.
### Enabling Log Output in Tilelang
To specify current log level at runtime, we need to set the environment variable `TVM_LOG_LEVEL`. An example usage is:
```c++
TVM_LOG_DEBUG=1python3code.py
```
which enables all DEBUG/INFO (level <= 1) logs for all files.
#### Detailed Rules for TVM_LOG_DEBUG Specification
The parsing logic is in `logging.cc`. Reference: [HyperAI Zhihu Article](https://zhuanlan.zhihu.com/p/1933106843468665163).
Launch Python with `TVM_LOG_DEBUG=<spec>`, where `<spec>` is a comma-separated list of level assignments in the form `<file_name>=<level>`. Important notes:
- The special filename DEFAULT sets the LOG level for all files.
-`<level>` can be set to -1 to disable LOG for that file.
-`<file_name>` is the C++ source filename (e.g., .cc, not .h) relative to the `src/` directory in the TVM repository. The `src/` prefix is optional when specifying file paths.
### Enabling Debug Mode
To enable DLOG/DCHECK, developers need to first build Tilelang in Debug mode:
```bash
cmake .. -DCMAKE_BUILD_TYPE=Debug -DUSE_CUDA=ON
```
Tilelang's CMake logic automatically adds the `TVM_LOG_DEBUG` macro, compiling all DLOG statements:
Then you also need to specify the runtime environment variables. For example, to use `DLOG(INFO) << "xxx"` for debugging, run your code with INFO level (1): `TVM_LOG_DEBUG=1`.
:::{note}
**Important**: There are two TVM_LOG_DEBUG variables. (1) Compile-time macro: Determines whether debug content (like DLOG) is compiled into the .so file. Referenced in C++ source via #ifdef TVM_LOG_DEBUG. This is automatically enabled when using Debug build mode in CMake. (2) Runtime environment variable: Controls logging level at runtime. TVM provides a specification for this variable, allowing control over per-file logging levels.
These two should ideally have different names, but TVM uses the same name for both, which can cause confusion.
A performance analysis toolkit for TVM IR modules, Provides hardware-aware performance metrics including FLOPs, memory bandwidth utilization, and execution time estimation.
## Features
- **Operation Analysis**: Supports arbitrary operations expressed in TVM IR (including GEMM and convolution)
- **Memory Traffic Calculation**: Tracks global memory transfers
- **Architecture-aware Metrics**: Pre-configured with NVIDIA GPU architectures (Ampere, Ada Lovelace)
- **Performance Estimation**: Predicts execution time using roofline model
- **TVM Integration**: Works with TVM IRModule and PrimFunc
We compare with an optimized version of the official Triton implementation [here](https://github.com/openai/gpt-oss/blob/main/gpt_oss/triton/attention.py).
## Algorithm
### Forward
The only change from vanilla FlashAttention is that `sinks` should be taken into consideration in the softmax, which requires an extra rescaling at the epilogue stage.
### Backward
Based on detailed mathematical derivation, interestingly, the backward computation process of `dQ`, `dK`, `dv` is almost identical to that in vanilla FlashAttention, except for that the specific meanings of `lse` differ. We only need to compute `dsinks` additionally, which is given by: