Unverified Commit 08262bce authored by Lei Wang's avatar Lei Wang Committed by GitHub
Browse files

[Doc] Minor documentation update (#1410)

parent ede9eaa3
/* Reduce the displayed size of the sidebar logo in Furo */
.sidebar-logo {
max-height: 125px;
width: auto;
}
/* Optional: keep container from growing too tall due to spacing */
.sidebar-logo-container {
line-height: 0;
}
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
# General information about the project.
project = "Tile Language <br>"
project = "TileLang <br>"
author = "Tile Lang Contributors"
copyright = f"2025-2025, {author}"
......@@ -62,12 +62,13 @@ todo_include_todos = False
html_theme = "furo"
templates_path = []
html_static_path = ["_static"]
footer_copyright = "© 2025-2025 Tile Language"
html_css_files = ["custom.css"]
footer_copyright = "© 2025-2026 TileLang"
footer_note = " "
html_theme_options = {
"light_logo": "img/logo-row.svg",
"dark_logo": "img/logo-row.svg",
"light_logo": "img/logo-v2.png",
"dark_logo": "img/logo-v2.png",
}
header_links = [
......
......@@ -27,6 +27,18 @@ tutorials/auto_tuning
tutorials/logging
:::
:::{toctree}
:maxdepth: 1
:caption: PROGRAMMING GUIDES
programming_guides/overview
programming_guides/language_basics
programming_guides/instructions
programming_guides/control_flow
programming_guides/autotuning
programming_guides/type_system
:::
:::{toctree}
:maxdepth: 1
:caption: DEEP LEARNING OPERATORS
......
# Autotuning
TileLang includes a built‑in autotuner that searches configuration spaces
for the best performing kernel, compiles candidates in parallel, validates
correctness, benchmarks them, and caches the best result for reuse.
This guide covers two workflows:
- Decorator‑based: `@tilelang.autotune(configs=...)` stacked on `@tilelang.jit`
- Programmatic: `AutoTuner.from_kernel(...).set_*().run()`
It also explains input tensor supply, validation, caching, and environment
variables that affect parallelism and cache behavior.
## 1) Decorator‑based Autotune
Use `@tilelang.autotune` above `@tilelang.jit` and expose tunable parameters as
function arguments with defaults. The autotuner overrides these parameters with
values from your config space.
```python
import tilelang
import tilelang.language as T
def matmul_configs(M, N, K):
# Example space — tailor to your target
tiles = [64, 128]
stages = [2, 3]
threads = [128, 256]
return [
dict(block_M=BM, block_N=BN, block_K=BK, num_stages=S, threads=TH)
for BM in tiles
for BN in tiles
for BK in [32, 64]
for S in stages
for TH in threads
]
@tilelang.autotune(configs=matmul_configs, warmup=25, rep=100, timeout=60)
@tilelang.jit(out_idx=[-1])
def matmul(M: int, N: int, K: int,
block_M: int = 128, block_N: int = 128, block_K: int = 32,
threads: int = 128, num_stages: int = 3,
dtype: str = 'float16', accum_dtype: str = 'float32'):
@T.prim_func
def kernel(A: T.Tensor((M, K), dtype),
B: T.Tensor((K, N), dtype),
C: T.Tensor((M, N), dtype)):
with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads) as (bx, by):
A_s = T.alloc_shared((block_M, block_K), dtype)
B_s = T.alloc_shared((block_K, block_N), dtype)
C_f = T.alloc_fragment((block_M, block_N), accum_dtype)
T.clear(C_f)
for ko in T.Pipelined(T.ceildiv(K, block_K), num_stages=num_stages):
T.copy(A[by * block_M, ko * block_K], A_s)
T.copy(B[ko * block_K, bx * block_N], B_s)
T.gemm(A_s, B_s, C_f)
T.copy(C_f, C[by * block_M, bx * block_N])
return kernel
# Usage
# Provide inputs via context (recommended for reproducibility across configs)
import torch
M = N = K = 1024
A = torch.randn(M, K, device='cuda', dtype=torch.float16)
B = torch.randn(K, N, device='cuda', dtype=torch.float16)
C = torch.empty(M, N, device='cuda', dtype=torch.float16)
from tilelang.autotuner import set_autotune_inputs
with set_autotune_inputs(A, B, C):
tuned_kernel = matmul(M, N, K) # compiles, tunes, returns best kernel
tuned_kernel(A, B, C) # run best kernel
```
Notes
- `configs` can be a list of dicts or a callable `(args...) -> list[dict]`. Each
dict’s keys must match the tunable function arguments (e.g., `block_M`).
- The decorator returns a callable that runs autotune once per argument tuple
and caches the resulting best kernel in‑process.
- For explicit input control during tuning, wrap the call with
`set_autotune_inputs(...)`. Otherwise, `supply_type` (below) is used.
## 2) Programmatic Autotune
Use the `AutoTuner` class to manage configs and arguments more explicitly.
```python
from tilelang.autotuner import AutoTuner
kernel_factory = matmul # the function above (already @tilelang.jit)
tuner = AutoTuner.from_kernel(kernel_factory(M, N, K), configs=matmul_configs(M, N, K))
tuner.set_profile_args(
warmup=25, rep=100, timeout=60,
supply_type=tilelang.TensorSupplyType.Auto, # or provide supply_prog/ref_prog
ref_prog=lambda A, B, C: torch.allclose(C, (A @ B).to(C.dtype), rtol=1e-2, atol=1e-2),
)
tuner.set_compile_args(
target='auto', # or 'cuda'/'hip'/'metal'
execution_backend='auto', # resolves per-target
out_idx=[-1], # which outputs to return if multiple
pass_configs={ # optional TVM passes/flags
# tilelang.PassConfigKey.EXAMPLE_KEY: value,
},
)
artifact = tuner.run() # compiles + runs + validates all configs
best_kernel = artifact.kernel # JITKernel
best_latency = artifact.latency
best_config = artifact.config
# Reuse best kernel
best_kernel(A, B, C)
```
### Example Gallery (in repo)
- examples/gdn/example_chunk_delta_h.py:101 — uses `@autotune` to sweep configs
- examples/deepseek_nsa/benchmark/benchmark_nsa_fwd.py:451 — uses `@tilelang.autotune`
- examples/quickstart.py:84 — profiles a tuned kernel with `get_profiler`
- examples/hadamard_transform/example_hadamard.py:152 — profiler with custom warmup
- examples/dynamic_shape/example_dynamic.py:94 — profiler for dynamic shapes
- examples/gemm/example_gemm_persistent.py:135 — compare persistent vs non‑persistent
Click any path to open the code and compare patterns.
## Input Tensor Supply
The tuner needs inputs to compile and benchmark kernels. Provide them in one of
three ways (priority order):
1) Context manager (fixed inputs across configs)
```python
with set_autotune_inputs(A, B, C):
tuned = matmul(M, N, K)
```
2) Custom supplier program
```python
def supply_prog(signature):
# signature holds KernelParam objects describing shapes/dtypes
# Return a list of torch tensors matching the kernel’s arguments
return [A, B, C]
tuner.set_profile_args(supply_prog=supply_prog)
```
3) Built‑in generators via `supply_type`
- `TensorSupplyType.Auto` (default): heuristic per dtype (uniform ints / fp ranges)
- `Integer`, `Uniform`, `Normal`, `Randn`, `Zero`, `One`
Important
- Built‑in generators require static shapes; if your PrimFunc uses symbolic
dimensions (T.dyn), supply concrete inputs via (1) or (2).
- Float8 dtypes require PyTorch 2.1+ for `torch.float8_*` support.
## Correctness Checking and Tolerances
Use one of the following validation methods:
- `ref_prog`: Provide a reference program that receives the same inputs and
checks results. You can return a boolean or raise on mismatch.
- `manual_check_prog`: A callable that inspects outputs and raises on mismatch.
- `skip_check=True`: Skip correctness checks (faster, use with caution).
Control numeric drift via:
- `rtol` and `atol` (defaults 1e‑2)
- `max_mismatched_ratio` (default 1%)
## Configuration Spaces and Best Practices
What to tune
- Tile sizes: `block_M`, `block_N`, `block_K`
- Software pipelining: `num_stages`
- Threads per block: `threads` (or (x, y) tuple)
- Optional: dtype variants, epilogues, small scheduling knobs
Tips
- Start from a working baseline. Tune a small, meaningful space first.
- Respect hardware limits (shared memory bytes, registers per thread/block,
max threads per block). Eliminate impossible configs up‑front.
- Keep block sizes multiples of vector widths and warp sizes when relevant.
- Use `set_autotune_inputs` to ensure each config is measured on identical data.
- Record your best configs and bake them as defaults when stable.
## Parallel Compilation/Benchmarking and Timeouts
The tuner compiles configurations in parallel using a thread pool and benchmarks
them with a per‑config timeout. On CUDA, each worker sets the current device to
avoid context issues.
Notes
- `timeout` uses POSIX signals; on non‑Unix systems, it may not take effect.
- Logs are written to `autotuner.log` in the working directory.
## Caching
The autotuner caches best artifacts both in‑memory (per process) and on disk under
`$TILELANG_CACHE_DIR/autotuner`. The cache key includes:
- TileLang version, function source, closure free‑vars
- Config list, compile args, profile args
Disk cache contents (per key)
- Best config and latency: `best_config.json`, `latency.json`
- Kernel sources and library: `device_kernel.cu`, `host_kernel.cu`, `kernel_lib.so` (or `kernel.cubin`/`executable.so` depending on backend)
- Function and params: `function.pkl`, `params.pkl`
Control via env vars (tilelang.env)
- `TILELANG_CACHE_DIR` (default `~/.tilelang/cache`)
- `TILELANG_TMP_DIR` (default `$TILELANG_CACHE_DIR/tmp`)
- Disable all kernel caches: `TILELANG_DISABLE_CACHE=1`
- Disable autotune disk cache only: `TILELANG_AUTO_TUNING_DISABLE_CACHE=1`
CPU worker control
- `TILELANG_AUTO_TUNING_CPU_UTILITIES` (fraction, default 0.9)
- `TILELANG_AUTO_TUNING_CPU_COUNTS` (int, `-1` auto)
- `TILELANG_AUTO_TUNING_MAX_CPU_COUNT` (int, `-1` unlimited)
Backend notes
- NVRTC backend persists `.cubin` and a Python launcher.
- Torch/DLPack backend may not save artifacts to disk; in this case, only
in‑memory caching applies and a warning is logged.
## Alternative: Manual Sweeps with par_compile
If you prefer manual control, use `JITImpl.par_compile` to compile a batch of
configs and drive your own benchmarking:
```python
@tilelang.jit
def factory(M, N, K, block_M=128, block_N=128, block_K=32):
@T.prim_func
def k(A: T.Tensor((M, K), 'float16'),
B: T.Tensor((K, N), 'float16'),
C: T.Tensor((M, N), 'float16')):
...
return k
impl = factory # JITImpl
cfgs = [
dict(block_M=64, block_N=128, block_K=32),
dict(block_M=128, block_N=128, block_K=64),
]
kernels = impl.par_compile(cfgs, num_workers=4)
# Now benchmark kernels[i](A, B, C) yourself
```
## Recording and Reusing Best Configs
The programmatic path returns an `AutotuneResult` that can be saved and later
reloaded. This is useful for CI, multi‑host workflows, or shipping tuned configs.
```python
artifact = tuner.run() # AutotuneResult
# Save to disk
from pathlib import Path
save_dir = Path('out/best/matmul_1024')
artifact.save_to_disk(save_dir, verbose=True)
# Reload later
from tilelang.autotuner.param import AutotuneResult, CompileArgs
restored = AutotuneResult.load_from_disk(save_dir, CompileArgs())
best = restored.kernel
best(A, B, C)
```
Notes
- DLPack/Torch execution backend may not persist compiled binaries; in that
case, re‑compilation is needed on load or use a different backend.
- The directory contains human‑readable JSONs (best config/latency) and sources.
## Advanced: Config Space Callables
Derive config spaces from problem sizes to keep searches targeted and legal:
```python
def matmul_configs(M, N, K):
large = min(M, N, K) >= 1024
tiles = [128] if large else [64, 128]
for BM in tiles:
for BN in tiles:
for BK in [32, 64]:
for S in [2, 3]:
for TH in [128, 256]:
yield dict(block_M=BM, block_N=BN, block_K=BK,
num_stages=S, threads=TH)
```
## Device and Backend Selection
Tune compile‑time options explicitly:
- `target='auto'|'cuda'|'hip'|'metal'` (normalized to a TVM Target)
- `execution_backend='auto'|'tvm_ffi'|'ctypes'|'cython'|'nvrtc'|'torch'`
- `pass_configs={...}` to toggle TileLang/TVM passes for experiments
On CUDA with multiple GPUs, the tuner sets the current device per worker thread
to avoid context mixups.
## Troubleshooting
- “No configurations to tune”: Ensure `configs` is a non‑empty list or callable.
- Timeouts: Increase `timeout`; ensure inputs fit device memory; verify that
your reference check isn’t the bottleneck.
- Dynamic shapes: Provide concrete inputs via `set_autotune_inputs` or a custom
`supply_prog`.
- Disk cache disabled: Check `TILELANG_AUTO_TUNING_DISABLE_CACHE` and backend.
# Control Flow
This guide covers the control‑flow primitives in TileLang and how they lower to
efficient GPU code. You will use these to structure loops, handle boundaries,
and express pipelined compute.
## Overview
- Conditionals: `if` / `elif` / `else`, ternary (`x if c else y`)
- Loops: `T.serial`, `T.unroll`, `T.Parallel`, `T.Pipelined`
- While loops: `while` with a TIR condition
- Flow control: Python `break` / `continue`
- Safety: automatic OOB guards via the LegalizeSafeMemoryAccess pass
The examples assume `import tilelang.language as T`.
## Conditionals
Standard Python `if`/`elif`/`else` is supported inside `@T.prim_func` kernels.
Conditions should be TIR expressions (e.g., `i < N`). Python plain booleans are
treated as compile‑time constants and will be folded.
```python
for i in T.serial(N):
if i < N: # TIR condition
C[i] = A[i] + B[i]
else:
pass
# Ternary
x = (A[i] if i < N else 0)
```
Short‑circuit boolean ops are supported. For multi‑dimensional bounds, use
`T.any_of` / `T.all_of` for clarity:
```python
if T.all_of(i < M, j < N):
C[i, j] = A[i, j] + B[i, j]
```
Boundary handling note
- The LegalizeSafeMemoryAccess pass automatically inserts guards when an access
may be out‑of‑bounds, and elides them when proven safe. You can often omit
explicit `if` checks for simple edge handling, but keep them when you need
custom logic or clarity.
## Loops
### Serial
`T.serial` creates a plain for‑loop. Common forms:
```python
for i in T.serial(N):
... # 0..N-1
for i in T.serial(0, N, 2):
... # 0, 2, 4, ...
```
### Unroll
`T.unroll` requests loop unrolling for small trip counts.
```python
for k in T.unroll(K_TILE):
acc += a[k] * b[k]
```
Advanced: TileLang forwards unroll hints to TIR; factor/explicit knobs are
available for expert tuning.
### Parallel (elementwise)
`T.Parallel(ext0, ext1, ...)` builds nested loops that map well to elementwise
operations. The body receives all indices in one `for` header:
```python
for i, j in T.Parallel(M, N):
C[i, j] = A[i, j] + B[i, j]
```
Optional: `coalesced_width=` can hint memory coalescing for the innermost loop.
### Pipelined (software pipelining)
`T.Pipelined(iters, num_stages=...)` overlaps producer/consumer stages (e.g.,
Global→Shared copies with compute). This is the backbone of GEMM/attention
pipelines.
```python
for ko in T.Pipelined(T.ceildiv(K, BK), num_stages=3):
T.copy(A[by * BM, ko * BK], A_s) # stage: copy A tile
T.copy(B[ko * BK, bx * BN], B_s) # stage: copy B tile
T.gemm(A_s, B_s, C_f) # stage: compute
```
### Persistent (advanced)
`T.Persistent(domain, wave_size, index, group_size=...)` exposes persistent
thread‑block style looping. It is an advanced construct that TileLang lowers in
later passes and is typically used by specialized templates.
## While Loops
`while` is supported when the condition is a TIR expression. Avoid infinite
loops; TileLang will error if it detects a constant‑true condition.
```python
i = 0
while i < N:
...
if done:
break
i += 1
```
## Break and Continue
Use Python `break`/`continue` to exit or skip within `T.serial`/`T.unroll`/
`T.Parallel`/`while` loops. Keep the body clean after a `break`/`continue` for
readability; the compiler will ignore the dead path.
## Putting It Together: Residual Tile Handling
Below is a typical edge pattern for a 2D kernel. With LegalizeSafeMemoryAccess,
the explicit guard can be omitted when you don’t need a custom edge path.
```python
for i, j in T.Parallel(M, N):
gi = by * BM + i
gj = bx * BN + j
if T.all_of(gi < M, gj < N): # optional in many cases
C[gi, gj] = A[gi, gj] + B[gi, gj]
```
## Debugging Conditions
Use `T.print` to inspect values under predicates. For buffers, TileLang prints
from a single thread to avoid duplicate outputs.
```python
if i == 0:
T.print(C, msg='C tile:')
```
# Instructions
This page summarizes the core TileLang “instructions” available at the DSL
level, how they map to hardware concepts, and how to use them correctly.
## Quick Categories
- Data movement: `T.copy`, `T.c2d_im2col`, staging Global ↔ Shared ↔ Fragment
- Compute primitives: `T.gemm`/`T.gemm_sp`, elementwise math (`T.exp`, `T.max`),
reductions (`T.reduce_sum`, `T.cumsum`, warp reducers)
- Control helpers: `T.clear`/`T.fill`, `T.reshape`/`T.view`
- Diagnostics: `T.print`, `T.device_assert`
- Advanced: atomics, memory barriers, warp‑group ops
## Data Movement
Use `T.copy(src, dst, coalesced_width=None, disable_tma=False, eviction_policy=None)`
to move tiles between memory scopes. It accepts `tir.Buffer`, `BufferLoad`, or
`BufferRegion`; extents are inferred or broadcast when possible.
```python
# Global → Shared tiles (extents inferred from dst)
T.copy(A[by * BM, ko * BK], A_s)
T.copy(B[ko * BK, bx * BN], B_s)
# Fragment/Register → Global (store result)
T.copy(C_f, C[by * BM, bx * BN])
```
Semantics
- Extents are deduced from arguments; missing sides broadcast to the other’s rank.
- Access patterns are legalized and coalesced during lowering. Explicit
vectorization is not required in HL mode.
- Safety: the LegalizeSafeMemoryAccess pass inserts boundary guards when an
access may be out‑of‑bounds and drops them when proven safe.
Other helpers
- `T.c2d_im2col(img, col, ...)`: convenience for conv‑style transforms.
## Compute Primitives
GEMM and sparse GEMM
- `T.gemm(A_shared, B_shared, C_fragment)`: computes a tile GEMM using shared
inputs and a fragment accumulator; lowered to target‑specific tensor cores.
- `T.gemm_sp(...)`: 2:4 sparse tensor core variant (see examples and README).
Reductions and scans
- `T.reduce_sum`, `T.reduce_max`, `T.reduce_min`, `T.cumsum`, plus warp
reducers (`T.warp_reduce_sum`, etc.).
- Allocate and initialize accumulators via `T.alloc_fragment` + `T.clear` or
`T.fill`.
Elementwise math
- Most math ops mirror TVM TIR: `T.exp`, `T.log`, `T.max`, `T.min`, `T.rsqrt`,
`T.sigmoid`, etc. Compose freely inside loops.
Reshape/view (no copy)
- `T.reshape(buf, new_shape)` and `T.view(buf, shape=None, dtype=None)` create
new views that share storage, with shape/dtype checks enforced.
## Synchronization (HL usage)
In HL pipelines, you usually don’t need to write explicit barriers. Passes such
as PipelinePlanning/InjectSoftwarePipeline/InjectTmaBarrier orchestrate
producer/consumer ordering and thread synchronization behind the scenes.
If you need debugging or explicit checks:
- `T.device_assert(cond, msg='')` emits device‑side asserts on CUDA targets.
- `T.print(obj, msg='...')` prints scalars or buffers safely from one thread.
## Putting It Together: GEMM Tile
```python
@T.prim_func
def gemm(
A: T.Tensor((M, K), 'float16'),
B: T.Tensor((K, N), 'float16'),
C: T.Tensor((M, N), 'float16'),
):
with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by):
A_s = T.alloc_shared((BM, BK), 'float16')
B_s = T.alloc_shared((BK, BN), 'float16')
C_f = T.alloc_fragment((BM, BN), 'float32')
T.clear(C_f)
for ko in T.Pipelined(T.ceildiv(K, BK), num_stages=3):
T.copy(A[by * BM, ko * BK], A_s) # Global → Shared
T.copy(B[ko * BK, bx * BN], B_s)
T.gemm(A_s, B_s, C_f) # compute into fragment
T.copy(C_f, C[by * BM, bx * BN]) # store back
```
## Instruction Reference (Concise)
Below is a concise list of TileLang instructions grouped by category. For full
signatures, behaviors, constraints, and examples, refer to API Reference
(`autoapi/tilelang/index`).
Data movement
- `T.copy(src, dst, ...)`: Move tiles between Global/Shared/Fragment.
- `T.c2d_im2col(img, col, ...)`: 2D im2col transform for conv.
Memory allocation and descriptors
- `T.alloc_shared(shape, dtype, scope='shared.dyn')`: Allocate shared buffer.
- `T.alloc_fragment(shape, dtype, scope='local.fragment')`: Allocate fragment.
- `T.alloc_var(dtype, [init], scope='local.var')`: Scalar var buffer (1 elem).
- `T.alloc_barrier(arrive_count)`: Shared barrier buffer.
- `T.alloc_tmem(shape, dtype)`: Tensor memory (TMEM) buffer (Hopper+).
- `T.alloc_reducer(shape, dtype, op='sum', replication=None)`: Reducer buf.
- `T.alloc_descriptor(kind, dtype)`: Generic descriptor allocator.
- `T.alloc_wgmma_desc(dtype='uint64')`
- `T.alloc_tcgen05_smem_desc(dtype='uint64')`
- `T.alloc_tcgen05_instr_desc(dtype='uint32')`
- `T.empty(shape, dtype='float32')`: Declare function output tensors.
Compute primitives
- `T.gemm(A_s, B_s, C_f)`: Tile GEMM into fragment accumulator.
- `T.gemm_sp(...)`: Sparse (2:4) tensor core GEMM.
- Reductions: `T.reduce_sum/max/min/abssum/absmax`, bitwise `and/or/xor`.
- Scans: `T.cumsum`, finalize: `T.finalize_reducer`.
- Warp reducers: `T.warp_reduce_sum/max/min/bitand/bitor`.
- Elementwise math: TIR ops (`T.exp`, `T.log`, `T.max`, `T.min`, `T.rsqrt`, ...).
- Fast math: `T.__log/__log2/__log10/__exp/__exp2/__exp10/__sin/__cos/__tan`.
- IEEE math: `T.ieee_add/sub/mul/fmaf` (configurable rounding).
- Helpers: `T.clear(buf)`, `T.fill(buf, value)`.
- Views: `T.reshape(buf, shape)`, `T.view(buf, shape=None, dtype=None)`.
Diagnostics
- `T.print(obj, msg='')`: Print scalar/buffer from one thread.
- `T.device_assert(cond, msg='')`: Device-side assert (CUDA).
Logical helpers
- `T.any_of(a, b, ...)`, `T.all_of(a, b, ...)`: Multi-term predicates.
Annotation helpers
- `T.use_swizzle(panel_size=..., enable=True)`: Rasterization hint.
- `T.annotate_layout({...})`: Attach explicit layouts to buffers.
- `T.annotate_safe_value(var, ...)`: Safety/const hints.
- `T.annotate_l2_hit_ratio(buf, ratio)`: Cache behavior hint.
Atomics
- `T.atomic_add(dst, value, memory_order=None, return_prev=False, use_tma=False)`.
- `T.atomic_addx2(dst, value, return_prev=False)`; `T.atomic_addx4(...)`.
- `T.atomic_max(dst, value, memory_order=None, return_prev=False)`.
- `T.atomic_min(dst, value, memory_order=None, return_prev=False)`.
- `T.atomic_load(dst)`, `T.atomic_store(dst, value)`.
Custom intrinsics
- `T.dp4a(A, B, C)`: 4‑element dot‑product accumulate.
- `T.clamp(x, lo, hi)`: Clamp to [lo, hi].
- `T.loop_break()`: Break from current loop via intrinsic.
Barriers, TMA, warp‑group
- Barriers: `T.create_list_of_mbarrier(...)`, `T.get_mbarrier(i)`.
- Parity ops: `T.mbarrier_wait_parity(barrier, parity)`, `T.mbarrier_arrive(barrier)`.
- Expect tx: `T.mbarrier_expect_tx(...)`; sugar: `T.barrier_wait(id, parity=None)`.
- TMA: `T.create_tma_descriptor(...)`, `T.tma_load(...)`,
`T.tma_store_arrive(...)`, `T.tma_store_wait(...)`.
- Proxy/fences: `T.fence_proxy_async(...)`, `T.warpgroup_fence_operand(...)`.
- Warp‑group: `T.warpgroup_arrive()`, `T.warpgroup_commit_batch()`,
`T.warpgroup_wait(num_mma)`, `T.wait_wgmma(id)`.
Lane/warp index
- `T.get_lane_idx(warp_size=None)`: Lane id in warp.
- `T.get_warp_idx_sync(warp_size=None)`: Canonical warp id (sync).
- `T.get_warp_idx(warp_size=None)`: Canonical warp id (no sync).
- `T.get_warp_group_idx(warp_size=None, warps_per_group=None)`: Group id.
Register control
- `T.set_max_nreg(reg_count, is_inc)`, `T.inc_max_nreg(n)`, `T.dec_max_nreg(n)`.
- `T.annotate_producer_reg_dealloc(n=24)`, `T.annotate_consumer_reg_alloc(n=240)`.
- `T.no_set_max_nreg()`, `T.disable_warp_group_reg_alloc()`.
## Notes on Dtypes
Dtypes accept three equivalent forms:
- String: `'float32'`
- TileLang dtype: `T.float32`
- Framework dtype: `torch.float32`
All are normalized internally. See Type System for details.
# Language Basics
This page introduces the core TileLang (tile‑lang) DSL that you’ll use to write
high‑performance kernels. It focuses on how to define a kernel, express
iteration, move data across memory scopes, and run it with JIT.
The examples use the conventional aliases:
```python
import tilelang
import tilelang.language as T
from tilelang import jit
```
## 1. Defining a Kernel with `@T.prim_func`
TileLang kernels are TIR (TVM IR) functions produced by the `@T.prim_func`
decorator. Arguments are annotated with shapes and dtypes via `T.Tensor` or
`T.Buffer`.
Note on dtypes
- You can pass dtypes as a string (e.g., 'float32'), a TileLang dtype (e.g., `T.float32`),
or a framework dtype (e.g., `torch.float32`). TileLang normalizes all of these.
See Type System for details.
```python
@T.prim_func
def add_kernel(
A: T.Tensor((N,), dtype), # dtype could be 'float32' | T.float32 | torch.float32
B: T.Tensor((N,), dtype),
C: T.Tensor((N,), dtype),
):
... # kernel body
```
- Shapes may be concrete integers or symbolic. For symbolic, you can pass
Python ints through the outer `@jit` wrapper (shown below), or annotate with
`T.dyn` when you want a named symbolic dimension.
```python
# Named symbolic dimension (optional)
K = T.dyn['K']
@T.prim_func
def uses_dyn(A: T.Tensor((K,), 'float32')):
...
```
### Dynamic symbolic dimensions: two ways
TileLang supports two complementary ways to introduce symbolic (dynamic) dims:
- Type-level annotations via `T.dyn[...]` (recommended for function signatures)
- Use in `T.Tensor((T.dyn['K'], ...), dtype)` or bind once then reuse (as above).
- Inside the kernel body, prefer reading from the buffer’s shape, e.g. `M = A.shape[0]`.
- Term-level variables via `T.dynamic(name, dtype)`
- Creates a TIR `tir.Var` you can use directly in expressions/loops.
- Handy when you need to reference the dimension symbol in the body.
```python
# 1) Annotation-only symbol; read the bound size via shape
K = T.dyn['K'] # dtype defaults to int32
@T.prim_func
def foo(A: T.Tensor((K,), 'float32')):
N = A.shape[0]
for i in T.serial(N):
...
# 2) Explicit Var symbol usable in the body
K = T.dynamic('K', 'int32') # or T.dynamic('K') defaults to int32
@T.prim_func
def bar(A: T.Tensor((K,), 'float32')):
for i in T.serial(K):
...
```
Notes
- `T.symbolic(name, dtype)` is a deprecated alias of `T.dynamic`; prefer `T.dynamic`.
- Under `@jit`, concrete sizes come from the actual tensor arguments at the first call.
- Symbols in annotations do not need to be separate kernel arguments; TileLang binds them from argument shapes.
## 2. Launching Work with `T.Kernel`
`with T.Kernel(...)` declares a launch context and creates block/thread
bindings. For GPU backends, specify a grid and threads per block.
```python
with T.Kernel(grid_x, grid_y, threads=128) as (bx, by):
... # bx/by are blockIdx.x/y
```
You rarely need raw thread indices; most kernels use structured loops
(`T.serial`, `T.unroll`, `T.Parallel`, `T.Pipelined`) inside a `T.Kernel`.
## 3. Loops and Control Flow
Core loop constructs map to familiar hardware patterns:
- `T.serial(start, stop[, step])`: plain for‑loop
- `T.unroll(start, stop[, step])`: unrolled loop
- `T.Parallel(ext0, ext1, ...)`: nested parallel loops (elementwise‑friendly)
- `T.Pipelined(iters, num_stages=N)`: software pipelining for producer/consumer
```python
for i in T.serial(N):
...
for i, j in T.Parallel(M, N):
C[i, j] = A[i, j] + B[i, j]
for k in T.Pipelined(T.ceildiv(K, BK), num_stages=3):
# overlap copy/compute across stages
...
```
Conditionals use standard Python `if`/`else`. Guard edges with predicates when
tile sizes do not divide problem sizes evenly.
## 4. Memory Scopes and Allocation
TileLang exposes key software‑managed scopes:
- Global: device memory (default for `T.Tensor` arguments)
- Shared: on‑chip, block‑visible (`T.alloc_shared(shape, dtype)`)
- Fragment and scalars: per‑thread fragments and scalar vars but in Shared View
(`T.alloc_fragment`, `T.alloc_var`)
```python
A_shared = T.alloc_shared((BM, BK), 'float16')
B_shared = T.alloc_shared((BK, BN), 'float16')
C_local = T.alloc_fragment((BM, BN), 'float32')
T.clear(C_local) # zero accumulators
```
## 5. Moving Data: `T.copy`
Use `T.copy(src, dst)` to move tiles between scopes. It accepts buffers,
buffer regions, or buffer loads; extents are inferred or can be broadcast.
```python
# Global -> Shared (tile copy), extents inferred from dst
T.copy(A[by * BM, ko * BK], A_shared)
T.copy(B[ko * BK, bx * BN], B_shared)
# Fragment -> Global (store back)
T.copy(C_local, C[by * BM, bx * BN])
```
`T.copy` performs coalescing and scope‑specific lowering during compilation.
## 6. A Minimal End‑to‑End Example (Vector Add)
```python
import tilelang
import tilelang.language as T
from tilelang import jit
@jit # infers target from tensors at first call
def add(N: int, block: int = 256, dtype: str = 'float32'):
@T.prim_func
def add_kernel(
A: T.Tensor((N,), dtype),
B: T.Tensor((N,), dtype),
C: T.Tensor((N,), dtype),
):
with T.Kernel(T.ceildiv(N, block), threads=block) as bx:
for i in T.Parallel(block):
gi = bx * block + i
# Optional — LegalizeSafeMemoryAccess inserts a guard when an access may be OOB
C[gi] = A[gi] + B[gi]
return add_kernel
# Host side (PyTorch shown; NumPy/DLPack also supported)
import torch
N = 1 << 20
A = torch.randn(N, device='cuda', dtype=torch.float32)
B = torch.randn(N, device='cuda', dtype=torch.float32)
C = torch.empty(N, device='cuda', dtype=torch.float32)
kernel = add(N)
kernel(A, B, C) # runs on GPU
torch.testing.assert_close(C, A + B)
```
Notes
- The `@jit` wrapper returns a callable kernel after the first compilation.
- You can pass compile‑time tunables (tile sizes, dtypes) through the outer
Python function and bake them into the generated TIR.
## 7. Tiled GEMM Skeleton
Below is a minimal pattern for a tiled GEMM using shared memory staging and a
fragment accumulator. It mirrors the quickstart style found in the repository.
```python
@T.prim_func
def gemm(
A: T.Tensor((M, K), 'float16'),
B: T.Tensor((K, N), 'float16'),
C: T.Tensor((M, N), 'float16'),
):
with T.Kernel(T.ceildiv(N, BN), T.ceildiv(M, BM), threads=128) as (bx, by):
A_s = T.alloc_shared((BM, BK), 'float16')
B_s = T.alloc_shared((BK, BN), 'float16')
C_f = T.alloc_fragment((BM, BN), 'float32')
T.clear(C_f)
for ko in T.Pipelined(T.ceildiv(K, BK), num_stages=3):
T.copy(A[by * BM, ko * BK], A_s)
T.copy(B[ko * BK, bx * BN], B_s)
T.gemm(A_s, B_s, C_f) # lowered to tensor‑core/ISA specific kernels
T.copy(C_f, C[by * BM, bx * BN])
```
## 8. Debugging and Printing
Use `T.print` inside a kernel for quick introspection. TileLang emits printing
from a single thread for shared/fragment scopes to avoid floods.
```python
T.print(C_f, msg='accumulator:')
T.print(A_s, msg='A tile:')
T.print(C[0], msg='C[0] = ')
```
## 9. Where to Go Next
- Control flow details: see Programming Guides → Control Flow
- Memory topics: see Programming Guides → (removed cache/layout); basics are covered inline
- Autotuning tile sizes and mappings: Programming Guides → Autotuning
- Operator examples (GEMM, GEMV, attention): see Deep Learning Operators
# Programming Guides Overview
This section provides a practical guide to writing high‑performance kernels with Tile Language (tile‑lang).
It mirrors the structure of a similar guide in another project and adapts it to tile‑lang concepts and APIs.
- Audience: Developers implementing custom GPU/CPU kernels with tile‑lang
- Prereqs: Basic Python, NumPy/Tensor concepts, and familiarity with GPU programming notions
- Scope: Language basics, control flow, instructions, autotuning, and type system
## What You’ll Learn
- How to structure kernels with TileLang’s core DSL constructs
- How to move data across global/shared/fragment and pipeline compute
- How to apply autotuning to tile sizes and schedules
- How to specify and work with dtypes in kernels
## Suggested Reading Order
1. Language Basics
2. Control Flow
3. Instructions
4. Autotuning
5. Type System
## Related Docs
- Tutorials: see existing guides in `tutorials/`
- Operators: examples in `deeplearning_operators/`
> NOTE: This is a draft scaffold. Fill in code snippets and benchmarks as APIs evolve.
# Type System
This page lists the data types supported by TileLang and how to specify them in
kernels. For full details and the authoritative list, see the API Reference
(`autoapi/tilelang/index`) and `tilelang.language.v2.dtypes`.
How to specify dtypes
- Use any of the following forms; TileLang normalizes them internally:
- String: `'float32'`, `'int8'`, `'bfloat16'`, ...
- TileLang dtype object: `T.float32`, `T.int8`, `T.bfloat16`, ...
- Framework dtype: `torch.float32`, `torch.int8`, `torch.bfloat16`, ...
Common scalar types
- Boolean: `bool`
- Signed integers: `int8`, `int16`, `int32`, `int64`
- Unsigned integers: `uint8`, `uint16`, `uint32`, `uint64`
- Floating‑point: `float16` (half), `bfloat16`, `float32`, `float64`
Float8 and low‑precision families
- Float8: `float8_e3m4`, `float8_e4m3`, `float8_e4m3b11fnuz`, `float8_e4m3fn`,
`float8_e4m3fnuz`, `float8_e5m2`, `float8_e5m2fnuz`, `float8_e8m0fnu`
- Float6: `float6_e2m3fn`, `float6_e3m2fn`
- Float4: `float4_e2m1fn`
Vectorized element types (SIMD packs)
- For many base types, vector‑packed variants are available by lane count:
`x2`, `x4`, `x8`, `x16`, `x32`, `x64`.
- Examples:
- Integers: `int8x2`, `int8x4`, ..., `int32x2`, `int32x4`, ...
- Unsigned: `uint8x2`, `uint8x4`, ...
- Floats: `float16x2`, `float16x4`, `float32x2`, `float32x4`, ...
- Float8/6/4 families also provide `x2/x4/x8/x16/x32/x64` where applicable,
e.g., `float8_e4m3x2`, `float8_e4m3x4`, `float6_e2m3fnx8`, `float4_e2m1fnx16`.
Notes
- Availability of certain low‑precision formats (float8/6/4) depends on target
architecture and backend support.
- Choose accumulation dtypes explicitly for mixed‑precision compute (e.g.,
GEMM with `float16` inputs and `float32` accumulators).
- The complete, up‑to‑date list is exposed in
`tilelang.language.v2.dtypes` and rendered in the API Reference.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment