init

bc2d5632 · root · bc2d5632 · bc2d5632 · bc2d5632 · bc2d5632
Commit bc2d5632 authored Jan 15, 2026 by root
20 changed files
--- a/docs/_static/img/MatmulExample.png
+++ b/docs/_static/img/MatmulExample.png
--- a/docs/_static/img/Parallel.png
+++ b/docs/_static/img/Parallel.png
--- a/docs/_static/img/ir_transform_diagram.png
+++ b/docs/_static/img/ir_transform_diagram.png
--- a/docs/_static/img/logo-row.svg
+++ b/docs/_static/img/logo-row.svg
--- a/docs/_static/img/mla_hopper/bs128_float16.png
+++ b/docs/_static/img/mla_hopper/bs128_float16.png
--- a/docs/_static/img/mla_hopper/bs64_float16.png
+++ b/docs/_static/img/mla_hopper/bs64_float16.png
--- a/docs/_static/img/mla_hopper/pv_layout.jpg
+++ b/docs/_static/img/mla_hopper/pv_layout.jpg
--- a/docs/_static/img/mla_hopper/qk_layout.jpg
+++ b/docs/_static/img/mla_hopper/qk_layout.jpg
--- a/docs/_static/img/op_benchmark_consistent_gemm_fp16.png
+++ b/docs/_static/img/op_benchmark_consistent_gemm_fp16.png
--- a/docs/_static/img/overview.png
+++ b/docs/_static/img/overview.png
--- a/docs/_static/img/software_pipeline_inference.png
+++ b/docs/_static/img/software_pipeline_inference.png
--- a/docs/compiler_internals/inject_fence_proxy.md
+++ b/docs/compiler_internals/inject_fence_proxy.md
+# InjectFenceProxy Pass
+`tl.InjectFenceProxy` is a TIR-level transform that keeps the GPU proxy state consistent on NVIDIA Hopper (SM90+) by inserting `fence.proxy.async` instructions when control flow switches from generic memory operations to asynchronous proxy operations.
+## Why Fences Are Needed
+Hopper separates memory instructions into generic and asynchronous proxy paths. When an asynchronous instruction (for example, `cp.async` or `tma.load`) issues after generic traffic (like `ldmatrix` or plain buffer stores), the hardware requires a `fence.proxy.async` to guarantee ordering. Missing fences can lead to race conditions or undefined behavior.
+## What the Pass Does
+- Walks every statement in the `PrimFunc`, tracking whether it behaves as a **generic**, **async**, or **neutral** proxy (neutral statements reset the state, such as an explicit fence).
+- Automatically lowers `tma_store` intrinsics into the required `arrive`/`wait` handshake so that TMA stores participate correctly in synchronization.
+- Injects an explicit `fence.proxy.async` whenever a generic statement is followed by an async statement without an intervening neutral barrier.
+The pass is conservative: unknown extern calls are treated as async so that the fence is inserted rather than accidentally omitted.
+### Timeline View
+```
+generic initialize_descriptor → generic shared-store → async wgmma
+             │                           │                   │
+             └─ generic proxy            ┴─ generic proxy    ┴─ async proxy
+                         │        fence inserted here   ↑
+                         └──────────────────────────────┘
+```
+The proxy tracker scans the sequence from left to right. The moment it detects a transition from generic to async (between the store and `cp.async` above), it synthesizes a `fence.proxy.async` to reset the hardware proxy state before the async path runs.
+## Coverage of Intrinsics
+The tracker understands the TileLang intrinsics for TMA load/store, shared-memory MMA (`wgmma`), and TVM/PTX async copy intrinsics (`cp.async` variants). Generic operations currently include `ldmatrix`, `stmatrix`, and descriptor initialization. Other IR nodes (loops, blocks, attributes) receive a proxy kind derived from their bodies so that the analysis survives structured control flow.
+## Usage
+The pass is part of the default TileLang lowering pipeline. To apply it manually:
+```python
+from tilelang import tl
+from tvm import IRModule
+mod = IRModule({"main": prim_func})
+with tvm.transform.PassContext():
+    mod = tl.transform.InjectFenceProxy()(mod)
+```
+## End-to-End Example
+Before the pass:
+```python
+@T.prim_func
+def kernel():
+    with T.Kernel(1):
+        desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
+        smem = T.decl_buffer((128,), "float16", scope="shared")
+        T.initialize_descriptor(desc, T.uint64(0), 2, 1, 32)
+        smem[0] = T.float16(0)
+        T.ptx_wgmma_ss(
+            "float16",
+            "m64n64k16",
+            T.bool(True),
+            T.bool(True),
+            "fp16",
+            "fp16",
+            "fp16",
+            desc.data,
+            T.int32(0),
+            desc.data,
+            T.int32(0),
+            smem.data,
+            T.int32(0),
+            T.bool(True),
+            1,
+            1,
+        )
+```
+After `tl.transform.InjectFenceProxy`:
+```python
+@T.prim_func
+def kernel():
+    with T.Kernel(1):
+        desc = T.decl_buffer((1,), "uint64", scope="local.descriptor")
+        smem = T.decl_buffer((128,), "float16", scope="shared")
+        T.initialize_descriptor(desc, T.uint64(0), 2, 1, 32)
+        smem[0] = T.float16(0)
+        T.fence_proxy_async()
+        T.ptx_wgmma_ss(
+            "float16",
+            "m64n64k16",
+            T.bool(True),
+            T.bool(True),
+            "fp16",
+            "fp16",
+            "fp16",
+            desc.data,
+            T.int32(0),
+            desc.data,
+            T.int32(0),
+            smem.data,
+            T.int32(0),
+            T.bool(True),
+            1,
+            1,
+        )
+```
+The only change is the `fence_proxy_async` between the generic descriptor setup / shared-memory write and the async `wgmma`. In larger kernels the pass performs the same operation across nested blocks, loops, and conditional branches.
+## Extending the Pass
+If you introduce a new intrinsic that behaves like an async proxy, add it to `IsAsyncIntrinsic` in `src/transform/inject_fence_proxy.cc`. Likewise, extend `IsKnownGeneric` for additional generic operations. When adding new neutral barriers, make sure they set the proxy kind to `kNeutral` so the state resets correctly.
--- a/docs/compiler_internals/letstmt_inline.md
+++ b/docs/compiler_internals/letstmt_inline.md
+# LetStmt Inlining in TileLang
+This document explains how `LetStmt` inlining works in TileLang's simplification pipeline, which is an important optimization that affects code generation and performance.
+## Overview
+A `LetStmt` (Let Statement) is a temporary variable binding in the IR (Intermediate Representation). During compilation, TileLang's simplifier may choose to inline these temporary variables to simplify the code. TileLang also provides a standalone `LetInline` pass that performs eager substitution before the main legalization pipeline. However, not all `LetStmt` nodes can be safely inlined.
+## When Does LetStmt Get Inlined?
+The inlining logic is implemented in `src/transform/simplify.cc`. A `LetStmt` will be inlined if **both** of the following conditions are met:
+### 1. The value satisfies `CanInlineLetStmt`
+The `CanInlineLetStmt` helper returns `true` when:
+- **The value is a constant** (`is_const_number(op->value)` returns true)
+- **The value is a variable** (`op->value.as<VarNode>()` returns a node)
+- **The value is an integer expression without side effects**:
+  - The value has `int` dtype
+  - The side effect level is `kPure` or lower (no observable side effects)
+```cpp
+bool CanInlineLetStmt(const LetStmtNode *op) {
+  if (is_const_number(op->value))
+    return true;
+  if (op->value.as<VarNode>())
+    return true;
+  // Won't face the deep expression explosion problem as in Let expression.
+  // attempt to inline as much as possible if the value integer type(can be
+  // index).
+  if (!op->value.dtype().is_int())
+    return false;
+  return SideEffect(op->value) <= CallEffectKind::kPure;
+}
+```
+### 2. The variable is NOT used in buffer definitions
+Even if `CanInlineLetStmt` returns true, the variable will **not** be inlined if it's used in a buffer's definition (shape, strides, elem_offset, or data fields).
+This protection exists because:
+- Buffer definitions are not updated during the simplification pass
+- If a variable used in a buffer definition is inlined, later references to that buffer would fail to find the variable definition
+- This would cause compilation errors or incorrect behavior
+The mutator checks this before dropping the binding:
+```cpp
+bool used_in_buffer_def = used_in_buffer_def_.count(op->var.get());
+if (can_inline && !used_in_buffer_def) {
+    return body;  // Inline: remove LetStmt and return body directly
+}
+```
+## Example: Why Buffer Definition Variables Are Protected
+Consider this code:
+```python
+let stride = M * 16
+let buffer_a = Buffer(data, shape=[M, N], strides=[stride, 1])
+buffer_a[i, j] = ...
+```
+- `stride` satisfies `CanInlineLetStmt` (it's an int expression with no side effects)
+- However, `stride` is used in `buffer_a`'s `strides` field
+- If we inline it, the buffer definition becomes `strides=[M*16, 1]`
+- But the Buffer object's fields are not updated during simplification
+- Later code accessing `buffer_a` would fail to find the `stride` variable
+Therefore, `stride` is added to `used_in_buffer_def_` and will **not** be inlined.
+## How Variables Are Collected
+The `CollectVarsUsedInBufferDefinition` helper traverses all `BufferLoad` and `BufferStore` nodes and collects variables used in their buffer definitions:
+```cpp
+void VisitBuffer(const Buffer &buf) {
+  // Collect variables that should remain defined
+  VarUseDefAnalyzer usage(Array<Var>{});
+  usage(buf->data);
+  for (const auto &dim : buf->shape) {
+    usage(dim);
+  }
+  for (const auto &dim : buf->strides) {
+    usage(dim);
+  }
+  usage(buf->elem_offset);
+  // Track for use in LetStmtNode mutator
+  for (const auto &var : usage.undefined_) {
+    used_in_buffer_def_.insert(var.get());
+  }
+}
+```
+## Practical Example: Temporary Variable Issue
+Consider this TileLang code:
+```python
+for i in T.Parallel(block_N):
+    idx = bx * block_N + i
+    tmp = T.max(A[idx], 1)
+    B[idx] = tmp / 2
+    A[idx] = tmp * 2
+```
+In this case:
+- `tmp` is an integer-like temporary variable
+- It satisfies `CanInlineLetStmt` (pure int expression)
+- It's **not** used in any buffer definition
+- Therefore, `tmp` **will be inlined**
+This means the IR becomes:
+```python
+for i in T.Parallel(block_N):
+    idx = bx * block_N + i
+    B[idx] = T.max(A[idx], 1) / 2
+    A[idx] = T.max(A[idx], 1) * 2
+```
+If this causes issues (e.g., `A[idx]` being read twice with different values due to the first write), it indicates a potential problem with the inlining heuristic or the code pattern.
+## Controlling Let Inlining via Pass Config
+TileLang exposes an explicit pass configuration key, `tilelang.PassConfigKey.TL_FORCE_LET_INLINE` (`"tl.force_let_inline"`), that allows users to force the eager `LetInline` pass to run before the legalization pipeline begins. When enabled, the pipeline invokes `tilelang.transform.LetInline()` at the start of `LowerAndLegalize` (see `tilelang/engine/phase.py`). This knob is useful when debugging LetStmt-related issues or when deterministic inlining behavior is desired across different environments.
+```python
+from tilelang import transform
+from tilelang.engine.phase import LowerAndLegalize
+with transform.PassContext(
+    config={transform.PassConfigKey.TL_FORCE_LET_INLINE: True}
+):
+    lowered_mod = LowerAndLegalize(input_mod, target)
+```
+If the flag is left unset (the default), the eager pass is only applied when downstream transforms opt in (for example, by calling `_Simplify(..., inline_let=True)` inside Tile operators). The guard in `tilelang/engine/phase.py` ensures the eager pass is only triggered when the user explicitly requests it.
+## Summary
+The LetStmt inlining mechanism is a **conservative optimization** that:
+1. Aggressively inlines simple, pure integer expressions to simplify the IR
+2. Protects variables used in buffer definitions to avoid breaking buffer access
+3. Helps reduce IR complexity and improve code generation
+4. Can be forced through `TL_FORCE_LET_INLINE` when deterministic eager inlining is required
+Understanding when inlining happens is crucial for:
+- Debugging compilation issues
+- Understanding generated code
+- Writing efficient TileLang programs
+- Identifying potential optimization opportunities or bugs
+## Related Files
+- `src/transform/simplify.cc`: Main Simplify implementation
+- `src/transform/frontend_legalize.cc`: Standalone LetInline pass
+- `tilelang/engine/phase.py`: Pipeline integration for eager LetInlining
+- `testing/python/transform/test_tilelang_transform_let_inline.py`: Regression coverage for the pass
--- a/docs/conf.py
+++ b/docs/conf.py
+# General information about the project.
+project = "Tile Language <br>"
+author = "Tile Lang Contributors"
+copyright = f"2025-2025, {author}"
+# Version information.
+with open("../VERSION") as f:
+    version = f.read().strip()
+release = version
+extensions = [
+    "sphinx_tabs.tabs",
+    "sphinx_toolbox.collapse",
+    "sphinxcontrib.httpdomain",
+    "sphinx.ext.napoleon",
+    "sphinx.ext.intersphinx",
+    "sphinx_reredirects",
+    "sphinx.ext.mathjax",
+    "myst_parser",
+    "autoapi.extension",
+]
+autoapi_type = 'python'
+autoapi_dirs = ['../tilelang']
+autoapi_options = [
+    'members',
+    'undoc-members',
+    'show-inheritance',
+    'show-module-summary',
+    'special-members',
+]
+autoapi_keep_files = False  # Useful for debugging the generated rst files
+autoapi_generate_api_docs = True
+autodoc_typehints = 'description'
+autoapi_ignore = ["*language/ast*", "*version*", "*libinfo*", "*parser*"]
+source_suffix = {
+    '.rst': 'restructuredtext',
+    '.md': 'markdown',
+}
+myst_enable_extensions = [
+    "colon_fence",
+    "deflist",
+]
+redirects = {"get_started/try_out": "../index.html#getting-started"}
+language = "en"
+exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "README.md", "**/*libinfo*", "**/*version*"]
+pygments_style = "sphinx"
+todo_include_todos = False
+# -- Options for HTML output ----------------------------------------------
+html_theme = "furo"
+templates_path = []
+html_static_path = ["_static"]
+footer_copyright = "© 2025-2025 Tile Language"
+footer_note = " "
+html_theme_options = {
+    "light_logo": "img/logo-row.svg",
+    "dark_logo": "img/logo-row.svg",
+}
+header_links = [
+    ("Home", "https://github.com/tile-ai/tilelang"),
+    ("Github", "https://github.com/tile-ai/tilelang"),
+]
+html_context = {
+    "footer_copyright": footer_copyright,
+    "footer_note": footer_note,
+    "header_links": header_links,
+    "display_github": True,
+    "github_user": "tile-ai",
+    "github_repo": "tilelang",
+    "github_version": "main/docs/",
+    "theme_vcs_pageview_mode": "edit",
+}
--- a/docs/deeplearning_operators/deepseek_mla.md
+++ b/docs/deeplearning_operators/deepseek_mla.md
+# 🚀 Write High Performance FlashMLA with TileLang on Hopper
+<div style="text-align: left;">
+    <em>Author:</em> <a href="https://github.com/chengyupku">Yu Cheng</a> 
+    <em>Author:</em> <a href="https://github.com/LeiWang1999">Lei Wang</a>
+</div>
+TileLang is a user-friendly AI programming language that significantly lowers the barrier to kernel programming, helping users quickly build customized operators. However, users still need to master certain programming techniques to better leverage TileLang's powerful capabilities. Here, we'll use MLA as an example to demonstrate how to write high-performance kernels with TileLang.
+## Introduction to MLA
+DeepSeek's MLA (Multi-Head Latent Attention) is a novel attention mechanism known for its hardware efficiency and significant improvements in model inference speed. Several deep learning compilers (such as [Triton](https://github.com/triton-lang/triton)) and libraries (such as [FlashInfer](https://github.com/flashinfer-ai/flashinfer)) have developed their own implementations of MLA. In February 2025, [FlashMLA](https://github.com/deepseek-ai/FlashMLA) was open-sourced on GitHub. FlashMLA utilizes [CUTLASS](https://github.com/NVIDIA/cutlass) templates and incorporates optimization techniques from [FlashAttention](https://github.com/Dao-AILab/flash-attention), achieving impressive performance.
+## Benchmark Results
+We benchmarked the performance of FlashMLA, TileLang, Torch, Triton, and FlashInfer under batch sizes of 64 and 128, with float16 data type, as shown in the figures below.
+```{figure} ../_static/img/mla_hopper/bs64_float16.png
+:width: 50%
+:alt: Overview
+:align: center
+Figure 1: Performance under batch size=64
+```
+```{figure} ../_static/img/mla_hopper/bs128_float16.png
+:width: 50%
+:alt: Overview
+:align: center
+Figure 2: Performance under batch size=128
+```
+As shown in the results, TileLang achieves performance comparable to FlashMLA in most cases, significantly outperforming both FlashInfer and Triton. 
+Notably, **TileLang accomplishes this with just around 80 lines of Python code**, demonstrating its exceptional ease of use and efficiency. Let's dive in and see how TileLang achieves this.
+## Implementation
+First, let's review the core computation logic of traditional FlashAttention:
+```python   
+# acc_s: [block_M, block_N]
+# scores_max: [block_M]
+# scores_scale: [block_M]
+# acc_o: [block_M, dim]
+for i in range(loop_range):
+    acc_s = Q @ K[i]
+    scores_max_prev = scores_max
+    scores_max = max(acc_s, dim=1)
+    scores_scale = exp(scores_max_prev - scores_max)
+    acc_o *= scores_scale
+    acc_s = exp(acc_s - scores_max)
+    acc_o = acc_s @ V[i]
+    ...
+```
+Here, `acc_s` represents the `Q @ K` result in each iteration with dimensions `[block_M, block_N]`, while `acc_o` represents the current iteration's output with dimensions `[block_M, dim]`. Both `acc_s` and `acc_o` need to be stored in registers to reduce latency.
+Compared to traditional attention operators like MHA (Multi-Headed Attention) or GQA (Grouped Query Attention), a major challenge in optimizing MLA is its large head dimensions - `query` and `key` have head dimensions of 576 (512 + 64), while `value` has a head dimension of 512. This raises a significant issue: `acc_o` becomes too large, and with insufficient threads (e.g., 128 threads), register spilling occurs, severely impacting performance.
+This raises the question of how to partition the matrix multiplication operation. On the Hopper architecture, most computation kernels use [`wgmma.mma_async`](https://docs.nvidia.com/cuda/parallel-thread-execution/#asynchronous-warpgroup-level-matrix-instructions) instructions for optimal performance. The `wgmma.mma_async` instruction organizes 4 warps (128 threads) into a warpgroup for collective MMA operations. However, `wgmma.mma_async` instructions require a minimum M dimension of 64. This means each warpgroup's minimum M dimension can only be reduced to 64, but a tile size of 64*512 is too large for a single warpgroup, leading to register spilling.
+Therefore, our only option is to partition `acc_o` along the `dim` dimension, with two warpgroups computing the left and right part of `acc_o` respectively. However, this introduces another challenge: both warpgroups require the complete `acc_s` result as input. 
+Our solution is to have each warpgroup compute half of `acc_s` during `Q @ K` computation, then obtain the other half computed by the other warpgroup through shared memory.
+### Layout Inference
+While the above process may seem complex, but don't worry - TileLang will handle all these intricacies for you.
+Figure 3 and Figure 4 illustrate the frontend TileLang script and its corresponding execution plan for MLA. Here, `T.gemm` represents matrix multiplication operations, `transpose_B=True` indicates transposition of matrix B, and `policy=FullCol` specifies that each warpgroup computes one column (e.g., split the result matrix in vertical dimension). `T.copy` represents buffer-to-buffer copying operations.
+```{figure} ../_static/img/mla_hopper/qk_layout.jpg
+:width: 50%
+:alt: Overview
+:align: center
+Figure 3: Buffer shapes in Q @ K
+```
+```{figure} ../_static/img/mla_hopper/pv_layout.jpg
+:width: 50%
+:alt: Overview
+:align: center
+Figure 4: Buffer shapes in acc_s @ V
+```
+The mapping from TileLang frontend code to execution plan is accomplished through Layout Inference. Layout inference is a core optimization technique in TileLang. It automatically deduces the required buffer shapes and optimal layouts based on Tile-Operators (like `T.gemm`, `T.copy`, etc.), then generates the corresponding code. Here, we demonstrate a concrete example of buffer shape inference in MLA.
+For instance, when computing `Q @ K`, TileLang infers that each warpgroup's `acc_s_0` shape should be `[blockM, blockN / 2]` based on the `policy=FullCol` annotation in `T.gemm`. Since this is followed by an `acc_s @ V` operation with `policy=FullCol`, which requires each warpgroup to have the complete `acc_s` result, TileLang deduces that `acc_s`'s shape at this point should be `[blockM, blockN]`. Consequently, TileLang can continue the inference process forward, determining that both `S_shared` and `acc_s` in `T.copy(S_shared, acc_s)` should have shapes of `[blockM, blockN]`.
+It's worth noting that our scheduling approach differs from FlashMLA's implementation strategy. In FlashMLA, `Q @ K` is assigned to a single warpgroup, while the `acc_o` partitioning scheme remains consistent with ours. Nevertheless, our scheduling approach still achieves comparable performance.
+### Threadblock Swizzling
+Threadblock swizzling is a common performance optimization technique in GPU kernel optimization. In GPU architecture, the L2 cache is a high-speed cache shared among multiple SMs (Streaming Multiprocessors). Threadblock swizzling optimizes data access patterns by remapping the scheduling order of threadblocks, thereby improving L2 cache hit rates. Traditional scheduling typically executes threadblocks in the natural order of the grid, which can lead to non-contiguous data access patterns between adjacent threadblocks, resulting in inefficient utilization of cached data. The swizzle technique employs mathematical mapping methods (such as diagonal or interleaved mapping) to adjust the execution order of threadblocks, ensuring that consecutively scheduled threadblocks access adjacent or overlapping data regions.
+In TileLang, threadblock swizzling optimization can be implemented with just a single line of Python code:
+```python
+T.use_swizzle(panel_size: int, order: str = "row")
+```
+Here, `panel_size` specifies the width of the swizzled threadblock group, and `order` determines the swizzling pattern, which can be either "row" or "col".
+### Shared Memory Swizzling
+In CUDA programming, shared memory is divided into multiple memory banks, with each bank capable of servicing one thread request per clock cycle in parallel. Bank conflicts occur when multiple threads simultaneously access different addresses mapped to the same bank, forcing these accesses to be serialized and degrading performance.
+One common strategy to address bank conflicts is shared memory swizzling. This technique rearranges how data is stored in shared memory by remapping addresses that would originally fall into the same bank to different banks, thereby reducing conflicts. For example, XOR operations or other bit manipulations can be incorporated into address calculations to alter the data layout, resulting in more evenly distributed memory accesses across consecutive threads. This approach is particularly crucial for implementing high-performance computing tasks like matrix multiplication and convolution, as it can significantly improve memory access parallelism and overall execution efficiency.
+Similarly, TileLang also supports shared memory swizzling. Users only need to add a single line of Python code:
+```python
+T.annotate_layout({
+    S_shared: TileLang.layout.make_swizzled_layout(S_shared),
+})
+```
+Here, `T.annotate_layout` allows users to specify any desired layout for a buffer. For convenience, TileLang provides the `make_swizzled_layout` primitive to automatically generate a swizzled layout.
+### Warp-Specialization
+The Hopper architecture commonly employs warp specialization for performance optimization. A typical approach is to designate one warpgroup as a producer that handles data movement using TMA (Tensor Memory Accelerator), while the remaining warpgroups serve as consumers performing computations. However, this programming pattern is complex, requiring developers to manually manage the execution logic for producers and consumers, including synchronization through the `mbarrier` objects.
+In TileLang, users are completely shielded from these implementation details. The frontend script is automatically transformed into a warp-specialized form, where TileLang handles all producer-consumer synchronization automatically, enabling efficient computation.
+### Pipeline
+Pipeline is a technique used to improve memory access efficiency by overlapping memory access and computation. In TileLang, pipeline can be implemented through the `T.pipelined` annotation:
+```python
+T.pipelined(range: int, stage: int)
+```
+Here, `range` specifies the range of the pipeline, and `stage` specifies the stage of the pipeline. Multi-stage pipelining enables overlapping of computation and memory access, which can significantly improve performance for memory-intensive operators. However, setting a higher number of stages consumes more shared memory resources, so the optimal configuration needs to be determined based on specific use cases.
+### Split-KV
+We have also implemented Split-KV optimization similar to [FlashDecoding](https://pytorch.org/blog/flash-decoding/). Specifically, when the batch size is small, parallel SM resources cannot be fully utilized due to low parallelism. In such cases, we can split the kv_ctx dimension across multiple SMs for parallel computation and then merge the results.
+In our implementation, we have developed both split and combine kernels, allowing users to control the split size through a `num_split` parameter.
+## 🚀 On AMD MI300X Accelerators
+Following our previous demonstration of [high-performance FlashMLA implementation on NVIDIA Hopper architectures using TileLang](https://github.com/tile-ai/tilelang/blob/main/examples/deepseek_mla/README.md), this work presents an optimized implementation for AMD MI300X accelerators. We examine architectural differences and corresponding optimization strategies between these platforms.
+### Architectural Considerations and Optimization Strategies
+Key implementation differences between Hopper and MI300X architectures include:
+1. **Instruction Set Variations**: The MI300X architecture eliminates the need for explicit Tensor Memory Access (TMA) instructions and warp specialization, which are automatically handled by the compiler on Hopper architectures, resulting in identical source code manifestations.
+2. **Shared Memory Constraints**: With 64KB of shared memory compared to Hopper's 228KB, MI300X implementations require careful memory management. Our optimization strategy includes:
+   - Reducing software pipeline stages
+   - Register-based caching of Q matrices instead of shared memory utilization:
+   ```python
+   # Original shared memory allocation
+   Q_shared = T.alloc_shared([block_H, dim], dtype)
+   Q_pe_shared = T.alloc_shared([block_H, pe_dim], dtype)
+   # Optimized register allocation
+   Q_local = T.alloc_fragment([block_H, dim], dtype)
+   Q_pe_local = T.alloc_fragment([block_H, pe_dim], dtype)
+   ```
+3. **Tile Size Flexibility**: The absence of WGMMA instructions on MI300X permits more flexible tile size selection, removing the requirement for block_m to be multiples of 64.
+4. **Memory Bank Conflict Swizzling**: MI300x has different memory bank conflict rules compared to NVIDIA, so we need to use different swizzling strategies. This is also automatically handled by TileLang, resulting in no visible differences in the code.
+### Performance Evaluation
+We conducted comparative performance analysis across multiple frameworks using float16 precision with batch sizes 64 and 128. The experimental results demonstrate:
+<figure style="text-align: center">
+  <a href="../figures/flashmla-amd.png">
+    <img src="../figures/flashmla-amd.png" alt="AMD FlashMLA Performance Comparison">
+   </a>
+  <figcaption style="text-align: center;">Figure 1: Computational throughput comparison across frameworks (Batch sizes 64 and 128)</figcaption>
+</figure>
+Notably, TileLang achieves performance parity with hand-optimized assembly kernels (aiter-asm) in most test cases, while significantly outperforming both Triton (1.98×) and PyTorch (3.76×) implementations. This performance is achieved through a concise 80-line Python implementation, demonstrating TileLang's efficiency and programmability advantages.
+### Future Optimization Opportunities
+1. **Memory Bank Conflict Mitigation**: Current implementations primarily address bank conflicts in NT layouts through TileLang's automatic optimization. Further investigation of swizzling techniques for alternative memory layouts remains an open research direction.
+2. **Dimension Parallelization**: For large MLA dimensions (e.g., 576 elements), we propose investigating head dimension partitioning strategies to:
+   - Reduce shared memory pressure
+   - Improve compute-to-memory access ratios
+   - Enhance parallelism through dimension-wise task distribution
--- a/docs/deeplearning_operators/elementwise.md
+++ b/docs/deeplearning_operators/elementwise.md
+# ElementWise Operators
+<div style="text-align: left;">
+    <em>Author:</em> <a href="https://github.com/chenghuaWang">Chenghua Wang</a>
+</div>
+:::{warning}
+:class: myclass1 myclass2
+:name: a-tip-reference
+   This document is still **experimental** and may be incomplete.  
+   Suggestions and improvements are highly encouraged—please submit a PR!
+:::
+Elementwise operators are widely used in deep learning and often serve as the first example encountered by those beginning to explore parallel programming. This tutorial will analyze several implementations of the elementwise addition operator using TileLang and compare them with the corresponding CUDA implementation. By the end of this tutorial, you will learn:
+1. How to implement an elementwise operator using TileLang.
+2. How to compile operators with dynamic shapes.
+3. How TileLang addresses boundary-related issues.
+4. The similarities and differences between operators implemented in TileLang and those implemented in CUDA/CuTe.
+Please note that this tutorial does not delve deeply into the design principles of TileLang. For a broader understanding of TileLang, we recommend consulting the [Overview](../get_started/overview.md).
+## Elementwise add in TileLang
+```python
+def elementwise_add(N, threads=256, dtype="bfloat16"):
+    @T.prim_func
+    def main(A: T.Tensor((N), dtype), B: T.Tensor((N), dtype), C: T.Tensor((N), dtype)):
+        with T.Kernel(T.ceildiv(N, threads), threads=threads) as (b_x):
+            # vector add.
+            for i in T.Parallel(threads):
+                C[b_x * threads + i] = A[b_x * threads + i] + B[b_x * threads + i]
+    return main
+```
+All logic for TileLang kernels must be implemented within the `T.Kernel(...)` scope. In this example, initializing `T.kernel(...)` requires specifying both the grid size and the number of threads per block. The returned value `bx` corresponds to `blockIdx.x` in CUDA. In the provided implementation, `T.Parallel` is used to process the data tile (of size `1 x threads`) assigned to the block for computation.
+Those familiar with CUDA programming might wonder where `threadIdx` fits into this. Note that the code inside `T.Kernel` operates at the **block level**, not the **thread level**. In this example, your focus is solely on defining the block-level logic. During compilation, TileLang automatically maps computations to the corresponding threads and applies further optimizations. The optimized code generated by TileLang may closely align with carefully handcrafted computational logic, as demonstrated in Section 2 with a concrete example. While TileLang also supports thread-level programming semantics, this will be covered in subsequent discussions.
+The program can be compiled using the following code:
+```python
+program = elementwise_add(1024, threads=256, dtype="bfloat16")
+kernel = tilelang.compile(program, out_idx=-1, target="cuda", execution_backend="cython")
+```
+Launching the kernel is straightforward, just call it directly like a function:
+```python
+C = kernel(A, B)
+```
+The vector add operation can also be extended to two-dimensional cases, where both implementations demonstrate comparable efficiency in practice. Below is an example from the test section that readers can refer to: [example](https://github.com/tile-ai/tilelang/blob/main/testing/python/kernel/test_tilelang_kernel_element_wise_add.py). The code for this kernel is provided below:
+```python
+import tilelang.language as T
+def elementwise_add(
+    M,
+    N,
+    block_M,
+    block_N,
+    in_dtype,
+    out_dtype,
+    threads,
+):
+    @T.prim_func
+    def main(
+            A: T.Tensor((M, N), in_dtype),
+            B: T.Tensor((M, N), in_dtype),
+            C: T.Tensor((M, N), out_dtype),
+    ):
+        with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=threads) as (bx, by):
+            start_x = bx * block_N
+            start_y = by * block_M
+            for (local_y, local_x) in T.Parallel(block_M, block_N):
+                y = start_y + local_y
+                x = start_x + local_x
+                C[y, x] = A[y, x] + B[y, x]
+    return main
+```
+### How to compile operators with dynamic shapes?
+In the compilation process above, a fixed shape was used. However, in practical usage, we often want the kernel to support dynamic shapes. So, how can we compile a kernel in TileLang to handle dynamic shapes? In TileLang, we can replace the target size with a dynamic symbolic value, making the dimension dynamic. The following example illustrates this:
+```python
+program = elementwise_add(T.dynamic("N"), threads=256, dtype="bfloat16")
+kernel = tilelang.compile(program, out_idx=-1, target="cuda", execution_backend="cython")
+```
+The resulting CUDA code for the kernel will include an additional `int N` parameter after the `bfloat16_t* __restrict__ A`, `bfloat16_t* __restrict__ B`, and `bfloat16_t* __restrict__ C` parameters.
+### How TileLang addresses boundary-related issues.
+TileLang automatically incorporates boundary-checking conditions; however, this comes at a cost. These boundary conditions may prevent TileLang from performing more advanced optimizations. I will introduce an example from the next section in advance. The corresponding code is also provided below, but note that it involves the associated CUDA code. Readers are encouraged to first review the next section before returning to this paragraph for a clearer understanding.
+When compiling the example below, let's set `N` to 2047:
+```python
+def elementwise_add(N, num_per_thread=8, threads=256, dtype="bfloat16"):
+    @T.prim_func
+    def main(A: T.Tensor((N), dtype), B: T.Tensor((N), dtype), C: T.Tensor((N), dtype)):
+        with T.Kernel(T.ceildiv(N, threads * num_per_thread), threads=threads) as (b_x):
+            # vector add.
+            for i, j in T.Parallel(threads, num_per_thread):
+                offsets = (b_x * threads + i) * num_per_thread
+                C[offsets + j] = A[offsets + j] + B[offsets + j]
+    return main
+```
+TileLang will generate the following CUDA code:
+```c++
+extern "C" __global__ void __launch_bounds__(256) main_kernel(bfloat16_t* __restrict__ A, bfloat16_t* __restrict__ B, bfloat16_t* __restrict__ C) {
+  #pragma unroll
+  for (int i = 0; i < 8; ++i) {
+    if (((i * 256) + ((int)threadIdx.x)) < 2047) {
+      C[((i * 256) + ((int)threadIdx.x))] = (A[((i * 256) + ((int)threadIdx.x))] + B[((i * 256) + ((int)threadIdx.x))]);
+    }
+  }
+}
+```
+We can observe that TileLang did not apply optimizations such as vectorization or coalesced memory access. In fact, except for the tail group of data, all other threads could have executed more optimized code.
+## Comparison of TileLang, CUDA, and CuTe
+For the subsequent examples, this tutorial will use the vector add operation for simplicity and brevity.
+Typically, those new to CUDA programming often write CUDA code in a style similar to this:
+```c++
+// vector add
+__global__ void elementwise_add(float* a, float* b, float* c, int N) {
+    int idx = threadIdx.x + blockIdx.x * blockDim.x;
+    if (idx < N) {
+        c[idx] = a[idx] + b[idx];
+    }
+}
+```
+The code above assigns each thread to compute a single element, which is evidently inefficient since common acceleration techniques like coalesced memory access and vectorization are not utilized. However, TileLang code written with similar logic (e.g., loop-based traversal) can be optimized by the compiler into highly efficient implementations, making it more accessible for beginners. Additionally, the final generated code from the compiler remains observable, providing transparency into the optimization process.
+The CUDA code generated by TileLang for the compiled kernel can be retrieved using the `kernel.get_kernel_source()` method. Below is the CUDA code produced for the vector addition example from Section 1:
+```cu
+extern "C" __global__ void __launch_bounds__(256) main_kernel(bfloat16_t* __restrict__ A, bfloat16_t* __restrict__ B, bfloat16_t* __restrict__ C) {
+  if (((int)threadIdx.x) < 32) {
+    uint4 __1;
+      uint4 v_ = *(uint4*)(A + ((((int)blockIdx.x) * 256) + (((int)threadIdx.x) * 8)));
+      uint4 v__1 = *(uint4*)(B + ((((int)blockIdx.x) * 256) + (((int)threadIdx.x) * 8)));
+      ((nv_bfloat162*)(&(__1.x)))->x = (((nv_bfloat162*)(&(v_.x)))->x+((nv_bfloat162*)(&(v__1.x)))->x);
+      ((nv_bfloat162*)(&(__1.x)))->y = (((nv_bfloat162*)(&(v_.x)))->y+((nv_bfloat162*)(&(v__1.x)))->y);
+      ((nv_bfloat162*)(&(__1.y)))->x = (((nv_bfloat162*)(&(v_.y)))->x+((nv_bfloat162*)(&(v__1.y)))->x);
+      ((nv_bfloat162*)(&(__1.y)))->y = (((nv_bfloat162*)(&(v_.y)))->y+((nv_bfloat162*)(&(v__1.y)))->y);
+      ((nv_bfloat162*)(&(__1.z)))->x = (((nv_bfloat162*)(&(v_.z)))->x+((nv_bfloat162*)(&(v__1.z)))->x);
+      ((nv_bfloat162*)(&(__1.z)))->y = (((nv_bfloat162*)(&(v_.z)))->y+((nv_bfloat162*)(&(v__1.z)))->y);
+      ((nv_bfloat162*)(&(__1.w)))->x = (((nv_bfloat162*)(&(v_.w)))->x+((nv_bfloat162*)(&(v__1.w)))->x);
+      ((nv_bfloat162*)(&(__1.w)))->y = (((nv_bfloat162*)(&(v_.w)))->y+((nv_bfloat162*)(&(v__1.w)))->y);
+    *(uint4*)(C + ((((int)blockIdx.x) * 256) + (((int)threadIdx.x) * 8))) = __1;
+  }
+}
+```
+In the code above, TileLang not only automatically maps block-level parallelism to threads but also applies optimizations such as vectorization and coalesced memory access.
+While TileLang incorporates various optimizations for the aforementioned case, its behavior may sometimes appear counterintuitive. For example, when targeting 256 threads for task processing, applying vectorization can result in each thread computing 8 data elements—effectively utilizing only 32 active threads. Interestingly, the kernel launch configuration still retains the original allocation of 256 threads.
+In such scenarios, explicitly specifying the number of elements computed per thread can help "guide" TileLang's code generation process, leading to implementations that are more closely aligned with the intended design.
+```python
+def elementwise_add(N, num_per_thread=8, threads=256, dtype="bfloat16"):
+    @T.prim_func
+    def main(A: T.Tensor((N), dtype), B: T.Tensor((N), dtype), C: T.Tensor((N), dtype)):
+        with T.Kernel(T.ceildiv(N, threads * num_per_thread), threads=threads) as (b_x):
+            # vector add.
+            for i, j in T.Parallel(threads, num_per_thread):
+                offsets = (b_x * threads + i) * num_per_thread
+                C[offsets + j] = A[offsets + j] + B[offsets + j]
+    return main
+```
+The corresponding CUDA code generated for the above example is presented below:
+```c++
+extern "C" __global__ void __launch_bounds__(256) main_kernel(bfloat16_t* __restrict__ A, bfloat16_t* __restrict__ B, bfloat16_t* __restrict__ C) {
+  uint4 __1;
+    uint4 v_ = *(uint4*)(A + (((int)threadIdx.x) * 8));
+    uint4 v__1 = *(uint4*)(B + (((int)threadIdx.x) * 8));
+    ((nv_bfloat162*)(&(__1.x)))->x = (((nv_bfloat162*)(&(v_.x)))->x+((nv_bfloat162*)(&(v__1.x)))->x);
+    ((nv_bfloat162*)(&(__1.x)))->y = (((nv_bfloat162*)(&(v_.x)))->y+((nv_bfloat162*)(&(v__1.x)))->y);
+    ((nv_bfloat162*)(&(__1.y)))->x = (((nv_bfloat162*)(&(v_.y)))->x+((nv_bfloat162*)(&(v__1.y)))->x);
+    ((nv_bfloat162*)(&(__1.y)))->y = (((nv_bfloat162*)(&(v_.y)))->y+((nv_bfloat162*)(&(v__1.y)))->y);
+    ((nv_bfloat162*)(&(__1.z)))->x = (((nv_bfloat162*)(&(v_.z)))->x+((nv_bfloat162*)(&(v__1.z)))->x);
+    ((nv_bfloat162*)(&(__1.z)))->y = (((nv_bfloat162*)(&(v_.z)))->y+((nv_bfloat162*)(&(v__1.z)))->y);
+    ((nv_bfloat162*)(&(__1.w)))->x = (((nv_bfloat162*)(&(v_.w)))->x+((nv_bfloat162*)(&(v__1.w)))->x);
+    ((nv_bfloat162*)(&(__1.w)))->y = (((nv_bfloat162*)(&(v_.w)))->y+((nv_bfloat162*)(&(v__1.w)))->y);
+  *(uint4*)(C + (((int)threadIdx.x) * 8)) = __1;
+}
+```
+Aha, this CUDA code aligns closely with conventional programming practices, making it more familiar and intuitive.
+But what happens if we provide additional hints to TileLang? For instance, by explicitly specifying register copies using the `T.copy(...)` operation. The example below demonstrates a vector addition implementation. Unlike the previous examples, this code explicitly loads data into registers before performing computations.
+```python
+def elementwise_add(N, NUM_ELE_PER_THREAD=8, threads=256, dtype="bfloat16"):
+    @T.prim_func
+    def main(A: T.Tensor((N), dtype), B: T.Tensor((N), dtype), C: T.Tensor((N), dtype)):
+        with T.Kernel(T.ceildiv(N, threads * NUM_ELE_PER_THREAD), threads=threads) as (b_x):
+            A_register = T.alloc_fragment((threads * NUM_ELE_PER_THREAD), dtype)
+            B_register = T.alloc_fragment((threads * NUM_ELE_PER_THREAD), dtype)
+            C_register = T.alloc_fragment((threads * NUM_ELE_PER_THREAD), dtype)
+            s_start = b_x * threads * NUM_ELE_PER_THREAD
+            s_end = (b_x + 1) * threads * NUM_ELE_PER_THREAD
+            # LDG. 128
+            T.copy(
+                A[s_start:s_end],
+                A_register,
+            )
+            T.copy(
+                B[s_start:s_end],
+                B_register,
+            )
+            # vector add.
+            for tid, i in T.Parallel(threads, NUM_ELE_PER_THREAD):
+                C_register[tid * NUM_ELE_PER_THREAD + i] = (
+                    A_register[tid * NUM_ELE_PER_THREAD + i] +
+                    B_register[tid * NUM_ELE_PER_THREAD + i])
+            # STG. 128
+            T.copy(
+                C_register,
+                C[s_start:s_end],
+            )
+    return main
+```
+In the example above, each thread is responsible for computing 8 elements. The `T.copy(...)` method functions at the block level, and TileLang automatically maps data movement operations to individual threads. This design may resonate more intuitively with CUDA developers. Let us now analyze the CUDA code generated from this implementation.
+```c++
+// N is set to 8192 * 8192 when compiling
+extern "C" __global__ void __launch_bounds__(256) main_kernel(bfloat16_t* __restrict__ A, bfloat16_t* __restrict__ B, bfloat16_t* __restrict__ C) {
+  bfloat16_t A_register[8];
+  bfloat16_t B_register[8];
+  *(uint4*)(A_register + 0) = *(uint4*)(A + ((((int)blockIdx.x) * 2048) + (((int)threadIdx.x) * 8)));
+  *(uint4*)(B_register + 0) = *(uint4*)(B + ((((int)blockIdx.x) * 2048) + (((int)threadIdx.x) * 8)));
+  uint4 __1;
+    uint4 v_ = *(uint4*)(A_register + 0);
+    uint4 v__1 = *(uint4*)(B_register + 0);
+    ((nv_bfloat162*)(&(__1.x)))->x = (((nv_bfloat162*)(&(v_.x)))->x+((nv_bfloat162*)(&(v__1.x)))->x);
+    ((nv_bfloat162*)(&(__1.x)))->y = (((nv_bfloat162*)(&(v_.x)))->y+((nv_bfloat162*)(&(v__1.x)))->y);
+    ((nv_bfloat162*)(&(__1.y)))->x = (((nv_bfloat162*)(&(v_.y)))->x+((nv_bfloat162*)(&(v__1.y)))->x);
+    ((nv_bfloat162*)(&(__1.y)))->y = (((nv_bfloat162*)(&(v_.y)))->y+((nv_bfloat162*)(&(v__1.y)))->y);
+    ((nv_bfloat162*)(&(__1.z)))->x = (((nv_bfloat162*)(&(v_.z)))->x+((nv_bfloat162*)(&(v__1.z)))->x);
+    ((nv_bfloat162*)(&(__1.z)))->y = (((nv_bfloat162*)(&(v_.z)))->y+((nv_bfloat162*)(&(v__1.z)))->y);
+    ((nv_bfloat162*)(&(__1.w)))->x = (((nv_bfloat162*)(&(v_.w)))->x+((nv_bfloat162*)(&(v__1.w)))->x);
+    ((nv_bfloat162*)(&(__1.w)))->y = (((nv_bfloat162*)(&(v_.w)))->y+((nv_bfloat162*)(&(v__1.w)))->y);
+  *(uint4*)(A_register + 0) = __1;
+  *(uint4*)(C + ((((int)blockIdx.x) * 2048) + (((int)threadIdx.x) * 8))) = *(uint4*)(A_register + 0);
+}
+```
+We observed the emergence of two additional registers, `A_register` and `B_register`. However, during the actual computation, these registers are simply reassigned to `v_` and `v__1`, respectively.
+To evaluate complexity, one could implement the same elementwise addition operator using CuTe and compare it with the TileLang version. The corresponding CuTe code is provided below:
+```c++
+template<int NUM_ELE_PER_THREAD=8>
+__global__ void elementwise_add(nv_bfloat16* C, 
+                                 const nv_bfloat16* A, 
+                                 const nv_bfloat16* B,
+                                 int N) {
+  using namespace cute;
+  const int idx = threadIdx.x + blockIdx.x * blockDim.x;
+  Tensor t_C = make_tensor(make_gmem_ptr(C), make_shape(N));
+  Tensor t_A = make_tensor(make_gmem_ptr(A), make_shape(N));
+  Tensor t_B = make_tensor(make_gmem_ptr(B), make_shape(N));
+  Tensor t_C_tile = local_tile(t_C, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));
+  Tensor t_A_tile = local_tile(t_A, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));
+  Tensor t_B_tile = local_tile(t_B, make_shape(Int<NUM_ELE_PER_THREAD>{}), make_coord(idx));
+  Tensor reg_buffer_A = make_tensor_like(t_A_tile);
+  Tensor reg_buffer_B = make_tensor_like(t_B_tile);
+  Tensor reg_buffer_C = make_tensor_like(t_C_tile);
+  // LDG. 128
+  copy(t_A_tile, reg_buffer_A);
+  copy(t_B_tile, reg_buffer_B);
+  auto reg_C_vector = recast<nv_bfloat162>(reg_buffer_C);
+  auto reg_A_vector = recast<nv_bfloat162>(reg_buffer_A);
+  auto reg_B_vector = recast<nv_bfloat162>(reg_buffer_B);
+  // Perform vectorized addition
+#pragma unroll
+  for (int vec_idx = 0; vec_idx < size(reg_C_vector); ++vec_idx) {
+    reg_C_vector(vec_idx) = reg_A_vector(vec_idx) + reg_B_vector(vec_idx);
+  }
+  auto reg_C_flat = recast<nv_bfloat16>(reg_C_vector);
+  // STG. 128
+  copy(reg_C_flat, t_C_tile);
+}
+```
+## Conclusion
+This tutorial showcases the implementation of the elementwise addition operator using TileLang, while also comparing various design approaches. TileLang significantly reduces the complexity of CUDA programming, enabling high performance with minimal code. Nevertheless, working with TileLang demands careful attention to specific implementation details. To ensure computational efficiency, it is essential to thoroughly examine the generated CUDA kernels.
+---
+**Reference:**
+[1] The CuTe code implementation draws inspiration from the techniques discussed in this blog: https://zhuanlan.zhihu.com/p/690703999
--- a/docs/deeplearning_operators/gemv.md
+++ b/docs/deeplearning_operators/gemv.md
--- a/docs/deeplearning_operators/matmul.md
+++ b/docs/deeplearning_operators/matmul.md
--- a/docs/get_started/Installation.md
+++ b/docs/get_started/Installation.md
--- a/docs/get_started/overview.md
+++ b/docs/get_started/overview.md
+# The Tile Language: A Brief Introduction
+## Programming Interface
+The figure below depicts how **TileLang** programs are progressively lowered from a high-level description to hardware-specific executables. We provide three different programming interfaces—targeted at **Beginner**, **Developer**, and **Expert** users—that each reside at different levels in this lowering pipeline. The **Tile Language** also allows mixing these interfaces within the same kernel, enabling users to work at whichever level of abstraction best suits their needs.
+```{figure} ../_static/img/overview.png
+:width: 50%
+:alt: Overview
+:align: center
+Figure 1: High-level overview of the TileLang compilation flow.
+```
+## Programming Interfaces
+1. **Beginner Level (Hardware-Unaware)**
+   - Intended for users who need to write code that is independent of specific hardware details.  
+   - The goal is to let developers focus on the basic logic without worrying about memory hierarchies or hardware-specific optimizations.  
+   - *Note:* This interface is not yet fully implemented.
+2. **Developer Level (Hardware-Aware with Tile Library)**
+   - Designed for developers who have a basic understanding of GPU memory hierarchies and performance considerations.  
+   - Provides a **Tile Library**, containing predefined operations and patterns optimized for various hardware architectures.  
+   - Users at this level can leverage these ready-made primitives without diving into low-level threading details.
+3. **Expert Level (Hardware-Aware with Thread Primitives)**
+   - For highly experienced users who have an in-depth understanding of low-level hardware characteristics (e.g., threading models, memory coalescing).  
+   - Offers direct access to **thread primitives** and other low-level constructs, allowing for fine-grained control of performance-critical kernels.  
+   - This level grants maximum flexibility for specialized optimizations tailored to specific GPU or multi-core architectures.
+## Compilation Flow
+1. **Tile Program**  
+   A high-level specification of the computation. Depending on the user’s expertise, they may write a purely hardware-unaware tile program or incorporate constructs from the Tile Library or thread primitives.
+2. **Tile Program with Tile Library**  
+   When developers choose from the Tile Library, the original Tile Program is expanded with specialized library calls. These calls encapsulate efficient implementation patterns for different operations.
+3. **Tile Program with Thread Primitives**  
+   Expert-level developers can explicitly use low-level threading constructs to hand-optimize data layout, synchronization, and memory usage.
+4. **IRModule**  
+   After the program is composed with libraries or thread primitives, it is lowered to an intermediate representation (IR) that captures the necessary hardware details.
+5. **Source Code Generation (C/CUDA/HIP/LLVM/…)**  
+   From the IR, the system generates target-specific source code. This source code is tuned for the desired backends or GPU architectures (e.g., NVIDIA, AMD).
+6. **Hardware-Specific Executable/Runtime**  
+   Finally, the generated source is compiled into hardware-specific executables, ready to run on the corresponding devices. The pipeline supports multiple GPU backends and can be extended to additional architectures.
+## Tile-based Programming Model
+[Figure 2](#fig-overview-gemm) provides a concise matrix multiplication (GEMM) example in ``TileLang``, 
+illustrating how developers can employ high-level constructs such as tiles, memory placement, pipelining, 
+and operator calls to manage data movement and computation with fine-grained control.
+In particular, this snippet ([Figure 2](#fig-overview-gemm) (a)) demonstrates how multi-level tiling 
+leverages different memory hierarchies (global, shared, and registers) to optimize bandwidth utilization 
+and reduce latency.
+Overall, [Figure 2](#fig-overview-gemm) (b) showcases how the Python-like syntax of ``TileLang`` 
+allows developers to reason about performance-critical optimizations within a user-friendly programming model.
+```{figure} ../_static/img/MatmulExample.png
+:align: center
+:width: 100%
+:alt: GEMM with Multi-Level Tiling on GPUs
+:name: fig-overview-gemm
+Figure 2: Optimizing GEMM with Multi-Level Tiling on GPUs via ``TileLang``.
+```
+### Tile declarations
+At the heart of our approach is the notion of *tiles* as first-class objects in the programming model. A tile represents a shaped portion of data, which can be owned and manipulated by a warp, thread block, or equivalent parallel unit. In the `Matmul` example, the `A` and `B` buffers are read in tiled chunks (determined by `block_M`, `block_N`, `block_K`) inside the kernel loop. With `T.Kernel`, `TileLang` defines the execution context, which includes the thread block index (`bx` and `by`) and the number of threads. These contexts can help compute the index for each thread block and make it easier for `TileLang` to automatically infer and optimize memory access and computation. Additionally, these contexts allow users to manually control the behavior of each independent thread within a thread block.
+### Explicit Hardware Memory Allocation
+A hallmark of `TileLang` is the ability to explicitly place these tile buffers in the hardware memory hierarchy. Rather than leaving it to a compiler's opaque optimization passes, `TileLang` exposes user-facing intrinsics that map directly to physical memory spaces or accelerator-specific constructs. In particular:
+- `T.alloc_shared`: Allocates memory in a fast, on-chip storage space, which corresponds to shared memory on NVIDIA GPUs. Shared memory is ideal for caching intermediate data during computations, as it is significantly faster than global memory and allows for efficient data sharing between threads in the same thread block. For example, in matrix multiplication, tiles of matrices can be loaded into shared memory to reduce global memory bandwidth demands and improve performance.
+- `T.alloc_fragment`: Allocates accumulators in fragment memory, which corresponds to register files on NVIDIA GPUs. By keeping inputs and partial sums in registers or hardware-level caches, latency is further minimized. Note that in this tile program, each tile allocates the same local buffers as shared memory, which might seem counterintuitive, as shared memory is generally faster but more abundant, whereas register file space is limited. This is because the allocation here refers to the register files for an entire thread block. `TileLang` uses a Layout Inference Pass during compilation to derive a Layout object `T.Fragment`, which determines how to allocate the corresponding register files for each thread. This process will be discussed in detail in subsequent sections.
+Data transfer between global memory and hardware-specific memory can be managed using `T.copy`. Furthermore, hardware-specific buffers can be initialized using `T.clear` or `T.fill`. For data assignments, operations can also be performed in parallel using `T.Parallel`, as demonstrated in Layout Inference Pass in the following sections.
+```{figure} ../_static/img/LayoutInference.png
+   :align: center
+   :width: 100%
+   :alt: GEMM with Multi-Level Tiling on GPUs
+```