README.md

# TileLang SM100 Support (Preview)

This directory contains examples for TileLang's experimental SM100 architecture support. **This is a preview version** with limited functionality.

## Current Limitations (Manual Implementation Required)

### 1. Manual TCGEN5.MMA Management
Users must manually handle TCGEN5MMA operations using:
- `T.alloc_tmem()` - Allocate Tensor Memory
- `T.gemm()` with `wg_wait=-1` - Launch TCGEN5MMA without waiting
- Manual synchronization with mbarrier

### 2. Manual mbarrier Synchronization
TCGEN5MMA is asynchronous and requires explicit synchronization:
```python
mbar = T.alloc_barrier(1)  # expect-arrive-count = 1
T.gemm(A_shared, B_shared, C_tmem, trans_A, trans_B, mbar=mbar, wg_wait=-1, clear_accum=k==0)
T.mbarrier_wait_parity(mbar, k%2)  # Manual phase calculation required
```

## Examples

### TCGEN5MMA Example (`gemm_tcgen5mma.py`)
Demonstrates TCGEN5MMA operations with:
- Tensor Memory allocation
- Manual mbarrier synchronization
- TCGEN5MMA gemm operations

### Traditional MMA Example (`gemm_mma.py`)
Shows standard MMA operations that work across architectures for comparison.

## Code Example

The following code is based on `gemm_tcgen5mma.py`, demonstrating TCGEN5MMA matrix multiplication:

```python
import torch
import tilelang
import tilelang.language as T

@T.prim_func
def main(
    A: T.Tensor((M, K), "bfloat16"),
    B: T.Tensor((N, K), "bfloat16"),
    C: T.Tensor((M, N), "bfloat16"),
):
    with T.Kernel(T.ceildiv(N, block_N), T.ceildiv(M, block_M), threads=256) as (bx, by):
        # 1. Allocate memory buffers
        A_shared = T.alloc_shared((block_M, block_K), "bfloat16")  # A matrix shared memory
        B_shared = T.alloc_shared((block_N, block_K), "bfloat16")  # B matrix shared memory
        C_tmem = T.alloc_tmem([block_M, block_N], "float")         # TCGEN5MMA output to Tensor Memory
        mbar = T.alloc_barrier(1)                                 # mbarrier synchronization primitive

        C_local = T.alloc_fragment((block_M, block_N), "float")   # Register storage
        C_shared = T.alloc_shared((block_M, block_N), "bfloat16") # Output shared memory

        # 2. Main computation loop
        for k in T.Pipelined(T.ceildiv(K, block_K), num_stages=1):
            # Data loading: global memory to shared memory
            T.copy(A[by * block_M, k * block_K], A_shared)
            T.copy(B[bx * block_N, k * block_K], B_shared)

            # TCGEN5MMA computation: asynchronous launch, output to Tensor Memory
            T.gemm(A_shared, B_shared, C_tmem, trans_A=False, trans_B=True,
                   mbar=mbar, wg_wait=-1, clear_accum=k==0)

            # Critical: wait for TCGEN5MMA completion
            T.mbarrier_wait_parity(mbar, k%2)

        # 3. Output processing (only subset of threads)
        T.copy(C_tmem, C_local)      # Tensor Memory → registers
        T.copy(C_local, C_shared)    # registers → shared memory

        # 4. Write back to global memory
        T.copy(C_shared, C[by * block_M, bx * block_N])
```

### Compilation and Usage

```python
# Parameter setup
M, N, K = 4096, 4096, 8192
block_M, block_N, block_K = 128, 256, 128

# Compile kernel
jit_kernel = tilelang.compile(func, out_idx=[2], target="cuda", pass_configs={
    tilelang.PassConfigKey.TL_DISABLE_TMA_LOWER: True,        # Required
    tilelang.PassConfigKey.TL_DISABLE_WARP_SPECIALIZED: True, # Required
})

# Run test
a = torch.randn(M, K, device="cuda", dtype=torch.bfloat16)
b = torch.randn(N, K, device="cuda", dtype=torch.bfloat16)
c = jit_kernel(a, b)

# Verify correctness
ref_c = (a.to(torch.float) @ b.T.to(torch.float)).to(torch.bfloat16)
torch.testing.assert_close(c, ref_c, rtol=1e-2, atol=1e-2)

# Performance benchmark
profiler = jit_kernel.get_profiler()
latency = profiler.do_bench()
print(f"Latency: {latency} ms")
print(f"Performance: {2 * M * N * K / (latency/1e3) / 1e12:.2f} TFLOPS")
```