README.md 3.81 KB
Newer Older
root's avatar
init  
root committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
# TVM IR Performance Analyzer

A performance analysis toolkit for TVM IR modules, Provides hardware-aware performance metrics including FLOPs, memory bandwidth utilization, and execution time estimation.

## Features

-**Operation Analysis**: Supports arbitrary operations expressed in TVM IR (including GEMM and convolution)
-**Memory Traffic Calculation**: Tracks global memory transfers
-**Architecture-aware Metrics**: Pre-configured with NVIDIA GPU architectures (Ampere, Ada Lovelace)
-**Performance Estimation**: Predicts execution time using roofline model
-**TVM Integration**: Works with TVM IRModule and PrimFunc

## Quick Start
### GEMM Analysis Example
```python
import tilelang.language as T
from tilelang.tools import Analyzer
from tilelang.carver.arch import CUDA

M = N = K = 1024

def kernel(block_M=128, block_N=128, block_K=32, num_stages=3, thread_num=128):
    @T.prim_func
    def main(A: T.Tensor((M, K), "float16"),
             B: T.Tensor((N, K), "float16"),
             C: T.Tensor((M, N), "float")):
        # ... (kernel definition)
    return main

cuda_device = CUDA("cuda")
result = Analyzer.analysis(kernel(), cuda_device)
print(result)
```

### Convolution Analysis Example
```python
import tilelang.language as T
from tilelang.tools import Analyzer
from tilelang.carver.arch import CUDA

def kernel(N=64, C=256, H=512, W=512, F=512, K=3, block_M=64, block_N=128):
    @T.prim_func
    def main(data: T.Tensor((N, H, W, C), "float16"),
             kernel: T.Tensor((K, K, C, F), "float16"),
             out: T.Tensor((N, (H-K+1), (W-K+1), F), "float")):
        # ... (convolution kernel definition)
    return main

cuda_device = CUDA("cuda")
result = Analyzer.analysis(kernel(), cuda_device)
print(result)
```

## API Documentation
### `AnalysisResult` Class
```python
@dataclass(frozen=True)
class AnalysisResult:
    total_flops: int          # Total floating-point operations
    total_global_bytes: int   # Global memory traffic in bytes
    estimated_time: float     # Predicted execution time (seconds)
    tflops: float             # Achieved TFLOPS
    bandwidth_GBps: float     # Memory bandwidth utilization
```
### `Analyzer` Class Methods
#### `analysis(fn, device)`
* ​Parameters:
    * fn: TVM IRModule or PrimFunc
    * device: Device configuration object
* Returns: AnalysisResult
#### Supported Architectures
```python
# Extendable to custom hardware via: "compute_capability": (cores_per_SM, clock_GHz, flops_per_cycle, max_SM_count)
ARCH_CONFIGS = {
    "80": (128, 1.41, 2, 108),  # A100
    "86": (128, 1.70, 2, 84),   # RTX 3080
    "89": (128, 2.52, 2, 128)  # RTX 4090
}
```

## Implementation Details

### Performance Model
Uses roofline model with two constraints:
1.**Compute Bound**: `Time = Total FLOPs / (SM Count × Cores/SM × Clock × FLOPs/Cycle)`
2.**Memory Bound**: `Time = Memory Bytes / (Bandwidth × Utilization)`

### IR Analysis Pass
1.**Traversal**: Walks through TVM IR using `ir_transform`
2.**Operation Detection**:
   - Counts FLOPs for all compute operations
   - Calculates memory traffic for all memory operations
3.**Loop Handling**:
   - Tracks nested loops for operation scaling
   - Accounts for block/grid dimensions

## Key Metrics Calculation

| Metric                  | Formula                                 |
|-------------------------|-----------------------------------------|
| FLOPs per GEMM          | `2 × M × N × K`                         |
| Memory Traffic per Copy | `elements × dtype_size × loop_product` |
| Achieved TFLOPS         | `total_flops / estimated_time / 1e12`  |
| Memory Bandwidth        | `total_global_bytes / estimated_time`  |

## Limitations
1. Requires memory operations to be properly annotated in the IR
2. Assumes perfect memory coalescing and no bank conflicts

## Supported Operations
Any operation expressed in TVM IR