README.md 10.3 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
### div

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 1.219e-04          | 8.916e-08          | 5.952e-08         | 2.152e-08         |
| Triton LibDevice vs Double    | 1.219e-04          | 8.916e-08          | 5.952e-08         | 2.152e-08         |
| TileLang vs Double            | 1.219e-04          | 8.916e-08          | 5.952e-08         | 2.152e-08         |
| PyTorch vs Double             | 1.219e-04          | 8.916e-08          | 5.952e-08         | 2.152e-08         |
| Triton vs Double              | 2.605e-04          | 1.285e-07          | 1.455e-07         | 3.175e-08         |
| TileLang Fastmath vs Double   | 2.605e-04          | 1.285e-07          | 1.455e-07         | 3.175e-08         |
| CUDA Fast vs Double           | 2.605e-04          | 1.285e-07          | 1.455e-07         | 3.175e-08         |

### reciprocal

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 3.039e-05          | 4.418e-08          | 5.960e-08         | 2.235e-08         |
| Triton LibDevice vs Double    | 3.039e-05          | 4.418e-08          | 5.960e-08         | 2.235e-08         |
| TileLang vs Double            | 3.039e-05          | 4.418e-08          | 5.960e-08         | 2.235e-08         |
| PyTorch vs Double             | 3.039e-05          | 4.418e-08          | 5.960e-08         | 2.235e-08         |
| Triton vs Double              | 4.470e-05          | 4.886e-08          | 9.699e-08         | 2.461e-08         |
| TileLang Fastmath vs Double   | 4.470e-05          | 4.886e-08          | 9.699e-08         | 2.461e-08         |
| CUDA Fast vs Double           | 4.470e-05          | 4.886e-08          | 9.699e-08         | 2.461e-08         |

### exp

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 5.494e-06          | 2.153e-07          | 1.483e-07         | 3.200e-08         |
| Triton LibDevice vs Double    | 5.494e-06          | 2.153e-07          | 1.483e-07         | 3.200e-08         |
| TileLang vs Double            | 5.494e-06          | 2.153e-07          | 1.483e-07         | 3.200e-08         |
| PyTorch vs Double             | 5.494e-06          | 2.153e-07          | 1.483e-07         | 3.200e-08         |
| Triton vs Double              | 1.338e-05          | 5.023e-07          | 2.641e-07         | 5.564e-08         |
| TileLang Fastmath vs Double   | 1.338e-05          | 5.023e-07          | 2.641e-07         | 5.564e-08         |
| CUDA Fast vs Double           | 1.338e-05          | 5.023e-07          | 2.641e-07         | 5.564e-08         |

### log

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 2.684e-07          | 2.051e-08          | 7.886e-08         | 2.297e-08         |
| Triton LibDevice vs Double    | 2.684e-07          | 2.051e-08          | 7.886e-08         | 2.297e-08         |
| TileLang vs Double            | 2.684e-07          | 2.051e-08          | 7.886e-08         | 2.297e-08         |
| PyTorch vs Double             | 2.684e-07          | 2.051e-08          | 7.886e-08         | 2.297e-08         |
| Triton vs Double              | 2.684e-07          | 2.051e-08          | 7.886e-08         | 2.297e-08         |
| TileLang Fastmath vs Double   | 9.087e-07          | 4.760e-08          | 2.019e-02         | 3.183e-07         |
| CUDA Fast vs Double           | 9.087e-07          | 4.760e-08          | 2.019e-02         | 3.183e-07         |

### sin

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 7.731e-08          | 1.401e-08          | 1.148e-07         | 2.492e-08         |
| Triton LibDevice vs Double    | 7.731e-08          | 1.401e-08          | 1.148e-07         | 2.492e-08         |
| TileLang vs Double            | 7.731e-08          | 1.401e-08          | 1.148e-07         | 2.492e-08         |
| PyTorch vs Double             | 7.731e-08          | 1.401e-08          | 1.148e-07         | 2.492e-08         |
| Triton vs Double              | 7.731e-08          | 1.401e-08          | 1.148e-07         | 2.492e-08         |
| TileLang Fastmath vs Double   | 6.463e-07          | 1.251e-07          | 7.111e-02         | 1.425e-06         |
| CUDA Fast vs Double           | 6.463e-07          | 1.251e-07          | 7.111e-02         | 1.425e-06         |

### cos

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 8.668e-08          | 1.587e-08          | 1.199e-07         | 2.513e-08         |
| Triton LibDevice vs Double    | 8.668e-08          | 1.587e-08          | 1.199e-07         | 2.513e-08         |
| TileLang vs Double            | 8.668e-08          | 1.587e-08          | 1.199e-07         | 2.513e-08         |
| PyTorch vs Double             | 8.668e-08          | 1.587e-08          | 1.199e-07         | 2.513e-08         |
| Triton vs Double              | 8.668e-08          | 1.587e-08          | 1.199e-07         | 2.513e-08         |
| TileLang Fastmath vs Double   | 4.006e-07          | 9.249e-08          | 5.275e-02         | 7.307e-07         |
| CUDA Fast vs Double           | 4.006e-07          | 9.249e-08          | 5.275e-02         | 7.307e-07         |

### sqrt

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 5.960e-08          | 2.554e-08          | 5.960e-08         | 1.986e-08         |
| Triton LibDevice vs Double    | 5.960e-08          | 2.554e-08          | 5.960e-08         | 1.986e-08         |
| TileLang vs Double            | 5.960e-08          | 2.554e-08          | 5.960e-08         | 1.986e-08         |
| PyTorch vs Double             | 5.960e-08          | 2.554e-08          | 5.960e-08         | 1.986e-08         |
| Triton vs Double              | 1.114e-07          | 2.947e-08          | 9.962e-08         | 2.291e-08         |
| TileLang Fastmath vs Double   | 1.114e-07          | 2.947e-08          | 9.962e-08         | 2.291e-08         |
| CUDA Fast vs Double           | 1.114e-07          | 2.947e-08          | 9.962e-08         | 2.291e-08         |

### tanh

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 1.056e-07          | 1.636e-08          | 1.966e-07         | 2.359e-08         |
| Triton LibDevice vs Double    | 1.056e-07          | 1.636e-08          | 1.966e-07         | 2.359e-08         |
| TileLang vs Double            | 1.056e-07          | 1.636e-08          | 1.966e-07         | 2.359e-08         |
| PyTorch vs Double             | 1.056e-07          | 1.636e-08          | 1.966e-07         | 2.359e-08         |
| Triton vs Double              | 2.293e-07          | 3.965e-08          | 6.204e-04         | 1.100e-07         |
| TileLang Fastmath vs Double   | 7.826e-06          | 1.384e-06          | 1.081e-05         | 1.906e-06         |
| CUDA Fast vs Double           | 7.826e-06          | 1.384e-06          | 1.081e-05         | 1.906e-06         |

### rsqrt

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 2.057e-06          | 2.798e-08          | 1.224e-07         | 2.918e-08         |
| Triton LibDevice vs Double    | 9.535e-07          | 2.199e-08          | 5.960e-08         | 2.315e-08         |
| TileLang vs Double            | 2.057e-06          | 2.798e-08          | 1.224e-07         | 2.918e-08         |
| PyTorch vs Double             | 2.057e-06          | 2.798e-08          | 1.224e-07         | 2.918e-08         |
| Triton vs Double              | 2.057e-06          | 2.798e-08          | 1.224e-07         | 2.918e-08         |
| TileLang Fastmath vs Double   | 2.057e-06          | 2.798e-08          | 1.224e-07         | 2.918e-08         |
| CUDA Fast vs Double           | 2.057e-06          | 2.798e-08          | 1.224e-07         | 2.918e-08         |

### inv_sqrt

| Implementation                | Max Abs Error      | Mean Abs Error     | Max Rel Error     | Mean Rel Error    |
|-------------------------------|--------------------|--------------------|-------------------|-------------------|
| FP32 Precise vs Double        | 2.501e-06          | 2.911e-08          | 8.939e-08         | 2.963e-08         |
| Triton LibDevice vs Double    | 2.501e-06          | 2.911e-08          | 8.939e-08         | 2.963e-08         |
| TileLang vs Double            | 2.501e-06          | 2.911e-08          | 8.939e-08         | 2.963e-08         |
| PyTorch vs Double             | 2.501e-06          | 2.911e-08          | 8.939e-08         | 2.963e-08         |
| Triton vs Double              | 2.876e-06          | 3.443e-08          | 1.536e-07         | 3.503e-08         |
| TileLang Fastmath vs Double   | 2.876e-06          | 3.443e-08          | 1.536e-07         | 3.503e-08         |
| CUDA Fast vs Double           | 2.876e-06          | 3.171e-08          | 1.250e-07         | 3.211e-08         |