tutorial_hello_world.rst 8.09 KB
Newer Older
pmaybank's avatar
pmaybank committed
1
2
3
4
5
6
7
8
===============
CK Hello world
===============

-------------------------------------
Motivation
-------------------------------------

9
10
11
12
13
This tutorial is aimed at engineers dealing with artificial intelligence and machine learning who
would like to optimize their pipelines and squeeze every performance drop by adding Composable
Kernel (CK) library to their projects. We would like to make the CK library approachable so
the tutorial is not based on the latest release and doesn't have all the bleeding edge features,
but it will be reproducible now and forever.
pmaybank's avatar
pmaybank committed
14

15
16
17
During this tutorial we will have an introduction to the CK library, we will build it and run some
examples and tests, so to say we will run a "Hello world" example. In future tutorials we will go
in depth and breadth and get familiar with other tools and ways to integrate CK into your project.
pmaybank's avatar
pmaybank committed
18
19
20
21
22

-------------------------------------
Description
-------------------------------------

23
24
25
26
27
28
Modern AI technology solves more and more problems in all imaginable fields, but crafting fast and
efficient workflows is still challenging. CK is one of the tools to make AI heavy lifting as fast
and efficient as possible. CK is a collection of optimized AI operator kernels and tools to create
new ones. The library has components required for majority of modern neural networks architectures
including matrix multiplication, convolution, contraction, reduction, attention modules, variety of
activation functions, fused operators and many more.
pmaybank's avatar
pmaybank committed
29
30
31
32
33
34
35
36
37

So how do we (almost) reach the speed of light? CK acceleration abilities are based on:

* Layered structure.
* Tile-based computation model.
* Tensor coordinate transformation.
* Hardware acceleration use.
* Support of low precision data types including fp16, bf16, int8 and int4.

38
39
If you are excited and need more technical details and benchmarking results - read this awesome
`blog post <https://community.amd.com/t5/instinct-accelerators/amd-composable-kernel-library-efficient-fused-kernels-for-ai/ba-p/553224>`_.
pmaybank's avatar
pmaybank committed
40

41
For more details visit our `github repository <https://github.com/ROCmSoftwarePlatform/composable_kernel>`_.
pmaybank's avatar
pmaybank committed
42
43
44
45
46

-------------------------------------
Hardware targets
-------------------------------------

47
48
49
CK library fully supports `gfx908` and `gfx90a` GPU architectures and only some operators are
supported for `gfx1030`. Let's check the hardware you have at hand and decide on the target
GPU architecture.
pmaybank's avatar
pmaybank committed
50
51
52
53
54
55
56
57
58

==========     =========
GPU Target     AMD GPU
==========     =========
gfx908 	       Radeon Instinct MI100
gfx90a 	       Radeon Instinct MI210, MI250, MI250X
gfx1030        Radeon PRO V620, W6800, W6800X, W6800X Duo, W6900X, RX 6800, RX 6800 XT, RX 6900 XT, RX 6900 XTX, RX 6950 XT
==========     =========

59
60
There are also `cloud options <https://aws.amazon.com/ec2/instance-types/g4/>`_ you can find if
you don't have an AMD GPU at hand.
pmaybank's avatar
pmaybank committed
61
62
63
64
65
66
67
68
69
70
71

-------------------------------------
Build the library
-------------------------------------

First let's clone the library and rebase to the tested version::

    git clone https://github.com/ROCmSoftwarePlatform/composable_kernel.git
    cd composable_kernel/
    git checkout tutorial_hello_world

72
73
74
To make our lives easier we prepared
`docker images <https://hub.docker.com/r/rocm/composable_kernel>`_ with all the necessary
dependencies. Pick the right image and create a container. In this tutorial we use
75
76
``rocm/composable_kernel:ck_ub20.04_rocm6.0`` image, it is based on Ubuntu 20.04 and
ROCm v6.0.
pmaybank's avatar
pmaybank committed
77

78
If your current folder is ``${HOME}``, start the docker container with::
pmaybank's avatar
pmaybank committed
79
80
81
82
83
84
85

    docker run  \
    -it  \
    --privileged  \
    --group-add sudo  \
    -w /root/workspace  \
    -v ${HOME}:/root/workspace  \
86
    rocm/composable_kernel:ck_ub20.04_rocm6.0  \
pmaybank's avatar
pmaybank committed
87
88
    /bin/bash

89
90
If your current folder is different from ``${HOME}``, adjust the line ``-v ${HOME}:/root/workspace``
to fit your folder structure.
pmaybank's avatar
pmaybank committed
91

92
93
Inside the docker container current folder is ``~/workspace``, library path is
``~/workspace/composable_kernel``, navigate to the library::
pmaybank's avatar
pmaybank committed
94
95
96

    cd composable_kernel/

97
Create and go to the ``build`` directory::
pmaybank's avatar
pmaybank committed
98
99
100

    mkdir build && cd build

101
102
In the previous section we talked about target GPU architecture. Once you decide which one is right
for you, run CMake using the right ``GPU_TARGETS`` flag::
pmaybank's avatar
pmaybank committed
103
104
105
106
107
108
109
110
111

    cmake  \
    -D CMAKE_PREFIX_PATH=/opt/rocm  \
    -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc  \
    -D CMAKE_CXX_FLAGS="-O3"  \
    -D CMAKE_BUILD_TYPE=Release  \
    -D BUILD_DEV=OFF  \
    -D GPU_TARGETS="gfx908;gfx90a;gfx1030" ..

112
If everything went well the CMake run will end up with::
pmaybank's avatar
pmaybank committed
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142

    -- Configuring done
    -- Generating done
    -- Build files have been written to: "/root/workspace/composable_kernel/build"

Finally, we can build examples and tests::

    make -j examples tests

If everything is smooth, you'll see::

    Scanning dependencies of target tests
    [100%] Built target tests

---------------------------
Run examples and tests
---------------------------

Examples are listed as test cases as well, so we can run all examples and tests with::

    ctest

You can check the list of all tests by running::

    ctest -N

We can also run them separately, here is a separate example execution::

    ./bin/example_gemm_xdl_fp16 1 1 1

143
144
145
The arguments ``1 1 1`` mean that we want to run this example in the mode: verify results with CPU,
initialize matrices with integers and benchmark the kernel execution. You can play around with
these parameters and see how output and execution results change.
pmaybank's avatar
pmaybank committed
146

147
148
If everything goes well and you have a device based on `gfx908` or `gfx90a` architecture you should see
something like::
pmaybank's avatar
pmaybank committed
149
150
151
152
153
154
155
156
157

    a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
    b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
    launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
    Warm up 1 time
    Start running 10 times...
    Perf: 1.10017 ms, 117.117 TFlops, 87.6854 GB/s, DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2> NumPrefetch: 1, LoopScheduler: Default, PipelineVersion: v1

158
Meanwhile, running it on a `gfx1030` device should result in::
pmaybank's avatar
pmaybank committed
159
160
161
162
163
164

    a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
    b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
    DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2> NumPrefetch: 1, LoopScheduler: Default, PipelineVersion: v1 does not support this problem

165
166
But don't panic, some of the operators are supported on `gfx1030` architecture, so you can run a
separate example like::
pmaybank's avatar
pmaybank committed
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182

    ./bin/example_gemm_dl_fp16 1 1 1

and it should result in something nice similar to::

    a_m_k: dim 2, lengths {3840, 4096}, strides {1, 4096}
    b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
    arg.a_grid_desc_k0_m0_m1_k1_{2048, 3840, 2}
    arg.b_grid_desc_k0_n0_n1_k1_{2048, 4096, 2}
    arg.c_grid_desc_m_n_{ 3840, 4096}
    launch_and_time_kernel: grid_dim {960, 1, 1}, block_dim {256, 1, 1}
    Warm up 1 time
    Start running 10 times...
    Perf: 3.65695 ms, 35.234 TFlops, 26.3797 GB/s, DeviceGemmDl<256, 128, 128, 16, 2, 4, 4, 1>

183
184
185
186
187
188
189
190
.. note::

    There was a new CMake flag ``DL_KERNELS`` added in the latest versions of CK. If you use one of
    the newest versions of the library and do not see the above results when running
    ``example_gemm_dl_fp16``, it might be necessary to add ``-D DL_KERNELS=ON`` to your CMake command
    in order to build the operators supported on the `gfx1030` architecture.

We can also run a separate test::
pmaybank's avatar
pmaybank committed
191
192
193
194
195
196
197
198
199
200
201
202
203
204

    ctest -R test_gemm_fp16

If everything goes well you should see something like::

    Start 121: test_gemm_fp16
    1/1 Test #121: test_gemm_fp16 ...................   Passed   51.81 sec

    100% tests passed, 0 tests failed out of 1

-----------
Summary
-----------

205
206
207
In this tutorial we took the first look at the Composable Kernel library, built it on your system
and ran some examples and tests. Stay tuned, in the next tutorial we will run kernels with different
configs to find out the best one for your hardware and task.
pmaybank's avatar
pmaybank committed
208

209
210
P.S.: Don't forget to switch off the cloud instance if you have launched one, you can find better
ways to spend your money for sure!