tutorial_hello_world.rst 6.94 KB
Newer Older
randyh62's avatar
randyh62 committed
1
2
3
.. meta::
  :description: Composable Kernel documentation and API reference library
  :keywords: composable kernel, CK, ROCm, API, documentation
pmaybank's avatar
pmaybank committed
4

randyh62's avatar
randyh62 committed
5
.. _hello-world:
pmaybank's avatar
pmaybank committed
6

randyh62's avatar
randyh62 committed
7
8
9
********************************************************************
Hello World Tutorial
********************************************************************
pmaybank's avatar
pmaybank committed
10

randyh62's avatar
randyh62 committed
11
12
13
This tutorial is for engineers dealing with artificial intelligence and machine learning who
would like to optimize pipelines and improve performance using the Composable
Kernel (CK) library. This tutorial provides an introduction to the CK library. You will build the library and run some examples using a "Hello World" example. 
pmaybank's avatar
pmaybank committed
14
15

Description
randyh62's avatar
randyh62 committed
16
===========
pmaybank's avatar
pmaybank committed
17

randyh62's avatar
randyh62 committed
18
19
20
21
22
Modern AI technology solves more and more problems in a variety of fields, but crafting fast and
efficient workflows is still challenging. CK can make the AI workflow fast
and efficient. CK is a collection of optimized AI operator kernels with tools to create
new kernels. The library has components required for modern neural network architectures
including matrix multiplication, convolution, contraction, reduction, attention modules, a variety of activation functions, and fused operators.
pmaybank's avatar
pmaybank committed
23

randyh62's avatar
randyh62 committed
24
CK library acceleration features are based on:
pmaybank's avatar
pmaybank committed
25

randyh62's avatar
randyh62 committed
26
27
28
29
30
* Layered structure
* Tile-based computation model
* Tensor coordinate transformation
* Hardware acceleration use
* Support of low precision data types including fp16, bf16, int8 and int4
pmaybank's avatar
pmaybank committed
31

randyh62's avatar
randyh62 committed
32
If you need more technical details and benchmarking results read the following 
33
`blog post <https://community.amd.com/t5/instinct-accelerators/amd-composable-kernel-library-efficient-fused-kernels-for-ai/ba-p/553224>`_.
pmaybank's avatar
pmaybank committed
34

35
To download the library visit the `composable_kernel repository <https://github.com/ROCm/composable_kernel>`_.
pmaybank's avatar
pmaybank committed
36
37

Hardware targets
randyh62's avatar
randyh62 committed
38
================
pmaybank's avatar
pmaybank committed
39

randyh62's avatar
randyh62 committed
40
41
CK library fully supports `gfx908` and `gfx90a` GPU architectures, while only some operators are
supported for `gfx1030` devices. Check your hardware to determine the target GPU architecture.
pmaybank's avatar
pmaybank committed
42
43
44
45
46
47
48
49
50

==========     =========
GPU Target     AMD GPU
==========     =========
gfx908 	       Radeon Instinct MI100
gfx90a 	       Radeon Instinct MI210, MI250, MI250X
gfx1030        Radeon PRO V620, W6800, W6800X, W6800X Duo, W6900X, RX 6800, RX 6800 XT, RX 6900 XT, RX 6900 XTX, RX 6950 XT
==========     =========

51
52
There are also `cloud options <https://aws.amazon.com/ec2/instance-types/g4/>`_ you can find if
you don't have an AMD GPU at hand.
pmaybank's avatar
pmaybank committed
53
54

Build the library
randyh62's avatar
randyh62 committed
55
=================
pmaybank's avatar
pmaybank committed
56

randyh62's avatar
randyh62 committed
57
This tutorial is based on the use of docker images as explained in :ref:`docker-hub`. Download a docker image suitable for your OS and ROCm release, run or start the docker container, and then resume the tutorial from this point. 
pmaybank's avatar
pmaybank committed
58

randyh62's avatar
randyh62 committed
59
.. note::
pmaybank's avatar
pmaybank committed
60

61
   You can also `install ROCm <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/>`_ on your system, clone the `Composable Kernel repository <https://github.com/ROCm/composable_kernel.git>`_ on GitHub, and use that to build and run the examples using the commands described below.
pmaybank's avatar
pmaybank committed
62

randyh62's avatar
randyh62 committed
63
Both the docker container and GitHub repository include the Composable Kernel library. Navigate to the library::
pmaybank's avatar
pmaybank committed
64
65
66

    cd composable_kernel/

randyh62's avatar
randyh62 committed
67
Create and change to a ``build`` directory::
pmaybank's avatar
pmaybank committed
68
69
70

    mkdir build && cd build

randyh62's avatar
randyh62 committed
71
The previous section discussed supported GPU architecture. Once you decide which hardware targets are needed, run CMake using the ``GPU_TARGETS`` flag::
pmaybank's avatar
pmaybank committed
72
73
74
75
76
77
78
79
80

    cmake  \
    -D CMAKE_PREFIX_PATH=/opt/rocm  \
    -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc  \
    -D CMAKE_CXX_FLAGS="-O3"  \
    -D CMAKE_BUILD_TYPE=Release  \
    -D BUILD_DEV=OFF  \
    -D GPU_TARGETS="gfx908;gfx90a;gfx1030" ..

randyh62's avatar
randyh62 committed
81
If everything goes well the CMake command will return::
pmaybank's avatar
pmaybank committed
82
83
84
85
86

    -- Configuring done
    -- Generating done
    -- Build files have been written to: "/root/workspace/composable_kernel/build"

randyh62's avatar
randyh62 committed
87
Finally, you can build examples and tests::
pmaybank's avatar
pmaybank committed
88
89
90

    make -j examples tests

randyh62's avatar
randyh62 committed
91
When complete you should see::
pmaybank's avatar
pmaybank committed
92
93
94
95
96

    Scanning dependencies of target tests
    [100%] Built target tests

Run examples and tests
randyh62's avatar
randyh62 committed
97
======================
pmaybank's avatar
pmaybank committed
98

randyh62's avatar
randyh62 committed
99
Examples are listed as test cases as well, so you can run all examples and tests with::
pmaybank's avatar
pmaybank committed
100
101
102
103
104
105
106

    ctest

You can check the list of all tests by running::

    ctest -N

randyh62's avatar
randyh62 committed
107
You can also run examples separately as shown in the following example execution::
pmaybank's avatar
pmaybank committed
108
109
110

    ./bin/example_gemm_xdl_fp16 1 1 1

randyh62's avatar
randyh62 committed
111
The arguments ``1 1 1`` mean that you want to run this example in the mode: verify results with CPU, initialize matrices with integers, and benchmark the kernel execution. You can play around with these parameters and see how output and execution results change.
pmaybank's avatar
pmaybank committed
112

randyh62's avatar
randyh62 committed
113
If you have a device based on `gfx908` or `gfx90a` architecture, and if the example runs as expected, you should see something like::
pmaybank's avatar
pmaybank committed
114
115

    a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
randyh62's avatar
randyh62 committed
116
    b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
pmaybank's avatar
pmaybank committed
117
    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
randyh62's avatar
randyh62 committed
118
    Perf: 1.08153 ms, 119.136 TFlops, 89.1972 GB/s, DeviceGemm_Xdl_CShuffle<Default, 256, 256, 128, 32, 8, 2, 32, 32, 4, 2, 8, 4, 1, 2> LoopScheduler: Interwave, PipelineVersion: v1
pmaybank's avatar
pmaybank committed
119

randyh62's avatar
randyh62 committed
120
However, running it on a `gfx1030` device should result in the following::
pmaybank's avatar
pmaybank committed
121
122
123
124
125
126

    a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
    b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
    DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2> NumPrefetch: 1, LoopScheduler: Default, PipelineVersion: v1 does not support this problem

randyh62's avatar
randyh62 committed
127
Don't worry, some operators are supported on `gfx1030` architecture, so you can run a
128
separate example like::
pmaybank's avatar
pmaybank committed
129
130
131

    ./bin/example_gemm_dl_fp16 1 1 1

randyh62's avatar
randyh62 committed
132
and it should return something like::
pmaybank's avatar
pmaybank committed
133
134
135
136
137
138
139
140
141
142
143
144

    a_m_k: dim 2, lengths {3840, 4096}, strides {1, 4096}
    b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
    arg.a_grid_desc_k0_m0_m1_k1_{2048, 3840, 2}
    arg.b_grid_desc_k0_n0_n1_k1_{2048, 4096, 2}
    arg.c_grid_desc_m_n_{ 3840, 4096}
    launch_and_time_kernel: grid_dim {960, 1, 1}, block_dim {256, 1, 1}
    Warm up 1 time
    Start running 10 times...
    Perf: 3.65695 ms, 35.234 TFlops, 26.3797 GB/s, DeviceGemmDl<256, 128, 128, 16, 2, 4, 4, 1>

145
146
.. note::

randyh62's avatar
randyh62 committed
147
    A new CMake flag ``DL_KERNELS`` has been added to the latest versions of CK. If you do not see the above results when running ``example_gemm_dl_fp16``, you might need to add ``-D DL_KERNELS=ON`` to your CMake command to build the operators supported on the `gfx1030` architecture.
148

randyh62's avatar
randyh62 committed
149
You can also run a separate test::
pmaybank's avatar
pmaybank committed
150
151
152
153
154
155
156
157
158
159
160

    ctest -R test_gemm_fp16

If everything goes well you should see something like::

    Start 121: test_gemm_fp16
    1/1 Test #121: test_gemm_fp16 ...................   Passed   51.81 sec

    100% tests passed, 0 tests failed out of 1

Summary
randyh62's avatar
randyh62 committed
161
=======
pmaybank's avatar
pmaybank committed
162

randyh62's avatar
randyh62 committed
163
In this tutorial you took the first look at the Composable Kernel library, built it on your system and ran some examples and tests. In the next tutorial you will run kernels with different configurations to find out the best one for your hardware and task.
pmaybank's avatar
pmaybank committed
164

randyh62's avatar
randyh62 committed
165
P.S.: If you are running on a cloud instance, don't forget to switch off the cloud instance.