RUN curl -fsSL https://repo.radeon.com/rocm/rocm.gpg.key | gpg --dearmor-o /etc/apt/trusted.gpg.d/rocm-keyring.gpg
RUN wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/focal/amdgpu-install_6.0.60000-1_all.deb --no-check-certificate
RUN apt-get update &&DEBIAN_FRONTEND=noninteractive apt-get install-y--allow-unauthenticated ./amdgpu-install_6.0.60000-1_all.deb
RUN wget -qO - http://repo.radeon.com/rocm/rocm.gpg.key | apt-key add - &&\
sh -c"echo deb [arch=amd64 signed-by=/etc/apt/trusted.gpg.d/rocm-keyring.gpg] $DEB_ROCM_REPO focal main > /etc/apt/sources.list.d/rocm.list"&&\
sh -c'echo deb [arch=amd64 signed-by=/etc/apt/trusted.gpg.d/rocm-keyring.gpg] https://repo.radeon.com/amdgpu/$ROCMVERSION/ubuntu focal main > /etc/apt/sources.list.d/amdgpu.list'
RUN if["$ROCMVERSION"!="6.0.1"];then\
sh -c"wget https://repo.radeon.com/amdgpu-install/6.0/ubuntu/focal/amdgpu-install_6.0.60000-1_all.deb --no-check-certificate"&&\
sh -c"echo deb [arch=amd64 signed-by=/etc/apt/trusted.gpg.d/rocm-keyring.gpg] $DEB_ROCM_REPO focal main > /etc/apt/sources.list.d/rocm.list"&&\
sh -c'echo deb [arch=amd64 signed-by=/etc/apt/trusted.gpg.d/rocm-keyring.gpg] https://repo.radeon.com/amdgpu/$ROCMVERSION/ubuntu focal main > /etc/apt/sources.list.d/amdgpu.list';\
sh -c'echo deb [arch=amd64 trusted=yes] http://compute-artifactory.amd.com/artifactory/list/rocm-release-archive-20.04-deb/ 6.0.1 rel-95 > /etc/apt/sources.list.d/rocm-build.list'&&\
amdgpu-repo --amdgpu-build=1704947;\
fi
RUN sh -c"echo deb http://mirrors.kernel.org/ubuntu focal main universe | tee -a /etc/apt/sources.list"
#. **Additional reading:** We also recommend reading a `blog post
#. **Additional reading:** The blog post `AMD Composable Kernel library: efficient fused kernels for AI apps with just a few lines of code <https://community.amd.com/t5/instinct-accelerators/amd-composable-kernel-library-efficient-fused-kernels-for-ai/ba-p/553224>`_ provides a deeper understanding of the CK library and showcases its performance capabilities.
This document contains details of supported primitives in Composable Kernel (CK). In contrast to the
API Reference Guide, the Supported Primitives Guide is an introduction to the math which underpins
the algorithms implemented in CK.
This document contains details of supported primitives in Composable Kernel (CK). In contrast to the API Reference Guide, the Supported Primitives Guide is an introduction to the math which underpins the algorithms implemented in CK.
------------
Softmax
------------
For vectors :math:`x^{(1)}, x^{(2)}, \ldots, x^{(T)}` of size :math:`B` we can decompose the
For vectors :math:`x^{(1)}, x^{(2)}, \ldots, x^{(T)}` of size :math:`B` you can decompose the
To make our lives easier and bring Composable Kernel dependencies together, we recommend using
docker images that can be found on `Docker Hub <https://hub.docker.com/r/rocm/composable_kernel>`_.
To make things simpler, and bring Composable Kernel and its dependencies together,
docker images can be found on `Docker Hub <https://hub.docker.com/r/rocm/composable_kernel/tags>`_. Docker images provide a complete image of the OS, the Composable Kernel library, and its dependencies in a single downloadable file.
-------------------------------------
So what is Composable Kernel?
-------------------------------------
Refer to `Docker Overview <https://docs.docker.com/get-started/overview/>`_ for more information on Docker images and containers.
Composable Kernel (CK) library aims to provide a programming model for writing performance critical
kernels for machine learning workloads across multiple architectures including GPUs, CPUs, etc,
through general purpose kernel languages, like HIP C++.
Which image is right for me?
============================
To get the CK library::
The image naming includes information related to the docker image.
For example ``ck_ub20.04_rocm6.0`` indicates the following:
Download a docker image suitable for your OS and ROCm release, run or start the docker container, and then resume the tutorial from this point. Use the ``docker pull`` command to download the file::
After downloading the docker image, you can start the container using one of a number of commands. Start with the ``docker run`` command as shown below::
docker run \
-it \
...
...
@@ -30,70 +52,50 @@ run a docker container::
--group-add sudo \
-w /root/workspace \
-v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace \
rocm/composable_kernel:ck_ub20.04_rocm5.6 \
rocm/composable_kernel:ck_ub20.04_rocm6.0 \
/bin/bash
and build the CK::
After starting the bash shell, the docker container current folder is `~/workspace`. The library path is ``~/workspace/composable_kernel``. Navigate to the library to begin the tutorial as explained in :ref:`hello-world`:
mkdir build && cd build
# Need to specify target ID, example below is for gfx908 and gfx90a
cmake \
-D CMAKE_PREFIX_PATH=/opt/rocm \
-D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc \
-D CMAKE_CXX_FLAGS="-O3" \
-D CMAKE_BUILD_TYPE=Release \
-D GPU_TARGETS="gfx908;gfx90a" \
..
.. note::
and::
If your current folder is different from `${HOME}`, adjust the line ``-v ${HOME}:/root/workspace`` in the ``docker run`` command to fit your folder structure.
make -j examples tests
Stop and restart the docker image
=================================
To run all the test cases including tests and examples run::
After finishing the tutorial, or just when you have completed your work session, you can close the docker container, or stop the docker container to restart it at another time. Closing the docker container means that it is still in the active state, and can be resumed from where you left it. Stopping the container closes it, and returns the image to its initial state.
make test
Use the ``Ctrl-D`` option to exit the container, while leaving it active, so you can return to the container in its current state to resume the tutorial, or pickup your project where you left off.
We can also run specific examples or tests like::
To restart the active container use the ``docker exec`` command to specify the container name and options as follows::
./bin/example_gemm_xdl_fp16
./bin/test_gemm_fp16
docker exec -it <container_name> bash
For more details visit `CK github repository <https://github.com/ROCmSoftwarePlatform/composable_kernel>`_,
The Composable Kernel (CK) library provides a programming model for writing performance critical kernels for machine learning workloads across multiple architectures including GPUs and CPUs, through general purpose kernel languages like HIP C++. This document contains instructions for installing, using, and contributing to the Composable Kernel project. To learn more see :ref:`what-is-ck`.
------------
Introduction
------------
The CK documentation is structured as follows:
This document contains instructions for installing, using, and contributing to Composable Kernel (CK).
.. card:: Conceptual
-----------
Methodology
-----------
* :ref:`what-is-ck`
Composable Kernel (CK) library aims to provide a programming model for writing performance critical
kernels for machine learning workloads across multiple architectures including GPUs, CPUs, etc,
through general purpose kernel languages, like HIP C++.
.. card:: Installation
CK utilizes two concepts to achieve performance portability and code maintainability:
* :ref:`docker-hub`
* A tile-based programming model
* Algorithm complexity reduction for complex ML operators, using innovative technique we call
"Tensor Coordinate Transformation".
.. card:: Tutorial
.. image:: data/ck_component.png
:alt: CK Components
* :ref:`hello-world`
--------------
Code Structure
--------------
.. card:: API reference
Current CK library are structured into 4 layers:
* :ref:`supported-primitives`
* :ref:`api-reference`
* :ref:`wrapper`
* "Templated Tile Operators" layer
* "Templated Kernel and Invoker" layer
* "Instantiated Kernel and Invoker" layer
* "Wrapper for tensor transform operations"
* "Client API" layer
.. card:: Contributing to CK
.. image:: data/ck_layer.png
:alt: CK Layers
Documentation Roadmap
^^^^^^^^^^^^^^^^^^^^^
The following is a list of CK documents in the suggested reading order:
* :ref:`contributing-to`
.. toctree::
:maxdepth: 5
:caption: Contents:
:numbered:
To contribute to the documentation refer to `Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/index.md>`_.
tutorial_hello_world
dockerhub
wrapper
Supported_Primitives_Guide
API_Reference_Guide
Contributors_Guide
You can find licensing information at the `Licensing <https://rocm.docs.amd.com/en/latest/about/license.md>`_ page.
During this tutorial we will have an introduction to the CK library, we will build it and run some
examples and tests, so to say we will run a "Hello world" example. In future tutorials we will go
in depth and breadth and get familiar with other tools and ways to integrate CK into your project.
This tutorial is for engineers dealing with artificial intelligence and machine learning who
would like to optimize pipelines and improve performance using the Composable
Kernel (CK) library. This tutorial provides an introduction to the CK library. You will build the library and run some examples using a "Hello World" example.
-------------------------------------
Description
-------------------------------------
===========
Modern AI technology solves more and more problems in all imaginable fields, but crafting fast and
efficient workflows is still challenging. CK is one of the tools to make AI heavy lifting as fast
and efficient as possible. CK is a collection of optimized AI operator kernels and tools to create
new ones. The library has components required for majority of modern neural networks architectures
including matrix multiplication, convolution, contraction, reduction, attention modules, variety of
activation functions, fused operators and many more.
Modern AI technology solves more and more problems in a variety of fields, but crafting fast and
efficient workflows is still challenging. CK can make the AI workflow fast
and efficient. CK is a collection of optimized AI operator kernels with tools to create
new kernels. The library has components required for modern neural network architectures
including matrix multiplication, convolution, contraction, reduction, attention modules, a variety of activation functions, and fused operators.
So how do we (almost) reach the speed of light? CK acceleration abilities are based on:
CK library acceleration features are based on:
* Layered structure.
* Tile-based computation model.
* Tensor coordinate transformation.
* Hardware acceleration use.
* Support of low precision data types including fp16, bf16, int8 and int4.
* Layered structure
* Tile-based computation model
* Tensor coordinate transformation
* Hardware acceleration use
* Support of low precision data types including fp16, bf16, int8 and int4
If you are excited and need more technical details and benchmarking results - read this awesome
If you need more technical details and benchmarking results read the following
`blog post <https://community.amd.com/t5/instinct-accelerators/amd-composable-kernel-library-efficient-fused-kernels-for-ai/ba-p/553224>`_.
For more details visit our `github repository <https://github.com/ROCmSoftwarePlatform/composable_kernel>`_.
To download the library visit the `composable_kernel repository <https://github.com/ROCmSoftwarePlatform/composable_kernel>`_.
-------------------------------------
Hardware targets
-------------------------------------
================
CK library fully supports `gfx908` and `gfx90a` GPU architectures and only some operators are
supported for `gfx1030`. Let's check the hardware you have at hand and decide on the target
GPU architecture.
CK library fully supports `gfx908` and `gfx90a` GPU architectures, while only some operators are
supported for `gfx1030` devices. Check your hardware to determine the target GPU architecture.
There are also `cloud options <https://aws.amazon.com/ec2/instance-types/g4/>`_ you can find if
you don't have an AMD GPU at hand.
-------------------------------------
Build the library
-------------------------------------
=================
First let's clone the library and rebase to the tested version::
This tutorial is based on the use of docker images as explained in :ref:`docker-hub`. Download a docker image suitable for your OS and ROCm release, run or start the docker container, and then resume the tutorial from this point.
`docker images <https://hub.docker.com/r/rocm/composable_kernel>`_ with all the necessary
dependencies. Pick the right image and create a container. In this tutorial we use
``rocm/composable_kernel:ck_ub20.04_rocm5.6`` image, it is based on Ubuntu 20.04 and
ROCm v5.6.
If your current folder is ``${HOME}``, start the docker container with::
docker run \
-it \
--privileged \
--group-add sudo \
-w /root/workspace \
-v ${HOME}:/root/workspace \
rocm/composable_kernel:ck_ub20.04_rocm5.6 \
/bin/bash
.. note::
If your current folder is different from ``${HOME}``, adjust the line ``-v ${HOME}:/root/workspace``
to fit your folder structure.
You can also `install ROCm <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/>`_ on your system, clone the `Composable Kernel repository <https://github.com/ROCmSoftwarePlatform/composable_kernel.git>`_ on GitHub, and use that to build and run the examples using the commands described below.
Inside the docker container current folder is ``~/workspace``, library path is
``~/workspace/composable_kernel``, navigate to the library::
Both the docker container and GitHub repository include the Composable Kernel library. Navigate to the library::
cd composable_kernel/
Create and go to the ``build`` directory::
Create and change to a ``build`` directory::
mkdir build && cd build
In the previous section we talked about target GPU architecture. Once you decide which one is right
for you, run CMake using the right ``GPU_TARGETS`` flag::
The previous section discussed supported GPU architecture. Once you decide which hardware targets are needed, run CMake using the ``GPU_TARGETS`` flag::
cmake \
-D CMAKE_PREFIX_PATH=/opt/rocm \
...
...
@@ -109,26 +78,25 @@ for you, run CMake using the right ``GPU_TARGETS`` flag::
-D BUILD_DEV=OFF \
-D GPU_TARGETS="gfx908;gfx90a;gfx1030" ..
If everything went well the CMake run will end up with::
If everything goes well the CMake command will return::
-- Configuring done
-- Generating done
-- Build files have been written to: "/root/workspace/composable_kernel/build"
Finally, we can build examples and tests::
Finally, you can build examples and tests::
make -j examples tests
If everything is smooth, you'll see::
When complete you should see::
Scanning dependencies of target tests
[100%] Built target tests
---------------------------
Run examples and tests
---------------------------
======================
Examples are listed as test cases as well, so we can run all examples and tests with::
Examples are listed as test cases as well, so you can run all examples and tests with::
ctest
...
...
@@ -136,38 +104,32 @@ You can check the list of all tests by running::
ctest -N
We can also run them separately, here is a separate example execution::
You can also run examples separately as shown in the following example execution::
./bin/example_gemm_xdl_fp16 1 1 1
The arguments ``1 1 1`` mean that we want to run this example in the mode: verify results with CPU,
initialize matrices with integers and benchmark the kernel execution. You can play around with
these parameters and see how output and execution results change.
The arguments ``1 1 1`` mean that you want to run this example in the mode: verify results with CPU, initialize matrices with integers, and benchmark the kernel execution. You can play around with these parameters and see how output and execution results change.
If everything goes well and you have a device based on `gfx908` or `gfx90a` architecture you should see
something like::
If you have a device based on `gfx908` or `gfx90a` architecture, and if the example runs as expected, you should see something like::
a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
Meanwhile, running it on a `gfx1030` device should result in::
However, running it on a `gfx1030` device should result in the following::
a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2> NumPrefetch: 1, LoopScheduler: Default, PipelineVersion: v1 does not support this problem
But don't panic, some of the operators are supported on `gfx1030` architecture, so you can run a
Don't worry, some operators are supported on `gfx1030` architecture, so you can run a
separate example like::
./bin/example_gemm_dl_fp16 1 1 1
and it should result in something nice similar to::
and it should return something like::
a_m_k: dim 2, lengths {3840, 4096}, strides {1, 4096}
b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
...
...
@@ -182,12 +144,9 @@ and it should result in something nice similar to::
.. note::
There was a new CMake flag ``DL_KERNELS`` added in the latest versions of CK. If you use one of
the newest versions of the library and do not see the above results when running
``example_gemm_dl_fp16``, it might be necessary to add ``-D DL_KERNELS=ON`` to your CMake command
in order to build the operators supported on the `gfx1030` architecture.
A new CMake flag ``DL_KERNELS`` has been added to the latest versions of CK. If you do not see the above results when running ``example_gemm_dl_fp16``, you might need to add ``-D DL_KERNELS=ON`` to your CMake command to build the operators supported on the `gfx1030` architecture.
We can also run a separate test::
You can also run a separate test::
ctest -R test_gemm_fp16
...
...
@@ -198,13 +157,9 @@ If everything goes well you should see something like::
100% tests passed, 0 tests failed out of 1
-----------
Summary
-----------
=======
In this tutorial we took the first look at the Composable Kernel library, built it on your system
and ran some examples and tests. Stay tuned, in the next tutorial we will run kernels with different
configs to find out the best one for your hardware and task.
In this tutorial you took the first look at the Composable Kernel library, built it on your system and ran some examples and tests. In the next tutorial you will run kernels with different configurations to find out the best one for your hardware and task.
P.S.: Don't forget to switch off the cloud instance if you have launched one, you can find better
ways to spend your money for sure!
P.S.: If you are running on a cloud instance, don't forget to switch off the cloud instance.
The Composable Kernel (CK) library provides a programming model for writing performance critical kernels for machine learning workloads across multiple architectures including GPUs and CPUs, through general purpose kernel languages like HIP C++.
CK utilizes two concepts to achieve performance portability and code maintainability:
* A tile-based programming model
* Algorithm complexity reduction for complex ML operators using an innovative technique called
"Tensor Coordinate Transformation".
.. image:: data/ck_component.png
:alt: CK Components
Code Structure
==============
The CK library is structured into 4 layers:
* "Templated Tile Operators" layer
* "Templated Kernel and Invoker" layer
* "Instantiated Kernel and Invoker" layer
* "Client API" layer
It also includes a simple wrapper component used to perform tensor transform operations more easily and with fewer lines of code.