Merge remote-tracking branch 'origin/develop' into migraphx-update

ef326c73 · Alan Turner · b7775add · e4dfe4d8 · b7775add · ef326c73
Commit ef326c73 authored Nov 19, 2024 by Alan Turner
20 changed files
--- a/docs/dockerhub.rst
+++ b/docs/dockerhub.rst
-===================
-CK Docker Hub
-===================
-
-------------------------------------
-Why do I need this?
-------------------------------------
-
-To make our lives easier and bring Composable Kernel dependencies together, we recommend using
-docker images that can be found on `Docker Hub <https://hub.docker.com/r/rocm/composable_kernel>`_.
-
-------------------------------------
-So what is Composable Kernel?
-------------------------------------
-
-Composable Kernel (CK) library aims to provide a programming model for writing performance critical
-kernels for machine learning workloads across multiple architectures including GPUs, CPUs, etc,
-through general purpose kernel languages, like HIP C++.
-
-To get the CK library::
-
-    git clone https://github.com/ROCmSoftwarePlatform/composable_kernel.git
-
-
-run a docker container::
-
-    docker run                                                            \
-    -it                                                                   \
-    --privileged                                                          \
-    --group-add sudo                                                      \
-    -w /root/workspace                                                    \
-    -v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace                         \
-    rocm/composable_kernel:ck_ub20.04_rocm5.6                             \
-    /bin/bash
-
-and build the CK::
-
-    mkdir build && cd build
-    # Need to specify target ID, example below is for gfx908 and gfx90a
-    cmake                                                                                             \
-    -D CMAKE_PREFIX_PATH=/opt/rocm                                                                    \
-    -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc                                                         \
-    -D CMAKE_CXX_FLAGS="-O3"                                                                          \
-    -D CMAKE_BUILD_TYPE=Release                                                                       \
-    -D GPU_TARGETS="gfx908;gfx90a"                                                                    \
-    ..
-
-and::
-
-    make -j examples tests
-
-To run all the test cases including tests and examples run::
-
-    make test
-
-We can also run specific examples or tests like::
-
-    ./bin/example_gemm_xdl_fp16
-    ./bin/test_gemm_fp16
-
-For more details visit `CK github repository <https://github.com/ROCmSoftwarePlatform/composable_kernel>`_,
-`CK examples <https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/example)>`_,
-`even more CK examples <https://github.com/ROCmSoftwarePlatform/composable_kernel/tree/develop/client_example>`_.
-
-------------------------------------
-And what is inside?
-------------------------------------
-
-The docker images have everything you need for running CK including:
-
-* `ROCm <https://www.amd.com/en/graphics/servers-solutions-rocm>`_
-* `CMake <https://cmake.org/>`_
-* `Compiler <https://github.com/RadeonOpenCompute/llvm-project>`_
-
-------------------------------------
-Which image is right for me?
-------------------------------------
-
-Let's take a look at the image naming, for example ``ck_ub20.04_rocm5.6``. The image specs are:
-
-* ``ck`` - made for running Composable Kernel;
-* ``ub20.04`` - based on Ubuntu 20.04;
-* ``rocm5.6`` - ROCm platform version 5.6.
-
-So just pick the right image for your project dependencies and you're all set.
-
-------------------------------------
-DIY starts here
-------------------------------------
-
-If you need to customize a docker image or just can't stop tinkering, feel free to adjust the
-`Dockerfile <https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/Dockerfile>`_
-for your needs.
-
-------------------------------------
-License
-------------------------------------
-
-CK is released under the MIT `license <https://github.com/ROCmSoftwarePlatform/composable_kernel/blob/develop/LICENSE>`_.
--- a/docs/doxygen/Doxyfile
+++ b/docs/doxygen/Doxyfile
@@ -58,7 +58,7 @@ PROJECT_LOGO           =
 # entered, it will be relative to the location where doxygen was started. If
 # left blank the current directory will be used.

-OUTPUT_DIRECTORY       = docBin
+OUTPUT_DIRECTORY       = .

 # If the CREATE_SUBDIRS tag is set to YES then doxygen will create 4096 sub-
 # directories (in 2 levels) under the output directory of each output format and
@@ -778,7 +778,9 @@ WARN_LOGFILE           =
 INPUT                  = ../../include/ck/tensor_operation/gpu/grid \
                         ../../include/ck/tensor_operation/gpu/block \
                         ../../include/ck/tensor_operation/gpu/thread \
-                         ../../library/include/ck/library/utility
+                         ../../library/include/ck/library/utility \
+                         ../../include/ck/wrapper
+

 # This tag can be used to specify the character encoding of the source files
 # that doxygen parses. Internally doxygen uses the UTF-8 encoding. Doxygen uses

--- a/docs/index.rst
+++ b/docs/index.rst
-============================
+.. meta::
+  :description: Composable Kernel documentation and API reference library
+  :keywords: composable kernel, CK, ROCm, API, documentation
+
+.. _composable-kernel:
+
+********************************************************************
 Composable Kernel User Guide
-============================
+********************************************************************

------------
-Introduction
------------
+The Composable Kernel (CK) library provides a programming model for writing performance critical kernels for machine learning workloads across multiple architectures including GPUs and CPUs, through general purpose kernel languages like HIP C++. This document contains instructions for installing, using, and contributing to the Composable Kernel project. To learn more see :ref:`what-is-ck`.

-This document contains instructions for installing, using, and contributing to Composable Kernel (CK).
+The CK documentation is structured as follows:

-----------
-Methodology
-----------
+.. grid:: 2
+  :gutter: 3

-Composable Kernel (CK) library aims to provide a programming model for writing performance critical
-kernels for machine learning workloads across multiple architectures including GPUs, CPUs, etc,
-through general purpose kernel languages, like HIP C++.
+  .. grid-item-card:: Installation

-CK utilizes two concepts to achieve performance portability and code maintainability:
+    * :ref:`docker-hub`

-* A tile-based programming model
-* Algorithm complexity reduction for complex ML operators, using innovative technique we call
-  "Tensor Coordinate Transformation".
+  .. grid-item-card:: Conceptual

-.. image:: data/ck_component.png
-   :alt: CK Components
+    * :ref:`what-is-ck`

--------------
-Code Structure
--------------
+  .. grid-item-card:: API reference

-Current CK library are structured into 4 layers:
+    * :ref:`supported-primitives`
+    * :ref:`api-reference`
+    * :ref:`wrapper`

-* "Templated Tile Operators" layer
-* "Templated Kernel and Invoker" layer
-* "Instantiated Kernel and Invoker" layer
-* "Client API" layer
+  .. grid-item-card:: Tutorial

-.. image:: data/ck_layer.png
-   :alt: CK Layers
-   
-Documentation Roadmap
-^^^^^^^^^^^^^^^^^^^^^
-The following is a list of CK documents in the suggested reading order:
+    * :ref:`hello-world`

-.. toctree::
-   :maxdepth: 5
-   :caption: Contents:
-   :numbered:
+To contribute to the documentation refer to `Contributing to ROCm  <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.

-   tutorial_hello_world
-   dockerhub
-   Supported_Primitives_Guide
-   API_Reference_Guide
-   Contributors_Guide
+You can find licensing information on the `Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
--- a/docs/install/dockerhub.rst
+++ b/docs/install/dockerhub.rst
+.. meta::
+  :description: Composable Kernel documentation and API reference library
+  :keywords: composable kernel, CK, ROCm, API, documentation
+
+.. _docker-hub:
+
+********************************************************************
+CK Docker Hub
+********************************************************************
+
+Why do I need this?
+===================
+
+To make things simpler, and bring Composable Kernel and its dependencies together, 
+docker images can be found on `Docker Hub <https://hub.docker.com/r/rocm/composable_kernel/tags>`_. Docker images provide a complete image of the OS, the Composable Kernel library, and its dependencies in a single downloadable file. 
+
+Refer to `Docker Overview <https://docs.docker.com/get-started/overview/>`_ for more information on Docker images and containers.
+
+Which image is right for me?
+============================
+
+The image naming includes information related to the docker image. 
+For example ``ck_ub20.04_rocm6.0`` indicates the following:
+
+* ``ck`` - made for running Composable Kernel;
+* ``ub20.04`` - based on Ubuntu 20.04;
+* ``rocm6.0`` - ROCm platform version 6.0.
+
+Download a docker image suitable for your OS and ROCm release, run or start the docker container, and then resume the tutorial from this point. Use the ``docker pull`` command to download the file::
+
+    docker pull rocm/composable_kernel:ck_ub20.04_rocm6.0
+
+
+What is inside the image?
+-------------------------
+
+The docker images have everything you need for running CK including:
+
+* `ROCm <https://rocm.docs.amd.com/en/latest/index.html>`_
+* `CMake <https://cmake.org/getting-started/>`_
+* `Compiler <https://github.com/ROCm/llvm-project>`_
+* `Composable Kernel library <https://github.com/ROCm/composable_kernel>`_
+
+Running the docker container
+============================
+
+After downloading the docker image, you can start the container using one of a number of commands. Start with the ``docker run`` command as shown below::
+
+    docker run                                                            \
+    -it                                                                   \
+    --privileged                                                          \
+    --group-add sudo                                                      \
+    -w /root/workspace                                                    \
+    -v ${PATH_TO_LOCAL_WORKSPACE}:/root/workspace                         \
+    rocm/composable_kernel:ck_ub20.04_rocm6.0                             \
+    /bin/bash
+
+After starting the bash shell, the docker container current folder is `~/workspace`. The library path is ``~/workspace/composable_kernel``. Navigate to the library to begin the tutorial as explained in :ref:`hello-world`:
+
+.. note::
+
+    If your current folder is different from `${HOME}`, adjust the line ``-v ${HOME}:/root/workspace`` in the ``docker run`` command to fit your folder structure.
+
+Stop and restart the docker image
+=================================
+
+After finishing the tutorial, or just when you have completed your work session, you can close the docker container, or stop the docker container to restart it at another time. Closing the docker container means that it is still in the active state, and can be resumed from where you left it. Stopping the container closes it, and returns the image to its initial state. 
+
+Use the ``Ctrl-D`` option to exit the container, while leaving it active, so you can return to the container in its current state to resume the tutorial, or pickup your project where you left off. 
+
+To restart the active container use the ``docker exec`` command to specify the container name and options as follows::
+
+    docker exec -it <container_name> bash
+
+Where: 
+
+* `exec` is the docker command
+* `-it` is the interactive option for `exec`
+* `<container_name>` specifies an active container on the system
+* `bash` specifies the command to run in the interactive shell
+
+.. note::
+
+    You can use the ``docker container ls`` command to list the active containers on the system.
+
+To start a container from the image, use the ``docker start`` command::
+
+    docker start <container_name>
+
+Then use the docker exec command as shown above to start the bash shell. 
+
+Use the ``docker stop`` command to stop the container and restore the image to its initial state::
+
+    docker stop <container_name>
+    
+Editing the docker image
+=======================
+
+If you want to customize the docker image, edit the
+`Dockerfile <https://github.com/ROCm/composable_kernel/blob/develop/Dockerfile>`_
+from the GitHub repository to suit your needs.
--- a/docs/license.rst
+++ b/docs/license.rst
-=======
+.. meta::
+  :description: Composable Kernel documentation and API reference library
+  :keywords: composable kernel, CK, ROCm, API, documentation
+
+.. _license:
+
+********************************************************************
 License
-=======
+********************************************************************

-.. include:: ../LICENSE
-   :literal:
+.. include:: ../LICENSE
\ No newline at end of file
--- a/docs/API_Reference_Guide.rst
+++ b/docs/API_Reference_Guide.rst
+.. meta::
+  :description: Composable Kernel documentation and API reference library
+  :keywords: composable kernel, CK, ROCm, API, documentation

-*******************
-API Reference Guide
-*******************
+.. _api-reference:
+
+********************************************************************
+API reference guide
+********************************************************************

-=================
-Introduction
-=================

 This document contains details of the APIs for the Composable Kernel (CK) library and introduces
 some of the key design principles that are used to write new classes that extend CK functionality.

-=================
-Using CK API
-=================
-
-This section describes how to use the CK library API.
-
 =================
 CK Datatypes
 =================
@@ -30,7 +26,7 @@ DeviceMem
 Kernels For Flashattention
 ---------------------------

-The Flashattention algorithm is defined in :cite:t:`dao2022flashattention`. This sections lists
+The Flashattention algorithm is defined in :cite:t:`dao2022flashattention`. This section lists
 the classes that are used in the CK GPU implementation of Flashattention.

 **Gridwise classes**

--- a/docs/Supported_Primitives_Guide.rst
+++ b/docs/Supported_Primitives_Guide.rst
-==========================
+.. meta::
+  :description: Composable Kernel documentation and API reference library
+  :keywords: composable kernel, CK, ROCm, API, documentation
+
+.. _supported-primitives:
+
+********************************************************************
 Supported Primitives Guide
-==========================
+********************************************************************

-This document contains details of supported primitives in Composable Kernel (CK). In contrast to the
-API Reference Guide, the Supported Primitives Guide is an introduction to the math which underpins
-the algorithms implemented in CK.
+This document contains details of supported primitives in Composable Kernel (CK). In contrast to the API Reference Guide, the Supported Primitives Guide is an introduction to the math which underpins the algorithms implemented in CK.

 ------------
 Softmax
 ------------

-For vectors :math:`x^{(1)}, x^{(2)}, \ldots, x^{(T)}` of size :math:`B` we can decompose the
+For vectors :math:`x^{(1)}, x^{(2)}, \ldots, x^{(T)}` of size :math:`B` you can decompose the
 softmax of concatenated :math:`x = [ x^{(1)}\ | \ \ldots \ | \ x^{(T)} ]` as,

 .. math::
@@ -27,7 +31,7 @@ where :math:`f(x^{(j)}) = \exp( x^{(j)} - m(x^{(j)}) )` is of size :math:`B` and
 :math:`z(x^{(j)}) = f(x_1^{(j)})+ \ldots+ f(x_B^{(j)})` is a scalar.

 For a matrix :math:`X` composed of :math:`T_r \times T_c` tiles, :math:`X_{ij}`, of size
-:math:`B_r \times B_c` we can compute the row-wise softmax as follows.
+:math:`B_r \times B_c` you can compute the row-wise softmax as follows.

 For :math:`j` from :math:`1` to :math:`T_c`, and :math:`i` from :math:`1` to :math:`T_r` calculate,


--- a/docs/reference/wrapper.rst
+++ b/docs/reference/wrapper.rst
+.. meta::
+  :description: Composable Kernel documentation and API reference library
+  :keywords: composable kernel, CK, ROCm, API, documentation
+
+.. _wrapper:
+
+********************************************************************
+Wrapper
+********************************************************************
+
+-------------------------------------
+Description
+-------------------------------------
+
+
+The CK library provides a lightweight wrapper for more complex operations implemented in 
+the library.
+
+Example:
+
+.. code-block:: c
+
+    const auto shape_4x2x4         = ck::make_tuple(4, ck::make_tuple(2, 4));
+    const auto strides_s2x1x8      = ck::make_tuple(2, ck::make_tuple(1, 8));
+    const auto layout = ck::wrapper::make_layout(shape_4x2x4, strides_s2x1x8);
+    
+    std::array<ck::index_t, 32> data;
+    auto tensor = ck::wrapper::make_tensor<ck::wrapper::MemoryTypeEnum::Generic>(&data[0], layout);
+
+    for(ck::index_t w = 0; w < size(tensor); w++) {
+        tensor(w) = w;
+    }
+
+    // slice() == slice(0, -1) (whole dimension)
+    auto tensor_slice = tensor(ck::wrapper::slice(1, 3), ck::make_tuple(ck::wrapper::slice(), ck::wrapper::slice()));
+    std::cout << "dims:2,(2,4) strides:2,(1,8)" << std::endl;
+    for(ck::index_t h = 0; h < ck::wrapper::size<0>(tensor_slice); h++)
+    {
+        for(ck::index_t w = 0; w < ck::wrapper::size<1>(tensor_slice); w++)
+        {
+            std::cout << tensor_slice(h, w) << " ";
+        }
+        std::cout << std::endl;
+    }
+
+Output::
+
+    dims:2,(2,4) strides:2,(1,8)
+    1 5 9 13 17 21 25 29 
+    2 6 10 14 18 22 26 30 
+
+
+Tutorials:
+
+* `GEMM tutorial <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/README.md>`_
+
+Advanced examples:
+
+* `Image to column <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_img2col.cpp>`_
+* `Basic gemm <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_basic_gemm.cpp>`_
+* `Optimized gemm <https://github.com/ROCm/composable_kernel/blob/develop/client_example/25_wrapper/wrapper_optimized_gemm.cpp>`_
+
+-------------------------------------
+Layout
+-------------------------------------
+
+.. doxygenstruct:: Layout
+
+-------------------------------------
+Layout helpers
+-------------------------------------
+
+.. doxygenfile:: include/ck/wrapper/utils/layout_utils.hpp
+
+-------------------------------------
+Tensor
+-------------------------------------
+
+.. doxygenstruct:: Tensor
+
+-------------------------------------
+Tensor helpers
+-------------------------------------
+
+.. doxygenfile:: include/ck/wrapper/utils/tensor_utils.hpp
+
+.. doxygenfile:: include/ck/wrapper/utils/tensor_partition.hpp
+
+-------------------------------------
+Operations
+-------------------------------------
+
+.. doxygenfile:: include/ck/wrapper/operations/copy.hpp
+.. doxygenfile:: include/ck/wrapper/operations/gemm.hpp
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
-# Anywhere {branch} is used, the branch name will be substituted.
-# These comments will also be removed.
 defaults:
  numbered: False
-  maxdepth: 6
 root: index
 subtrees:
-  - caption: About
-    entries:
-      - file: license
+
+- caption: Conceptual
+  entries:
+  - file: conceptual/what-is-ck.rst
+    title: What is Composable Kernel?
+
+- caption: Install
+  entries:
+  - file: install/dockerhub.rst
+    title: Docker Hub
+
+- caption: CK API Reference
+  entries:
+  - file: reference/Supported_Primitives_Guide.rst
+    title: Supported Primitives
+  - file: reference/API_Reference_Guide.rst
+    title: API Reference
+  - file: reference/wrapper.rst
+    title: Wrapper
+
+- caption: Tutorial
+  entries:
+  - file: tutorial/tutorial_hello_world.rst
+    title: Hello World Tutorial
+
+- caption: About
+  entries:
+  - file: Contributors_Guide.rst
+    title: Contributing to CK
+  - file: license.rst
+    title: License
+    
\ No newline at end of file
--- a/docs/sphinx/requirements.in
+++ b/docs/sphinx/requirements.in
-rocm-docs-core>=0.20.0
-sphinxcontrib-bibtex==2.5.0
+rocm-docs-core==1.8.5
+sphinxcontrib-bibtex==2.6.3
--- a/docs/sphinx/requirements.txt
+++ b/docs/sphinx/requirements.txt
 #
-# This file is autogenerated by pip-compile with Python 3.8
+# This file is autogenerated by pip-compile with Python 3.10
 # by the following command:
 #
 #    pip-compile requirements.in
 #
-accessible-pygments==0.0.3
+accessible-pygments==0.0.5
    # via pydata-sphinx-theme
-alabaster==0.7.13
+alabaster==0.7.16
    # via sphinx
-babel==2.12.1
+babel==2.15.0
    # via
    #   pydata-sphinx-theme
    #   sphinx
-beautifulsoup4==4.11.2
+beautifulsoup4==4.12.3
    # via pydata-sphinx-theme
-breathe==4.34.0
+breathe==4.35.0
    # via rocm-docs-core
-certifi==2022.12.7
+certifi==2024.7.4
    # via requests
-cffi==1.15.1
+cffi==1.16.0
    # via
    #   cryptography
    #   pynacl
-charset-normalizer==3.1.0
+charset-normalizer==3.3.2
    # via requests
-click==8.1.3
+click==8.1.7
    # via sphinx-external-toc
-cryptography==40.0.2
+cryptography==43.0.0
    # via pyjwt
-deprecated==1.2.13
+deprecated==1.2.14
    # via pygithub
-docutils==0.16
+docutils==0.21.2
    # via
    #   breathe
    #   myst-parser
@@ -38,35 +38,35 @@ docutils==0.16
    #   pydata-sphinx-theme
    #   sphinx
    #   sphinxcontrib-bibtex
-fastjsonschema==2.18.0
+fastjsonschema==2.20.0
    # via rocm-docs-core
-gitdb==4.0.10
+gitdb==4.0.11
    # via gitpython
-gitpython==3.1.31
+gitpython==3.1.43
    # via rocm-docs-core
-idna==3.4
+idna==3.7
    # via requests
 imagesize==1.4.1
    # via sphinx
-jinja2==3.1.2
+jinja2==3.1.4
    # via
    #   myst-parser
    #   sphinx
-latexcodec==2.0.1
+latexcodec==3.0.0
    # via pybtex
-markdown-it-py==2.2.0
+markdown-it-py==3.0.0
    # via
    #   mdit-py-plugins
    #   myst-parser
-markupsafe==2.1.2
+markupsafe==2.1.5
    # via jinja2
-mdit-py-plugins==0.3.5
+mdit-py-plugins==0.4.1
    # via myst-parser
 mdurl==0.1.2
    # via markdown-it-py
-myst-parser==1.0.0
+myst-parser==3.0.1
    # via rocm-docs-core
-packaging==23.0
+packaging==24.1
    # via
    #   pydata-sphinx-theme
    #   sphinx
@@ -74,48 +74,46 @@ pybtex==0.24.0
    # via
    #   pybtex-docutils
    #   sphinxcontrib-bibtex
-pybtex-docutils==1.0.2
+pybtex-docutils==1.0.3
    # via sphinxcontrib-bibtex
-pycparser==2.21
+pycparser==2.22
    # via cffi
-pydata-sphinx-theme==0.13.3
+pydata-sphinx-theme==0.15.4
    # via
    #   rocm-docs-core
    #   sphinx-book-theme
-pygithub==1.58.2
+pygithub==2.3.0
    # via rocm-docs-core
-pygments==2.14.0
+pygments==2.18.0
    # via
    #   accessible-pygments
    #   pydata-sphinx-theme
    #   sphinx
-pyjwt[crypto]==2.6.0
+pyjwt[crypto]==2.8.0
    # via pygithub
 pynacl==1.5.0
    # via pygithub
-pyyaml==6.0
+pyyaml==6.0.1
    # via
    #   myst-parser
    #   pybtex
    #   rocm-docs-core
    #   sphinx-external-toc
-requests==2.28.2
+requests==2.32.3
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core>=0.20.0
+rocm-docs-core==1.8.5
    # via -r requirements.in
 six==1.16.0
-    # via
-    #   latexcodec
-    #   pybtex
-smmap==5.0.0
+    # via pybtex
+smmap==5.0.1
    # via gitdb
 snowballstemmer==2.2.0
    # via sphinx
-soupsieve==2.4
+soupsieve==2.5
    # via beautifulsoup4
-sphinx==5.3.0
+sphinx==7.4.7
    # via
    #   breathe
    #   myst-parser
@@ -127,33 +125,39 @@ sphinx==5.3.0
    #   sphinx-external-toc
    #   sphinx-notfound-page
    #   sphinxcontrib-bibtex
-sphinx-book-theme==1.0.1
+sphinx-book-theme==1.1.3
    # via rocm-docs-core
-sphinx-copybutton==0.5.1
+sphinx-copybutton==0.5.2
    # via rocm-docs-core
-sphinx-design==0.3.0
+sphinx-design==0.6.0
    # via rocm-docs-core
-sphinx-external-toc==0.3.1
+sphinx-external-toc==1.0.1
    # via rocm-docs-core
-sphinx-notfound-page==0.8.3
+sphinx-notfound-page==1.0.3
    # via rocm-docs-core
-sphinxcontrib-applehelp==1.0.4
+sphinxcontrib-applehelp==2.0.0
    # via sphinx
-sphinxcontrib-bibtex==2.5.0
+sphinxcontrib-bibtex==2.6.3
    # via -r requirements.in
-sphinxcontrib-devhelp==1.0.2
+sphinxcontrib-devhelp==2.0.0
    # via sphinx
-sphinxcontrib-htmlhelp==2.0.1
+sphinxcontrib-htmlhelp==2.1.0
    # via sphinx
 sphinxcontrib-jsmath==1.0.1
    # via sphinx
-sphinxcontrib-qthelp==1.0.3
+sphinxcontrib-qthelp==2.0.0
    # via sphinx
-sphinxcontrib-serializinghtml==1.1.5
+sphinxcontrib-serializinghtml==2.0.0
    # via sphinx
-typing-extensions==4.5.0
-    # via pydata-sphinx-theme
-urllib3==1.26.15
-    # via requests
-wrapt==1.15.0
+tomli==2.0.1
+    # via sphinx
+typing-extensions==4.12.2
+    # via
+    #   pydata-sphinx-theme
+    #   pygithub
+urllib3==2.2.2
+    # via
+    #   pygithub
+    #   requests
+wrapt==1.16.0
    # via deprecated
--- a/docs/tutorial/tutorial_hello_world.rst
+++ b/docs/tutorial/tutorial_hello_world.rst
+.. meta::
+  :description: Composable Kernel documentation and API reference library
+  :keywords: composable kernel, CK, ROCm, API, documentation
+
+.. _hello-world:
+
+********************************************************************
+Hello World Tutorial
+********************************************************************
+
+This tutorial is for engineers dealing with artificial intelligence and machine learning who
+would like to optimize pipelines and improve performance using the Composable
+Kernel (CK) library. This tutorial provides an introduction to the CK library. You will build the library and run some examples using a "Hello World" example. 
+
+Description
+===========
+
+Modern AI technology solves more and more problems in a variety of fields, but crafting fast and
+efficient workflows is still challenging. CK can make the AI workflow fast
+and efficient. CK is a collection of optimized AI operator kernels with tools to create
+new kernels. The library has components required for modern neural network architectures
+including matrix multiplication, convolution, contraction, reduction, attention modules, a variety of activation functions, and fused operators.
+
+CK library acceleration features are based on:
+
+* Layered structure
+* Tile-based computation model
+* Tensor coordinate transformation
+* Hardware acceleration use
+* Support of low precision data types including fp16, bf16, int8 and int4
+
+If you need more technical details and benchmarking results read the following 
+`blog post <https://community.amd.com/t5/instinct-accelerators/amd-composable-kernel-library-efficient-fused-kernels-for-ai/ba-p/553224>`_.
+
+To download the library visit the `composable_kernel repository <https://github.com/ROCm/composable_kernel>`_.
+
+Hardware targets
+================
+
+CK library fully supports `gfx908` and `gfx90a` GPU architectures, while only some operators are
+supported for `gfx1030` devices. Check your hardware to determine the target GPU architecture.
+
+==========     =========
+GPU Target     AMD GPU
+==========     =========
+gfx908 	       Radeon Instinct MI100
+gfx90a 	       Radeon Instinct MI210, MI250, MI250X
+gfx1030        Radeon PRO V620, W6800, W6800X, W6800X Duo, W6900X, RX 6800, RX 6800 XT, RX 6900 XT, RX 6900 XTX, RX 6950 XT
+==========     =========
+
+There are also `cloud options <https://aws.amazon.com/ec2/instance-types/g4/>`_ you can find if
+you don't have an AMD GPU at hand.
+
+Build the library
+=================
+
+This tutorial is based on the use of docker images as explained in :ref:`docker-hub`. Download a docker image suitable for your OS and ROCm release, run or start the docker container, and then resume the tutorial from this point. 
+
+.. note::
+
+   You can also `install ROCm <https://rocm.docs.amd.com/projects/install-on-linux/en/latest/>`_ on your system, clone the `Composable Kernel repository <https://github.com/ROCm/composable_kernel.git>`_ on GitHub, and use that to build and run the examples using the commands described below.
+
+Both the docker container and GitHub repository include the Composable Kernel library. Navigate to the library::
+
+    cd composable_kernel/
+
+Create and change to a ``build`` directory::
+
+    mkdir build && cd build
+
+The previous section discussed supported GPU architecture. Once you decide which hardware targets are needed, run CMake using the ``GPU_TARGETS`` flag::
+
+    cmake  \
+    -D CMAKE_PREFIX_PATH=/opt/rocm  \
+    -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc  \
+    -D CMAKE_CXX_FLAGS="-O3"  \
+    -D CMAKE_BUILD_TYPE=Release  \
+    -D BUILD_DEV=OFF  \
+    -D GPU_TARGETS="gfx908;gfx90a;gfx1030" ..
+
+If everything goes well the CMake command will return::
+
+    -- Configuring done
+    -- Generating done
+    -- Build files have been written to: "/root/workspace/composable_kernel/build"
+
+Finally, you can build examples and tests::
+
+    make -j examples tests
+
+When complete you should see::
+
+    Scanning dependencies of target tests
+    [100%] Built target tests
+
+Run examples and tests
+======================
+
+Examples are listed as test cases as well, so you can run all examples and tests with::
+
+    ctest
+
+You can check the list of all tests by running::
+
+    ctest -N
+
+You can also run examples separately as shown in the following example execution::
+
+    ./bin/example_gemm_xdl_fp16 1 1 1
+
+The arguments ``1 1 1`` mean that you want to run this example in the mode: verify results with CPU, initialize matrices with integers, and benchmark the kernel execution. You can play around with these parameters and see how output and execution results change.
+
+If you have a device based on `gfx908` or `gfx90a` architecture, and if the example runs as expected, you should see something like::
+
+    a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
+    b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
+    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
+    Perf: 1.08153 ms, 119.136 TFlops, 89.1972 GB/s, DeviceGemm_Xdl_CShuffle<Default, 256, 256, 128, 32, 8, 2, 32, 32, 4, 2, 8, 4, 1, 2> LoopScheduler: Interwave, PipelineVersion: v1
+
+However, running it on a `gfx1030` device should result in the following::
+
+    a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
+    b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
+    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
+    DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2> NumPrefetch: 1, LoopScheduler: Default, PipelineVersion: v1 does not support this problem
+
+Don't worry, some operators are supported on `gfx1030` architecture, so you can run a
+separate example like::
+
+    ./bin/example_gemm_dl_fp16 1 1 1
+
+and it should return something like::
+
+    a_m_k: dim 2, lengths {3840, 4096}, strides {1, 4096}
+    b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
+    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
+    arg.a_grid_desc_k0_m0_m1_k1_{2048, 3840, 2}
+    arg.b_grid_desc_k0_n0_n1_k1_{2048, 4096, 2}
+    arg.c_grid_desc_m_n_{ 3840, 4096}
+    launch_and_time_kernel: grid_dim {960, 1, 1}, block_dim {256, 1, 1}
+    Warm up 1 time
+    Start running 10 times...
+    Perf: 3.65695 ms, 35.234 TFlops, 26.3797 GB/s, DeviceGemmDl<256, 128, 128, 16, 2, 4, 4, 1>
+
+.. note::
+
+    A new CMake flag ``DL_KERNELS`` has been added to the latest versions of CK. If you do not see the above results when running ``example_gemm_dl_fp16``, you might need to add ``-D DL_KERNELS=ON`` to your CMake command to build the operators supported on the `gfx1030` architecture.
+
+You can also run a separate test::
+
+    ctest -R test_gemm_fp16
+
+If everything goes well you should see something like::
+
+    Start 121: test_gemm_fp16
+    1/1 Test #121: test_gemm_fp16 ...................   Passed   51.81 sec
+
+    100% tests passed, 0 tests failed out of 1
+
+Summary
+=======
+
+In this tutorial you took the first look at the Composable Kernel library, built it on your system and ran some examples and tests. In the next tutorial you will run kernels with different configurations to find out the best one for your hardware and task.
+
+P.S.: If you are running on a cloud instance, don't forget to switch off the cloud instance. 
--- a/docs/tutorial_hello_world.rst
+++ b/docs/tutorial_hello_world.rst
-===============
-CK Hello world
-===============
-
-------------------------------------
-Motivation
-------------------------------------
-
-This tutorial is aimed at engineers dealing with artificial intelligence and machine learning who
-would like to optimize their pipelines and squeeze every performance drop by adding Composable
-Kernel (CK) library to their projects. We would like to make the CK library approachable so
-the tutorial is not based on the latest release and doesn't have all the bleeding edge features,
-but it will be reproducible now and forever.
-
-During this tutorial we will have an introduction to the CK library, we will build it and run some
-examples and tests, so to say we will run a "Hello world" example. In future tutorials we will go
-in depth and breadth and get familiar with other tools and ways to integrate CK into your project.
-
-------------------------------------
-Description
-------------------------------------
-
-Modern AI technology solves more and more problems in all imaginable fields, but crafting fast and
-efficient workflows is still challenging. CK is one of the tools to make AI heavy lifting as fast
-and efficient as possible. CK is a collection of optimized AI operator kernels and tools to create
-new ones. The library has components required for majority of modern neural networks architectures
-including matrix multiplication, convolution, contraction, reduction, attention modules, variety of
-activation functions, fused operators and many more.
-
-So how do we (almost) reach the speed of light? CK acceleration abilities are based on:
-
-* Layered structure.
-* Tile-based computation model.
-* Tensor coordinate transformation.
-* Hardware acceleration use.
-* Support of low precision data types including fp16, bf16, int8 and int4.
-
-If you are excited and need more technical details and benchmarking results - read this awesome
-`blog post <https://community.amd.com/t5/instinct-accelerators/amd-composable-kernel-library-efficient-fused-kernels-for-ai/ba-p/553224>`_.
-
-For more details visit our `github repository <https://github.com/ROCmSoftwarePlatform/composable_kernel>`_.
-
-------------------------------------
-Hardware targets
-------------------------------------
-
-CK library fully supports `gfx908` and `gfx90a` GPU architectures and only some operators are
-supported for `gfx1030`. Let's check the hardware you have at hand and decide on the target
-GPU architecture.
-
-==========     =========
-GPU Target     AMD GPU
-==========     =========
-gfx908 	       Radeon Instinct MI100
-gfx90a 	       Radeon Instinct MI210, MI250, MI250X
-gfx1030        Radeon PRO V620, W6800, W6800X, W6800X Duo, W6900X, RX 6800, RX 6800 XT, RX 6900 XT, RX 6900 XTX, RX 6950 XT
-==========     =========
-
-There are also `cloud options <https://aws.amazon.com/ec2/instance-types/g4/>`_ you can find if
-you don't have an AMD GPU at hand.
-
-------------------------------------
-Build the library
-------------------------------------
-
-First let's clone the library and rebase to the tested version::
-
-    git clone https://github.com/ROCmSoftwarePlatform/composable_kernel.git
-    cd composable_kernel/
-    git checkout tutorial_hello_world
-
-To make our lives easier we prepared
-`docker images <https://hub.docker.com/r/rocm/composable_kernel>`_ with all the necessary
-dependencies. Pick the right image and create a container. In this tutorial we use
-``rocm/composable_kernel:ck_ub20.04_rocm5.6`` image, it is based on Ubuntu 20.04 and
-ROCm v5.6.
-
-If your current folder is ``${HOME}``, start the docker container with::
-
-    docker run  \
-    -it  \
-    --privileged  \
-    --group-add sudo  \
-    -w /root/workspace  \
-    -v ${HOME}:/root/workspace  \
-    rocm/composable_kernel:ck_ub20.04_rocm5.6  \
-    /bin/bash
-
-If your current folder is different from ``${HOME}``, adjust the line ``-v ${HOME}:/root/workspace``
-to fit your folder structure.
-
-Inside the docker container current folder is ``~/workspace``, library path is
-``~/workspace/composable_kernel``, navigate to the library::
-
-    cd composable_kernel/
-
-Create and go to the ``build`` directory::
-
-    mkdir build && cd build
-
-In the previous section we talked about target GPU architecture. Once you decide which one is right
-for you, run CMake using the right ``GPU_TARGETS`` flag::
-
-    cmake  \
-    -D CMAKE_PREFIX_PATH=/opt/rocm  \
-    -D CMAKE_CXX_COMPILER=/opt/rocm/bin/hipcc  \
-    -D CMAKE_CXX_FLAGS="-O3"  \
-    -D CMAKE_BUILD_TYPE=Release  \
-    -D BUILD_DEV=OFF  \
-    -D GPU_TARGETS="gfx908;gfx90a;gfx1030" ..
-
-If everything went well the CMake run will end up with::
-
-    -- Configuring done
-    -- Generating done
-    -- Build files have been written to: "/root/workspace/composable_kernel/build"
-
-Finally, we can build examples and tests::
-
-    make -j examples tests
-
-If everything is smooth, you'll see::
-
-    Scanning dependencies of target tests
-    [100%] Built target tests
-
---------------------------
-Run examples and tests
---------------------------
-
-Examples are listed as test cases as well, so we can run all examples and tests with::
-
-    ctest
-
-You can check the list of all tests by running::
-
-    ctest -N
-
-We can also run them separately, here is a separate example execution::
-
-    ./bin/example_gemm_xdl_fp16 1 1 1
-
-The arguments ``1 1 1`` mean that we want to run this example in the mode: verify results with CPU,
-initialize matrices with integers and benchmark the kernel execution. You can play around with
-these parameters and see how output and execution results change.
-
-If everything goes well and you have a device based on `gfx908` or `gfx90a` architecture you should see
-something like::
-
-    a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
-    b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
-    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
-    launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
-    Warm up 1 time
-    Start running 10 times...
-    Perf: 1.10017 ms, 117.117 TFlops, 87.6854 GB/s, DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2> NumPrefetch: 1, LoopScheduler: Default, PipelineVersion: v1
-
-Meanwhile, running it on a `gfx1030` device should result in::
-
-    a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
-    b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
-    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
-    DeviceGemmXdl<256, 256, 128, 4, 8, 32, 32, 4, 2> NumPrefetch: 1, LoopScheduler: Default, PipelineVersion: v1 does not support this problem
-
-But don't panic, some of the operators are supported on `gfx1030` architecture, so you can run a
-separate example like::
-
-    ./bin/example_gemm_dl_fp16 1 1 1
-
-and it should result in something nice similar to::
-
-    a_m_k: dim 2, lengths {3840, 4096}, strides {1, 4096}
-    b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
-    c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
-    arg.a_grid_desc_k0_m0_m1_k1_{2048, 3840, 2}
-    arg.b_grid_desc_k0_n0_n1_k1_{2048, 4096, 2}
-    arg.c_grid_desc_m_n_{ 3840, 4096}
-    launch_and_time_kernel: grid_dim {960, 1, 1}, block_dim {256, 1, 1}
-    Warm up 1 time
-    Start running 10 times...
-    Perf: 3.65695 ms, 35.234 TFlops, 26.3797 GB/s, DeviceGemmDl<256, 128, 128, 16, 2, 4, 4, 1>
-
-.. note::
-
-    There was a new CMake flag ``DL_KERNELS`` added in the latest versions of CK. If you use one of
-    the newest versions of the library and do not see the above results when running
-    ``example_gemm_dl_fp16``, it might be necessary to add ``-D DL_KERNELS=ON`` to your CMake command
-    in order to build the operators supported on the `gfx1030` architecture.
-
-We can also run a separate test::
-
-    ctest -R test_gemm_fp16
-
-If everything goes well you should see something like::
-
-    Start 121: test_gemm_fp16
-    1/1 Test #121: test_gemm_fp16 ...................   Passed   51.81 sec
-
-    100% tests passed, 0 tests failed out of 1
-
-----------
-Summary
-----------
-
-In this tutorial we took the first look at the Composable Kernel library, built it on your system
-and ran some examples and tests. Stay tuned, in the next tutorial we will run kernels with different
-configs to find out the best one for your hardware and task.
-
-P.S.: Don't forget to switch off the cloud instance if you have launched one, you can find better
-ways to spend your money for sure!
--- a/example/01_gemm/CMakeLists.txt
+++ b/example/01_gemm/CMakeLists.txt
-if(DL_KERNELS)
-  add_custom_target(example_gemm_dl)
-
-  add_example_executable(example_gemm_dl_fp32 gemm_dl_fp32.cpp)
-  add_dependencies(example_gemm_dl example_gemm_dl_fp32)
-  if(DTYPES MATCHES "fp16" OR NOT DEFINED DTYPES)
-    add_example_executable(example_gemm_dl_fp16 gemm_dl_fp16.cpp)
-    add_dependencies(example_gemm_dl example_gemm_dl_fp16)
-    add_example_executable(example_gemm_dl_dpp8_fp16 gemm_dl_dpp8_fp16.cpp)
-    add_dependencies(example_gemm_dl example_gemm_dl_dpp8_fp16)
-  endif()
-  if(DTYPES MATCHES "int8" OR NOT DEFINED DTYPES)
-    add_example_executable(example_gemm_dl_int8 gemm_dl_int8.cpp)
-    add_dependencies(example_gemm_dl example_gemm_dl_int8)
-  endif()
-
-  if(USE_BITINT_EXTENSION_INT4)
+add_custom_target(example_gemm_dl)
+
+add_example_executable(example_gemm_dl_fp32 gemm_dl_fp32.cpp)
+add_example_dependencies(example_gemm_dl example_gemm_dl_fp32)
+
+add_example_executable(example_gemm_dl_fp16 gemm_dl_fp16.cpp)
+add_example_dependencies(example_gemm_dl example_gemm_dl_fp16)
+
+add_example_executable(example_gemm_dpp_fp16 gemm_dpp_fp16.cpp)
+
+add_example_executable(example_gemm_dl_int8 gemm_dl_int8.cpp)
+add_example_dependencies(example_gemm_dl example_gemm_dl_int8)
+if(USE_BITINT_EXTENSION_INT4)
    add_example_executable(example_gemm_dl_int4 gemm_dl_int4.cpp)
-    add_dependencies(example_gemm_dl example_gemm_dl_int4)
-  endif(USE_BITINT_EXTENSION_INT4)
-endif()
+    add_example_dependencies(example_gemm_dl example_gemm_dl_int4)
+endif(USE_BITINT_EXTENSION_INT4)

 add_custom_target(example_gemm_xdl)
-if(DTYPES MATCHES "fp16" OR NOT DEFINED DTYPES)
-  add_example_executable(example_gemm_xdl_fp16 gemm_xdl_fp16.cpp)
-  add_example_executable(example_gemm_xdl_wavelet_fp16 gemm_xdl_wavelet_fp16.cpp)
-  add_dependencies(example_gemm_xdl example_gemm_xdl_fp16)
-  add_dependencies(example_gemm_xdl example_gemm_xdl_wavelet_fp16)
-  add_example_executable(example_gemm_xdl_skip_b_lds_fp16 gemm_xdl_skip_b_lds_fp16.cpp)
-  add_dependencies(example_gemm_xdl example_gemm_xdl_skip_b_lds_fp16)
-
-  if(GPU_TARGETS MATCHES "gfx1100" OR GPU_TARGETS MATCHES "gfx1101" OR GPU_TARGETS MATCHES "gfx1102")
-    add_custom_target(example_gemm_wmma)
-    add_example_executable(example_gemm_wmma_fp16 gemm_wmma_fp16.cpp)
-    add_dependencies(example_gemm_wmma example_gemm_wmma_fp16)
-  endif()
-
-endif()
-
-if(DTYPES MATCHES "bf16" OR NOT DEFINED DTYPES)
-  add_example_executable(example_gemm_xdl_bf16 gemm_xdl_bf16.cpp)
-  add_dependencies(example_gemm_xdl example_gemm_xdl_bf16)
-endif()
-
-if(DTYPES MATCHES "int8" OR NOT DEFINED DTYPES)
-  add_example_executable(example_gemm_xdl_int8 gemm_xdl_int8.cpp)
-  add_dependencies(example_gemm_xdl example_gemm_xdl_int8)
-endif()
+add_example_executable(example_gemm_xdl_fp16 gemm_xdl_fp16.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp16)
+
+add_example_executable(example_gemm_xdl_fp16_v2 gemm_xdl_fp16_v2.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp16_v2)
+
+add_example_executable(example_gemm_xdl_fp16_streamk_v3 gemm_xdl_fp16_streamk_v3.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp16_streamk_v3)
+add_example_executable(example_gemm_xdl_fp16_v3 gemm_xdl_fp16_v3.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp16_v3)
+add_example_executable(example_gemm_xdl_fp8_v3 gemm_xdl_fp8_v3.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp8_v3)
+add_example_executable(example_gemm_xdl_fp16_fp8_v3 gemm_xdl_fp16_fp8_v3.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp16_fp8_v3)
+add_example_executable(example_gemm_xdl_bf16_v3 gemm_xdl_bf16_v3.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_bf16_v3)
+
+add_example_executable(example_gemm_xdl_wavelet_fp16 gemm_xdl_wavelet_fp16.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_wavelet_fp16)
+
+add_example_executable(example_gemm_xdl_skip_b_lds_fp16 gemm_xdl_skip_b_lds_fp16.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_skip_b_lds_fp16)
+
+add_example_executable(example_gemm_xdl_bf16 gemm_xdl_bf16.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_bf16)
+
+add_example_executable(example_gemm_xdl_bf16_rtn gemm_xdl_bf16_rtn.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_bf16_rtn)
+
+add_example_executable(example_gemm_xdl_int8 gemm_xdl_int8.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_int8)

 if(USE_BITINT_EXTENSION_INT4)
-  add_example_executable(example_gemm_xdl_int4 gemm_xdl_int4.cpp)
-  add_dependencies(example_gemm_xdl example_gemm_xdl_int4)
+    add_example_executable(example_gemm_xdl_int4 gemm_xdl_int4.cpp)
+    add_example_dependencies(example_gemm_xdl example_gemm_xdl_int4)
 endif(USE_BITINT_EXTENSION_INT4)

-if(DTYPES MATCHES "fp64" OR NOT DEFINED DTYPES)
-  # FIXME: re-enable this exampe as test when SWDEV-335738 is fixed
-  add_example_executable_no_testing(example_gemm_xdl_fp64 gemm_xdl_fp64.cpp)
-  add_dependencies(example_gemm_xdl example_gemm_xdl_fp64)
-endif()
+add_example_executable(example_gemm_xdl_fp64 gemm_xdl_fp64.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp64)

 add_example_executable(example_gemm_xdl_streamk gemm_xdl_streamk.cpp)

-if(DTYPES MATCHES "fp8" OR NOT DEFINED DTYPES)
-  if(GPU_TARGETS MATCHES "gfx940" OR GPU_TARGETS MATCHES "gfx941" OR GPU_TARGETS MATCHES "gfx942")
-    add_example_executable(example_gemm_xdl_f8 gemm_xdl_f8.cpp)
-    add_dependencies(example_gemm_xdl example_gemm_xdl_f8)
-  endif()
-endif()
+list(APPEND gpu_list gfx90a gfx940 gfx941 gfx942)
+set(target 0)
+foreach(gpu IN LISTS GPU_TARGETS)
+    if(gpu IN_LIST gpu_list AND target EQUAL 0)
+        add_example_executable(example_gemm_xdl_lds_direct_load_fp32 gemm_xdl_lds_direct_load_fp32.cpp)
+        add_example_dependencies(example_gemm_xdl example_gemm_xdl_lds_direct_load_fp32)
+
+        add_example_executable(example_gemm_xdl_lds_direct_load_fp16 gemm_xdl_lds_direct_load_fp16.cpp)
+        add_example_dependencies(example_gemm_xdl example_gemm_xdl_lds_direct_load_fp16)
+        set(target 1)
+    endif()
+endforeach()
+
+add_example_executable(example_gemm_xdl_fp8 gemm_xdl_fp8.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp8)
+
+add_example_executable(example_gemm_xdl_fp8_bf8 gemm_xdl_fp8_bf8.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp8_bf8)
+
+add_example_executable(example_gemm_xdl_fp16_fp8 gemm_xdl_fp16_fp8.cpp)
+add_example_dependencies(example_gemm_xdl example_gemm_xdl_fp16_fp8)

-add_example_executable(example_gemm_xdl_fp16_f8 gemm_xdl_fp16_f8.cpp)
-add_dependencies(example_gemm_xdl example_gemm_xdl_fp16_f8)
+add_custom_target(example_gemm_wmma)
+add_example_executable(example_gemm_wmma_fp16 gemm_wmma_fp16.cpp)
+add_example_dependencies(example_gemm_wmma example_gemm_wmma_fp16)
+add_example_executable(example_gemm_wmma_bf16 gemm_wmma_bf16.cpp)
+add_example_dependencies(example_gemm_wmma example_gemm_wmma_bf16)
+add_example_executable(example_gemm_wmma_int8 gemm_wmma_int8.cpp)
+add_example_dependencies(example_gemm_wmma example_gemm_wmma_int8)
--- a/example/01_gemm/README.md
+++ b/example/01_gemm/README.md
@@ -8,16 +8,20 @@
 ./bin/example_gemm_xdl 0 1 5
 ```

-Result (MI100 @ 1087Mhz, 133.5TFlops peak FP16)
-```
+# Instructions for ```example_gemm_xdl_fp16_streamk_v3```
+
+## Run ```example_gemm_xdl_fp16_streamk_v3```
+```bash
+arg1: verification (0=no, 1=yes)
+arg2: initialization (0=no init, 1=integer value, 2=decimal value)
+arg3: time kernel (0=no, 1=yes)
+arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideC
+arg10: stream-k select (-1: default config, 0: all DP, 1: 1-tile SK, 2: 2-tile SK)
+arg11: Grid_size(-1 for max occupancy)
+bin/example_gemm_xdl_fp16_streamk_v3 1 2 1 3840 4096 4096 4096 4096 4096 1 -1
 a_m_k: dim 2, lengths {3840, 4096}, strides {4096, 1}
-b_k_n: dim 2, lengths {4096, 4096}, strides {1, 4096}
+b_k_n: dim 2, lengths {4096, 4096}, strides {4096, 1}
 c_m_n: dim 2, lengths {3840, 4096}, strides {4096, 1}
-arg.a_grid_desc_k0_m_k1_{512, 3840, 8}
-arg.b_grid_desc_k0_n_k1_{512, 4096, 8}
-arg.c_grid_desc_m_n_{ 3840, 4096}
-launch_and_time_kernel: grid_dim {480, 1, 1}, block_dim {256, 1, 1}
-Warm up
-Start running 5 times...
-Perf: 1.19685 ms, 107.657 TFlops, 78.8501 GB/s
+problem {M:3840, N:4096, K:4096, SA:4096, SB:4096, SC:4096, MP:4032, NP:4096, KRead:4096, KP:4096, AK0:512, BK0:2048, MBlock: 18, NBlock: 16, Stream-K Selection:1, Grid size:-1}
+Perf: 0.292022 ms, 441.23 TFlops, 330.348 GB/s, DeviceGemmXdlUniversal<MNPadding, RRR> BlkSize: 256, BlkTile: 224x256x64, WaveTile: 16x16, WaveMap: 7x8, VmemReadVec: 8x8, BlkGemmPipelineScheduler: Intrawave, BlkGemmPipelineVersion: v3, BlkGemmPipelinePrefetchStages: 2
 ```
--- a/example/01_gemm/common.hpp
+++ b/example/01_gemm/common.hpp
 // SPDX-License-Identifier: MIT
-// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.

 #pragma once

@@ -21,6 +21,7 @@
 #include "ck/library/utility/host_tensor_generator.hpp"
 #include "ck/library/utility/literals.hpp"
 #include "ck/library/reference_tensor_operation/cpu/reference_gemm.hpp"
+#include "ck/library/reference_tensor_operation/gpu/reference_gemm.hpp"

 struct ProblemSize final
 {
@@ -28,9 +29,9 @@ struct ProblemSize final
    ck::index_t N = 4096;
    ck::index_t K = 4096;

-    ck::index_t StrideA = 4096;
-    ck::index_t StrideB = 4096;
-    ck::index_t StrideC = 4096;
+    ck::index_t StrideA = -1;
+    ck::index_t StrideB = -1;
+    ck::index_t StrideC = -1;
 };

 struct ProblemSizeStreamK final
@@ -39,18 +40,45 @@ struct ProblemSizeStreamK final
    ck::index_t N = 4096;
    ck::index_t K = 4096;

-    ck::index_t StrideA = 4096;
-    ck::index_t StrideB = 4096;
-    ck::index_t StrideC = 4096;
+    ck::index_t StrideA = -1;
+    ck::index_t StrideB = -1;
+    ck::index_t StrideC = -1;

    ck::index_t NumSKBlocks = -1;
 };
+struct ProblemSizeStreamK_universal final
+{
+    ck::index_t M = 3840;
+    ck::index_t N = 4096;
+    ck::index_t K = 4096;
+
+    ck::index_t StrideA = -1;
+    ck::index_t StrideB = -1;
+    ck::index_t StrideC = -1;
+
+    ck::index_t Grid_size   = -1; // defaults to max occupancy
+    ck::index_t Streamk_sel = 1;  // defaults to 1-tile SK
+};
+
+struct ProblemSizeSplitK final
+{
+    ck::index_t M = 3840;
+    ck::index_t N = 4096;
+    ck::index_t K = 4096;
+
+    ck::index_t StrideA = -1;
+    ck::index_t StrideB = -1;
+    ck::index_t StrideC = -1;
+
+    ck::index_t KBatch = 1;
+};

 struct ExecutionConfig final
 {
-    bool do_verification = true;
-    int init_method      = 1;
-    bool time_kernel     = false;
+    // 0 - no verification, 1 - CPU, 2 - GPU, 3 - CPU + GPU
+    int do_verification = 3;
+    int init_method     = 2;
+    bool time_kernel    = false;
 };

 template <ck::index_t... Is>
@@ -99,7 +127,7 @@ bool parse_cmd_args<ProblemSize>(int argc,
    }
    else
    {
-        std::cerr << "arg1: verification (0=no, 1=yes)" << std::endl
+        std::cerr << "arg1: verification (0=no, 1=CPU, 2=GPU, 3=CPU and GPU)" << std::endl
                  << "arg2: initialization (0=no init, 1=integer value, 2=decimal value)"
                  << std::endl
                  << "arg3: time kernel (0=no, 1=yes)" << std::endl
@@ -110,6 +138,57 @@ bool parse_cmd_args<ProblemSize>(int argc,
    return true;
 }

+template <>
+bool parse_cmd_args<ProblemSizeStreamK_universal>(int argc,
+                                                  char* argv[],
+                                                  ProblemSizeStreamK_universal& problem_size,
+                                                  ExecutionConfig& config)
+{
+    if(argc == 1)
+    {
+        // use default case
+    }
+    else if(argc == 4)
+    {
+        config.do_verification = std::stoi(argv[1]);
+        config.init_method     = std::stoi(argv[2]);
+        config.time_kernel     = std::stoi(argv[3]);
+    }
+    else if(argc >= 10)
+    {
+        config.do_verification = std::stoi(argv[1]);
+        config.init_method     = std::stoi(argv[2]);
+        config.time_kernel     = std::stoi(argv[3]);
+
+        problem_size.M = std::stoi(argv[4]);
+        problem_size.N = std::stoi(argv[5]);
+        problem_size.K = std::stoi(argv[6]);
+
+        problem_size.StrideA = std::stoi(argv[7]);
+        problem_size.StrideB = std::stoi(argv[8]);
+        problem_size.StrideC = std::stoi(argv[9]);
+
+        if(argc >= 11)
+        {
+            problem_size.Streamk_sel = std::stoi(argv[10]);
+            problem_size.Grid_size   = std::stoi(argv[11]);
+        }
+    }
+    else
+    {
+        std::cerr
+            << "arg1: verification (0=no, 1=CPU, 2=GPU, 3=CPU and GPU)" << std::endl
+            << "arg2: initialization (0=no init, 1=integer value, 2=decimal value)" << std::endl
+            << "arg3: time kernel (0=no, 1=yes)" << std::endl
+            << "arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideC" << std::endl
+            << "arg10: stream-k select (-1: default config, 0: all DP, 1: 1-tile SK, 2: 2-tile SK)"
+            << "\narg11: Grid_size(-1 for max occupancy)" << std::endl;
+        return false;
+    }
+
+    return true;
+}
+
 template <>
 bool parse_cmd_args<ProblemSizeStreamK>(int argc,
                                        char* argv[],
@@ -147,12 +226,62 @@ bool parse_cmd_args<ProblemSizeStreamK>(int argc,
    }
    else
    {
-        std::cerr << "arg1: verification (0=no, 1=yes)" << std::endl
+        std::cerr << "arg1: verification (0=no, 1=CPU, 2=GPU, 3=CPU and GPU)" << std::endl
+                  << "arg2: initialization (0=no init, 1=integer value, 2=decimal value)"
+                  << std::endl
+                  << "arg3: time kernel (0=no, 1=yes)" << std::endl
+                  << "arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideC" << std::endl
+                  << "arg10: stream-k select (0: all DP, 1: 1-tile SK, 2: 2-tile SK)"
+                  << "\narg11: Grid_size(-1 for max occupancy)" << std::endl;
+        return false;
+    }
+
+    return true;
+}
+
+template <>
+bool parse_cmd_args<ProblemSizeSplitK>(int argc,
+                                       char* argv[],
+                                       ProblemSizeSplitK& problem_size,
+                                       ExecutionConfig& config)
+{
+    if(argc == 1)
+    {
+        // use default case
+    }
+    else if(argc == 4)
+    {
+        config.do_verification = std::stoi(argv[1]);
+        config.init_method     = std::stoi(argv[2]);
+        config.time_kernel     = std::stoi(argv[3]);
+    }
+    else if(argc >= 10)
+    {
+        config.do_verification = std::stoi(argv[1]);
+        config.init_method     = std::stoi(argv[2]);
+        config.time_kernel     = std::stoi(argv[3]);
+
+        problem_size.M = std::stoi(argv[4]);
+        problem_size.N = std::stoi(argv[5]);
+        problem_size.K = std::stoi(argv[6]);
+
+        problem_size.StrideA = std::stoi(argv[7]);
+        problem_size.StrideB = std::stoi(argv[8]);
+        problem_size.StrideC = std::stoi(argv[9]);
+
+        if(argc >= 11)
+        {
+            problem_size.KBatch = std::stoi(argv[10]);
+        }
+    }
+    else
+    {
+        std::cerr << "arg1: verification (0=no, 1=CPU, 2=GPU, 3=CPU and GPU)" << std::endl
                  << "arg2: initialization (0=no init, 1=integer value, 2=decimal value)"
                  << std::endl
                  << "arg3: time kernel (0=no, 1=yes)" << std::endl
                  << "arg4 to 9: M (256x), N(128x), K(32x), StrideA, StrideB, StrideC" << std::endl
-                  << "arg10: NumSKBlocks(optional)" << std::endl;
+                  << "arg10: KBatch" << std::endl;
        return false;
    }


--- a/example/01_gemm/gemm_dl_dpp8_fp16.cpp
+++ b/example/01_gemm/gemm_dl_dpp8_fp16.cpp
-// SPDX-License-Identifier: MIT
-// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
-
-#include "common.hpp"
-
-#include "ck/tensor_operation/gpu/device/impl/device_gemm_dl_dpp8.hpp"
-
-using ADataType   = ck::half_t;
-using BDataType   = ck::half_t;
-using CDataType   = ck::half_t;
-using AccDataType = float;
-
-using ALayout = Col;
-using BLayout = Row;
-using CLayout = Row;
-
-using AElementOp = PassThrough;
-using BElementOp = PassThrough;
-using CElementOp = PassThrough;
-
-static constexpr auto GemmDefault = ck::tensor_operation::device::GemmSpecialization::Default;
-
-// clang-format off
-using DeviceGemmInstance = ck::tensor_operation::device::DeviceGemmDlDpp8
-// ######|     AData|     BData|     CData|     AccData| ALayout| BLayout| CLayout|           A|           B|           C|           GEMM| Block|  MPer|  NPer| K0Per| K1|      M1Per|      N1Per|   KPer|  M11N11Thread|  M11N11Thread|     ABlockTransfer|       ABlockTransfer| ABlockTransfer| ABlockTransfer|      ABlockTransfer|     ABlockTransfer|      ABlockTransfer|     BBlockTransfer|       BBlockTransfer| BBlockTransfer| BBlockTransfer|      BBlockTransfer|     BBlockTransfer|      BBlockTransfer|     CThreadTransfer| CThreadTransfer|    CThreadTransfer|
-// ######|      Type|      Type|      Type|        Type|        |        |        | Elementwise| Elementwise| Elementwise| Spacialization|  Size| Block| Block| Block|   | ThreadM111| ThreadN111| Thread| ClusterM110Xs| ClusterN110Xs| ThreadSliceLengths| ThreadClusterLengths|  ThreadCluster|      SrcAccess|     SrcVectorTensor|    SrcVectorTensor|     DstVectorTensor| ThreadSliceLengths| ThreadClusterLengths|  ThreadCluster|      SrcAccess|     SrcVectorTensor|    SrcVectorTensor|     DstVectorTensor|        SrcDstAccess| SrcDstVectorDim| DstScalarPerVector|
-// ######|          |          |          |            |        |        |        |   Operation|   Operation|   Operation|               |      |      |      |      |   |           |           |       |              |              |        K0_M0_M1_K1|          K0_M0_M1_K1|   ArrangeOrder|          Order| Lengths_K0_M0_M1_K1| ContiguousDimOrder| Lengths_K0_M0_M1_K1|        K0_N0_N1_K1|          K0_N0_N1_K1|   ArrangeOrder|          Order| Lengths_K0_N0_N1_K1| ContiguousDimOrder| Lengths_K0_N0_N1_K1|               Order|                |                   |
-// ######|          |          |          |            |        |        |        |            |            |            |               |      |      |      |      |   |           |           |       |              |              |                   |                     |               |               |                    |                   |                    |                   |                     |               |               |                    |                   |                    |                    |                |                   |
-         < ADataType, BDataType, CDataType, AccDataType, ALayout, BLayout, CLayout,  AElementOp,  BElementOp,  CElementOp,    GemmDefault,   256,   128,   128,    16,  2,          1,          8,      8,       S<8, 8>,       S<4, 1>,      S<2, 1, 4, 2>,      S<8, 1,  32, 1>,  S<0, 3, 1, 2>,  S<0, 3, 1, 2>,       S<1, 1, 4, 1>,      S<0, 3, 1, 2>,       S<1, 1, 4, 2>,      S<2, 1, 4, 2>,       S<8, 1, 32, 1>,  S<0, 3, 1, 2>,  S<0, 3, 1, 2>,       S<1, 1, 4, 1>,      S<0, 3, 1, 2>,       S<1, 1, 4, 2>, S<0, 1, 2, 3, 4, 5>,               5,                  4>;
-// clang-format on
-
-using ReferenceGemmInstance = ck::tensor_operation::host::
-    ReferenceGemm<ADataType, BDataType, CDataType, AccDataType, AElementOp, BElementOp, CElementOp>;
-
-#include "run_gemm_example.inc"
-
-int main(int argc, char* argv[]) { return !run_gemm_example(argc, argv); }
--- a/example/01_gemm/gemm_dl_fp16.cpp
+++ b/example/01_gemm/gemm_dl_fp16.cpp
 // SPDX-License-Identifier: MIT
-// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.

 #include "common.hpp"

@@ -32,6 +32,17 @@ using DeviceGemmInstance = ck::tensor_operation::device::DeviceGemmDl
 using ReferenceGemmInstance = ck::tensor_operation::host::
    ReferenceGemm<ADataType, BDataType, CDataType, AccDataType, AElementOp, BElementOp, CElementOp>;

+using ReferenceGemmInstanceGPU = ck::tensor_operation::device::ReferenceGemm<ALayout,
+                                                                             BLayout,
+                                                                             CLayout,
+                                                                             ADataType,
+                                                                             BDataType,
+                                                                             CDataType,
+                                                                             AccDataType,
+                                                                             AElementOp,
+                                                                             BElementOp,
+                                                                             CElementOp>;
+
 #include "run_gemm_example.inc"

 int main(int argc, char* argv[]) { return !run_gemm_example(argc, argv); }
--- a/example/01_gemm/gemm_dl_fp32.cpp
+++ b/example/01_gemm/gemm_dl_fp32.cpp
 // SPDX-License-Identifier: MIT
-// Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.
+// Copyright (c) 2018-2024, Advanced Micro Devices, Inc. All rights reserved.

 #include "common.hpp"

@@ -32,6 +32,17 @@ using DeviceGemmInstance = ck::tensor_operation::device::DeviceGemmDl
 using ReferenceGemmInstance = ck::tensor_operation::host::
    ReferenceGemm<ADataType, BDataType, CDataType, AccDataType, AElementOp, BElementOp, CElementOp>;

+using ReferenceGemmInstanceGPU = ck::tensor_operation::device::ReferenceGemm<ALayout,
+                                                                             BLayout,
+                                                                             CLayout,
+                                                                             ADataType,
+                                                                             BDataType,
+                                                                             CDataType,
+                                                                             AccDataType,
+                                                                             AElementOp,
+                                                                             BElementOp,
+                                                                             CElementOp>;
+
 #include "run_gemm_example.inc"

 int main(int argc, char* argv[]) { return !run_gemm_example(argc, argv); }
--- a/example/01_gemm/gemm_dl_int4.cpp
+++ b/example/01_gemm/gemm_dl_int4.cpp
 // SPDX-License-Identifier: MIT
 // Copyright (c) 2018-2023, Advanced Micro Devices, Inc. All rights reserved.

-#ifndef CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4
-#error Should compile this file with ck::int4_t support
-#endif
+#ifdef CK_EXPERIMENTAL_BIT_INT_EXTENSION_INT4

 #include "common.hpp"

@@ -43,3 +41,4 @@ using ReferenceGemmInstance = ck::tensor_operation::host::
 #include "run_gemm_example.inc"

 int main(int argc, char* argv[]) { return !run_gemm_example(argc, argv); }
+#endif
\ No newline at end of file