Merge remote-tracking branch 'origin/ck-host-lib' into ck-gemm-api

9b0fc65a · Alan Turner · 5cef60b8 · 031ccf5f · 9b0fc65a · 9b0fc65a
Commit 9b0fc65a authored May 25, 2023 by Alan Turner
15 changed files
--- a/.github/workflows/README.md
+++ b/.github/workflows/README.md
@@ -7,17 +7,17 @@ This workflow adds pull requests and issues to a specific GitHub project board w
 </p>
 - ## Trigger
-The workflow is triggered by the following events:
+    The workflow is triggered by the following events:
-> - A pull request being opened.
+     - A pull request being opened.
-> - An issue being opened.
+     - An issue being opened.
 - ## Jobs
-The workflow has a single job named `add-to-project`. The following step is executed in this job:
+    The workflow has a single job named `add-to-project`. The following step is executed in this job:
-> - The `add-to-project` job uses the `actions/add-to-project@v0.4.0` action to add pull requests and issues to a specific project board. The `with` parameters are `project-url` and `github-token`, which specify the URL of the project board and the GitHub token used to authenticate the action.
+     - The `add-to-project` job uses the `actions/add-to-project@v0.4.0` action to add pull requests and issues to a specific project board. The `with` parameters are `project-url` and `github-token`, which specify the URL of the project board and the GitHub token used to authenticate the action.
-For more details, please refer to the [add-to-project.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/add-to-project.yaml) file in the repository.
+    For more details, please refer to the [add-to-project.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/add-to-project.yaml) file in the repository.
 ---
 ## `benchmark.yaml`
@@ -27,21 +27,21 @@ This workflow runs the `MiGraphX performance benchmarks` and generates reports b
 </p>
 - ## Trigger
-TODO: Update [benchmarks.yml (archived)](https://github.com/ROCmSoftwarePlatform/actions/blob/main/.github/workflows/benchmarks.yml) link after workflow is updated
+    TODO: Update [benchmarks.yml (archived)](https://github.com/ROCmSoftwarePlatform/actions/blob/main/.github/workflows/benchmarks.yml) link after workflow is updated
-> The workflow is triggered manually through the "Run workflow" button in the Actions tab of the repository and it will run reusable workflow [benchmarks.yml (archived)](https://github.com/ROCmSoftwarePlatform/actions/blob/main/.github/workflows/benchmarks.yml)
+     - The workflow is triggered manually through the "Run workflow" button in the Actions tab of the repository and it will run reusable workflow [benchmarks.yml (archived)](https://github.com/ROCmSoftwarePlatform/actions/blob/main/.github/workflows/benchmarks.yml)
 - ## Input Parameters
-The workflow uses the following input parameters:
+    The workflow uses the following input parameters:
-> - `rocm_version`: the version of ROCm to use for running the benchmarks.
+     - `rocm_version`: the version of ROCm to use for running the benchmarks.
-> - `script_repo`: repository that contains the benchmark scripts.
+     - `script_repo`: repository that contains the benchmark scripts.
-> - `result_path`: the path where benchmark results will be stored.
+     - `result_path`: the path where benchmark results will be stored.
-> - `result_repo`: the repository where the benchmark results will be pushed for comparison.
+     - `result_repo`: the repository where the benchmark results will be pushed for comparison.
-For more details, please refer to the [benchmark.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/benchmark.yaml) file in the repository.
+    For more details, please refer to the [benchmark.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/benchmark.yaml) file in the repository.
 ---
@@ -52,36 +52,35 @@ Overall, this workflow automates the process of building and testing the AMDMIGr
 </p>
 - ## Trigger
-The workflow is triggered by the following events:
+    The workflow is triggered by the following events:
-> - A pull request being opened, synchronized or closed.
+     - A pull request being opened, synchronized or closed.
-> - On push to the `develop`, `master`, and `release/**` branches.
+     - On push to the `develop`, `master`, and `release/**` branches.
 - ## Jobs
-The following jobs are executed in the workflow:
+    The following jobs are executed in the workflow:
-> - `cancel`: This job is responsible for canceling any previous runs of the workflow that may still be running. It runs on an `ubuntu-latest` runner and uses the `styfle/cancel-workflow-action` action to cancel any previous runs of the workflow.
+     - `cancel`: This job is responsible for canceling any previous runs of the workflow that may still be running. It runs on an `ubuntu-latest` runner and uses the `styfle/cancel-workflow-action` action to cancel any previous runs of the workflow.
-> - `tidy`: It runs on an `ubuntu-20.04` runner and runs `clang-tidy` for the codebase in a Docker container with the MIGraphX build environment.
+     - `tidy`: It runs on an `ubuntu-20.04` runner and runs `clang-tidy` for the codebase in a Docker container with the MIGraphX build environment.
-> - `cppcheck`: It runs on an `ubuntu-20.04` runner and performs static analysis on code in a Docker container, and caches the results for faster subsequent runs.
+     - `cppcheck`: It runs on an `ubuntu-20.04` runner and performs static analysis on code in a Docker container, and caches the results for faster subsequent runs.
-> - `format`: It runs on an `ubuntu-20.04` runner and includes steps for freeing up disk space, caching Docker layers, and checking code formatting.
+     - `format`: It runs on an `ubuntu-20.04` runner and includes steps for freeing up disk space, caching Docker layers, and checking code formatting.
-> - `pyflakes`: It runs on an `ubuntu-20.04` runner and runs the Pyflakes static analysis tool to detect and report Python code issues.
+     - `pyflakes`: It runs on an `ubuntu-20.04` runner and runs the Pyflakes static analysis tool to detect and report Python code issues.
-> - `licensing`: It runs on an `ubuntu-20.04` runner and includes steps to free up space, checkout the code, set up Python and run a license check using a Python script.
+     - `licensing`: It runs on an `ubuntu-20.04` runner and includes steps to free up space, checkout the code, set up Python and run a license check using a Python script.
---
-We have 2 jobs with multiple matrix configurations, both of them are running on `ubuntu-20.04` runner but right now only `linux` works on all 3 configurations (debug, release, codecov) ,`linux-fpga` works just on (debug).
+    We have 2 jobs with multiple matrix configurations, both of them are running on `ubuntu-20.04` runner but right now only `linux` works on all 3 configurations (debug, release, codecov) ,`linux-fpga` works just on (debug).
---
-> - `linux`: this job runs continuous integration tests for AMDMIGraphX on a Linux operating system. It tests a variety of build configurations to ensure code quality and compatibility.
-> - `linux-fpga`: this job builds and tests AMDMIGraphX on a Linux operating system with support for FPGA acceleration. It includes additional steps to verify FPGA functionality and performance.
+     - `linux`: this job runs continuous integration tests for AMDMIGraphX on a Linux operating system. It tests a variety of build configurations to ensure code quality and compatibility.
+     - `linux-fpga`: this job builds and tests AMDMIGraphX on a Linux operating system with support for FPGA acceleration. It includes additional steps to verify FPGA functionality and performance.
-For more details, please refer to the [ci.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/ci.yaml) file in the repository.
+    For more details, please refer to the [ci.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/ci.yaml) file in the repository.
 ---
 ## `clean-closed-pr-caches.yaml`
@@ -91,17 +90,17 @@ This workflow has purpose to clean up any cached data related to the pull reques
 </p>
 - ## Trigger
-The workflow is triggered by the following events:
+    The workflow is triggered by the following events:
-> - A pull request being closed.
+     - A pull request being closed.
 - ## Jobs
-The workflow has a single job named `cleanup`. The following steps are executed in this job:
+    The workflow has a single job named `cleanup`. The following steps are executed in this job:
-> - `Check out code`: step checks out the codebase from the repository.
+     - `Check out code`: step checks out the codebase from the repository.
-> - `Cleanup`: step performs the actual cache cleanup using a series of commands.
+     - `Cleanup`: step performs the actual cache cleanup using a series of commands.
-For more details, please refer to the [clean-closed-pr-caches.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/clean-closed-pr-caches.yaml) file in the repository.
+    For more details, please refer to the [clean-closed-pr-caches.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/clean-closed-pr-caches.yaml) file in the repository.
 ---
@@ -112,22 +111,22 @@ This workflow generates a report of the MiGraphX benchmark results between two d
 </p>
 - ## Trigger
-> The workflow is triggered manually through the "Run workflow" button in the Actions tab of the repository and it will run reusable workflow [history.yml](https://github.com/ROCmSoftwarePlatform/migraphx-benchmark/blob/main/.github/workflows/history.yml)
+     - The workflow is triggered manually through the "Run workflow" button in the Actions tab of the repository and it will run reusable workflow [history.yml](https://github.com/ROCmSoftwarePlatform/migraphx-benchmark/blob/main/.github/workflows/history.yml)
 - ## Input Parameters
-The workflow requires the following inputs:
+    The workflow requires the following inputs:
-> - `start_date`: Start date for results analysis.
+     - `start_date`: Start date for results analysis.
-> - `end_date`: End date for results analysis.
+     - `end_date`: End date for results analysis.
-> - `history_repo`: Repository for history results between dates.
+     - `history_repo`: Repository for history results between dates.
-> - `benchmark_utils_repo`: Repository where benchmark utils are stored.
+     - `benchmark_utils_repo`: Repository where benchmark utils are stored.
-> - `organization`: Organization based on which location of files will be different.
+     - `organization`: Organization based on which location of files will be different.
-For more details, please refer to the [history.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/history.yaml) file in the repository.
+    For more details, please refer to the [history.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/history.yaml) file in the repository.
 ---
 ## `performance.yaml`
@@ -137,32 +136,32 @@ This workflow runs performance tests on the MIGraphX repository and generates a
 </p>
 - ## Trigger
-The workflow will run reusable workflow [perf-test.yml](https://github.com/ROCmSoftwarePlatform/migraphx-benchmark/blob/main/.github/workflows/perf-test.yml) by the following events:
+    The workflow will run reusable workflow [perf-test.yml](https://github.com/ROCmSoftwarePlatform/migraphx-benchmark/blob/main/.github/workflows/perf-test.yml) by the following events:
-> - Pull requests opened, synchronized or closed on the `develop` branch.
+     - Pull requests opened, synchronized or closed on the `develop` branch.
-> - Schedule: Runs every day of the week from Monday to Saturday at 6:00 AM.
+     - Schedule: Runs every day of the week from Monday to Saturday at 6:00 AM.
-> - Manual trigger through the "Run workflow" button in the Actions tab of the repository.
+     - Manual trigger through the "Run workflow" button in the Actions tab of the repository.
 - ## Input Parameters
-The workflow requires the following inputs:
+    The workflow requires the following inputs:
-> - `rocm_release`: ROCm version to use for the performance tests.
+     - `rocm_release`: ROCm version to use for the performance tests.
-> - `performance_reports_repo`: Repository where the performance reports are stored.
+     - `performance_reports_repo`: Repository where the performance reports are stored.
-> - `benchmark_utils_repo`: Repository where the benchmark utilities are stored.
+     - `benchmark_utils_repo`: Repository where the benchmark utilities are stored.
-> - `organization`: Organization based on which location of files will be different.
+     - `organization`: Organization based on which location of files will be different.
-> - `result_number`: Last N results.
+     - `result_number`: Last N results.
-> - `model_timeout`: If a model in the performance test script passes this threshold, it will be skipped.
+     - `model_timeout`: If a model in the performance test script passes this threshold, it will be skipped.
-> - `flags`: Command line arguments to be passed to the performance test script. Default is `-r`.
+     - `flags`: Command line arguments to be passed to the performance test script. Default is `-r`.
-For more details, please refer to the [performance.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/performance.yaml) file in the repository.
+    For more details, please refer to the [performance.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/performance.yaml) file in the repository.
 ---
 ## `rocm-image-release.yaml`
@@ -172,26 +171,26 @@ This workflow builds a Docker image for a specified ROCm release version and pus
 </p>
 - ## Trigger
-> The workflow is triggered manually through the "Run workflow" button in the Actions tab of the repository and it will run reusable workflow [rocm-release.yml](https://github.com/ROCmSoftwarePlatform/migraphx-benchmark/blob/main/.github/workflows/rocm-release.yml)
+     - The workflow is triggered manually through the "Run workflow" button in the Actions tab of the repository and it will run reusable workflow [rocm-release.yml](https://github.com/ROCmSoftwarePlatform/migraphx-benchmark/blob/main/.github/workflows/rocm-release.yml)
 - ## Input Parameters
-The workflow requires the following inputs:
+    The workflow requires the following inputs:
-> - `rocm_release`: ROCm release version to build Docker image for.
+     - `rocm_release`: ROCm release version to build Docker image for.
-> - `benchmark_utils_repo`: Repository where benchmark utils are stored.
+     - `benchmark_utils_repo`: Repository where benchmark utils are stored.
-> - `base_image`: Base image for ROCm Docker build.
+     - `base_image`: Base image for ROCm Docker build.
-> - `docker_image`: Docker image name for ROCm Docker build.
+     - `docker_image`: Docker image name for ROCm Docker build.
-> - `build_navi`: Build number for the Navi architecture.
+     - `build_navi`: Build number for the Navi architecture.
-> - `organization`: The organization name used to determine the location of files.
+     - `organization`: The organization name used to determine the location of files.
-> - `overwrite`: Specify whether to overwrite the Docker image if it already exists.
+     - `overwrite`: Specify whether to overwrite the Docker image if it already exists.
-For more details, please refer to the [rocm-image-release.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/rocm-image-release.yaml) file in the repository.
+    For more details, please refer to the [rocm-image-release.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/rocm-image-release.yaml) file in the repository.
 ---
@@ -202,24 +201,24 @@ This workflow updates a file with the latest commit hash then creates a pull req
 </p>
 - ## Trigger
-The workflow is triggered by the following events:
+    The workflow is triggered by the following events:
-> - Schedule: Runs every week on Friday at 05:07 PM.
+     - Schedule: Runs every week on Friday at 05:07 PM.
 - ## Jobs
-The workflow has a single job named `Update and create pull request`. The following steps are executed in this job:
+    The workflow has a single job named `Update and create pull request`. The following steps are executed in this job:
-> - `get_date`: step sets an environment variable to the current date in the format 'YYYY-MM-DD'.
+     - `get_date`: step sets an environment variable to the current date in the format 'YYYY-MM-DD'.
-> - `extract_sha1`: step fetches the latest SHA1 commit hash of the HEAD branch of the `microsoft/onnxruntime` repository and sets it as an environment variable.
+     - `extract_sha1`: step fetches the latest SHA1 commit hash of the HEAD branch of the `microsoft/onnxruntime` repository and sets it as an environment variable.
-> - `echo_sha1`: step prints the SHA1 commit hash set in step `extract_sha1`.
+     - `echo_sha1`: step prints the SHA1 commit hash set in step `extract_sha1`.
-> - `actions/checkout@v3`: step checks out the codebase from the repository.
+     - `actions/checkout@v3`: step checks out the codebase from the repository.
-> - `update_file`: step updates a file in the repository with the SHA1 commit hash fetched in step `extract_sha1`.
+     - `update_file`: step updates a file in the repository with the SHA1 commit hash fetched in step `extract_sha1`.
-> - `Make changes to pull request`: step uses the `peter-evans/create-pull-request` action to create a pull request.
+     - `Make changes to pull request`: step uses the `peter-evans/create-pull-request` action to create a pull request.
-For more details, please refer to the [sync-onnxrt-main.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/sync-onnxrt-main.yaml) file in the repository.
+    For more details, please refer to the [sync-onnxrt-main.yaml](https://github.com/ROCmSoftwarePlatform/AMDMIGraphX/blob/develop/.github/workflows/sync-onnxrt-main.yaml) file in the repository.
 ---
--- a/requirements.txt
+++ b/requirements.txt
@@ -28,4 +28,4 @@ ROCmSoftwarePlatform/half@rocm-5.4.2
 pybind/pybind11@d159a563383d10c821ba7b2a71905d1207db6de4 --build
 msgpack/msgpack-c@cpp-3.3.0 -DMSGPACK_BUILD_TESTS=Off
 sqlite3@3.17 -DCMAKE_POSITION_INDEPENDENT_CODE=On
-ROCmSoftwarePlatform/composable_kernel@bef0cb20dba0d9b315df46899310478a81c21852 -X header
+#ROCmSoftwarePlatform/composable_kernel@bef0cb20dba0d9b315df46899310478a81c21852 -X header
--- a/src/api/migraphx.py
+++ b/src/api/migraphx.py
@@ -46,6 +46,7 @@ def shape_type_wrap(p):
 def auto_handle(*args, **kwargs):
    def with_handle(f):
        return api.handle('migraphx_' + f.__name__, 'migraphx::' + f.__name__,
                          *args, **kwargs)(f)

--- a/src/include/migraphx/serialize.hpp
+++ b/src/include/migraphx/serialize.hpp
@@ -213,28 +213,22 @@ void from_value_impl(rank<6>, const value& v, optional<T>& x)
        x = from_value<T>(v);
 }
-template <class T, MIGRAPHX_REQUIRES(std::is_arithmetic<T>{})>
+template <class T, MIGRAPHX_REQUIRES(std::is_arithmetic<T>{} or std::is_enum<T>{})>
 void from_value_impl(rank<7>, const value& v, T& x)
 {
    x = v.to<T>();
 }
-template <class T, MIGRAPHX_REQUIRES(std::is_enum<T>{})>
+inline void from_value_impl(rank<8>, const value& v, std::string& x) { x = v.to<std::string>(); }
-void from_value_impl(rank<8>, const value& v, T& x)
-{
-    x = v.to<T>();
-}
-inline void from_value_impl(rank<9>, const value& v, std::string& x) { x = v.to<std::string>(); }
 template <class T>
-auto from_value_impl(rank<10>, const value& v, T& x) -> decltype(x.from_value(v), void())
+auto from_value_impl(rank<9>, const value& v, T& x) -> decltype(x.from_value(v), void())
 {
    x.from_value(v);
 }
 template <class T>
-auto from_value_impl(rank<11>, const value& v, T& x) -> decltype(migraphx_from_value(v, x), void())
+auto from_value_impl(rank<10>, const value& v, T& x) -> decltype(migraphx_from_value(v, x), void())
 {
    migraphx_from_value(v, x);
 }
@@ -250,7 +244,7 @@ value to_value(const T& x)
 template <class T>
 void from_value(const value& v, T& x)
 {
-    detail::from_value_impl(rank<11>{}, v, x);
+    detail::from_value_impl(rank<10>{}, v, x);
 }
 } // namespace MIGRAPHX_INLINE_NS

--- a/src/targets/gpu/CMakeLists.txt
+++ b/src/targets/gpu/CMakeLists.txt
@@ -243,8 +243,10 @@ else()
    message(STATUS "MIOpen does not have find mode api")
 endif()
+find_package(composable_kernel 1.0.0 COMPONENTS jit_library REQUIRED) 
 target_link_libraries(migraphx_gpu PUBLIC migraphx MIOpen roc::rocblas)
-target_link_libraries(migraphx_gpu PRIVATE migraphx_device migraphx_kernels)
+target_link_libraries(migraphx_gpu PRIVATE migraphx_device migraphx_kernels composable_kernel::jit_library)
 add_subdirectory(driver)
 add_subdirectory(hiprtc)

--- a/src/targets/gpu/compile_hip_code_object.cpp
+++ b/src/targets/gpu/compile_hip_code_object.cpp
@@ -112,7 +112,8 @@ const std::vector<std::string>& compiler_warnings()
                                                "-Wno-sign-compare",
                                                "-Wno-unused-command-line-argument",
                                                "-Wno-weak-vtables",
-                                                "-Wno-c99-extensions"};
+                                                "-Wno-c99-extensions",
+                                                "-Wno-global-constructors"};
    return warnings;
 }
@@ -171,6 +172,19 @@ operation compile_hip_code_object(const std::string& content, hip_compile_option
                       auto path   = fs::path{"migraphx"} / "kernels" / name;
                       return src_file{path, c};
                   });
+    if(not options.embedded_headers.empty())
+    {
+        std::transform(options.embedded_headers.begin(),
+                       options.embedded_headers.end(),
+                       std::back_inserter(srcs),
+                       [](auto&& p) {
+                           auto&& name = p.first;
+                           auto&& c    = p.second;
+                           auto path   = fs::path{"migraphx"} / "kernels" / name;
+                           return src_file{path, c};
+                       });
+    }
    srcs.push_back(src_file{fs::path{"main.cpp"},
                            std::make_pair(content.data(), content.data() + content.size())});
    auto args_hpp =

--- a/src/targets/gpu/include/migraphx/gpu/compile_hip_code_object.hpp
+++ b/src/targets/gpu/include/migraphx/gpu/compile_hip_code_object.hpp
@@ -42,6 +42,7 @@ struct hip_compile_options
    std::string kernel_name           = "kernel";
    std::string params                = "";
    std::vector<shape> virtual_inputs = {};
+    std::unordered_map<std::string, std::pair<const char*, const char*>> embedded_headers;
    /**
     * @brief Set the launch parameters but allow v to override the values

--- a/src/targets/gpu/jit/ck_gemm.cpp
+++ b/src/targets/gpu/jit/ck_gemm.cpp
@@ -38,8 +38,7 @@
 #include <migraphx/env.hpp>
 #include <migraphx/file_buffer.hpp>
-const std::vector<std::string>&
+#include "ck/include/device_gemm_multiple_d.hpp"
-get_instance(std::size_t i, const std::function<bool(const std::vector<std::string>&)>& pred);
 namespace migraphx {
 inline namespace MIGRAPHX_INLINE_NS {
@@ -58,6 +57,7 @@ static const char* const ck_gemm_kernel = R"__migraphx__(
 #include <args.hpp>
 #include <migraphx/kernels/ck_gemm.hpp>
 #include <migraphx/kernels/pointwise.hpp>
+#include <migraphx/kernels/${include}>
 namespace migraphx {
@@ -68,7 +68,7 @@ extern "C" {
 __global__ void ${kernel}(${params})
 {
    transform_args(make_tensors(), rotate_last())(${args})([](auto... xs) {
-        ck_gemm<CK_DeviceGemmMultipleD<${instance}>, ${blocks_per_batch}>(xs...);
+        ck_gemm<${solution}, ${blocks_per_batch}>(xs...);
    });
 }
@@ -78,88 +78,6 @@ __global__ void ${kernel}(${params})
 )__migraphx__";
-static std::size_t int_div_ceil(std::size_t x, std::size_t y) { return (x + y - 1) / y; }
-struct instance
-{
-    std::vector<std::string> params;
-    static const std::size_t block_size_index = 15;
-    std::size_t int_at(std::size_t i) const { return std::stoull(params[i]); }
-    std::size_t get_block_size() const { return int_at(block_size_index); }
-    std::size_t get_pb(std::size_t i) const
-    {
-        assert(i < 4);
-        return int_at(block_size_index + 1 + i);
-    }
-    std::array<std::size_t, 3> get_pad(const std::array<std::size_t, 3>& config) const
-    {
-        std::array<std::size_t, 3> result{};
-        for(auto i : range(config.size()))
-        {
-            result[i] = int_div_ceil(config[i], get_pb(i)) * get_pb(i) - config[i];
-        }
-        return result;
-    }
-    std::size_t get_grid_size(const std::array<std::size_t, 3>& config) const
-    {
-        return int_div_ceil(config[0], get_pb(0)) * int_div_ceil(config[1], get_pb(1));
-    }
-    void set_ds_layout(const std::string& s)
-    {
-        assert(params[2] == "ck::Tuple<>");
-        params[2] = s;
-    }
-    void set_ds_type(const std::string& s)
-    {
-        assert(params[8] == "ck::Tuple<>");
-        params[8] = s;
-    }
-    void set_e_type(const std::string& s)
-    {
-        //assert(params[9] == "ck::Tuple<>");
-        params[9] = s;
-    }
-    void set_ds_op(const std::string& s)
-    {
-        assert(params[12] == "ck_passthrough");
-        params[12] = s;
-    }
-    void set_gemm(const std::string& s)
-    {
-        assert(params[13] == "ck::tensor_operation::device::GemmSpecialization::Default");
-        params[13] = s;
-    }
-    void set_a_scalar_per_vec(const std::string& s)
-    {
-        params[block_size_index + 14] = s;
-        params[block_size_index + 15] = s;
-    }
-    void set_b_scalar_per_vec(const std::string& s)
-    {
-        params[block_size_index + 20] = s;
-        params[block_size_index + 21] = s;
-    }
-    void set_c_scalar_per_vec(const std::string& s)
-    {
-        params[params.size() - 3] = s;
-    }
-    std::string str() const { return join_strings(params, ","); }
-};
 static bool transposed_matrix(const shape& s) { return s.strides().back() != 1; }
 template <class F, class Action>
@@ -309,9 +227,10 @@ struct ck_gemm_compiler : compiler<ck_gemm_compiler>
    operation compile_op(context& /* ctx */, const std::vector<shape>& inputs, const value& v) const
    {
-        auto a_shape = inputs[0];
+        auto a_shape      = inputs[0];
-        auto b_shape = inputs[1];
+        auto b_shape      = inputs[1];
-        auto c_shape = inputs.back();
+        auto c_shape      = inputs.back();
+        auto tuning_value = get_tuning_for({a_shape, b_shape, c_shape});
        auto rank           = a_shape.lens().size();
        auto b_strides      = b_shape.strides();
@@ -322,55 +241,61 @@ struct ck_gemm_compiler : compiler<ck_gemm_compiler>
        m                = can_fold_batch ? m * batch_count : m;
        auto n           = c_shape.lens().back();
        auto k           = a_shape.lens().back();
-        std::array<char, 3> keys{'M', 'N', 'K'};
-        std::array<std::size_t, 3> config{m, n, k};
+        const bool transA = transposed_matrix(a_shape);
-        auto tuning_val = v.get("tuning_val", get_tuning_for({a_shape, b_shape, c_shape.with_type(a_shape.type())}));
+        const bool transB = transposed_matrix(b_shape);
-        auto ip         = instance{get_instance(tuning_val, [&](const auto& x) -> bool {
+        const bool transE = transposed_matrix(c_shape);
-            return get_layout(a_shape) == x[0] and get_layout(b_shape) == x[1] and
+        const auto a_type = get_type(a_shape);
-                   get_layout(c_shape) == x[3] and get_type(a_shape) == x[4] and
+        const auto b_type = get_type(b_shape);
-                   get_type(b_shape) == x[5];
+        const auto e_type = get_type(c_shape);
-        })};
+        std::vector<bool> ds_layout;
+        std::transform(inputs.begin() + 2,
+                       inputs.end() - 1,
+                       std::back_inserter(ds_layout),
+                       [](const auto& i) { return transposed_matrix(i); });
+        std::vector<std::string> ds_type;
+        std::transform(inputs.begin() + 2,
+                       inputs.end() - 1,
+                       std::back_inserter(ds_type),
+                       [](const auto& i) { return get_type(i); });
+        std::string ck_passthrough = "ck_passthrough";
+        std::string cde_op         = ck_passthrough;
        assert(inputs.size() < 4 or v.contains("post"));
        if(v.contains("post"))
        {
-            ip.set_ds_layout(ck_tuple(inputs.begin() + 2, inputs.end() - 1, &get_layout));
+            cde_op = v.at("post").to<std::string>();
-            ip.set_ds_type(ck_tuple(inputs.begin() + 2, inputs.end() - 1, &get_type));
-            ip.set_ds_op(v.at("post").to<std::string>());
-        }
-        if (a_shape.type() == shape::int8_type)
-        {
-            ip.set_e_type(get_type(c_shape));
-            if (std::any_of(inputs.begin(), inputs.end(), [](auto s) { return get_type(s) == "ck::half_t"; }))
-            {
-                ip.set_c_scalar_per_vec("8");
-            }
-            if (std::any_of(inputs.begin(), inputs.end(), [](auto s) { return get_type(s) == "float"; }))
-            {
-                ip.set_c_scalar_per_vec("4");
-            }
        }
-        auto padding = ip.get_pad(config);
+        auto problem = ck::tensor_operation::device::device_gemm_multiple_d::Problem{
-        std::string gemm_type;
+            static_cast<ck::index_t>(m),
-        for(auto i : range(padding.size()))
+            static_cast<ck::index_t>(n),
-        {
+            static_cast<ck::index_t>(k),
-            if(padding[i] != 0)
+            transA,
-                gemm_type += keys[i];
+            transB,
-        }
+            transE,
-        if(gemm_type.empty())
+            ds_layout,
-            gemm_type = "Default";
+            a_type,
-        else
+            b_type,
-            gemm_type += "Padding";
+            e_type,
-        ip.set_gemm("ck::tensor_operation::device::GemmSpecialization::" + gemm_type);
+            ds_type,
+            ck_passthrough,
-        auto blocks_per_batch = ip.get_grid_size(config);
+            ck_passthrough,
+            cde_op};
+        const auto include_header   = problem.GetIncludeHeader();
+        const auto ck_headers       = problem.GetHeaders();
+        const auto solutions        = problem.GetSolutions("gfx90a");
+        const auto solution         = solutions.at(tuning_value);
+        const auto template_str     = solution.template_str;
+        const auto blocks_per_batch = solution.grid_size;
+        const auto block_size       = solution.block_size;
        hip_compile_options options;
-        auto block_size = ip.get_block_size();
+        options.embedded_headers = ck_headers;
-        auto grid_size  = can_fold_batch ? blocks_per_batch : batch_count * blocks_per_batch;
+        auto grid_size = can_fold_batch ? blocks_per_batch : batch_count * blocks_per_batch;
        options.set_launch_params(v, grid_size * block_size, block_size);
        options.inputs         = inputs;
        options.output         = c_shape;
@@ -389,7 +314,8 @@ struct ck_gemm_compiler : compiler<ck_gemm_compiler>
            options.params += " -DMIGRAPHX_CK_CHECK=1";
        auto src = interpolate_string(ck_gemm_kernel,
-                                      {{"instance", ip.str()},
+                                      {{"solution", template_str},
+                                       {"include", include_header},
                                       {"params", enum_params(inputs.size(), "void * private_p")},
                                       {"args", enum_params(inputs.size(), "private_p")},
                                       {"blocks_per_batch", to_string(blocks_per_batch)},

--- a/src/targets/gpu/kernels/include/migraphx/kernels/ck_gemm.hpp
+++ b/src/targets/gpu/kernels/include/migraphx/kernels/ck_gemm.hpp
@@ -29,7 +29,6 @@
 #include <migraphx/kernels/integral_constant.hpp>
 #include <migraphx/kernels/tensor_view.hpp>
 #include <migraphx/kernels/ck.hpp>
-#include <migraphx/kernels/ck_gemm_includes.hpp>
 #include <migraphx/kernels/gemm_batcher.hpp>
 namespace migraphx {
@@ -48,52 +47,15 @@ using ck_transposeb = decltype(make_shape(ck_transposeb_dims(get_shape_c<Tensor>
 template <class G, class E, class A, class B, class... Ds>
 __device__ void ck_gemm_matrix(E e, A a, B b, Ds... ds)
 {
-    constexpr const G gemm{};
+    constexpr auto desc = G::make_descriptor(to_ck_tensor<A>(),
+                                             to_ck_tensor<ck_transposeb<B>>(),
-    constexpr const auto a_grid_desc_m_k = gemm.matrix_padder.PadADescriptor_M_K(to_ck_tensor<A>());
+                                             ck::make_tuple(to_ck_tensor<Ds>()...),
-    constexpr const auto b_grid_desc_n_k =
+                                             to_ck_tensor<E>());
-        gemm.matrix_padder.PadBDescriptor_N_K(to_ck_tensor<ck_transposeb<B>>());
+    G::Run(desc,
+           to_ck_const_pointer(a.data()),
-    constexpr const auto e_grid_desc_m_n = gemm.matrix_padder.PadCDescriptor_M_N(to_ck_tensor<E>());
+           to_ck_const_pointer(b.data()),
-    constexpr const auto ds_grid_desc_m_n =
+           ck::make_tuple(to_ck_const_pointer(ds.data())...),
-        ck::make_tuple(gemm.matrix_padder.PadCDescriptor_M_N(to_ck_tensor<Ds>())...);
+           to_ck_pointer(e.data()));
-    constexpr const auto block_2_etile_map = gemm.MakeDefaultBlock2ETileMap(e_grid_desc_m_n);
-    using GridwiseGemm = typename G::GridwiseGemm;
-    // tensor descriptors for block/thread-wise copy
-    constexpr auto a_grid_desc_ak0_m_ak1 =
-        GridwiseGemm::MakeDefaultAGridDescriptor_AK0_M_AK1(a_grid_desc_m_k);
-    constexpr auto b_grid_desc_bk0_n_bk1 =
-        GridwiseGemm::MakeDefaultBGridDescriptor_BK0_N_BK1(b_grid_desc_n_k);
-    constexpr auto ds_grid_desc_mblock_mperblock_nblock_nperblock =
-        GridwiseGemm::MakeDsGridDescriptor_MBlock_MPerBlock_NBlock_NPerBlock(ds_grid_desc_m_n);
-    constexpr auto e_grid_desc_mblock_mperblock_nblock_nperblock =
-        GridwiseGemm::MakeEGridDescriptor_MBlock_MPerBlock_NBlock_NPerBlock(e_grid_desc_m_n);
-    static_assert(GridwiseGemm::CheckValidity(
-        a_grid_desc_m_k, b_grid_desc_n_k, ds_grid_desc_m_n, e_grid_desc_m_n, block_2_etile_map));
-    __shared__ char p_shared_block[GridwiseGemm::GetSharedMemoryNumberOfByte()];
-    constexpr const bool HasMainKBlockLoop =
-        GridwiseGemm::CalculateHasMainKBlockLoop(a_grid_desc_ak0_m_ak1.GetLength(ck::Number<0>{}) *
-                                                 a_grid_desc_ak0_m_ak1.GetLength(ck::Number<2>{}));
-    GridwiseGemm::template Run<HasMainKBlockLoop>(to_ck_const_pointer(a.data()),
-                                                  to_ck_const_pointer(b.data()),
-                                                  ck::make_tuple(to_ck_const_pointer(ds.data())...),
-                                                  to_ck_pointer(e.data()),
-                                                  p_shared_block,
-                                                  gemm.a_element_op,
-                                                  gemm.b_element_op,
-                                                  gemm.cde_element_op,
-                                                  a_grid_desc_ak0_m_ak1,
-                                                  b_grid_desc_bk0_n_bk1,
-                                                  ds_grid_desc_mblock_mperblock_nblock_nperblock,
-                                                  e_grid_desc_mblock_mperblock_nblock_nperblock,
-                                                  block_2_etile_map);
 }
 template <class G, index_int BlocksPerBatch, class... Ts>

--- a/src/targets/gpu/kernels/include/migraphx/kernels/ck_gemm_includes.hpp
+++ b/src/targets/gpu/kernels/include/migraphx/kernels/ck_gemm_includes.hpp
-/*
- * The MIT License (MIT)
- *
- * Copyright (c) 2015-2022 Advanced Micro Devices, Inc. All rights reserved.
- *
- * Permission is hereby granted, free of charge, to any person obtaining a copy
- * of this software and associated documentation files (the "Software"), to deal
- * in the Software without restriction, including without limitation the rights
- * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
- * copies of the Software, and to permit persons to whom the Software is
- * furnished to do so, subject to the following conditions:
- *
- * The above copyright notice and this permission notice shall be included in
- * all copies or substantial portions of the Software.
- *
- * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
- * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
- * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
- * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
- * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
- * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
- * THE SOFTWARE.
- */
-#ifndef MIGRAPHX_GUARD_KERNELS_CK_INCLUDES_HPP
-#define MIGRAPHX_GUARD_KERNELS_CK_INCLUDES_HPP
-#include <migraphx/kernels/index.hpp>
-#include <migraphx/kernels/algorithm.hpp>
-#include <migraphx/kernels/integral_constant.hpp>
-#include <migraphx/kernels/tensor_view.hpp>
-#include <ck/utility/common_header.hpp>
-#include <ck/tensor_description/tensor_descriptor.hpp>
-#include <ck/tensor_description/tensor_descriptor_helper.hpp>
-#include <ck/tensor_operation/gpu/device/tensor_layout.hpp>
-#include <ck/tensor_operation/gpu/device/device_gemm.hpp>
-#include <ck/tensor_operation/gpu/device/gemm_specialization.hpp>
-#include <ck/tensor_operation/gpu/grid/gridwise_gemm_xdl_cshuffle_v1.hpp>
-#include <ck/tensor_operation/gpu/device/matrix_padder.hpp>
-#include <ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_d_xdl_cshuffle.hpp>
-namespace migraphx {
-template <ck::index_t MPerBlock, ck::index_t NPerBlock, typename CGridDesc_M_N>
-struct BlockToCTileMap_M00_N0_M01Adapt
-{
-    static constexpr auto I0 = ck::Number<0>{};
-    static constexpr auto I1 = ck::Number<1>{};
-    static constexpr auto I2 = ck::Number<2>{};
-    static constexpr auto I3 = ck::Number<3>{};
-    __host__ __device__ constexpr BlockToCTileMap_M00_N0_M01Adapt() = default;
-    __host__
-        __device__ constexpr BlockToCTileMap_M00_N0_M01Adapt(const CGridDesc_M_N& c_grid_desc_m_n,
-                                                             ck::index_t M01 = 8)
-        : M01_(M01), c_grid_desc_m_n_(c_grid_desc_m_n)
-    {
-    }
-    __host__ __device__ constexpr ck::index_t
-    CalculateGridSize(const CGridDesc_M_N& c_grid_desc_m_n) const
-    {
-        const auto M0 = ck::math::integer_divide_ceil(c_grid_desc_m_n.GetLength(I0), MPerBlock);
-        const auto N0 = ck::math::integer_divide_ceil(c_grid_desc_m_n.GetLength(I1), NPerBlock);
-        const ck::index_t grid_size = M0 * N0;
-        return grid_size;
-    }
-    template <typename TopIdx>
-    __host__ __device__ constexpr auto CalculateBottomIndex(const TopIdx& idx_top) const
-    {
-        auto block_1d_id = idx_top[I0];
-        const auto M0 = ck::math::integer_divide_ceil(c_grid_desc_m_n_.GetLength(I0), MPerBlock);
-        const auto N0 = ck::math::integer_divide_ceil(c_grid_desc_m_n_.GetLength(I1), NPerBlock);
-        block_1d_id = block_1d_id % (M0 * N0); // swallow batch index
-        ck::index_t idx_N0 = block_1d_id % N0;
-        ck::index_t idx_M0 = block_1d_id / N0;
-        const auto M01_adapt = (idx_M0 < M0 - M0 % M01_) ? M01_ : M0 % M01_;
-        ck::index_t idx_M00          = idx_M0 / M01_;
-        ck::index_t idx_M01          = idx_M0 % M01_;
-        ck::index_t idx_N0_M01_local = idx_N0 + idx_M01 * N0;
-        return ck::make_tuple(idx_N0_M01_local % M01_adapt + idx_M00 * M01_,
-                              idx_N0_M01_local / M01_adapt);
-    }
-    template <typename CTileIdx, typename CTileDim>
-    __host__ __device__ bool constexpr ValidCTileIndex(const CTileIdx& /* c_tile_idx */,
-                                                       const CTileDim& /* c_tile_dim */) const
-    {
-        return true; // always valid provided that user gets grid size from CalculateGridSize()
-    }
-    __host__ __device__ constexpr bool
-    CheckValidity(const CGridDesc_M_N& /* c_grid_desc_m_n */) const
-    {
-        return true;
-    }
-    private:
-    ck::index_t M01_;
-    CGridDesc_M_N c_grid_desc_m_n_;
-};
-template <typename ALayout,
-          typename BLayout,
-          typename DsLayout,
-          typename ELayout,
-          typename ADataType,
-          typename BDataType,
-          typename AccDataType,
-          typename CShuffleDataType,
-          typename DsDataType,
-          typename EDataType,
-          typename AElementwiseOperation,
-          typename BElementwiseOperation,
-          typename CDEElementwiseOperation,
-          ck::tensor_operation::device::GemmSpecialization GemmSpec,
-          ck::index_t NumGemmKPrefetchStage,
-          ck::index_t BlockSize,
-          ck::index_t MPerBlock,
-          ck::index_t NPerBlock,
-          ck::index_t KPerBlock,
-          ck::index_t AK1,
-          ck::index_t BK1,
-          ck::index_t MPerXDL,
-          ck::index_t NPerXDL,
-          ck::index_t MXdlPerWave,
-          ck::index_t NXdlPerWave,
-          typename ABlockTransferThreadClusterLengths_AK0_M_AK1,
-          typename ABlockTransferThreadClusterArrangeOrder,
-          typename ABlockTransferSrcAccessOrder,
-          ck::index_t ABlockTransferSrcVectorDim,
-          ck::index_t ABlockTransferSrcScalarPerVector,
-          ck::index_t ABlockTransferDstScalarPerVector_AK1,
-          ck::index_t ABlockLdsExtraM,
-          typename BBlockTransferThreadClusterLengths_BK0_N_BK1,
-          typename BBlockTransferThreadClusterArrangeOrder,
-          typename BBlockTransferSrcAccessOrder,
-          ck::index_t BBlockTransferSrcVectorDim,
-          ck::index_t BBlockTransferSrcScalarPerVector,
-          ck::index_t BBlockTransferDstScalarPerVector_BK1,
-          ck::index_t BBlockLdsExtraN,
-          ck::index_t CShuffleMXdlPerWavePerShuffle,
-          ck::index_t CShuffleNXdlPerWavePerShuffle,
-          typename CDEBlockTransferClusterLengths_MBlock_MPerBlock_NBlock_NPerBlock,
-          ck::index_t CDEBlockTransferScalarPerVector_NPerBlock,
-          ck::LoopScheduler LoopSched     = ck::make_default_loop_scheduler(),
-          ck::PipelineVersion PipelineVer = ck::PipelineVersion::v1>
-struct CK_DeviceGemmMultipleD
-{
-    static constexpr auto I0 = ck::Number<0>{};
-    static constexpr auto I1 = ck::Number<1>{};
-    // static constexpr auto I2 = ck::Number<2>{};
-    // static constexpr auto I3 = ck::Number<3>{};
-    // static constexpr auto I4 = ck::Number<4>{};
-    // static constexpr auto I5 = ck::Number<5>{};
-    // static constexpr auto I6 = ck::Number<6>{};
-    // static constexpr auto I7 = ck::Number<7>{};
-    ck::tensor_operation::device::MatrixPadder<GemmSpec, ck::index_t, ck::index_t, ck::index_t>
-        matrix_padder{MPerBlock, NPerBlock, KPerBlock};
-    // GridwiseGemm
-    using GridwiseGemm = ck::GridwiseGemmMultipleD_xdl_cshuffle<
-        ADataType, // TODO: distinguish A/B datatype
-        AccDataType,
-        CShuffleDataType,
-        DsDataType,
-        EDataType,
-        AElementwiseOperation,
-        BElementwiseOperation,
-        CDEElementwiseOperation,
-        ck::InMemoryDataOperationEnum::Set,
-        NumGemmKPrefetchStage,
-        BlockSize,
-        MPerBlock,
-        NPerBlock,
-        KPerBlock,
-        AK1,
-        BK1,
-        MPerXDL,
-        NPerXDL,
-        MXdlPerWave,
-        NXdlPerWave,
-        ABlockTransferThreadClusterLengths_AK0_M_AK1,
-        ABlockTransferThreadClusterArrangeOrder,
-        ABlockTransferSrcAccessOrder,
-        ABlockTransferSrcVectorDim,
-        ABlockTransferSrcScalarPerVector,
-        ABlockTransferDstScalarPerVector_AK1,
-        false,
-        ABlockLdsExtraM,
-        BBlockTransferThreadClusterLengths_BK0_N_BK1,
-        BBlockTransferThreadClusterArrangeOrder,
-        BBlockTransferSrcAccessOrder,
-        BBlockTransferSrcVectorDim,
-        BBlockTransferSrcScalarPerVector,
-        BBlockTransferDstScalarPerVector_BK1,
-        false,
-        BBlockLdsExtraN,
-        CShuffleMXdlPerWavePerShuffle,
-        CShuffleNXdlPerWavePerShuffle,
-        CDEBlockTransferClusterLengths_MBlock_MPerBlock_NBlock_NPerBlock,
-        CDEBlockTransferScalarPerVector_NPerBlock,
-        LoopSched,
-        PipelineVer>;
-    // return block_id to E matrix tile idx (m0, n0) mapping
-    template <class EGridDesc_M_N>
-    __device__ static constexpr auto
-    MakeDefaultBlock2ETileMap(const EGridDesc_M_N& e_grid_desc_m_n_)
-    {
-        return BlockToCTileMap_M00_N0_M01Adapt<MPerBlock, NPerBlock, EGridDesc_M_N>(
-            e_grid_desc_m_n_);
-    }
-    // block_id to matrix tile idx (m0, n0) mapping are controlled by {M01, N01}
-    template <typename AGridDesc_M_K,
-              typename BGridDesc_N_K,
-              typename DsGridDesc_M_N,
-              typename EGridDesc_M_N,
-              typename Block2ETileMap>
-    static constexpr bool CheckValidity(const AGridDesc_M_K& a_grid_desc_m_k,
-                                        const BGridDesc_N_K& b_grid_desc_n_k,
-                                        const DsGridDesc_M_N& ds_grid_desc_m_n,
-                                        const EGridDesc_M_N& e_grid_desc_m_n,
-                                        const Block2ETileMap& block_2_etile_map)
-    {
-        const auto M = a_grid_desc_m_k.GetLength(I0);
-        const auto N = b_grid_desc_n_k.GetLength(I0);
-        const auto K = a_grid_desc_m_k.GetLength(I1);
-        // check consistency of desc
-        MIGRAPHX_CHECK(M == e_grid_desc_m_n.GetLength(I0) && N == e_grid_desc_m_n.GetLength(I1));
-        // check tile size
-        MIGRAPHX_CHECK(M % MPerBlock == 0 && N % NPerBlock == 0 && K % KPerBlock == 0);
-        // check block-to-E-tile
-        MIGRAPHX_CHECK(block_2_etile_map.CheckValidity(e_grid_desc_m_n));
-        return GridwiseGemm::CheckValidity(
-            a_grid_desc_m_k, b_grid_desc_n_k, ds_grid_desc_m_n, e_grid_desc_m_n, block_2_etile_map);
-    }
-    AElementwiseOperation a_element_op{};
-    BElementwiseOperation b_element_op{};
-    CDEElementwiseOperation cde_element_op{};
-};
-} // namespace migraphx
-#endif
--- a/src/targets/gpu/kernels/include/migraphx/kernels/math.hpp
+++ b/src/targets/gpu/kernels/include/migraphx/kernels/math.hpp
@@ -161,8 +161,7 @@ MIGRAPHX_DEVICE_MATH_HALF(fmod, ::fmod)
 // Map math functions to hip half2 functions
 // The half2 type is defined in include/hip/amd_detail/hip_fp16_gcc.h and is 2 16-bit floats
 // packed into a 32-bit number.  See include/hip/amd_detail/hip_fp16_math_fwd.h for the HIP names
-// Most but not all of these math ops have operators of the same names.  Ones not yet implemented
+// Most but not all of these math ops have operators of the same names.
-// at this time are: exp2, exp10, log2, log10, isinf
 MIGRAPHX_DEVICE_MATH_HALF2(abs, ::__habs2)
 MIGRAPHX_DEVICE_MATH_HALF2(ceil, ::h2ceil)
 MIGRAPHX_DEVICE_MATH_HALF2(cos, ::h2cos)
@@ -189,7 +188,8 @@ MIGRAPHX_DEVICE_MATH_BINARY_FOR(float, max, ::max)
 MIGRAPHX_DEVICE_MATH_BINARY_FOR(float, min, ::min)
 MIGRAPHX_DEVICE_MATH_BINARY_FOR(double, max, ::max)
 MIGRAPHX_DEVICE_MATH_BINARY_FOR(double, min, ::min)
-// Add overloads for half that calls the float version
+// Add overloads for half that calls the float version, this should use "hmax" and "hmin" once
+// perf CI docker is upgraded to rocm-5.5
 MIGRAPHX_DEVICE_MATH_BINARY_FOR(migraphx::half, max, ::fmaxf)
 MIGRAPHX_DEVICE_MATH_BINARY_FOR(migraphx::half, min, ::fminf)

--- a/src/targets/gpu/mlir.cpp
+++ b/src/targets/gpu/mlir.cpp
@@ -164,10 +164,10 @@ std::string mlir_print(F f, T x)
    return ss.str();
 }
-const std::unordered_set<std::string>& get_xdlops_archs()
+bool has_xdlops(const std::string& target_arch)
 {
-    static std::unordered_set<std::string> supported_archs{"gfx908", "gfx90a"};
+    const auto device_name = trim(split_string(target_arch, ':').front());
-    return supported_archs;
+    return (starts_with(device_name, "gfx9") and device_name >= "gfx908");
 }
 struct mlir_program
@@ -560,9 +560,7 @@ struct mlir_program
                pp =
                    problem_params{ins->get_operator(), to_shapes(ins->inputs()), ins->get_shape()};
                // check if HW supports xdlops
-                auto target_chip = trim(split_string(target_arch, ':').front());
+                if(has_xdlops(target_arch))
-                bool xdlops      = contains(get_xdlops_archs(), target_chip);
-                if(xdlops)
                    ops.add_attributes({{"xdlopsV2", true}});
            }

--- a/src/targets/gpu/rocblas.cpp
+++ b/src/targets/gpu/rocblas.cpp
@@ -47,16 +47,10 @@ rocblas_handle_ptr create_rocblas_handle_ptr(hipStream_t s)
    return rb;
 }
-const std::unordered_set<std::string>& get_rocblas_fp32_archs()
-{
-    static std::unordered_set<std::string> supported_archs{"gfx908", "gfx90a"};
-    return supported_archs;
-}
 bool get_compute_fp32_flag()
 {
    const auto device_name = trim(split_string(get_device_name(), ':').front());
-    return contains(get_rocblas_fp32_archs(), device_name);
+    return (starts_with(device_name, "gfx9") and device_name >= "gfx908");
 }
 bool get_int8_x4_format(context& ctx)

--- a/test/verify/test_min_max.cpp
+++ b/test/verify/test_min_max.cpp
+/*
+ * The MIT License (MIT)
+ *
+ * Copyright (c) 2015-2022 Advanced Micro Devices, Inc. All rights reserved.
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#include "verify_program.hpp"
+#include <migraphx/program.hpp>
+#include <migraphx/generate.hpp>
+#include <migraphx/op/max.hpp>
+#include <migraphx/op/min.hpp>
+template <class Op, migraphx::shape::type_t T>
+struct test_min_max : verify_program<test_min_max<Op, T>>
+{
+    migraphx::program create_program() const
+    {
+        migraphx::program p;
+        auto* mm = p.get_main_module();
+        migraphx::shape s{T, {128}};
+        auto x = mm->add_parameter("x", s);
+        auto y = mm->add_parameter("y", s);
+        mm->add_instruction(Op{}, x, y);
+        return p;
+    }
+};
+template struct test_min_max<migraphx::op::max, migraphx::shape::float_type>;
+template struct test_min_max<migraphx::op::max, migraphx::shape::half_type>;
+template struct test_min_max<migraphx::op::max, migraphx::shape::double_type>;
+template struct test_min_max<migraphx::op::min, migraphx::shape::float_type>;
+template struct test_min_max<migraphx::op::min, migraphx::shape::half_type>;
+template struct test_min_max<migraphx::op::min, migraphx::shape::double_type>;
--- a/tools/api.py
+++ b/tools/api.py
@@ -50,6 +50,7 @@ class Template(string.Template):
 class Type:
    def __init__(self, name: str) -> None:
        self.name = name.strip()
@@ -144,6 +145,7 @@ extern "C" ${error_type} ${name}(${params})
 class CFunction:
    def __init__(self, name: str) -> None:
        self.name = name
        self.params: List[str] = []
@@ -188,12 +190,14 @@ class CFunction:
 class BadParam:
    def __init__(self, cond: str, msg: str) -> None:
        self.cond = cond
        self.msg = msg
 class Parameter:
    def __init__(self,
                 name: str,
                 type: str,
@@ -250,7 +254,8 @@ class Parameter:
                                           size=self.size_name,
                                           result=result or '')
-    def add_param(self, t: Union[str, Type],
+    def add_param(self,
+                  t: Union[str, Type],
                  name: Optional[str] = None) -> None:
        if not isinstance(t, str):
            t = t.str()
@@ -409,6 +414,7 @@ def to_template_vars(params: List[Union[Any, Parameter]]) -> str:
 class Function:
    def __init__(self,
                 name: str,
                 params: Optional[List[Parameter]] = None,
@@ -545,6 +551,7 @@ cpp_class_constructor_template = Template('''
 class CPPMember:
    def __init__(self,
                 name: str,
                 function: Function,
@@ -621,6 +628,7 @@ class CPPMember:
 class CPPClass:
    def __init__(self, name: str, ctype: str) -> None:
        self.name = name
        self.ctype = ctype
@@ -677,6 +685,7 @@ def add_function(name: str, *args, **kwargs) -> Function:
 def once(f: Callable) -> Any:
    @wraps(f)
    def decorated(*args, **kwargs):
        if not decorated.has_run:
@@ -722,6 +731,7 @@ c_type_map: Dict[str, Type] = {}
 def cwrap(name: str, c_type: Optional[str] = None) -> Callable:
    def with_cwrap(f):
        type_map[name] = f
        if c_type:
@@ -1015,6 +1025,7 @@ def string_c_wrap(p: Parameter) -> None:
 class Handle:
    def __init__(self, name: str, ctype: str, cpptype: str, **kwargs) -> None:
        self.name = name
        self.ctype = ctype
@@ -1140,6 +1151,7 @@ def generate_virtual_impl(f: Function, fname: str) -> str:
 class Interface(Handle):
    def __init__(self, name: str, ctype: str, cpptype: str) -> None:
        super().__init__(name, ctype, cpptype, skip_def=True)
        self.ifunctions: List[Function] = []
@@ -1234,6 +1246,7 @@ def handle(ctype: str,
           cpptype: str,
           name: Optional[str] = None,
           ref: Optional[bool] = None) -> Callable:
    def with_handle(f):
        n = name or f.__name__
        h = Handle(n, ctype, cpptype, ref=ref)
@@ -1249,8 +1262,10 @@ def handle(ctype: str,
    return with_handle
-def interface(ctype: str, cpptype: str,
+def interface(ctype: str,
+              cpptype: str,
              name: Optional[str] = None) -> Callable:
    def with_interface(f):
        n = name or f.__name__
        h = Interface(n, ctype, cpptype)