Merge branch 'develop' into mi200

70d9faf7 · Chris Austen · GitHub · a56c531c · a60bdb67 · 70d9faf7
Unverified Commit 70d9faf7 authored Dec 13, 2023 by Chris Austen Committed by GitHub Dec 13, 2023
20 changed files
--- a/.github/workflows/ci.yaml
+++ b/.github/workflows/ci.yaml
@@ -465,7 +465,7 @@ jobs:
    - name: Upload code coverage
      if: "matrix.configuration == 'codecov'"
      env:
-        CODECOV_TOKEN: "8545af1c-f90b-4345-92a5-0d075503ca56"
+        CODECOV_TOKEN: "f5d5a10b-3177-4c76-b25f-9b1c2f165e8b"
      run: |
        sudo apt-get install -y lcov
        cd build

--- a/.gitignore
+++ b/.gitignore
@@ -81,5 +81,7 @@ cmake-build*/
 build*/
 # Recommended location to install rbuild dependencies from README.md
-depend
+depend*/
+# local Python virtual environment
+.venv/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
-# Change Log for MIGraphX
+# Changelog for MIGraphX
-Full documentation for MIGraphX is available at [MIGraphX Documentation](https://rocmdocs.amd.com/projects/AMDMIGraphX/en/latest/).
+Full documentation for MIGraphX is available at
+[https://rocmdocs.amd.com/projects/AMDMIGraphX/en/latest/](https://rocmdocs.amd.com/projects/AMDMIGraphX/en/latest/).
 ## MIGraphX 2.8 for ROCm 6.0.0
-### Added
- Support for MI300 GPUs
+### Additions
- Support for TorchMIGraphX via PyTorch
- Boosted overall performance by integrating rocMLIR 
+* Support for MI300 GPUs
- INT8 support for ONNX Runtime
+* Support for TorchMIGraphX via PyTorch
- Support for ONNX version 1.14.1
+* Boosted overall performance by integrating rocMLIR
- Added operators Qlinearadd, QlinearGlobalAveragePool, Qlinearconv, Shrink, CastLike, and RandomUniform operators
+* INT8 support for ONNX Runtime
- Added an error message when gpu_targets is not set when compiling migraphx
+* Support for ONNX version 1.14.1
- Added parameter to set tolerances with migraphx-driver verify 
+* Added new operators: `Qlinearadd`, `QlinearGlobalAveragePool`, `Qlinearconv`, `Shrink`, `CastLike`,
- Added support for MXR files >4 GB 
+  and `RandomUniform`
- Added MIGRAPHX_TRACE_MLIR flag
+* Added an error message for when `gpu_targets` is not set during MIGraphX compilation
- BETA added capability to use ROCm Composable Kernels via environment variable MIGRAPHX_ENABLE_CK=1
+* Added parameter to set tolerances with `migraphx-driver` verify
+* Added support for MXR files > 4 GB
+* Added `MIGRAPHX_TRACE_MLIR` flag
+* BETA added capability for using ROCm Composable Kernels via the `MIGRAPHX_ENABLE_CK=1`
+  environment variable
 ### Optimizations
- Improved performance support for INT8
- Improved time percision while benchmarking candidate kernels from CK or MLIR 
- Remove contiguous from reshape parsing
- Updated ConstantOfShape operator to support Dynamic Batch
- Simplifies dynamic shapes related operators to their static versions if possible
- Improved debugging tools for accuracy issues
- Print warning about miopen_fusion while generating mxr 
- General reduction in system memory usage during model compilation
- Created additional fusion opportunities during model compilation
- Improved debugging for matchers
- Improved general debug messages 
-### Fixed
- Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
- Provided a compile option to improve accuracy of some models by disabling Fast-Math
- Improved layernorm + pointwise fusion matching to ignore arguments order
- Fixed accuracy issue with ROIAlign operator 
- Fixed Trilu operator computation logic
- Fixed support for the DETR model 
-### Changed
- Changed migraphx version to 2.8
- Extracted test packages as its own separate deb file when building migraphx from source
-### Removed
- Removed building Python 2.7 bindings
+* Improved performance support for INT8
+* Improved time precision while benchmarking candidate kernels from CK or MLIR
+* Removed contiguous from reshape parsing
+* Updated the `ConstantOfShape` operator to support Dynamic Batch
+* Simplified dynamic shapes-related operators to their static versions, where possible
+* Improved debugging tools for accuracy issues
+* Included a print warning about `miopen_fusion` while generating `mxr`
+* General reduction in system memory usage during model compilation
+* Created additional fusion opportunities during model compilation
+* Improved debugging for matchers
+* Improved general debug messages
+### Fixes
+* Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
+* Provided a compile option to improve the accuracy of some models by disabling Fast-Math
+* Improved layernorm + pointwise fusion matching to ignore argument order
+* Fixed accuracy issue with `ROIAlign` operator
+* Fixed computation logic for the `Trilu` operator
+* Fixed support for the DETR model
+### Changes
+* Changed MIGraphX version to 2.8
+* Extracted the test packages into a separate deb file when building MIGraphX from source
+### Removals
+* Removed building Python 2.7 bindings
 ## MIGraphX 2.7 for ROCm 5.7.0
-### Added
- Enabled hipRTC to not require dev packages for migraphx runtime and allow the ROCm install to be in a different directory than it was during build time
+### Additions
- Add support for multi-target execution
- Added Dynamic Batch support with C++/Python APIs
+* hipRTC no longer requires dev packages for MIGraphX runtime and allows the ROCm install to be in a
- Add migraphx.create_argument to python API
+   different directory than build time
- Added dockerfile example for Ubuntu 22.04
+* Added support for multi-target execution
- Add TensorFlow supported ops in driver similar to exist onnx operator list
+* Added Dynamic Batch support with C++/Python APIs
- Add a MIGRAPHX_TRACE_MATCHES_FOR env variable to filter the matcher trace
+* Added `migraphx.create_argument` to Python API
- Improved debugging by printing max,min,mean and stddev values for TRACE_EVAL = 2
+* Added dockerfile example for Ubuntu 22.04
- use fast_math flag instead of ENV flag for GELU
+* Added TensorFlow supported ops in driver similar to exist onnx operator list
- Print message from driver if offload copy is set for compiled program
+* Added a MIGRAPHX_TRACE_MATCHES_FOR env variable to filter the matcher trace
+* Improved debugging by printing max,min,mean and stddev values for TRACE_EVAL = 2
+* You can now use the ` fast_math` flag instead of `ENV` for GELU
+* Print message from driver if offload copy is set for compiled program
 ### Optimizations
- Optimized for ONNX Runtime 1.14.0
- Improved compile times by only building for the GPU on the system
- Improve performance of pointwise/reduction kernels when using NHWC layouts
- Load specific version of the migraphx_py library
- Annotate functions with the block size so the compiler can do a better job of optimizing 
- Enable reshape on nonstandard shapes
- Use half HIP APIs to compute max and min
- Added support for broadcasted scalars to unsqueeze operator
- Improved multiplies with dot operator
- Handle broadcasts across dot and concat
- Add verify namespace for better symbol resolution
-### Fixed
- Resolved accuracy issues with FP16 resnet50
- Update cpp generator to handle inf from  float
- Fix assertion error during verify and make DCE work with tuples
- Fix convert operation for NaNs
- Fix shape typo in API test
- Fix compile warnings for shadowing variable names
- Add missing specialization for the `nullptr` for the hash function
-### Changed
- Bumped version of half library to 5.6.0
- Bumped CI to support rocm 5.6
- Make building tests optional
- replace np.bool with bool as per numpy request
-### Removed
- Removed int8x4 rocBlas calls due to deprecation
- removed std::reduce usage since not all OS' support it
+* Optimized for ONNX Runtime 1.14.0
+* Improved compile times by only building for the GPU on the system
+* Improved performance of pointwise/reduction kernels when using NHWC layouts
+* Loaded specific version of the `migraphx_py` library
+* Annotated functions with the block size so the compiler can do a better job of optimizing
+* Enabled reshape on nonstandard shapes
+* Used half HIP APIs to compute max and min
+* Added support for broadcasted scalars to unsqueeze operator
+* Improved multiplies with dot operator
+* Handled broadcasts across dot and concat
+* Added verify namespace for better symbol resolution
+### Fixes
+* Resolved accuracy issues with FP16 resnet50
+* Updated cpp generator to handle inf from float
+* Fixed assertion error during verify and made DCE work with tuples
+* Fixed convert operation for NaNs
+* Fixed shape typo in API test
+* Fixed compile warnings for shadowing variable names
+* Added missing specialization for the `nullptr` hash function
+### Changees
+* Bumped version of half library to 5.6.0
+* Bumped CI to support ROCm 5.6
+* Made building tests optional
+* Replaced `np.bool` with `bool` per NumPy request
+### Removals
+* Removed int8x4 rocBlas calls due to deprecation
+* Removed `std::reduce` usage because not all operating systems support it
 ## MIGraphX 2.5 for ROCm 5.5.0
-### Added
- Y-Model feature to store tuning information with the optimized model
+### Additions
- Added Python 3.10 bindings 
- Accuracy checker tool based on ONNX Runtime
+* Y-Model feature will store tuning information with the optimized model
- ONNX Operators parse_split, and Trilu 
+* Added Python 3.10 bindings
- Build support for ROCm MLIR
+* Accuracy checker tool based on ONNX runtime
- Added migraphx-driver flag to print optimizations in python (--python)
+* ONNX operators parse_split, and Trilu
- Added JIT implementation of the Gather and Pad operator which results in better handling of larger tensor sizes.
+* Build support for ROCm MLIR
+* Added the `migraphx-driver` flag to print optimizations in Python (--python)
+* Added JIT implementation of the Gather and Pad operators, which results in better handling for
+  larger tensor sizes
 ### Optimizations
- Improved performance of Transformer based models
- Improved performance of the Pad, Concat, Gather, and Pointwise operators
+* Improved performance of Transformer-based models
- Improved onnx/pb file loading speed
+* Improved performance of the `Pad`, `Concat`, `Gather`, and `Pointwise` operators
- Added general optimize pass which runs several passes such as simplify_reshapes/algebra and DCE in loop.
+* Improved ONNX/pb file loading speed
-### Fixed
+* Added a general optimize pass that runs several passes, such as `simplify_reshapes`, algebra, and DCE
- Improved parsing Tensorflow Protobuf files 
+  in a loop
- Resolved various accuracy issues with some onnx models
- Resolved a gcc-12 issue with mivisionx
+### Fixes
- Improved support for larger sized models and batches
- Use --offload-arch instead of --cuda-gpu-arch for the HIP compiler
+* Improved parsing for TensorFlow Protobuf files
- Changes inside JIT to use float accumulator for large reduce ops of half type to avoid overflow.
+* Resolved various accuracy issues with some ONNX models
- Changes inside JIT to temporarily use cosine to compute sine function.
+* Resolved a gcc-12 issue with MIVisionX
-### Changed
+* Improved support for larger sized models and batches
- Changed version/location of 3rd party build dependencies to pick up fixes
+* Use `--offload-arch` instead of `--cuda-gpu-arch` for the HIP compiler
+* Changes inside JIT to use float accumulator for large reduce ops of half type to avoid overflow
+* Changes inside JIT to temporarily use cosine to compute sine function
+### Changes
+* Changed version and location of third-party build dependencies in order to pick up fixes
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -41,9 +41,12 @@ if(NOT MIGRAPHX_GENERATOR_IS_MULTI_CONFIG)
    set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS ${CMAKE_CONFIGURATION_TYPES})
 endif()
-set(CMAKE_INSTALL_PREFIX "/opt/rocm" CACHE PATH "")
+if(NOT WIN32)
+    set(CMAKE_INSTALL_PREFIX "/opt/rocm" CACHE PATH "")
+    set(CMAKE_BUILD_RPATH "${CMAKE_BINARY_DIR}/lib")
+endif()
-set(CMAKE_BUILD_RPATH "${CMAKE_BINARY_DIR}/lib")
+list(APPEND CMAKE_PREFIX_PATH /opt/rocm /opt/rocm/llvm $ENV{ROCM_PATH} $ENV{HIP_PATH})
 project(migraphx LANGUAGES C CXX)
 include(CTest)
@@ -57,6 +60,15 @@ else()
 option(MIGRAPHX_ENABLE_PYTHON "Enable python bindings" ON)
 endif()
+# By default build shared libraries
+option(BUILD_SHARED_LIBS "Create shared libraries" ON)
+if(WIN32) # CK is not yet ported to Windows
+option(MIGRAPHX_USE_COMPOSABLEKERNEL "Enable MIGraphX to use composable kernel JIT library" OFF)
+else()
+option(MIGRAPHX_USE_COMPOSABLEKERNEL "Enable MIGraphX to use composable kernel JIT library" ON)
+endif()
 find_path(HALF_INCLUDE_DIR half.hpp PATH_SUFFIXES half)
 if (NOT HALF_INCLUDE_DIR)
    message(FATAL_ERROR "Could not find half.hpp - Please check that the install path of half.hpp has been added to CMAKE_PREFIX_PATH")
@@ -96,13 +108,21 @@ set(MIGRAPHX_ENABLE_CPU Off CACHE BOOL "")
 # Disable fpga backend by default
 set(MIGRAPHX_ENABLE_FPGA Off CACHE BOOL "")
+if(WIN32)
+    add_compile_definitions("$<$<COMPILE_LANGUAGE:C,CXX>:_CRT_SECURE_NO_WARNINGS;_USE_MATH_DEFINES>")
+endif()
 set(CMAKE_CXX_STANDARD_DEFAULT "")
-add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-std=c++17>)
+if(MSVC)
+    add_compile_options($<$<COMPILE_LANGUAGE:CXX>:/std:c++17>)
+else()
+    add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-std=c++17>)
+endif()
 list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cmake)
 include(EnableCompilerWarnings)
 include(ROCMClangTidy)
-if(CMAKE_CXX_COMPILER MATCHES ".*clang\\+\\+")
+if(CMAKE_CXX_COMPILER MATCHES ".*clang\\+\\+.*")
    set(MIGRAPHX_TIDY_ERRORS ERRORS * -readability-inconsistent-declaration-parameter-name)
 # Enable tidy on hip
 elseif(MIGRAPHX_ENABLE_GPU)

--- a/Dockerfile
+++ b/Dockerfile
@@ -80,6 +80,10 @@ ADD rbuild.ini /rbuild.ini
 # Temporarily install a new cmake until switching to ubuntu 22.04
 RUN pip3 install cmake==3.22.1
+# Location where onnx unit tests models are cached
+ENV ONNX_HOME=/.onnx
+RUN mkdir -p $ONNX_HOME/models && chmod 777 $ONNX_HOME/models
 COPY ./tools/install_prereqs.sh /
 RUN /install_prereqs.sh /usr/local / && rm /install_prereqs.sh
 RUN test -f /usr/local/hash || exit 1
@@ -91,11 +95,6 @@ RUN pip3 install yapf==0.28.0
 ADD docs/.sphinx/requirements.txt /doc-requirements.txt
 RUN pip3 install -r /doc-requirements.txt
-# Download real models to run onnx unit tests
-ENV ONNX_HOME=/.onnx
-COPY ./tools/download_models.sh /
-RUN /download_models.sh && rm /download_models.sh
 # Install latest ccache version
 RUN cget -p $PREFIX install facebook/zstd@v1.4.5 -X subdir -DCMAKE_DIR=build/cmake
 RUN cget -p $PREFIX install ccache@v4.1 -DENABLE_TESTING=OFF

--- a/Jenkinsfile
+++ b/Jenkinsfile
@@ -22,6 +22,8 @@ def rocmtestnode(Map conf) {
        def cmd = """
            ulimit -c unlimited
            echo "leak:dnnl::impl::malloc" > suppressions.txt
+            echo "leak:libtbb.so" >> suppressions.txt
+            cat suppressions.txt
            export LSAN_OPTIONS="suppressions=\$(pwd)/suppressions.txt"
            export MIGRAPHX_GPU_DEBUG=${gpu_debug}
            export CXX=${compiler}
@@ -30,7 +32,7 @@ def rocmtestnode(Map conf) {
            rm -rf build
            mkdir build
            cd build
-            cmake -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DBUILD_DEV=On -DCMAKE_EXECUTE_PROCESS_COMMAND_ECHO=STDOUT ${flags} ..
+            cmake -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DBUILD_DEV=On -DCMAKE_EXECUTE_PROCESS_COMMAND_ECHO=STDOUT -DMIGRAPHX_DISABLE_VIRTUAL_ENV=ON ${flags} ..
            git diff
            git diff-index --quiet HEAD || (echo "Git repo is not clean after running cmake." && exit 1)
            make -j\$(nproc) generate VERBOSE=1
@@ -136,12 +138,14 @@ rocmtest clang_debug: rocmnode('migraphx-trial') { cmake_build ->
    }
 }, mlir_debug: rocmnode('migraphx-trial') { cmake_build ->
    stage('MLIR Debug') {
-        withEnv(['MIGRAPHX_ENABLE_EXTRA_MLIR=1']) {
+        withEnv(['MIGRAPHX_ENABLE_EXTRA_MLIR=1', 'MIGRAPHX_MLIR_USE_SPECIFIC_OPS=fused,attention,convolution,dot']) {
            def sanitizers = "undefined"
            // Note: the -fno-sanitize= is copied from upstream LLVM_UBSAN_FLAGS.
            def debug_flags_cxx = "-g -O2 -fsanitize=${sanitizers} -fno-sanitize=vptr,function -fno-sanitize-recover=${sanitizers}"
            def debug_flags = "-g -O2 -fsanitize=${sanitizers} -fno-sanitize=vptr -fno-sanitize-recover=${sanitizers}"
            def gpu_targets = getgputargets()
+            // Since the purpose of this run verify all things MLIR supports,
+            // enabling all possible types of offloads
            cmake_build(flags: "-DCMAKE_BUILD_TYPE=debug -DMIGRAPHX_ENABLE_PYTHON=Off -DMIGRAPHX_ENABLE_MLIR=On -DCMAKE_CXX_FLAGS_DEBUG='${debug_flags_cxx}' -DCMAKE_C_FLAGS_DEBUG='${debug_flags}' -DGPU_TARGETS='${gpu_targets}'")
        }
    }

--- a/README.md
+++ b/README.md
 # AMD MIGraphX
-AMD MIGraphX is AMD's graph inference engine that accelerates machine learning model inference. AMD MIGraphX can be used by
+AMD MIGraphX is AMD's graph inference engine, which accelerates machine learning model inference.
-installing binaries directly or building from source code.
+To use MIGraphX, you can install the binaries or build from source code. Refer to the following sections
+for Ubuntu installation instructions (we'll provide instructions for other Linux distributions in the future).
-In the following, instructions of how to build and install MIGraphX are described with Ubuntu as the OS
+```note
-(Instructions of installation on other Linux OSes will come later). Note that all the following instructions assume 
+You must [install ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html) before
-ROCm has been installed successfully. ROCm installation instructions are explained in the [ROCm installation
+installing MIGraphX.
-guide](https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html).
+```
 ## Installing from binaries
-With ROCm installed correctly, MIGraphX binaries can be installed on Ubuntu with the following command:
-```
+Install binaries using:
+```bash
 sudo apt update && sudo apt install -y migraphx
 ```
-then the header files and libs are installed under `/opt/rocm-<version>`, where `<version>` is the ROCm version.
-## Building from source
-There are three ways to build the MIGraphX sources. 
+Header files and libraries are installed under `/opt/rocm-<version>`, where `<version>` is the ROCm
-* [Use the ROCm build tool](#use-the-rocm-build-tool-rbuild)
+version.
-    This approach uses [rbuild](https://github.com/RadeonOpenCompute/rbuild) to install the prerequisites and
+## Building from source
-build the libs with just one command. 
-* [Use cmake](#use-cmake-to-build-migraphx)
+You have three options for building from source:
-    This approach uses a script to install the prerequisites, then use cmake to build the source.
+* [ROCm build tool](#use-the-rocm-build-tool-rbuild): Uses
+  [rbuild](https://github.com/RadeonOpenCompute/rbuild) to install prerequisites, then you can build
+  the libraries with a single command.
-* [Use docker](#use-docker)
+* [CMake](#use-cmake-to-build-migraphx): Uses a script to install prerequisites, then you can use
+  CMake to build the source.
-    This approach builds a docker image with all prerequisites installed, then build the MIGraphX sources inside a docker container. 
+* [Docker](#use-docker): Builds a Docker image with all prerequisites installed, then you can build the
+  MIGraphX sources inside a Docker container.
-In the following, we will first list the prerequisites required to build MIGraphX source code, then describe 
+### Build prerequisites
-each of the three approaches.
-### List of prerequisites
+The following is a list of prerequisites for building MIGraphX.
-The following is a list of prerequisites required to build MIGraphX source. 
-* [ROCm cmake modules](https://github.com/RadeonOpenCompute/rocm-cmake) **required**
+* [ROCm CMake modules](https://github.com/RadeonOpenCompute/rocm-cmake) **required**
 * [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) for running on the GPU
 * [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS) for running on the GPU
 * [HIP](https://github.com/ROCm-Developer-Tools/HIP) for running on the GPU
-* [Protobuf](https://github.com/google/protobuf) for reading [onnx](https://github.com/onnx/onnx) files
+* [Protobuf](https://github.com/google/protobuf) for reading [onnx](https://github.com/onnx/onnx)
-* [Half](http://half.sourceforge.net/) - IEEE 754-based half-precision floating point library
+  files
-* [pybind11](https://pybind11.readthedocs.io/en/stable/) - for python bindings
+* [Half](http://half.sourceforge.net/), an IEEE 754-based half-precision floating point library
-* [JSON](https://github.com/nlohmann/json) - for model serialization to json string format
+* [pybind11](https://pybind11.readthedocs.io/en/stable/) for python bindings
-* [MessagePack](https://msgpack.org/index.html) - for model serialization to binary format
+* [JSON](https://github.com/nlohmann/json) for model serialization to json string format
-* [SQLite3](https://www.sqlite.org/index.html) - to create database of kernels' tuning information or execute queries on existing database
+* [MessagePack](https://msgpack.org/index.html) for model serialization to binary format
+* [SQLite3](https://www.sqlite.org/index.html) to create database of kernels' tuning information or run queries on existing database
-#### Use the ROCm build tool [rbuild](https://github.com/RadeonOpenCompute/rbuild).
+### Use the ROCm build tool [rbuild](https://github.com/RadeonOpenCompute/rbuild).
-In this approach, we use the [rbuild](https://github.com/RadeonOpenCompute/rbuild) build tool to
+1. Install `rocm-cmake`, `pip3`, `rocblas`, and `miopen-hip`:
-build MIGraphX. The specific steps are as follows:
-1) Install rocm-cmake, pip3, rocblas, and miopen-hip with the command
+    ```bash
+    sudo apt install -y rocm-cmake python3-pip rocblas miopen-hip
+    ```
-```
+2. Install [rbuild](https://github.com/RadeonOpenCompute/rbuild) (sudo may be required):
-sudo apt install -y rocm-cmake python3-pip rocblas miopen-hip
-```
-2) Install [rbuild](https://github.com/RadeonOpenCompute/rbuild) (sudo may be required here.)
+    ```bash
+    pip3 install https://github.com/RadeonOpenCompute/rbuild/archive/master.tar.gz
+    ```
-```
+3. Build MIGraphX source code:
-pip3 install https://github.com/RadeonOpenCompute/rbuild/archive/master.tar.gz
-```
-3) Build MIGraphX source code
+    ```bash
+    rbuild build -d depend -B build -DGPU_TARGETS=$(/opt/rocm/bin/rocminfo | grep -o -m1 'gfx.*')
+    ```
-```
+Once completed, all prerequisites are in the `depend` folder and MIGraphX is in the `build` directory.
-rbuild build -d depend -B build
+```note
+If you get an `rbuild: command not found` error, it's because `rbuild` is installed in `$HOME/.local/bin`,
+which is not in `PATH`. You can either export PATH as `export PATH=$HOME/.local/bin:$PATH` to add
+the folder to `PATH`, or add the option `--prefix /usr/local` in the pip3 command when installing `rbuild`.
 ```
-then all the prerequisites are in the folder `depend`, and MIGraphX is built in the `build` directory.
+### Use CMake to build MIGraphX
-Also note that you may meet the error of `rbuild: command not found`. It is because rbuild is installed 
+1. Install the prerequisites:
-at `$HOME/.local/bin`, which is not in `PATH`. You can either export PATH as `export PATH=$HOME/.local/bin:$PATH` 
-to add the folder to `PATH` or add the option `--prefix /usr/local` in the pip3 command when installing rbuild.
-#### Use cmake to build MIGraphX
+    ```bash
+    rbuild prepare -d depend
+    ```
-If using this approach, we need to install the prerequisites, configure the cmake, and then build the source.
+    This puts all the prerequisites are in `depend` the folder. They can be used in the `cmake`
+    configuration as `-DCMAKE_PREFIX_PATH=depend`.
-##### Installing the prerequisites
+    If you have sudo access, as an alternative to the `rbuild` command, you can install the prerequisites
+    in the same way as a Dockerfile, by calling `./tools/install_prereqs.sh`.
-For convenience, the prerequisites can be built automatically with rbuild as:
+    By default, all prerequisites are installed at the default location (`/usr/local`) and are accessible by all
+    users. For the default location, `sudo` is required to run the script. You can also specify a different
+    location using `./tools/install_prereqs.sh $custom_location`.
-```
+2. Go to the project folder and create a `build` directory:
-rbuild prepare -d depend
-```
-then all the prerequisites are in the folder `depend`, and they can be used in the `cmake` configuration
+    ```bash
-as `-DCMAKE_PREFIX_PATH=depend`.
+    mkdir build
+    cd build
+    ```
-If you have sudo access, as an alternative to the rbuild command, you can install the prerequisites just 
+3. Configure CMake. If the prerequisites are installed at the default location `/usr/local`, use:
-like in the dockerfile by calling `./tools/install_prereqs.sh`.
-(Note that this script is for Ubuntu. By default, all prerequisites are installed at the default location `/usr/local` 
+    ```bash
-and are accessible by all users. For the default location, `sudo` is required to run the script.
+    CXX=/opt/rocm/llvm/bin/clang++ cmake .. -DGPU_TARGETS=$(/opt/rocm/bin/rocminfo | grep -o -m1 'gfx.*')
-You can also specify a location at which the prerequisites are installed with `./tools/install_prereqs.sh $your_loc`.)
+    ```
-##### Building MIGraphX source and install libs
+    Otherwise, you need to set `-DCMAKE_PREFIX_PATH=$your_loc` to configure CMake.
-With the above prerequisites installed, we can build source as:
+4. Build MIGraphX source code:
-1) Go to the project folder and create a `build` directory:
+    ```cpp
+    make -j$(nproc)
+    ```
+    You can verify this using:
-```
+    ```cpp
-mkdir build
+    make -j$(nproc) check
-cd build
+    ```
-```
-2) Configure the cmake. If the prerequisites are installed at the default location `/usr/local`, the command is:
+5. Install MIGraphX libraries:
-```
+    ```cpp
-CXX=/opt/rocm/llvm/bin/clang++ cmake ..
+    make install
-```
+    ```
-Otherwise, you need to set `-DCMAKE_PREFIX_PATH=$your_loc` to configure the cmake. 
-3) Build MIGraphX source code
+### Use Docker
-```
+The easiest way to set up the development environment is to use Docker.
-make -j$(nproc)
-```
-Correctness can be verified as:
+1. With the Dockerfile, build a Docker image:
-```
+    ```bash
-make -j$(nproc) check
+        docker build -t migraphx .
-```
+    ```
-MIGraphX libs can be installed as:
+2. Enter the development environment using `docker run`:
-```
+    ```bash
-make install
+        docker run --device='/dev/kfd' --device='/dev/dri' -v=`pwd`:/code/AMDMIGraphX -w /code/AMDMIGraphX --group-add video -it migraphx
-```
+    ```
-#### Use docker
+3. In the Docker container, all required prerequisites are already installed, so you can go to the folder
+    `/code/AMDMIGraphX` and follow the steps (starting from 2) in the
+    [Use CMake to build MIGraphX](#use-cmake-to-build-migraphx).
-The easiest way to setup the development environment is to use docker. With the dockerfile, you can build a docker image as:
+## Using the MIGraphX Python module
-    docker build -t migraphx .
+To use MIGraphX's Python module, you can set `PYTHONPATH` or use the `.deb` package:
-Then to enter the developement environment use `docker run`:
+* Setting `PYTHONPATH`:
-    docker run --device='/dev/kfd' --device='/dev/dri' -v=`pwd`:/code/AMDMIGraphX -w /code/AMDMIGraphX --group-add video -it migraphx
+    ```bash
+    export PYTHONPATH=/opt/rocm/lib:$PYTHONPATH
+    ```
-In the docker container, all the required prerequisites are already installed, so users can just go to the folder 
+* Creating the `deb` package:
-`/code/AMDMIGraphX` and follow the steps in the above [Build MIGraphX source and install
-libs](#building-migraphx-source-and-install-libs)
-section to build MIGraphX source.
-### Using MIGraphX Python Module
+    ```bash
-To use MIGraphX's Python module, please either set `PYTHONPATH` or use `.deb` package as explained below:
+    make package
+    ```
- Setting `PYTHONPATH` :
+    This provides the path for .deb package.
-```
-export PYTHONPATH=/opt/rocm/lib:$PYTHONPATH
-```
- Creating and installing the package:
-To create deb package:
+    To install:
-```
-make package
-```
-This will provide the path of .deb package.
-To install:
+    ```bash
-```
+    dpkg -i <path_to_deb_file>
-dpkg -i <path_to_deb_file>
+    ```
-```
-### Calling MIGraphX APIs
+## Calling MIGraphX APIs
-To use MIGraphX's C/C++ API in your cmake project, we need to set `CMAKE_PREFIX_PATH` to the MIGraphX
-installation location and then do 
+To use MIGraphX's C/C++ API in your CMake project, you must set `CMAKE_PREFIX_PATH` to the
-```
+MIGraphX installation location and run:
+```bash
 find_package(migraphx)
 target_link_libraries(myApp migraphx::c)
 ```
-Where `myApp` is the cmake target in your project.
+Where `myApp` is the CMake target in your project.
 ## Building for development
-Using rbuild, the dependencies for development can be installed with:
+Using `rbuild`, you can install the dependencies for development with:
-```
+```bash
 rbuild develop
 ```
-This will install the dependencies for development into the `deps` directory and
+This installs development dependencies in the `deps` directory and configures `cmake` to use those
-configure `cmake` to use those dependencies in the `build` directory. These
+dependencies in the `build` directory. You can change these directories by passing the `--deps-dir` and
-directories can be changed by passing the `--deps-dir` and `--build-dir` flags
+`--build-dir` flags to the `rbuild` command:
-to `rbuild` command:
-```
+```bash
 rbuild develop --build-dir build_rocm_55 --deps-dir /home/user/deps_dir
 ```
@@ -223,12 +227,12 @@ Depending on your setup `sudo` may be required for the pip install.
 All the code is formatted using clang-format. To format a file, use:
-```
+```clang
 clang-format-10 -style=file -i <path-to-source-file>
 ```
 Also, githooks can be installed to format the code per-commit:
-```
+```bash
 ./.githooks/install
 ```
--- a/cmake/Embed.cmake
+++ b/cmake/Embed.cmake
@@ -21,17 +21,25 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 # THE SOFTWARE.
 #####################################################################################
-find_program(EMBED_LD ld)
-find_program(EMBED_OBJCOPY objcopy)
-option(EMBED_USE_LD "Use ld to embed data files" OFF)
+if(WIN32)
+    set(EMBED_USE RC CACHE STRING "Use RC or CArrays to embed data files")
+    set_property(CACHE EMBED_USE PROPERTY STRINGS "RC;CArrays")
+else()
+    set(EMBED_USE CArrays CACHE STRING "Use LD or CArrays to embed data files")
+    set_property(CACHE EMBED_USE PROPERTY STRINGS "LD;CArrays")
+endif()
+if(EMBED_USE STREQUAL "LD")
+    find_program(EMBED_LD ld REQUIRED)
+    find_program(EMBED_OBJCOPY objcopy REQUIRED)
+endif()
 function(wrap_string)
    set(options)
    set(oneValueArgs VARIABLE AT_COLUMN)
    set(multiValueArgs)
    cmake_parse_arguments(PARSE "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
-    cmake_parse_arguments(WRAP_STRING "${options}" "${oneValueArgs}" "" ${ARGN})
    string(LENGTH ${${PARSE_VARIABLE}} string_length)
    math(EXPR offset "0")
@@ -54,97 +62,108 @@ function(wrap_string)
    set(${PARSE_VARIABLE} "${lines}" PARENT_SCOPE)
 endfunction()
-function(generate_embed_source EMBED_NAME)
+function(generate_embed_source EMBED_NAME EMBED_DIR BASE_DIRECTORY)
    set(options)
-    set(oneValueArgs SRC HEADER RELATIVE)
+    set(oneValueArgs)
-    set(multiValueArgs OBJECTS SYMBOLS FILES)
+    set(multiValueArgs SYMBOLS FILES)
    cmake_parse_arguments(PARSE "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
-    set(EXTERNS)
+    set(RESOURCE_ID 100)
-    set(INIT_KERNELS)
+    foreach(SYMBOL FILE IN ZIP_LISTS PARSE_SYMBOLS PARSE_FILES)
+        cmake_path(RELATIVE_PATH FILE BASE_DIRECTORY ${BASE_DIRECTORY} OUTPUT_VARIABLE BASE_NAME)
-    list(LENGTH PARSE_SYMBOLS SYMBOLS_LEN)
+        if(EMBED_USE STREQUAL "RC")
-    list(LENGTH PARSE_OBJECTS OBJECTS_LEN)
+            string(TOUPPER "${SYMBOL}" SYMBOL)
-    if(NOT ${SYMBOLS_LEN} EQUAL ${OBJECTS_LEN})
+            string(APPEND FILE_IDS "#define IDR_${SYMBOL} ${RESOURCE_ID}\n")
-        message(FATAL_ERROR "Symbols and objects dont match: ${SYMBOLS_LEN} != ${OBJECTS_LEN}")
+            cmake_path(NATIVE_PATH FILE NORMALIZE NATIVE_FILE)
-    endif()
+            string(REPLACE "\\" "\\\\" NATIVE_FILE "${NATIVE_FILE}")
-    math(EXPR LEN "${SYMBOLS_LEN} - 1")
+            string(APPEND RC_FILE_MAPPING "IDR_${SYMBOL} TEXTFILE \"${NATIVE_FILE}\"\n")
+            string(APPEND INIT_KERNELS "\n        {\"${BASE_NAME}\", resource::read(IDR_${SYMBOL})},")
-    foreach(idx RANGE ${LEN})
+            math(EXPR RESOURCE_ID "${RESOURCE_ID} + 1" OUTPUT_FORMAT DECIMAL)
-        list(GET PARSE_SYMBOLS ${idx} SYMBOL)
+        else()
-        list(GET PARSE_OBJECTS ${idx} OBJECT)
-        list(GET PARSE_FILES ${idx} FILE)
            set(START_SYMBOL "_binary_${SYMBOL}_start")
            set(LENGTH_SYMBOL "_binary_${SYMBOL}_length")
-        if(EMBED_USE_LD)
+            if(EMBED_USE STREQUAL "LD")
                string(APPEND EXTERNS "
 extern const char ${START_SYMBOL}[];
 extern const size_t _binary_${SYMBOL}_size;
 const auto ${LENGTH_SYMBOL} = reinterpret_cast<size_t>(&_binary_${SYMBOL}_size);
-            ")
+")
            else()
                string(APPEND EXTERNS "
 extern const char ${START_SYMBOL}[];
 extern const size_t ${LENGTH_SYMBOL};
-            ")
+")
-        endif()
-        if(PARSE_RELATIVE)
-            file(RELATIVE_PATH BASE_NAME ${PARSE_RELATIVE} "${FILE}")
-        else()
-            get_filename_component(BASE_NAME "${FILE}" NAME)
            endif()
            string(APPEND INIT_KERNELS "
        { \"${BASE_NAME}\", { ${START_SYMBOL}, ${LENGTH_SYMBOL}} },")
+        endif()
    endforeach()
+    if(EMBED_USE STREQUAL "RC")
+       file(WRITE "${EMBED_DIR}/include/resource.h" "
+#define TEXTFILE 256
+${FILE_IDS}
+")
+        file(WRITE "${EMBED_DIR}/resource.rc" "
+#include \"resource.h\"
+${RC_FILE_MAPPING}
+")
+        set(EXTERNS "
+#include <Windows.h>
+#include \"resource.h\"
-    file(WRITE "${PARSE_HEADER}" "
+namespace resource {
+std::string_view read(int id)
+{
+    HMODULE handle = GetModuleHandle(nullptr);
+    HRSRC rc = FindResource(handle, MAKEINTRESOURCE(id), MAKEINTRESOURCE(TEXTFILE));
+    HGLOBAL data = LoadResource(handle, rc);
+    return {static_cast<const char*>(LockResource(data)), SizeofResource(handle, rc)};
+}
+}
+")
+        set(EMBED_FILES ${EMBED_DIR}/include/resource.h ${EMBED_DIR}/resource.rc)
+    endif()
+    file(WRITE "${EMBED_DIR}/include/${EMBED_NAME}.hpp" "
 #include <string_view>
 #include <unordered_map>
 #include <utility>
 std::unordered_map<std::string_view, std::string_view> ${EMBED_NAME}();
 ")
-    file(WRITE "${PARSE_SRC}" "
+    file(WRITE "${EMBED_DIR}/${EMBED_NAME}.cpp" "
 #include <${EMBED_NAME}.hpp>
 ${EXTERNS}
 std::unordered_map<std::string_view, std::string_view> ${EMBED_NAME}()
 {
-    static std::unordered_map<std::string_view, std::string_view> result = {${INIT_KERNELS}};
+    static std::unordered_map<std::string_view, std::string_view> result = {${INIT_KERNELS}
+    };
    return result;
 }
 ")
+    list(APPEND EMBED_FILES ${EMBED_DIR}/${EMBED_NAME}.cpp ${EMBED_DIR}/include/${EMBED_NAME}.hpp)
+    set(EMBED_FILES ${EMBED_FILES} PARENT_SCOPE)
 endfunction()
-function(embed_file OUTPUT_FILE OUTPUT_SYMBOL FILE)
+function(embed_file FILE BASE_DIRECTORY)
-    set(WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})
+    message(STATUS "    ${FILE}")
-    # Glob is used to compute the relative path
+    cmake_path(RELATIVE_PATH FILE BASE_DIRECTORY "${BASE_DIRECTORY}" OUTPUT_VARIABLE REL_FILE)
-    file(GLOB FILES RELATIVE ${WORKING_DIRECTORY} ${FILE})
+    string(MAKE_C_IDENTIFIER "${REL_FILE}" OUTPUT_SYMBOL)
-    foreach(REL_FILE ${FILES})
-        string(MAKE_C_IDENTIFIER "${REL_FILE}" SYMBOL)
    get_filename_component(OUTPUT_FILE_DIR "${REL_FILE}" DIRECTORY)
    file(MAKE_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${OUTPUT_FILE_DIR}")
-        if(EMBED_USE_LD)
+    if(EMBED_USE STREQUAL "LD")
-            set(OUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/${REL_FILE}.o")
+        set(OUTPUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/${REL_FILE}.o")
-        else()
-            set(OUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/${REL_FILE}.cpp")
-        endif()
-        set(${OUTPUT_SYMBOL} ${SYMBOL} PARENT_SCOPE)
-        set(${OUTPUT_FILE} "${OUT_FILE}" PARENT_SCOPE)
-        if(EMBED_USE_LD)
        add_custom_command(
-                OUTPUT "${OUT_FILE}"
+            OUTPUT "${OUTPUT_FILE}"
-                COMMAND ${EMBED_LD} -r -o "${OUT_FILE}" -z noexecstack --format=binary "${REL_FILE}" 
+            COMMAND ${EMBED_LD} -r -o "${OUTPUT_FILE}" -z noexecstack --format=binary "${REL_FILE}"
-                COMMAND ${EMBED_OBJCOPY} --rename-section .data=.rodata,alloc,load,readonly,data,contents "${OUT_FILE}"
+            COMMAND ${EMBED_OBJCOPY} --rename-section .data=.rodata,alloc,load,readonly,data,contents "${OUTPUT_FILE}"
-                WORKING_DIRECTORY ${WORKING_DIRECTORY}
+            WORKING_DIRECTORY "${BASE_DIRECTORY}"
-                DEPENDS ${FILE}
+            DEPENDS "${FILE}"
-                VERBATIM
+            VERBATIM)
-            )
+        set(OUTPUT_FILE ${OUTPUT_FILE} PARENT_SCOPE)
-        else()
+    elseif(EMBED_USE STREQUAL "CArrays")
-            set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS ${FILE})
+        set(OUTPUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/${REL_FILE}.cpp")
        # reads source file contents as hex string
        file(READ ${FILE} HEX_STRING HEX)
        # wraps the hex string into multiple lines
@@ -153,13 +172,14 @@ function(embed_file OUTPUT_FILE OUTPUT_SYMBOL FILE)
        string(REGEX REPLACE "([0-9a-f][0-9a-f])" "0x\\1, " ARRAY_VALUES ${HEX_STRING})
        # removes trailing comma
        string(REGEX REPLACE ", $" "" ARRAY_VALUES ${ARRAY_VALUES})
-            file(WRITE "${OUT_FILE}" "
+        file(WRITE "${OUTPUT_FILE}" "
 #include <cstddef>
-extern const char _binary_${SYMBOL}_start[] = { ${ARRAY_VALUES} };
+extern const char _binary_${OUTPUT_SYMBOL}_start[] = { ${ARRAY_VALUES} };
-extern const size_t _binary_${SYMBOL}_length = sizeof(_binary_${SYMBOL}_start);
+extern const size_t _binary_${OUTPUT_SYMBOL}_length = sizeof(_binary_${OUTPUT_SYMBOL}_start);
 ")
+        set(OUTPUT_FILE ${OUTPUT_FILE} PARENT_SCOPE)
    endif()
-    endforeach()
+    set(OUTPUT_SYMBOL ${OUTPUT_SYMBOL} PARENT_SCOPE)
 endfunction()
 function(add_embed_library EMBED_NAME)
@@ -168,35 +188,32 @@ function(add_embed_library EMBED_NAME)
    set(multiValueArgs)
    cmake_parse_arguments(PARSE "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
-    file(MAKE_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/embed)
-    file(MAKE_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/embed/${EMBED_NAME})
    set(EMBED_DIR ${CMAKE_CURRENT_BINARY_DIR}/embed/${EMBED_NAME})
-    set(SRC_FILE "${EMBED_DIR}/${EMBED_NAME}.cpp")
+    file(MAKE_DIRECTORY ${EMBED_DIR})
-    set(HEADER_FILE "${EMBED_DIR}/include/${EMBED_NAME}.hpp")
+    message(STATUS "Embedding kernel files:")
-    set(WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})
-    set(OUTPUT_FILES)
-    set(SYMBOLS)
-    message(STATUS "Embedding files")
    foreach(FILE ${PARSE_UNPARSED_ARGUMENTS})
-        embed_file(OUTPUT_FILE OUTPUT_SYMBOL ${FILE})
+        embed_file(${FILE} ${PARSE_RELATIVE})
        list(APPEND OUTPUT_FILES ${OUTPUT_FILE})
        list(APPEND SYMBOLS ${OUTPUT_SYMBOL})
    endforeach()
-    message(STATUS "Generating embedding library ${EMBED_NAME}")
+    message(STATUS "Generating embedding library '${EMBED_NAME}'")
-    generate_embed_source(${EMBED_NAME} SRC ${SRC_FILE} HEADER ${HEADER_FILE} OBJECTS ${OUTPUT_FILES} SYMBOLS ${SYMBOLS} RELATIVE ${PARSE_RELATIVE} FILES ${PARSE_UNPARSED_ARGUMENTS})
+    generate_embed_source(${EMBED_NAME} ${EMBED_DIR} "${PARSE_RELATIVE}" SYMBOLS ${SYMBOLS} FILES ${PARSE_UNPARSED_ARGUMENTS})
    set(INTERNAL_EMBED_LIB embed_lib_${EMBED_NAME})
-    add_library(${INTERNAL_EMBED_LIB} OBJECT "${SRC_FILE}")
+    add_library(${INTERNAL_EMBED_LIB} OBJECT ${EMBED_FILES})
+    if(EMBED_USE STREQUAL "CArrays")
+        target_sources(${INTERNAL_EMBED_LIB} PRIVATE ${OUTPUT_FILES})
+    endif()
    target_include_directories(${INTERNAL_EMBED_LIB} PRIVATE "${EMBED_DIR}/include")
    target_compile_options(${INTERNAL_EMBED_LIB} PRIVATE -Wno-reserved-identifier -Wno-extern-initializer -Wno-missing-variable-declarations)
    set_target_properties(${INTERNAL_EMBED_LIB} PROPERTIES POSITION_INDEPENDENT_CODE On)
    add_library(${EMBED_NAME} INTERFACE)
-    if(EMBED_USE_LD)
+    if(EMBED_USE STREQUAL "LD")
        target_sources(${EMBED_NAME} INTERFACE ${OUTPUT_FILES})
-    else()
+    endif()
-        target_sources(${INTERNAL_EMBED_LIB} PRIVATE ${OUTPUT_FILES})
+    if(EMBED_USE STREQUAL "RC")
+        target_link_libraries(${EMBED_NAME} INTERFACE $<TARGET_OBJECTS:${INTERNAL_EMBED_LIB}>)
    endif()
    target_sources(${EMBED_NAME} INTERFACE $<TARGET_OBJECTS:${INTERNAL_EMBED_LIB}>)
    target_include_directories(${EMBED_NAME} INTERFACE "${EMBED_DIR}/include")
 endfunction()
--- a/docs/.doxygen/Doxyfile
+++ b/docs/.doxygen/Doxyfile
@@ -28,7 +28,14 @@ MACRO_EXPANSION = YES
 OUTPUT_DIRECTORY = docBin
-PREDEFINED = DOXYGEN
+PREDEFINED = \
+    DOXYGEN \
+    MIGRAPHX_EXPORT= \
+    MIGRAPHX_API_EXPORT= \
+    MIGRAPHX_GPU_EXPORT= \
+    MIGRAPHX_CPU_EXPORT= \
+    MIGRAPHX_ONNX_EXPORT= \
+    MIGRAPHX_TF_EXPORT= \
 PROJECT_NAME = MIGraphX

--- a/docs/.sphinx/requirements.txt
+++ b/docs/.sphinx/requirements.txt
@@ -21,7 +21,7 @@ charset-normalizer==3.1.0
    # via requests
 click==8.1.3
    # via sphinx-external-toc
-cryptography==41.0.4
+cryptography==41.0.6
    # via pyjwt
 deprecated==1.2.13
    # via pygithub
@@ -75,7 +75,9 @@ pygments==2.15.0
    #   pydata-sphinx-theme
    #   sphinx
 pyjwt[crypto]==2.6.0
-    # via pygithub
+    # via
+    #   pygithub
+    #   pyjwt
 pynacl==1.5.0
    # via pygithub
 pyyaml==6.0
@@ -87,7 +89,7 @@ requests==2.28.2
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==0.26.0
+rocm-docs-core==0.30.1
    # via -r requirements.in
 smmap==5.0.0
    # via gitdb

--- a/docs/contributor_guide.rst
+++ b/docs/contributor_guide.rst
 Contributor Guide
-===============
+=================
 .. toctree::
   :maxdepth: 2
   :caption: Contents:
-   dev_intro
+   dev/dev_intro
   dev/data
   dev/operators
   dev/program
@@ -14,3 +14,4 @@ Contributor Guide
   dev/pass
   dev/matchers
   dev/tools
+   dev/env_vars
--- a/docs/dev/data.rst
+++ b/docs/dev/data.rst
@@ -5,26 +5,36 @@ shape
 -----
 .. doxygenstruct:: migraphx::internal::shape
+   :members:
+   :undoc-members:
 literal
 -------
 .. doxygenstruct:: migraphx::internal::literal
+   :members:
+   :undoc-members:
 argument
 --------
 .. doxygenstruct:: migraphx::internal::argument
+   :members:
+   :undoc-members:
 raw_data
 --------
 .. doxygenstruct:: migraphx::internal::raw_data
+   :members:
+   :undoc-members:
-.. doxygenfunction:: migraphx::internal::visit_all
+.. doxygenfunction:: template<class T, class ...Ts> auto migraphx::internal::visit_all(T &&x, Ts&&... xs)
 tensor_view
 -----------
 .. doxygenstruct:: migraphx::internal::tensor_view
+   :members:
+   :undoc-members:
--- a/docs/dev_intro.rst
+++ b/docs/dev_intro.rst
-MIGraphX Fundamentals
+Developer Introduction
 ======================
 MIGraphX provides an optimized execution engine for deep learning neural networks.
@@ -18,8 +18,8 @@ Directions for building MIGraphX from source can be found in the main README fil
 Adding Two Literals
 --------------------
-A program is a collection of modules, which are collections of instructions to be executed when calling `eval <migraphx::program::eval>`.
+A program is a collection of modules, which are collections of instructions to be executed when calling :cpp:any:`eval <migraphx::internal::program::eval>`.
-Each instruction has an associated `operation <migraphx::operation>` which represents the computation to be performed by the instruction.
+Each instruction has an associated :cpp:any:`operation <migraphx::internal::operation>` which represents the computation to be performed by the instruction.
 We start with a snippet of the simple ``add_two_literals()`` function::
@@ -41,14 +41,14 @@ We start with a snippet of the simple ``add_two_literals()`` function::
    auto result = p.eval({}).back();
    std::cout << "add_two_literals: 1 + 2 = " << result << "\n";
-We start by creating a simple ``migraphx::program`` object and then getting a pointer to the main module of it.
+We start by creating a simple :cpp:any:`migraphx::program <migraphx::internal::program>` object and then getting a pointer to the main module of it.
 The program is a collection of ``modules`` that start executing from the main module, so instructions are added to the modules rather than directly onto the program object.
-We then use the `add_literal <migraphx::program::add_literal>` function to add an instruction that stores the literal number ``1`` while returning an `instruction_ref <migraphx::instruction_ref>`.
+We then use the :cpp:any:`add_literal <migraphx::internal::program::add_literal>` function to add an instruction that stores the literal number ``1`` while returning an :cpp:any:`instruction_ref <migraphx::internal::instruction_ref>`.
-The returned `instruction_ref <migraphx::instruction_ref>` can be used in another instruction as an input.
+The returned :cpp:any:`instruction_ref <migraphx::internal::instruction_ref>` can be used in another instruction as an input.
-We use the same `add_literal <migraphx::program::add_literal>` function to add a ``2`` to the program.
+We use the same :cpp:any:`add_literal <migraphx::internal::program::add_literal>` function to add a ``2`` to the program.
 After creating the literals, we then create the instruction to add the numbers together.
-This is done by using the `add_instruction <migraphx::program::add_instruction>` function with the ``"add"`` `operation <migraphx::program::operation>` created by `make_op <migraphx::program::make_op>` along with the previous `add_literal` `instruction_ref <migraphx::instruction_ref>` for the input arguments of the instruction.
+This is done by using the :cpp:any:`add_instruction <migraphx::internal::program::add_instruction>` function with the ``"add"`` :cpp:any:`operation <migraphx::internal::program::operation>` created by :cpp:any:`make_op <migraphx::internal::program::make_op>` along with the previous `add_literal` :cpp:any:`instruction_ref <migraphx::internal::instruction_ref>` for the input arguments of the instruction.
-Finally, we can run this `program <migraphx::program>` by compiling it for the reference target (CPU) and then running it with `eval <migraphx::program::eval>`
+Finally, we can run this :cpp:any:`program <migraphx::internal::program>` by compiling it for the reference target (CPU) and then running it with :cpp:any:`eval <migraphx::internal::program::eval>`
 The result is then retreived and printed to the console.
 We can compile the program for the GPU as well, but the file will have to be moved to the ``test/gpu/`` directory and the correct target must be included::
@@ -76,8 +76,8 @@ We can modify the program to take an input parameter ``x``, as seen in the ``add
    p.compile(migraphx::ref::target{});
 This adds a parameter of type ``int32``, and compiles it for the CPU.
-To run the program, we need to pass the parameter as a ``parameter_map`` when we call `eval <migraphx::program::eval>`.
+To run the program, we need to pass the parameter as a ``parameter_map`` when we call :cpp:any:`eval <migraphx::internal::program::eval>`.
-We create the ``parameter_map`` by setting the ``x`` key to an `argument <migraphx::argument>` object with an ``int`` data type::
+We create the ``parameter_map`` by setting the ``x`` key to an :cpp:any:`argument <migraphx::internal::argument>` object with an ``int`` data type::
    // create a parameter_map object for passing a value to the "x" parameter
    std::vector<int> data = {4};
@@ -92,7 +92,7 @@ We create the ``parameter_map`` by setting the ``x`` key to an `argument <migrap
 Handling Tensor Data
 ---------------------
-In the previous examples we have only been dealing with scalars, but the `shape <migraphx::shape>` class can describe multi-dimensional tensors.
+In the previous examples we have only been dealing with scalars, but the :cpp:any:`shape <migraphx::internal::shape>` class can describe multi-dimensional tensors.
 For example, we can compute a simple convolution::
    migraphx::program p;
@@ -109,7 +109,7 @@ For example, we can compute a simple convolution::
 Here we create two parameters for both the ``input`` and ``weights``.
 In the previous examples, we created simple literals, however, most programs will take data from allocated buffers (usually on the GPU).
-In this case, we can create `argument <migraphx::argument>` objects directly from the pointers to the buffers::
+In this case, we can create :cpp:any:`argument <migraphx::internal::argument>` objects directly from the pointers to the buffers::
    // Compile the program
    p.compile(migraphx::ref::target{});
@@ -133,8 +133,8 @@ In this case, we can create `argument <migraphx::argument>` objects directly fro
    EXPECT(migraphx::verify::verify_rms_range(results_vector, sol));
-An `argument <migraphx::argument>` can handle memory buffers from either the GPU or the CPU.
+An :cpp:any:`argument <migraphx::internal::argument>` can handle memory buffers from either the GPU or the CPU.
-By default when running the `program <migraphx::program>`, buffers are allocated on the corresponding target.
+By default when running the :cpp:any:`program <migraphx::internal::program>`, buffers are allocated on the corresponding target.
 When compiling for the CPU, the buffers by default will be allocated on the CPU.
 When compiling for the GPU, the buffers by default will be allocated on the GPU.
 With the option ``offload_copy=true`` set while compiling for the GPU, the buffers will be located on the CPU.
@@ -143,7 +143,7 @@ With the option ``offload_copy=true`` set while compiling for the GPU, the buffe
 Importing From ONNX
 --------------------
-A `program <migraphx::program>` can be built directly from an onnx file using the MIGraphX ONNX parser.
+A :cpp:any:`program <migraphx::internal::program>` can be built directly from an onnx file using the MIGraphX ONNX parser.
 This makes it easier to use neural networks directly from other frameworks.
 In this case, there is an ``parse_onnx`` function::

--- a/docs/dev/env_vars.rst
+++ b/docs/dev/env_vars.rst
+Environment Variables
+=====================
+For parsing
+---------------
+.. envvar:: MIGRAPHX_TRACE_ONNX_PARSER
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debugging traces for the onnx parser.
+Prints: initializers (if used), ONNX node operators, added MIGraphX instructions
+.. envvar:: MIGRAPHX_DISABLE_FP16_INSTANCENORM_CONVERT
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disables the conversion from fp16 to fp32 for the InstanceNormalization ONNX operator that MIGX does as a workaround for accuracy issues with reduce_mean/variance.
+See ``parse_instancenorm.cpp`` for more details.
+Matchers
+------------
+.. envvar:: MIGRAPHX_TRACE_MATCHES
+Set to "1" to print the matcher that matches an instruction and the matched instruction.
+Set to "2" and use the ``MIGRAPHX_TRACE_MATHCES_FOR`` flag to filter out results.
+.. envvar:: MIGRAPHX_TRACE_MATCHES_FOR
+Set to the name of any matcher and only traces for that matcher will be printed out.
+.. envvar:: MIGRAPHX_VALIDATE_MATCHES
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Validate the module after finding the matches (runs ``module.validate()``).
+Program Execution 
+---------------------
+.. envvar:: MIGRAPHX_TRACE_EVAL
+Set to "1", "2", or "3" to use.
+"1" prints the instruction run and the time taken.
+"2" prints everything in "1" and a snippet of the output argument and some statistics (ex. min, max, mean) of the output.
+"3" prints everything in "1" and the full output buffers.
+Program Verification
+------------------------
+.. envvar:: MIGRAPHX_VERIFY_ENABLE_ALLCLOSE
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Uses ``allclose`` with the given ``atol`` and ``rtol`` for verifying ranges with ``driver verify`` or the tests that use ``migraphx/verify.hpp``.
+Pass debugging or Pass controls
+-----------------------------------
+.. envvar:: MIGRAPHX_TRACE_ELIMINATE_CONTIGUOUS
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Debug print the instructions that have input ``contiguous`` instructions removed.
+.. envvar:: MIGRAPHX_DISABLE_POINTWISE_FUSION
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disables the ``fuse_pointwise`` compile pass.
+.. envvar:: MIGRAPHX_DEBUG_MEMORY_COLORING
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debug statements for the ``memory_coloring`` pass.
+.. envvar:: MIGRAPHX_TRACE_SCHEDULE
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debug statements for the ``schedule`` pass.
+.. envvar:: MIGRAPHX_TRACE_PROPAGATE_CONSTANT
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Traces instructions replaced with a constant.
+.. envvar:: MIGRAPHX_8BITS_QUANTIZATION_PARAMS
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the quantization parameters in only the main module.
+.. envvar:: MIGRAPHX_DISABLE_DNNL_POST_OPS_WORKAROUND
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the DNNL post ops workaround.
+.. envvar:: MIGRAPHX_DISABLE_MIOPEN_FUSION
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable MIOpen fusions.
+.. envvar:: MIGRAPHX_DISABLE_SCHEDULE_PASS
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the ``schedule`` pass.
+.. envvar:: MIGRAPHX_DISABLE_REDUCE_FUSION
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the ``fuse_reduce`` pass.
+.. envvar:: MIGRAPHX_ENABLE_NHWC
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable the ``layout_nhwc`` pass.
+.. envvar:: MIGRAPHX_ENABLE_CK
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable using the Composable Kernels library.
+Should be used in conjunction with ``MIGRAPHX_DISABLE_MLIR=1``.
+.. envvar:: MIGRAPHX_DISABLE_MLIR*
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable using the rocMLIR library.
+.. envvar:: MIGRAPHX_ENABLE_EXTRA_MLIR
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enables additional opportunities to use MLIR that may improve performance.
+.. envvar:: MIGRAPHX_COPY_LITERALS
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use ``hip_copy_to_gpu`` with a new ``literal`` instruction rather than use ``hip_copy_literal{}``.
+Compilation traces
+----------------------
+.. envvar:: MIGRAPHX_TRACE_FINALIZE
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Debug print instructions during the ``module.finalize()`` step.
+.. envvar:: MIGRAPHX_TRACE_COMPILE
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print trace information for the graph compilation process.
+.. envvar:: MIGRAPHX_TRACE_PASSES
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the compile pass and the program after the pass.
+.. envvar:: MIGRAPHX_TIME_PASSES
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Time the compile passes.
+GPU Kernels JIT compilation debugging (applicable for both hiprtc and hipclang)
+-----------------------------------------
+.. envvar:: MIGRAPHX_TRACE_CMD_EXECUTE
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print commands executed by the MIGraphX ``process``.
+.. envvar:: MIGRAPHX_TRACE_HIPRTC
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print HIPRTC options and C++ file executed.
+.. envvar:: MIGRAPHX_DEBUG_SAVE_TEMP_DIR
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Make it so the created temporary directories are not deleted.
+.. envvar:: MIGRAPHX_GPU_DEBUG
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Internally, this adds the option ``-DMIGRAPHX_DEBUG`` when compiling GPU kernels. It enables assertions and capture of source locations for the errors. 
+.. envvar:: MIGRAPHX_GPU_DEBUG_SYM
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Adds the option ``-g`` when compiling HIPRTC.
+.. envvar:: MIGRAPHX_GPU_DUMP_SRC
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dump the HIPRTC source files compiled.
+.. envvar:: MIGRAPHX_GPU_DUMP_ASM
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dump the hip-clang assembly.
+.. envvar:: MIGRAPHX_GPU_OPTIMIZE
+Set the optimization mode for GPU compile (``-O`` option).
+Defaults to ``-O3``.
+.. envvar:: MIGRAPHX_GPU_COMPILE_PARALLEL
+Set to the number of threads to use.
+Compile GPU code in parallel with the given number of threads.
+.. envvar:: MIGRAPHX_TRACE_NARY
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the ``nary`` device functions used.
+.. envvar:: MIGRAPHX_ENABLE_HIPRTC_WORKAROUNDS
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable HIPRTC workarounds for bugs in HIPRTC.
+.. envvar:: MIGRAPHX_USE_FAST_SOFTMAX
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use the fast softmax optimization.
+.. envvar:: MIGRAPHX_ENABLE_NULL_STREAM
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Allow using null stream for miopen and hipStream.
+.. envvar:: MIGRAPHX_NSTREAMS
+Set to the number of streams to use.
+Defaults to 1.
+.. envvar:: MIGRAPHX_TRACE_BENCHMARKING
+Set to "1" to print benchmarching trace.
+Set to "2" to print benchmarching trace with more detail.
+MLIR vars
+-------------
+.. envvar:: MIGRAPHX_TRACE_MLIR
+Set to "1" to trace MLIR and print any failures.
+Set to "2" to additionally print all MLIR operations.
+.. envvar:: MIGRAPHX_MLIR_USE_SPECIFIC_OPS
+Set to the name of the operations you want to always use MLIR regardless of GPU architecture.
+Accepts a list of operators separated by commas (ex: "fused", "convolution", "dot").
+.. envvar:: MIGRAPHX_MLIR_TUNING_DB
+Set to the path of the MLIR tuning database to load.
+.. envvar:: MIGRAPHX_MLIR_TUNING_CFG
+Set to the path of the tuning configuration.
+Appends to tuning cfg file that could be used with rocMLIR tuning scripts.
+.. envvar:: MIGRAPHX_MLIR_TUNE_EXHAUSTIVE
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Do exhaustive tuning for MLIR.
+.. envvar:: MIGRAPHX_MLIR_TUNE_LIMIT
+Set to an integer greater than 1.
+Limits the number of solutions that MLIR will use for tuning.
+CK vars
+-----------
+.. envvar:: MIGRAPHX_LOG_CK_GEMM
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print Composable Kernels GEMM traces.
+.. envvar:: MIGRAPHX_CK_DEBUG
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Always add the ``-DMIGRAPHX_CK_CHECK=1`` for compiling Composable Kernels operators.
+.. envvar:: MIGRAPHX_TUNE_CK
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use tuning for Composable Kernels.
+Testing 
+------------
+.. envvar:: MIGRAPHX_TRACE_TEST_COMPILE
+Set to the target that you want to trace the compilation of (ex. "gpu", "cpu").
+Prints the compile trace for the given target for the verify tests.
+This flag shouldn't be used in conjunction with ``MIGRAPHX_TRACE_COMPILE``.
+For the verify tests only use ``MIGRAPHX_TRACE_TEST_COMPILE``.
+.. envvar:: MIGRAPHX_TRACE_TEST
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Prints the reference and target programs even if the verify passed successfully.
+.. envvar:: MIGRAPHX_DUMP_TEST
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dumps verify tests to ``.mxr`` files.
--- a/docs/dev/operators.rst
+++ b/docs/dev/operators.rst
@@ -5,6 +5,8 @@ operation
 ---------
 .. doxygenstruct:: migraphx::internal::operation
+   :members:
+   :undoc-members:
 .. doxygenfunction:: migraphx::internal::is_context_free
@@ -14,3 +16,5 @@ operators
 ---------
 .. doxygennamespace:: migraphx::internal::op
+   :members:
+   :undoc-members:
--- a/docs/dev/pass.rst
+++ b/docs/dev/pass.rst
@@ -5,63 +5,82 @@ pass
 ----
 .. doxygenstruct:: migraphx::internal::pass
+   :members:
+   :undoc-members:
 dead_code_elimination
 ---------------------
 .. doxygenstruct:: migraphx::internal::dead_code_elimination
+   :members:
+   :undoc-members:
 eliminate_common_subexpression
 ------------------------------
 .. doxygenstruct:: migraphx::internal::eliminate_common_subexpression
+   :members:
+   :undoc-members:
 eliminate_concat
 ----------------
 .. doxygenstruct:: migraphx::internal::eliminate_concat
+   :members:
+   :undoc-members:
 eliminate_contiguous
 --------------------
 .. doxygenstruct:: migraphx::internal::eliminate_contiguous
+   :members:
+   :undoc-members:
 eliminate_identity
 ------------------
 .. doxygenstruct:: migraphx::internal::eliminate_identity
+   :members:
+   :undoc-members:
 eliminate_pad
 -------------
 .. doxygenstruct:: migraphx::internal::eliminate_pad
+   :members:
+   :undoc-members:
 propagate_constant
 ------------------
 .. doxygenstruct:: migraphx::internal::propagate_constant
+   :members:
-rewrite_batchnorm
+   :undoc-members:
-----------------
-.. doxygenstruct:: migraphx::internal::rewrite_batchnorm
 rewrite_rnn
 -----------
 .. doxygenstruct:: migraphx::internal::rewrite_rnn
+   :members:
+   :undoc-members:
 schedule
 --------
 .. doxygenstruct:: migraphx::internal::schedule
+   :members:
+   :undoc-members:
 simplify_algebra
 ----------------
 .. doxygenstruct:: migraphx::internal::simplify_algebra
+   :members:
+   :undoc-members:
 simplify_reshapes
 -----------------
 .. doxygenstruct:: migraphx::internal::simplify_reshapes
+   :members:
+   :undoc-members:
--- a/docs/dev/program.rst
+++ b/docs/dev/program.rst
@@ -5,6 +5,8 @@ instruction
 -----------
 .. doxygenstruct:: migraphx::internal::instruction
+   :members:
+   :undoc-members:
 instruction_ref
 ---------------
@@ -17,6 +19,8 @@ program
 -------
 .. doxygenstruct:: migraphx::internal::program
+   :members:
+   :undoc-members:
 parse_onnx
 ----------

--- a/docs/dev/targets.rst
+++ b/docs/dev/targets.rst
@@ -5,14 +5,20 @@ target
 ------
 .. doxygenstruct:: migraphx::internal::target
+   :members:
+   :undoc-members:
 gpu::target
 -----------
 .. doxygenstruct:: migraphx::internal::gpu::target
+   :members:
+   :undoc-members:
 cpu::target
 -----------
 .. doxygenstruct:: migraphx::internal::cpu::target
+   :members:
+   :undoc-members:
--- a/docs/driver.rst
+++ b/docs/driver.rst
 MIGraphX Driver
 ===============
+The MIGraphX driver is a tool that allows you to utilize many of the core functions of MIGraphX without having to write your own program. It can read, compile, run, and test the performance of a model with randomized data.
 read
 ----
@@ -17,6 +19,7 @@ compile
 Compiles and prints input graph.
+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst
 run
@@ -26,6 +29,7 @@ run
 Loads and prints input graph.
+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst
 perf
@@ -35,6 +39,7 @@ perf
 Compiles and runs input graph then prints performance report.
+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst
 .. option::  --iterations, -n [unsigned int]
@@ -48,6 +53,7 @@ verify
 Runs reference and CPU or GPU implementations and checks outputs for consistency.
+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst
 .. option::  --rms-tol [double]
@@ -71,7 +77,7 @@ Verify each instruction
 Reduce program and verify
 roctx
----
+-----
 .. program:: migraphx-driver roctx
@@ -86,4 +92,5 @@ An example command line combined with rocprof for tracing purposes is given belo
 After `rocprof` is run, the output directory will contain trace information for HIP, HCC and ROCTX in seperate `.txt` files.
 To understand the interactions between API calls, it is recommended to utilize `roctx.py` helper script as desribed in :ref:`dev/tools:rocTX` section. 
+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst
--- a/docs/driver/compile.rst
+++ b/docs/driver/compile.rst
-.. include:: ./driver/read.rst
 .. option::  --fill0 [std::vector<std::string>]
 Fill parameter with 0s
@@ -40,3 +38,6 @@ Quantize for fp16
 Quantize for int8
+.. option:: --fp8
+Quantize for Float8E4M3FNUZ type