Merge branch 'develop' of https://github.com/ROCmSoftwarePlatform/AMDMIGraphX into mlir-attention

f69d828d · Manupa Karunaratne · fe36d210 · 24148857 · f69d828d · f69d828d
Commit f69d828d authored Nov 22, 2023 by Manupa Karunaratne
20 changed files
--- a/.gitignore
+++ b/.gitignore
@@ -81,5 +81,7 @@ cmake-build*/
 build*/

 # Recommended location to install rbuild dependencies from README.md
-depend
+depend*/

+# local Python virtual environment
+.venv/
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
-# Change Log for MIGraphX
+# Changelog for MIGraphX

-Full documentation for MIGraphX is available at [MIGraphX Documentation](https://rocmdocs.amd.com/projects/AMDMIGraphX/en/latest/).
+Full documentation for MIGraphX is available at
+[https://rocmdocs.amd.com/projects/AMDMIGraphX/en/latest/](https://rocmdocs.amd.com/projects/AMDMIGraphX/en/latest/).

 ## MIGraphX 2.8 for ROCm 6.0.0
-### Added
- Support for MI300 GPUs
- Support for TorchMIGraphX via PyTorch
- Boosted overall performance by integrating rocMLIR 
- INT8 support for ONNX Runtime
- Support for ONNX version 1.14.1
- Added operators Qlinearadd, QlinearGlobalAveragePool, Qlinearconv, Shrink, CastLike, and RandomUniform operators
- Added an error message when gpu_targets is not set when compiling migraphx
- Added parameter to set tolerances with migraphx-driver verify 
- Added support for MXR files >4 GB 
- Added MIGRAPHX_TRACE_MLIR flag
- BETA added capability to use ROCm Composable Kernels via environment variable MIGRAPHX_ENABLE_CK=1
+
+### Additions
+
+* Support for MI300 GPUs
+* Support for TorchMIGraphX via PyTorch
+* Boosted overall performance by integrating rocMLIR
+* INT8 support for ONNX Runtime
+* Support for ONNX version 1.14.1
+* Added new operators: `Qlinearadd`, `QlinearGlobalAveragePool`, `Qlinearconv`, `Shrink`, `CastLike`,
+  and `RandomUniform`
+* Added an error message for when `gpu_targets` is not set during MIGraphX compilation
+* Added parameter to set tolerances with `migraphx-driver` verify
+* Added support for MXR files > 4 GB
+* Added `MIGRAPHX_TRACE_MLIR` flag
+* BETA added capability for using ROCm Composable Kernels via the `MIGRAPHX_ENABLE_CK=1`
+  environment variable

 ### Optimizations
- Improved performance support for INT8
- Improved time percision while benchmarking candidate kernels from CK or MLIR 
- Remove contiguous from reshape parsing
- Updated ConstantOfShape operator to support Dynamic Batch
- Simplifies dynamic shapes related operators to their static versions if possible
- Improved debugging tools for accuracy issues
- Print warning about miopen_fusion while generating mxr 
- General reduction in system memory usage during model compilation
- Created additional fusion opportunities during model compilation
- Improved debugging for matchers
- Improved general debug messages 
-
-### Fixed
- Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
- Provided a compile option to improve accuracy of some models by disabling Fast-Math
- Improved layernorm + pointwise fusion matching to ignore arguments order
- Fixed accuracy issue with ROIAlign operator 
- Fixed Trilu operator computation logic
- Fixed support for the DETR model 
-
-### Changed
- Changed migraphx version to 2.8
- Extracted test packages as its own separate deb file when building migraphx from source
-
-### Removed
- Removed building Python 2.7 bindings

+* Improved performance support for INT8
+* Improved time precision while benchmarking candidate kernels from CK or MLIR
+* Removed contiguous from reshape parsing
+* Updated the `ConstantOfShape` operator to support Dynamic Batch
+* Simplified dynamic shapes-related operators to their static versions, where possible
+* Improved debugging tools for accuracy issues
+* Included a print warning about `miopen_fusion` while generating `mxr`
+* General reduction in system memory usage during model compilation
+* Created additional fusion opportunities during model compilation
+* Improved debugging for matchers
+* Improved general debug messages
+
+### Fixes
+
+* Fixed scatter operator for nonstandard shapes with some models from ONNX Model Zoo
+* Provided a compile option to improve the accuracy of some models by disabling Fast-Math
+* Improved layernorm + pointwise fusion matching to ignore argument order
+* Fixed accuracy issue with `ROIAlign` operator
+* Fixed computation logic for the `Trilu` operator
+* Fixed support for the DETR model
+
+### Changes
+
+* Changed MIGraphX version to 2.8
+* Extracted the test packages into a separate deb file when building MIGraphX from source
+
+### Removals
+
+* Removed building Python 2.7 bindings

 ## MIGraphX 2.7 for ROCm 5.7.0
-### Added
- Enabled hipRTC to not require dev packages for migraphx runtime and allow the ROCm install to be in a different directory than it was during build time
- Add support for multi-target execution
- Added Dynamic Batch support with C++/Python APIs
- Add migraphx.create_argument to python API
- Added dockerfile example for Ubuntu 22.04
- Add TensorFlow supported ops in driver similar to exist onnx operator list
- Add a MIGRAPHX_TRACE_MATCHES_FOR env variable to filter the matcher trace
- Improved debugging by printing max,min,mean and stddev values for TRACE_EVAL = 2
- use fast_math flag instead of ENV flag for GELU
- Print message from driver if offload copy is set for compiled program
+
+### Additions
+
+* hipRTC no longer requires dev packages for MIGraphX runtime and allows the ROCm install to be in a
+   different directory than build time
+* Added support for multi-target execution
+* Added Dynamic Batch support with C++/Python APIs
+* Added `migraphx.create_argument` to Python API
+* Added dockerfile example for Ubuntu 22.04
+* Added TensorFlow supported ops in driver similar to exist onnx operator list
+* Added a MIGRAPHX_TRACE_MATCHES_FOR env variable to filter the matcher trace
+* Improved debugging by printing max,min,mean and stddev values for TRACE_EVAL = 2
+* You can now use the ` fast_math` flag instead of `ENV` for GELU
+* Print message from driver if offload copy is set for compiled program
+
 ### Optimizations
- Optimized for ONNX Runtime 1.14.0
- Improved compile times by only building for the GPU on the system
- Improve performance of pointwise/reduction kernels when using NHWC layouts
- Load specific version of the migraphx_py library
- Annotate functions with the block size so the compiler can do a better job of optimizing 
- Enable reshape on nonstandard shapes
- Use half HIP APIs to compute max and min
- Added support for broadcasted scalars to unsqueeze operator
- Improved multiplies with dot operator
- Handle broadcasts across dot and concat
- Add verify namespace for better symbol resolution
-### Fixed
- Resolved accuracy issues with FP16 resnet50
- Update cpp generator to handle inf from  float
- Fix assertion error during verify and make DCE work with tuples
- Fix convert operation for NaNs
- Fix shape typo in API test
- Fix compile warnings for shadowing variable names
- Add missing specialization for the `nullptr` for the hash function
-### Changed
- Bumped version of half library to 5.6.0
- Bumped CI to support rocm 5.6
- Make building tests optional
- replace np.bool with bool as per numpy request
-### Removed
- Removed int8x4 rocBlas calls due to deprecation
- removed std::reduce usage since not all OS' support it

+* Optimized for ONNX Runtime 1.14.0
+* Improved compile times by only building for the GPU on the system
+* Improved performance of pointwise/reduction kernels when using NHWC layouts
+* Loaded specific version of the `migraphx_py` library
+* Annotated functions with the block size so the compiler can do a better job of optimizing
+* Enabled reshape on nonstandard shapes
+* Used half HIP APIs to compute max and min
+* Added support for broadcasted scalars to unsqueeze operator
+* Improved multiplies with dot operator
+* Handled broadcasts across dot and concat
+* Added verify namespace for better symbol resolution
+
+### Fixes
+
+* Resolved accuracy issues with FP16 resnet50
+* Updated cpp generator to handle inf from float
+* Fixed assertion error during verify and made DCE work with tuples
+* Fixed convert operation for NaNs
+* Fixed shape typo in API test
+* Fixed compile warnings for shadowing variable names
+* Added missing specialization for the `nullptr` hash function
+
+### Changees
+
+* Bumped version of half library to 5.6.0
+* Bumped CI to support ROCm 5.6
+* Made building tests optional
+* Replaced `np.bool` with `bool` per NumPy request
+
+### Removals
+
+* Removed int8x4 rocBlas calls due to deprecation
+* Removed `std::reduce` usage because not all operating systems support it

 ## MIGraphX 2.5 for ROCm 5.5.0
-### Added
- Y-Model feature to store tuning information with the optimized model
- Added Python 3.10 bindings 
- Accuracy checker tool based on ONNX Runtime
- ONNX Operators parse_split, and Trilu 
- Build support for ROCm MLIR
- Added migraphx-driver flag to print optimizations in python (--python)
- Added JIT implementation of the Gather and Pad operator which results in better handling of larger tensor sizes.
+
+### Additions
+
+* Y-Model feature will store tuning information with the optimized model
+* Added Python 3.10 bindings
+* Accuracy checker tool based on ONNX runtime
+* ONNX operators parse_split, and Trilu
+* Build support for ROCm MLIR
+* Added the `migraphx-driver` flag to print optimizations in Python (--python)
+* Added JIT implementation of the Gather and Pad operators, which results in better handling for
+  larger tensor sizes
+
 ### Optimizations
- Improved performance of Transformer based models
- Improved performance of the Pad, Concat, Gather, and Pointwise operators
- Improved onnx/pb file loading speed
- Added general optimize pass which runs several passes such as simplify_reshapes/algebra and DCE in loop.
-### Fixed
- Improved parsing Tensorflow Protobuf files 
- Resolved various accuracy issues with some onnx models
- Resolved a gcc-12 issue with mivisionx
- Improved support for larger sized models and batches
- Use --offload-arch instead of --cuda-gpu-arch for the HIP compiler
- Changes inside JIT to use float accumulator for large reduce ops of half type to avoid overflow.
- Changes inside JIT to temporarily use cosine to compute sine function.
-### Changed
- Changed version/location of 3rd party build dependencies to pick up fixes
+
+* Improved performance of Transformer-based models
+* Improved performance of the `Pad`, `Concat`, `Gather`, and `Pointwise` operators
+* Improved ONNX/pb file loading speed
+* Added a general optimize pass that runs several passes, such as `simplify_reshapes`, algebra, and DCE
+  in a loop
+
+### Fixes
+
+* Improved parsing for TensorFlow Protobuf files
+* Resolved various accuracy issues with some ONNX models
+* Resolved a gcc-12 issue with MIVisionX
+* Improved support for larger sized models and batches
+* Use `--offload-arch` instead of `--cuda-gpu-arch` for the HIP compiler
+* Changes inside JIT to use float accumulator for large reduce ops of half type to avoid overflow
+* Changes inside JIT to temporarily use cosine to compute sine function
+
+### Changes
+
+* Changed version and location of third-party build dependencies in order to pick up fixes
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -41,9 +41,12 @@ if(NOT MIGRAPHX_GENERATOR_IS_MULTI_CONFIG)
    set_property(CACHE CMAKE_BUILD_TYPE PROPERTY STRINGS ${CMAKE_CONFIGURATION_TYPES})
 endif()

-set(CMAKE_INSTALL_PREFIX "/opt/rocm" CACHE PATH "")
+if(NOT WIN32)
+    set(CMAKE_INSTALL_PREFIX "/opt/rocm" CACHE PATH "")
+    set(CMAKE_BUILD_RPATH "${CMAKE_BINARY_DIR}/lib")
+endif()

-set(CMAKE_BUILD_RPATH "${CMAKE_BINARY_DIR}/lib")
+list(APPEND CMAKE_PREFIX_PATH /opt/rocm /opt/rocm/llvm $ENV{ROCM_PATH} $ENV{HIP_PATH})

 project(migraphx LANGUAGES C CXX)
 include(CTest)
@@ -57,6 +60,9 @@ else()
 option(MIGRAPHX_ENABLE_PYTHON "Enable python bindings" ON)
 endif()

+# By default build shared libraries
+option(BUILD_SHARED_LIBS "Create shared libraries" ON)
+
 if(WIN32) # CK is not yet ported to Windows
 option(MIGRAPHX_USE_COMPOSABLEKERNEL "Enable MIGraphX to use composable kernel JIT library" OFF)
 else()
@@ -67,7 +73,7 @@ find_path(HALF_INCLUDE_DIR half.hpp PATH_SUFFIXES half)
 if (NOT HALF_INCLUDE_DIR)
    message(FATAL_ERROR "Could not find half.hpp - Please check that the install path of half.hpp has been added to CMAKE_PREFIX_PATH")
 else()
-	message(STATUS "half.hpp is at ${HALF_INCLUDE_DIR}")
+    message(STATUS "half.hpp is at ${HALF_INCLUDE_DIR}")
 endif()

 include(CheckTypeSize)
@@ -102,13 +108,21 @@ set(MIGRAPHX_ENABLE_CPU Off CACHE BOOL "")
 # Disable fpga backend by default
 set(MIGRAPHX_ENABLE_FPGA Off CACHE BOOL "")

+if(WIN32)
+    add_compile_definitions("$<$<COMPILE_LANGUAGE:C,CXX>:_CRT_SECURE_NO_WARNINGS;_USE_MATH_DEFINES>")
+endif()
+
 set(CMAKE_CXX_STANDARD_DEFAULT "")
-add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-std=c++17>)
+if(MSVC)
+    add_compile_options($<$<COMPILE_LANGUAGE:CXX>:/std:c++17>)
+else()
+    add_compile_options($<$<COMPILE_LANGUAGE:CXX>:-std=c++17>)
+endif()

 list(APPEND CMAKE_MODULE_PATH ${CMAKE_CURRENT_SOURCE_DIR}/cmake)
 include(EnableCompilerWarnings)
 include(ROCMClangTidy)
-if(CMAKE_CXX_COMPILER MATCHES ".*clang\\+\\+")
+if(CMAKE_CXX_COMPILER MATCHES ".*clang\\+\\+.*")
    set(MIGRAPHX_TIDY_ERRORS ERRORS * -readability-inconsistent-declaration-parameter-name)
 # Enable tidy on hip
 elseif(MIGRAPHX_ENABLE_GPU)

--- a/Dockerfile
+++ b/Dockerfile
@@ -80,6 +80,10 @@ ADD rbuild.ini /rbuild.ini
 # Temporarily install a new cmake until switching to ubuntu 22.04
 RUN pip3 install cmake==3.22.1

+# Location where onnx unit tests models are cached
+ENV ONNX_HOME=/.onnx
+RUN mkdir -p $ONNX_HOME/models && chmod 777 $ONNX_HOME/models
+
 COPY ./tools/install_prereqs.sh /
 RUN /install_prereqs.sh /usr/local / && rm /install_prereqs.sh
 RUN test -f /usr/local/hash || exit 1
@@ -91,11 +95,6 @@ RUN pip3 install yapf==0.28.0
 ADD docs/.sphinx/requirements.txt /doc-requirements.txt
 RUN pip3 install -r /doc-requirements.txt

-# Download real models to run onnx unit tests
-ENV ONNX_HOME=/.onnx
-COPY ./tools/download_models.sh /
-RUN /download_models.sh && rm /download_models.sh
-
 # Install latest ccache version
 RUN cget -p $PREFIX install facebook/zstd@v1.4.5 -X subdir -DCMAKE_DIR=build/cmake
 RUN cget -p $PREFIX install ccache@v4.1 -DENABLE_TESTING=OFF

--- a/README.md
+++ b/README.md
 # AMD MIGraphX

-AMD MIGraphX is AMD's graph inference engine that accelerates machine learning model inference. AMD MIGraphX can be used by
-installing binaries directly or building from source code.
+AMD MIGraphX is AMD's graph inference engine, which accelerates machine learning model inference.
+To use MIGraphX, you can install the binaries or build from source code. Refer to the following sections
+for Ubuntu installation instructions (we'll provide instructions for other Linux distributions in the future).

-In the following, instructions of how to build and install MIGraphX are described with Ubuntu as the OS
-(Instructions of installation on other Linux OSes will come later). Note that all the following instructions assume 
-ROCm has been installed successfully. ROCm installation instructions are explained in the [ROCm installation
-guide](https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html).
+```note
+You must [install ROCm](https://rocm.docs.amd.com/en/latest/deploy/linux/quick_start.html) before
+installing MIGraphX.
+```

 ## Installing from binaries
-With ROCm installed correctly, MIGraphX binaries can be installed on Ubuntu with the following command:
-```
+
+Install binaries using:
+
+```bash
 sudo apt update && sudo apt install -y migraphx
 ```
-then the header files and libs are installed under `/opt/rocm-<version>`, where `<version>` is the ROCm version.
+
+Header files and libraries are installed under `/opt/rocm-<version>`, where `<version>` is the ROCm
+version.

 ## Building from source

-There are three ways to build the MIGraphX sources. 
-* [Use the ROCm build tool](#use-the-rocm-build-tool-rbuild)
-    
-    This approach uses [rbuild](https://github.com/RadeonOpenCompute/rbuild) to install the prerequisites and
-build the libs with just one command. 
+You have three options for building from source:

-* [Use cmake](#use-cmake-to-build-migraphx)
-    
-    This approach uses a script to install the prerequisites, then use cmake to build the source.
-      
-* [Use docker](#use-docker)
-    
-    This approach builds a docker image with all prerequisites installed, then build the MIGraphX sources inside a docker container. 
+* [ROCm build tool](#use-the-rocm-build-tool-rbuild): Uses
+  [rbuild](https://github.com/RadeonOpenCompute/rbuild) to install prerequisites, then you can build
+  the libraries with a single command.

-In the following, we will first list the prerequisites required to build MIGraphX source code, then describe 
-each of the three approaches.
+* [CMake](#use-cmake-to-build-migraphx): Uses a script to install prerequisites, then you can use
+  CMake to build the source.

-### List of prerequisites
-The following is a list of prerequisites required to build MIGraphX source. 
+* [Docker](#use-docker): Builds a Docker image with all prerequisites installed, then you can build the
+  MIGraphX sources inside a Docker container.

-* [ROCm cmake modules](https://github.com/RadeonOpenCompute/rocm-cmake) **required**
+### Build prerequisites
+
+The following is a list of prerequisites for building MIGraphX.
+
+* [ROCm CMake modules](https://github.com/RadeonOpenCompute/rocm-cmake) **required**
 * [MIOpen](https://github.com/ROCmSoftwarePlatform/MIOpen) for running on the GPU
 * [rocBLAS](https://github.com/ROCmSoftwarePlatform/rocBLAS) for running on the GPU
 * [HIP](https://github.com/ROCm-Developer-Tools/HIP) for running on the GPU
-* [Protobuf](https://github.com/google/protobuf) for reading [onnx](https://github.com/onnx/onnx) files
-* [Half](http://half.sourceforge.net/) - IEEE 754-based half-precision floating point library
-* [pybind11](https://pybind11.readthedocs.io/en/stable/) - for python bindings
-* [JSON](https://github.com/nlohmann/json) - for model serialization to json string format
-* [MessagePack](https://msgpack.org/index.html) - for model serialization to binary format
-* [SQLite3](https://www.sqlite.org/index.html) - to create database of kernels' tuning information or execute queries on existing database
+* [Protobuf](https://github.com/google/protobuf) for reading [onnx](https://github.com/onnx/onnx)
+  files
+* [Half](http://half.sourceforge.net/), an IEEE 754-based half-precision floating point library
+* [pybind11](https://pybind11.readthedocs.io/en/stable/) for python bindings
+* [JSON](https://github.com/nlohmann/json) for model serialization to json string format
+* [MessagePack](https://msgpack.org/index.html) for model serialization to binary format
+* [SQLite3](https://www.sqlite.org/index.html) to create database of kernels' tuning information or run queries on existing database

-#### Use the ROCm build tool [rbuild](https://github.com/RadeonOpenCompute/rbuild).
+### Use the ROCm build tool [rbuild](https://github.com/RadeonOpenCompute/rbuild).

-In this approach, we use the [rbuild](https://github.com/RadeonOpenCompute/rbuild) build tool to
-build MIGraphX. The specific steps are as follows:
+1. Install `rocm-cmake`, `pip3`, `rocblas`, and `miopen-hip`:

-1) Install rocm-cmake, pip3, rocblas, and miopen-hip with the command
+    ```bash
+    sudo apt install -y rocm-cmake python3-pip rocblas miopen-hip
+    ```

-```
-sudo apt install -y rocm-cmake python3-pip rocblas miopen-hip
-```
+2. Install [rbuild](https://github.com/RadeonOpenCompute/rbuild) (sudo may be required):

-2) Install [rbuild](https://github.com/RadeonOpenCompute/rbuild) (sudo may be required here.)
+    ```bash
+    pip3 install https://github.com/RadeonOpenCompute/rbuild/archive/master.tar.gz
+    ```

-```
-pip3 install https://github.com/RadeonOpenCompute/rbuild/archive/master.tar.gz
-```
+3. Build MIGraphX source code:

-3) Build MIGraphX source code
+    ```bash
+    rbuild build -d depend -B build -DGPU_TARGETS=$(/opt/rocm/bin/rocminfo | grep -o -m1 'gfx.*')
+    ```

-```
-rbuild build -d depend -B build
+Once completed, all prerequisites are in the `depend` folder and MIGraphX is in the `build` directory.
+
+```note
+If you get an `rbuild: command not found` error, it's because `rbuild` is installed in `$HOME/.local/bin`,
+which is not in `PATH`. You can either export PATH as `export PATH=$HOME/.local/bin:$PATH` to add
+the folder to `PATH`, or add the option `--prefix /usr/local` in the pip3 command when installing `rbuild`.
 ```

-then all the prerequisites are in the folder `depend`, and MIGraphX is built in the `build` directory.
+### Use CMake to build MIGraphX

-Also note that you may meet the error of `rbuild: command not found`. It is because rbuild is installed 
-at `$HOME/.local/bin`, which is not in `PATH`. You can either export PATH as `export PATH=$HOME/.local/bin:$PATH` 
-to add the folder to `PATH` or add the option `--prefix /usr/local` in the pip3 command when installing rbuild.
+1. Install the prerequisites:

-#### Use cmake to build MIGraphX
+    ```bash
+    rbuild prepare -d depend
+    ```

-If using this approach, we need to install the prerequisites, configure the cmake, and then build the source.
+    This puts all the prerequisites are in `depend` the folder. They can be used in the `cmake`
+    configuration as `-DCMAKE_PREFIX_PATH=depend`.

-##### Installing the prerequisites
+    If you have sudo access, as an alternative to the `rbuild` command, you can install the prerequisites
+    in the same way as a Dockerfile, by calling `./tools/install_prereqs.sh`.

-For convenience, the prerequisites can be built automatically with rbuild as:
+    By default, all prerequisites are installed at the default location (`/usr/local`) and are accessible by all
+    users. For the default location, `sudo` is required to run the script. You can also specify a different
+    location using `./tools/install_prereqs.sh $custom_location`.

-```
-rbuild prepare -d depend
-```
+2. Go to the project folder and create a `build` directory:

-then all the prerequisites are in the folder `depend`, and they can be used in the `cmake` configuration
-as `-DCMAKE_PREFIX_PATH=depend`.
+    ```bash
+    mkdir build
+    cd build
+    ```

-If you have sudo access, as an alternative to the rbuild command, you can install the prerequisites just 
-like in the dockerfile by calling `./tools/install_prereqs.sh`.
+3. Configure CMake. If the prerequisites are installed at the default location `/usr/local`, use:

-(Note that this script is for Ubuntu. By default, all prerequisites are installed at the default location `/usr/local` 
-and are accessible by all users. For the default location, `sudo` is required to run the script.
-You can also specify a location at which the prerequisites are installed with `./tools/install_prereqs.sh $your_loc`.)
+    ```bash
+    CXX=/opt/rocm/llvm/bin/clang++ cmake .. -DGPU_TARGETS=$(/opt/rocm/bin/rocminfo | grep -o -m1 'gfx.*')
+    ```

-##### Building MIGraphX source and install libs
+    Otherwise, you need to set `-DCMAKE_PREFIX_PATH=$your_loc` to configure CMake.

-With the above prerequisites installed, we can build source as:
+4. Build MIGraphX source code:

-1) Go to the project folder and create a `build` directory:
+    ```cpp
+    make -j$(nproc)
+    ```

+    You can verify this using:

-```
-mkdir build
-cd build
-```
+    ```cpp
+    make -j$(nproc) check
+    ```

-2) Configure the cmake. If the prerequisites are installed at the default location `/usr/local`, the command is:
+5. Install MIGraphX libraries:

-```
-CXX=/opt/rocm/llvm/bin/clang++ cmake ..
-```
-Otherwise, you need to set `-DCMAKE_PREFIX_PATH=$your_loc` to configure the cmake. 
+    ```cpp
+    make install
+    ```

-3) Build MIGraphX source code
+### Use Docker

-```
-make -j$(nproc)
-```
+The easiest way to set up the development environment is to use Docker.

-Correctness can be verified as:
+1. With the Dockerfile, build a Docker image:

-```
-make -j$(nproc) check
-```
+    ```bash
+        docker build -t migraphx .
+    ```

-MIGraphX libs can be installed as:
+2. Enter the development environment using `docker run`:

-```
-make install
-```
+    ```bash
+        docker run --device='/dev/kfd' --device='/dev/dri' -v=`pwd`:/code/AMDMIGraphX -w /code/AMDMIGraphX --group-add video -it migraphx
+    ```

-#### Use docker
+3. In the Docker container, all required prerequisites are already installed, so you can go to the folder
+    `/code/AMDMIGraphX` and follow the steps (starting from 2) in the
+    [Use CMake to build MIGraphX](#use-cmake-to-build-migraphx).

-The easiest way to setup the development environment is to use docker. With the dockerfile, you can build a docker image as:
+## Using the MIGraphX Python module

-    docker build -t migraphx .
+To use MIGraphX's Python module, you can set `PYTHONPATH` or use the `.deb` package:

-Then to enter the developement environment use `docker run`:
+* Setting `PYTHONPATH`:

-    docker run --device='/dev/kfd' --device='/dev/dri' -v=`pwd`:/code/AMDMIGraphX -w /code/AMDMIGraphX --group-add video -it migraphx
+    ```bash
+    export PYTHONPATH=/opt/rocm/lib:$PYTHONPATH
+    ```

-In the docker container, all the required prerequisites are already installed, so users can just go to the folder 
-`/code/AMDMIGraphX` and follow the steps in the above [Build MIGraphX source and install
-libs](#building-migraphx-source-and-install-libs)
-section to build MIGraphX source.
+* Creating the `deb` package:

-### Using MIGraphX Python Module
-To use MIGraphX's Python module, please either set `PYTHONPATH` or use `.deb` package as explained below:
+    ```bash
+    make package
+    ```

- Setting `PYTHONPATH` :
-```
-export PYTHONPATH=/opt/rocm/lib:$PYTHONPATH
-```
- Creating and installing the package:
+    This provides the path for .deb package.

-To create deb package:
-```
-make package
-```
-This will provide the path of .deb package.
+    To install:

-To install:
-```
-dpkg -i <path_to_deb_file>
-```
+    ```bash
+    dpkg -i <path_to_deb_file>
+    ```

-### Calling MIGraphX APIs
-To use MIGraphX's C/C++ API in your cmake project, we need to set `CMAKE_PREFIX_PATH` to the MIGraphX
-installation location and then do 
-```
+## Calling MIGraphX APIs
+
+To use MIGraphX's C/C++ API in your CMake project, you must set `CMAKE_PREFIX_PATH` to the
+MIGraphX installation location and run:
+
+```bash
 find_package(migraphx)
 target_link_libraries(myApp migraphx::c)
 ```
-Where `myApp` is the cmake target in your project.
+
+Where `myApp` is the CMake target in your project.

 ## Building for development

-Using rbuild, the dependencies for development can be installed with:
+Using `rbuild`, you can install the dependencies for development with:

-```
+```bash
 rbuild develop
 ```

-This will install the dependencies for development into the `deps` directory and
-configure `cmake` to use those dependencies in the `build` directory. These
-directories can be changed by passing the `--deps-dir` and `--build-dir` flags
-to `rbuild` command:
+This installs development dependencies in the `deps` directory and configures `cmake` to use those
+dependencies in the `build` directory. You can change these directories by passing the `--deps-dir` and
+`--build-dir` flags to the `rbuild` command:

-```
+```bash
 rbuild develop --build-dir build_rocm_55 --deps-dir /home/user/deps_dir
 ```

@@ -223,12 +227,12 @@ Depending on your setup `sudo` may be required for the pip install.

 All the code is formatted using clang-format. To format a file, use:

-```
+```clang
 clang-format-10 -style=file -i <path-to-source-file>
 ```

 Also, githooks can be installed to format the code per-commit:

-```
+```bash
 ./.githooks/install
 ```
--- a/cmake/Embed.cmake
+++ b/cmake/Embed.cmake
@@ -21,17 +21,25 @@
 # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
 # THE SOFTWARE.
 #####################################################################################
-find_program(EMBED_LD ld)
-find_program(EMBED_OBJCOPY objcopy)

-option(EMBED_USE_LD "Use ld to embed data files" OFF)
+if(WIN32)
+    set(EMBED_USE RC CACHE STRING "Use RC or CArrays to embed data files")
+    set_property(CACHE EMBED_USE PROPERTY STRINGS "RC;CArrays")
+else()
+    set(EMBED_USE CArrays CACHE STRING "Use LD or CArrays to embed data files")
+    set_property(CACHE EMBED_USE PROPERTY STRINGS "LD;CArrays")
+endif()
+
+if(EMBED_USE STREQUAL "LD")
+    find_program(EMBED_LD ld REQUIRED)
+    find_program(EMBED_OBJCOPY objcopy REQUIRED)
+endif()

 function(wrap_string)
    set(options)
    set(oneValueArgs VARIABLE AT_COLUMN)
    set(multiValueArgs)
    cmake_parse_arguments(PARSE "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})
-    cmake_parse_arguments(WRAP_STRING "${options}" "${oneValueArgs}" "" ${ARGN})

    string(LENGTH ${${PARSE_VARIABLE}} string_length)
    math(EXPR offset "0")
@@ -54,112 +62,124 @@ function(wrap_string)
    set(${PARSE_VARIABLE} "${lines}" PARENT_SCOPE)
 endfunction()

-function(generate_embed_source EMBED_NAME)
+function(generate_embed_source EMBED_NAME EMBED_DIR BASE_DIRECTORY)
    set(options)
-    set(oneValueArgs SRC HEADER RELATIVE)
-    set(multiValueArgs OBJECTS SYMBOLS FILES)
-
+    set(oneValueArgs)
+    set(multiValueArgs SYMBOLS FILES)
    cmake_parse_arguments(PARSE "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})

-    set(EXTERNS)
-    set(INIT_KERNELS)
-
-    list(LENGTH PARSE_SYMBOLS SYMBOLS_LEN)
-    list(LENGTH PARSE_OBJECTS OBJECTS_LEN)
-    if(NOT ${SYMBOLS_LEN} EQUAL ${OBJECTS_LEN})
-        message(FATAL_ERROR "Symbols and objects dont match: ${SYMBOLS_LEN} != ${OBJECTS_LEN}")
-    endif()
-    math(EXPR LEN "${SYMBOLS_LEN} - 1")
-
-    foreach(idx RANGE ${LEN})
-        list(GET PARSE_SYMBOLS ${idx} SYMBOL)
-        list(GET PARSE_OBJECTS ${idx} OBJECT)
-        list(GET PARSE_FILES ${idx} FILE)
-
-        set(START_SYMBOL "_binary_${SYMBOL}_start")
-        set(LENGTH_SYMBOL "_binary_${SYMBOL}_length")
-        if(EMBED_USE_LD)
-            string(APPEND EXTERNS "
+    set(RESOURCE_ID 100)
+    foreach(SYMBOL FILE IN ZIP_LISTS PARSE_SYMBOLS PARSE_FILES)
+        cmake_path(RELATIVE_PATH FILE BASE_DIRECTORY ${BASE_DIRECTORY} OUTPUT_VARIABLE BASE_NAME)
+        if(EMBED_USE STREQUAL "RC")
+            string(TOUPPER "${SYMBOL}" SYMBOL)
+            string(APPEND FILE_IDS "#define IDR_${SYMBOL} ${RESOURCE_ID}\n")
+            cmake_path(NATIVE_PATH FILE NORMALIZE NATIVE_FILE)
+            string(REPLACE "\\" "\\\\" NATIVE_FILE "${NATIVE_FILE}")
+            string(APPEND RC_FILE_MAPPING "IDR_${SYMBOL} TEXTFILE \"${NATIVE_FILE}\"\n")
+            string(APPEND INIT_KERNELS "\n        {\"${BASE_NAME}\", resource::read(IDR_${SYMBOL})},")
+            math(EXPR RESOURCE_ID "${RESOURCE_ID} + 1" OUTPUT_FORMAT DECIMAL)
+        else()
+            set(START_SYMBOL "_binary_${SYMBOL}_start")
+            set(LENGTH_SYMBOL "_binary_${SYMBOL}_length")
+            if(EMBED_USE STREQUAL "LD")
+                string(APPEND EXTERNS "
 extern const char ${START_SYMBOL}[];
 extern const size_t _binary_${SYMBOL}_size;
 const auto ${LENGTH_SYMBOL} = reinterpret_cast<size_t>(&_binary_${SYMBOL}_size);
-            ")
-        else()
-            string(APPEND EXTERNS "
+")
+            else()
+                string(APPEND EXTERNS "
 extern const char ${START_SYMBOL}[];
 extern const size_t ${LENGTH_SYMBOL};
-            ")
+")
+            endif()
+            string(APPEND INIT_KERNELS "
+        { \"${BASE_NAME}\", { ${START_SYMBOL}, ${LENGTH_SYMBOL}} },")
        endif()
+    endforeach()
+    if(EMBED_USE STREQUAL "RC")
+       file(WRITE "${EMBED_DIR}/include/resource.h" "
+#define TEXTFILE 256

-        if(PARSE_RELATIVE)
-            file(RELATIVE_PATH BASE_NAME ${PARSE_RELATIVE} "${FILE}")
-        else()
-            get_filename_component(BASE_NAME "${FILE}" NAME)
-        endif()
+${FILE_IDS}
+")
+        file(WRITE "${EMBED_DIR}/resource.rc" "
+#include \"resource.h\"

-        string(APPEND INIT_KERNELS "
-            { \"${BASE_NAME}\", { ${START_SYMBOL}, ${LENGTH_SYMBOL}} },")
-    endforeach()
+${RC_FILE_MAPPING}
+")
+        set(EXTERNS "
+#include <Windows.h>
+#include \"resource.h\"

-    file(WRITE "${PARSE_HEADER}" "
+namespace resource {
+std::string_view read(int id)
+{
+    HMODULE handle = GetModuleHandle(nullptr);
+    HRSRC rc = FindResource(handle, MAKEINTRESOURCE(id), MAKEINTRESOURCE(TEXTFILE));
+    HGLOBAL data = LoadResource(handle, rc);
+    return {static_cast<const char*>(LockResource(data)), SizeofResource(handle, rc)};
+}
+}
+")
+        set(EMBED_FILES ${EMBED_DIR}/include/resource.h ${EMBED_DIR}/resource.rc)
+    endif()
+    file(WRITE "${EMBED_DIR}/include/${EMBED_NAME}.hpp" "
 #include <string_view>
 #include <unordered_map>
 #include <utility>
 std::unordered_map<std::string_view, std::string_view> ${EMBED_NAME}();
 ")

-    file(WRITE "${PARSE_SRC}" "
+    file(WRITE "${EMBED_DIR}/${EMBED_NAME}.cpp" "
 #include <${EMBED_NAME}.hpp>
 ${EXTERNS}
 std::unordered_map<std::string_view, std::string_view> ${EMBED_NAME}()
 {
-    static std::unordered_map<std::string_view, std::string_view> result = {${INIT_KERNELS}};
+    static std::unordered_map<std::string_view, std::string_view> result = {${INIT_KERNELS}
+    };
    return result;
 }
 ")
+    list(APPEND EMBED_FILES ${EMBED_DIR}/${EMBED_NAME}.cpp ${EMBED_DIR}/include/${EMBED_NAME}.hpp)
+    set(EMBED_FILES ${EMBED_FILES} PARENT_SCOPE)
 endfunction()

-function(embed_file OUTPUT_FILE OUTPUT_SYMBOL FILE)
-    set(WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR})
-    # Glob is used to compute the relative path
-    file(GLOB FILES RELATIVE ${WORKING_DIRECTORY} ${FILE})
-    foreach(REL_FILE ${FILES})
-        string(MAKE_C_IDENTIFIER "${REL_FILE}" SYMBOL)
-        get_filename_component(OUTPUT_FILE_DIR "${REL_FILE}" DIRECTORY)
-        file(MAKE_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${OUTPUT_FILE_DIR}")
-        if(EMBED_USE_LD)
-            set(OUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/${REL_FILE}.o")
-        else()
-            set(OUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/${REL_FILE}.cpp")
-        endif()
-        set(${OUTPUT_SYMBOL} ${SYMBOL} PARENT_SCOPE)
-        set(${OUTPUT_FILE} "${OUT_FILE}" PARENT_SCOPE)
-        if(EMBED_USE_LD)
-            add_custom_command(
-                OUTPUT "${OUT_FILE}"
-                COMMAND ${EMBED_LD} -r -o "${OUT_FILE}" -z noexecstack --format=binary "${REL_FILE}" 
-                COMMAND ${EMBED_OBJCOPY} --rename-section .data=.rodata,alloc,load,readonly,data,contents "${OUT_FILE}"
-                WORKING_DIRECTORY ${WORKING_DIRECTORY}
-                DEPENDS ${FILE}
-                VERBATIM
-            )
-        else()
-            set_property(DIRECTORY APPEND PROPERTY CMAKE_CONFIGURE_DEPENDS ${FILE})
-            # reads source file contents as hex string
-            file(READ ${FILE} HEX_STRING HEX)
-            # wraps the hex string into multiple lines
-            wrap_string(VARIABLE HEX_STRING AT_COLUMN 80)
-            # adds '0x' prefix and comma suffix before and after every byte respectively
-            string(REGEX REPLACE "([0-9a-f][0-9a-f])" "0x\\1, " ARRAY_VALUES ${HEX_STRING})
-            # removes trailing comma
-            string(REGEX REPLACE ", $" "" ARRAY_VALUES ${ARRAY_VALUES})
-            file(WRITE "${OUT_FILE}" "
+function(embed_file FILE BASE_DIRECTORY)
+    message(STATUS "    ${FILE}")
+    cmake_path(RELATIVE_PATH FILE BASE_DIRECTORY "${BASE_DIRECTORY}" OUTPUT_VARIABLE REL_FILE)
+    string(MAKE_C_IDENTIFIER "${REL_FILE}" OUTPUT_SYMBOL)
+    get_filename_component(OUTPUT_FILE_DIR "${REL_FILE}" DIRECTORY)
+    file(MAKE_DIRECTORY "${CMAKE_CURRENT_BINARY_DIR}/${OUTPUT_FILE_DIR}")
+    if(EMBED_USE STREQUAL "LD")
+        set(OUTPUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/${REL_FILE}.o")
+        add_custom_command(
+            OUTPUT "${OUTPUT_FILE}"
+            COMMAND ${EMBED_LD} -r -o "${OUTPUT_FILE}" -z noexecstack --format=binary "${REL_FILE}"
+            COMMAND ${EMBED_OBJCOPY} --rename-section .data=.rodata,alloc,load,readonly,data,contents "${OUTPUT_FILE}"
+            WORKING_DIRECTORY "${BASE_DIRECTORY}"
+            DEPENDS "${FILE}"
+            VERBATIM)
+        set(OUTPUT_FILE ${OUTPUT_FILE} PARENT_SCOPE)
+    elseif(EMBED_USE STREQUAL "CArrays")
+        set(OUTPUT_FILE "${CMAKE_CURRENT_BINARY_DIR}/${REL_FILE}.cpp")
+        # reads source file contents as hex string
+        file(READ ${FILE} HEX_STRING HEX)
+        # wraps the hex string into multiple lines
+        wrap_string(VARIABLE HEX_STRING AT_COLUMN 80)
+        # adds '0x' prefix and comma suffix before and after every byte respectively
+        string(REGEX REPLACE "([0-9a-f][0-9a-f])" "0x\\1, " ARRAY_VALUES ${HEX_STRING})
+        # removes trailing comma
+        string(REGEX REPLACE ", $" "" ARRAY_VALUES ${ARRAY_VALUES})
+        file(WRITE "${OUTPUT_FILE}" "
 #include <cstddef>
-extern const char _binary_${SYMBOL}_start[] = { ${ARRAY_VALUES} };
-extern const size_t _binary_${SYMBOL}_length = sizeof(_binary_${SYMBOL}_start);
+extern const char _binary_${OUTPUT_SYMBOL}_start[] = { ${ARRAY_VALUES} };
+extern const size_t _binary_${OUTPUT_SYMBOL}_length = sizeof(_binary_${OUTPUT_SYMBOL}_start);
 ")
-        endif()
-    endforeach()
+        set(OUTPUT_FILE ${OUTPUT_FILE} PARENT_SCOPE)
+    endif()
+    set(OUTPUT_SYMBOL ${OUTPUT_SYMBOL} PARENT_SCOPE)
 endfunction()

 function(add_embed_library EMBED_NAME)
@@ -168,35 +188,32 @@ function(add_embed_library EMBED_NAME)
    set(multiValueArgs)
    cmake_parse_arguments(PARSE "${options}" "${oneValueArgs}" "${multiValueArgs}" ${ARGN})

-    file(MAKE_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/embed)
-    file(MAKE_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}/embed/${EMBED_NAME})
    set(EMBED_DIR ${CMAKE_CURRENT_BINARY_DIR}/embed/${EMBED_NAME})
-    set(SRC_FILE "${EMBED_DIR}/${EMBED_NAME}.cpp")
-    set(HEADER_FILE "${EMBED_DIR}/include/${EMBED_NAME}.hpp")
-    set(WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR})
-    set(OUTPUT_FILES)
-    set(SYMBOLS)
-    message(STATUS "Embedding files")
+    file(MAKE_DIRECTORY ${EMBED_DIR})
+    message(STATUS "Embedding kernel files:")
    foreach(FILE ${PARSE_UNPARSED_ARGUMENTS})
-        embed_file(OUTPUT_FILE OUTPUT_SYMBOL ${FILE})
+        embed_file(${FILE} ${PARSE_RELATIVE})
        list(APPEND OUTPUT_FILES ${OUTPUT_FILE})
        list(APPEND SYMBOLS ${OUTPUT_SYMBOL})
    endforeach()
-    message(STATUS "Generating embedding library ${EMBED_NAME}")
-    generate_embed_source(${EMBED_NAME} SRC ${SRC_FILE} HEADER ${HEADER_FILE} OBJECTS ${OUTPUT_FILES} SYMBOLS ${SYMBOLS} RELATIVE ${PARSE_RELATIVE} FILES ${PARSE_UNPARSED_ARGUMENTS})
-    
+    message(STATUS "Generating embedding library '${EMBED_NAME}'")
+    generate_embed_source(${EMBED_NAME} ${EMBED_DIR} "${PARSE_RELATIVE}" SYMBOLS ${SYMBOLS} FILES ${PARSE_UNPARSED_ARGUMENTS})
    set(INTERNAL_EMBED_LIB embed_lib_${EMBED_NAME})
-    add_library(${INTERNAL_EMBED_LIB} OBJECT "${SRC_FILE}")
+    add_library(${INTERNAL_EMBED_LIB} OBJECT ${EMBED_FILES})
+    if(EMBED_USE STREQUAL "CArrays")
+        target_sources(${INTERNAL_EMBED_LIB} PRIVATE ${OUTPUT_FILES})
+    endif()
    target_include_directories(${INTERNAL_EMBED_LIB} PRIVATE "${EMBED_DIR}/include")
    target_compile_options(${INTERNAL_EMBED_LIB} PRIVATE -Wno-reserved-identifier -Wno-extern-initializer -Wno-missing-variable-declarations)
    set_target_properties(${INTERNAL_EMBED_LIB} PROPERTIES POSITION_INDEPENDENT_CODE On)
-    
    add_library(${EMBED_NAME} INTERFACE)
-    if(EMBED_USE_LD)
+    if(EMBED_USE STREQUAL "LD")
        target_sources(${EMBED_NAME} INTERFACE ${OUTPUT_FILES})
-    else()
-        target_sources(${INTERNAL_EMBED_LIB} PRIVATE ${OUTPUT_FILES})
+    endif()
+    if(EMBED_USE STREQUAL "RC")
+        target_link_libraries(${EMBED_NAME} INTERFACE $<TARGET_OBJECTS:${INTERNAL_EMBED_LIB}>)
    endif()
    target_sources(${EMBED_NAME} INTERFACE $<TARGET_OBJECTS:${INTERNAL_EMBED_LIB}>)
    target_include_directories(${EMBED_NAME} INTERFACE "${EMBED_DIR}/include")
 endfunction()
+
--- a/docs/.sphinx/requirements.txt
+++ b/docs/.sphinx/requirements.txt
@@ -75,7 +75,9 @@ pygments==2.15.0
    #   pydata-sphinx-theme
    #   sphinx
 pyjwt[crypto]==2.6.0
-    # via pygithub
+    # via
+    #   pygithub
+    #   pyjwt
 pynacl==1.5.0
    # via pygithub
 pyyaml==6.0
@@ -87,7 +89,7 @@ requests==2.28.2
    # via
    #   pygithub
    #   sphinx
-rocm-docs-core==0.26.0
+rocm-docs-core==0.28.0
    # via -r requirements.in
 smmap==5.0.0
    # via gitdb

--- a/docs/contributor_guide.rst
+++ b/docs/contributor_guide.rst
 Contributor Guide
-===============
+=================

 .. toctree::
   :maxdepth: 2
   :caption: Contents:

-   dev_intro
+   dev/dev_intro
   dev/data
   dev/operators
   dev/program
@@ -14,3 +14,4 @@ Contributor Guide
   dev/pass
   dev/matchers
   dev/tools
+   dev/env_vars
--- a/docs/dev_intro.rst
+++ b/docs/dev_intro.rst
-MIGraphX Fundamentals
+Developer Introduction
 ======================

 MIGraphX provides an optimized execution engine for deep learning neural networks.

--- a/docs/dev/env_vars.rst
+++ b/docs/dev/env_vars.rst
+Environment Variables
+=====================
+
+For parsing
+---------------
+
+**MIGRAPHX_TRACE_ONNX_PARSER**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debugging traces for the onnx parser.
+Prints: initializers (if used), ONNX node operators, added MIGraphX instructions
+
+**MIGRAPHX_DISABLE_FP16_INSTANCENORM_CONVERT**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disables the conversion from fp16 to fp32 for the InstanceNormalization ONNX operator that MIGX does as a workaround for accuracy issues with reduce_mean/variance.
+See ``parse_instancenorm.cpp`` for more details.
+
+
+Matchers
+------------
+
+**MIGRAPHX_TRACE_MATCHES**
+
+Set to "1" to print the matcher that matches an instruction and the matched instruction.
+Set to "2" and use the ``MIGRAPHX_TRACE_MATHCES_FOR`` flag to filter out results.
+
+**MIGRAPHX_TRACE_MATCHES_FOR**
+
+Set to the name of any matcher and only traces for that matcher will be printed out.
+
+**MIGRAPHX_VALIDATE_MATCHES**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Validate the module after finding the matches (runs ``module.validate()``).
+
+Program Execution 
+---------------------
+
+**MIGRAPHX_TRACE_EVAL**
+
+Set to "1", "2", or "3" to use.
+"1" prints the instruction run and the time taken.
+"2" prints everything in "1" and a snippet of the output argument and some statistics (ex. min, max, mean) of the output.
+"3" prints everything in "1" and the full output buffers.
+
+
+Program Verification
+------------------------
+
+**MIGRAPHX_VERIFY_ENABLE_ALLCLOSE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Uses ``allclose`` with the given ``atol`` and ``rtol`` for verifying ranges with ``driver verify`` or the tests that use ``migraphx/verify.hpp``.
+
+
+Pass debugging or Pass controls
+-----------------------------------
+
+**MIGRAPHX_TRACE_ELIMINATE_CONTIGUOUS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Debug print the instructions that have input ``contiguous`` instructions removed.
+
+**MIGRAPHX_DISABLE_POINTWISE_FUSION**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disables the ``fuse_pointwise`` compile pass.
+
+**MIGRAPHX_DEBUG_MEMORY_COLORING**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debug statements for the ``memory_coloring`` pass.
+
+**MIGRAPHX_TRACE_SCHEDULE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print debug statements for the ``schedule`` pass.
+
+**MIGRAPHX_TRACE_PROPAGATE_CONSTANT**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Traces instructions replaced with a constant.
+
+**MIGRAPHX_INT8_QUANTIZATION_PARAMS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the quantization parameters in only the main module.
+
+**MIGRAPHX_DISABLE_DNNL_POST_OPS_WORKAROUND**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the DNNL post ops workaround.
+
+**MIGRAPHX_DISABLE_MIOPEN_FUSION**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable MIOpen fusions.
+
+**MIGRAPHX_DISABLE_SCHEDULE_PASS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the ``schedule`` pass.
+
+**MIGRAPHX_DISABLE_REDUCE_FUSION**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable the ``fuse_reduce`` pass.
+
+**MIGRAPHX_ENABLE_NHWC**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable the ``layout_nhwc`` pass.
+
+**MIGRAPHX_ENABLE_CK**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable using the Composable Kernels library.
+Should be used in conjunction with ``MIGRAPHX_DISABLE_MLIR=1``.
+
+**MIGRAPHX_DISABLE_MLIR** 
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Disable using the rocMLIR library.
+
+**MIGRAPHX_ENABLE_EXTRA_MLIR**
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enables additional opportunities to use MLIR that may improve performance.
+
+**MIGRAPHX_COPY_LITERALS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use ``hip_copy_to_gpu`` with a new ``literal`` instruction rather than use ``hip_copy_literal{}``.
+
+Compilation traces
+----------------------
+
+**MIGRAPHX_TRACE_FINALIZE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Debug print instructions during the ``module.finalize()`` step.
+
+**MIGRAPHX_TRACE_COMPILE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print trace information for the graph compilation process.
+
+**MIGRAPHX_TRACE_PASSES**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the compile pass and the program after the pass.
+
+**MIGRAPHX_TIME_PASSES**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Time the compile passes.
+
+
+GPU Kernels JIT compilation debugging (applicable for both hiprtc and hipclang)
+-----------------------------------------
+
+**MIGRAPHX_TRACE_CMD_EXECUTE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print commands executed by the MIGraphX ``process``.
+
+**MIGRAPHX_TRACE_HIPRTC**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print HIPRTC options and C++ file executed.
+
+**MIGRAPHX_DEBUG_SAVE_TEMP_DIR**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Make it so the created temporary directories are not deleted.
+
+**MIGRAPHX_GPU_DEBUG**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Internally, this adds the option ``-DMIGRAPHX_DEBUG`` when compiling GPU kernels. It enables assertions and capture of source locations for the errors. 
+
+**MIGRAPHX_GPU_DEBUG_SYM**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Adds the option ``-g`` when compiling HIPRTC.
+
+**MIGRAPHX_GPU_DUMP_SRC**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dump the HIPRTC source files compiled.
+
+**MIGRAPHX_GPU_DUMP_ASM**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dump the hip-clang assembly.
+
+**MIGRAPHX_GPU_OPTIMIZE**
+
+Set the optimization mode for GPU compile (``-O`` option).
+Defaults to ``-O3``.
+
+**MIGRAPHX_GPU_COMPILE_PARALLEL**
+
+Set to the number of threads to use.
+Compile GPU code in parallel with the given number of threads.
+
+**MIGRAPHX_TRACE_NARY**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print the ``nary`` device functions used.
+
+**MIGRAPHX_ENABLE_HIPRTC_WORKAROUNDS**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Enable HIPRTC workarounds for bugs in HIPRTC.
+
+**MIGRAPHX_USE_FAST_SOFTMAX**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use the fast softmax optimization.
+
+**MIGRAPHX_ENABLE_NULL_STREAM**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Allow using null stream for miopen and hipStream.
+
+**MIGRAPHX_NSTREAMS**
+
+Set to the number of streams to use.
+Defaults to 1.
+
+**MIGRAPHX_TRACE_BENCHMARKING**
+
+Set to "1" to print benchmarching trace.
+Set to "2" to print benchmarching trace with more detail.
+
+MLIR vars
+-------------
+
+**MIGRAPHX_TRACE_MLIR**
+
+Set to "1" to trace MLIR and print any failures.
+Set to "2" to additionally print all MLIR operations.
+
+**MIGRAPHX_MLIR_USE_SPECIFIC_OPS**
+
+Set to the name of the operations you want to always use MLIR regardless of GPU architecture.
+Accepts a list of operators separated by commas (ex: "fused", "convolution", "dot").
+
+**MIGRAPHX_MLIR_TUNING_DB**
+
+Set to the path of the MLIR tuning database to load.
+
+**MIGRAPHX_MLIR_TUNING_CFG**
+
+Set to the path of the tuning configuration.
+Appends to tuning cfg file that could be used with rocMLIR tuning scripts.
+
+**MIGRAPHX_MLIR_TUNE_EXHAUSTIVE**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Do exhaustive tuning for MLIR.
+
+
+CK vars
+-----------
+
+**MIGRAPHX_LOG_CK_GEMM**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Print Composable Kernels GEMM traces.
+
+**MIGRAPHX_CK_DEBUG**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Always add the ``-DMIGRAPHX_CK_CHECK=1`` for compiling Composable Kernels operators.
+
+**MIGRAPHX_TUNE_CK**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Use tuning for Composable Kernels.
+
+Testing 
+------------
+
+**MIGRAPHX_TRACE_TEST_COMPILE**
+
+Set to the target that you want to trace the compilation of (ex. "gpu", "cpu").
+Prints the compile trace for the given target for the verify tests.
+This flag shouldn't be used in conjunction with ``MIGRAPHX_TRACE_COMPILE``.
+For the verify tests only use ``MIGRAPHX_TRACE_TEST_COMPILE``.
+
+**MIGRAPHX_TRACE_TEST**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Prints the reference and target programs even if the verify passed successfully.
+
+**MIGRAPHX_DUMP_TEST**
+
+Set to "1", "enable", "enabled", "yes", or "true" to use.
+Dumps verify tests to ``.mxr`` files.
--- a/docs/driver.rst
+++ b/docs/driver.rst
 MIGraphX Driver
 ===============

+The MIGraphX driver is a tool that allows you to utilize many of the core functions of MIGraphX without having to write your own program. It can read, compile, run, and test the performance of a model with randomized data.
+
 read
 ----

@@ -17,6 +19,7 @@ compile

 Compiles and prints input graph.

+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst

 run
@@ -26,6 +29,7 @@ run

 Loads and prints input graph.

+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst

 perf
@@ -35,6 +39,7 @@ perf

 Compiles and runs input graph then prints performance report.

+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst

 .. option::  --iterations, -n [unsigned int]
@@ -48,6 +53,7 @@ verify

 Runs reference and CPU or GPU implementations and checks outputs for consistency.

+.. include:: ./driver/read.rst
 .. include:: ./driver/compile.rst

 .. option::  --rms-tol [double]
@@ -71,7 +77,7 @@ Verify each instruction
 Reduce program and verify

 roctx
----
+-----

 .. program:: migraphx-driver roctx

@@ -86,4 +92,5 @@ An example command line combined with rocprof for tracing purposes is given belo
 After `rocprof` is run, the output directory will contain trace information for HIP, HCC and ROCTX in seperate `.txt` files.
 To understand the interactions between API calls, it is recommended to utilize `roctx.py` helper script as desribed in :ref:`dev/tools:rocTX` section. 

-.. include:: ./driver/compile.rst
\ No newline at end of file
+.. include:: ./driver/read.rst
+.. include:: ./driver/compile.rst
--- a/docs/driver/compile.rst
+++ b/docs/driver/compile.rst
-.. include:: ./driver/read.rst
-
 .. option::  --fill0 [std::vector<std::string>]

 Fill parameter with 0s

--- a/docs/driver/read.rst
+++ b/docs/driver/read.rst
@@ -46,11 +46,11 @@ Trim instructions from the end (Default: 0)

 Dim of a parameter (format: "@name d1 d2 dn")

-.. options:: --dyn-input-dim [std::vector<std::string>]
+.. option:: --dyn-input-dim [std::vector<std::string>]

 Set dynamic dimensions of a parameter using JSON formatting (format "@name" "dynamic_dimension_json")

-.. options:: --default-dyn-dim
+.. option:: --default-dyn-dim

 Set the default dynamic dimension (format {min:x, max:y, optimals:[o1,o2,...]})


--- a/docs/reference/py.rst
+++ b/docs/reference/py.rst
@@ -95,7 +95,7 @@ shape
    :rtype: bool

 dynamic_dimension
--------
+-----------------

 .. py:class:: dynamic_dimension(min, max, optimals)

@@ -326,7 +326,7 @@ op
 parse_onnx
 ----------

-.. py:function:: parse_onnx(filename, default_dim_value=1, map_input_dims={}, skip_unknown_operators=false, print_program_on_error=false, max_loop_iterations=10)
+.. py:function:: parse_onnx(filename, default_dim_value=1, map_input_dims={}, skip_unknown_operators=false, print_program_on_error=false, max_loop_iterations=10, limit_max_iterations=65535)

    Load and parse an onnx file.

@@ -337,7 +337,8 @@ parse_onnx
    :param list[dynamic_dimension] map_dyn_input_dims: Explicitly specify the dynamic_dimensions of an input.
    :param str skip_unknown_operators: Continue parsing onnx file if an unknown operator is found.
    :param str print_program_on_error: Print program if an error occurs.
-    :param int max_loop_iterations: Maximum iteration number for the loop operator.
+    :param int max_loop_iterations: Maximum iteration number for the loop operator if trip count is not set.
+    :param int limit_max_iterations: Maximum iteration limit for the loop operator.
    :rtype: program

 parse_tf

--- a/requirements.txt
+++ b/requirements.txt
@@ -29,4 +29,4 @@ pybind/pybind11@d159a563383d10c821ba7b2a71905d1207db6de4 --build
 msgpack/msgpack-c@cpp-3.3.0 -DMSGPACK_BUILD_TESTS=Off
 sqlite3@3.43.2 -DCMAKE_POSITION_INDEPENDENT_CODE=On
 ROCmSoftwarePlatform/composable_kernel@70eefcf4f263aa5c25f3c9ff0db8f6f199ef0fb9 -DCK_BUILD_JIT_LIB=On -DCMAKE_POSITION_INDEPENDENT_CODE=On
-ROCmSoftwarePlatform/rocMLIR@3700afd2564e21267a4d1fd8f1f80465f45daa93 -DBUILD_FAT_LIBROCKCOMPILER=On
+ROCmSoftwarePlatform/rocMLIR@13f6c2a69cfe80a575c6b241ec7353d1e953cb12 -DBUILD_FAT_LIBROCKCOMPILER=On
--- a/src/CMakeLists.txt
+++ b/src/CMakeLists.txt
@@ -28,9 +28,8 @@ include(ROCMInstallTargets)
 include(ROCMPackageConfigHelpers)
 include(RegisterOp)
 include(CheckCXXLinkerFlag)
- 

-add_library(migraphx 
+add_library(migraphx
    adjust_allocation.cpp
    analyze_streams.cpp
    apply_alpha_beta.cpp
@@ -104,6 +103,12 @@ add_library(migraphx
    value.cpp
    verify_args.cpp
 )
+
+if(WIN32)
+    # Due to compilation crashing, we need to use type-erased matchers on Windows.
+    target_compile_definitions(migraphx PUBLIC MIGRAPHX_USE_TYPE_ERASED_MATCHERS=1)
+endif()
+
 configure_file(version.h.in include/migraphx/version.h)
 rocm_set_soversion(migraphx ${MIGRAPHX_SO_VERSION})
 function(register_migraphx_ops)
@@ -155,6 +160,7 @@ register_migraphx_ops(
    identity
    if_op
    im2col
+    isinf
    isnan
    layout
    leaky_relu
@@ -174,6 +180,7 @@ register_migraphx_ops(
    mul
    multibroadcast
    multinomial
+    nearbyint
    neg
    nonmaxsuppression
    nonzero
@@ -204,7 +211,6 @@ register_migraphx_ops(
    rnn_last_hs_output
    rnn_var_sl_last_output
    roialign
-    round
    rsqrt
    run_on_target
    scalar
@@ -246,14 +252,14 @@ rocm_install_targets(
    ${CMAKE_CURRENT_BINARY_DIR}/include
 )

-
-check_cxx_linker_flag(-lstdc++fs HAS_LIB_STD_FILESYSTEM)
-if(HAS_LIB_STD_FILESYSTEM)
-target_link_libraries(migraphx PRIVATE -lstdc++fs)
+if(NOT WIN32)
+    check_cxx_linker_flag(-lstdc++fs HAS_LIB_STD_FILESYSTEM)
+    if(HAS_LIB_STD_FILESYSTEM)
+        target_link_libraries(migraphx PRIVATE -lstdc++fs)
+    endif()
+    target_link_libraries(migraphx PRIVATE -ldl)
 endif()

-target_link_libraries(migraphx PRIVATE -ldl)
-
 target_include_directories(migraphx SYSTEM PUBLIC $<BUILD_INTERFACE:${HALF_INCLUDE_DIR}>)
 target_link_libraries(migraphx PUBLIC Threads::Threads)

@@ -274,8 +280,6 @@ target_link_libraries(migraphx INTERFACE $<BUILD_INTERFACE:msgpackc-cxx>)

 add_library(migraphx_all_targets INTERFACE)

-set(PACKAGE_DEPENDS)
-
 add_subdirectory(api)
 add_subdirectory(driver)
 add_subdirectory(onnx)

--- a/src/api/api.cpp
+++ b/src/api/api.cpp
@@ -164,6 +164,11 @@ void set_default_loop_iterations(onnx_options& options, int64_t value)
    options.max_loop_iterations = value;
 }

+void set_limit_loop_iterations(onnx_options& options, int64_t value)
+{
+    options.limit_max_iterations = value;
+}
+
 void set_nhwc(tf_options& options, bool is_nhwc) { options.is_nhwc = is_nhwc; }

 void set_default_dim_value(tf_options& options, size_t value) { options.batch_size = value; }
@@ -1904,6 +1909,17 @@ migraphx_onnx_options_set_default_loop_iterations(migraphx_onnx_options_t onnx_o
    return api_error_result;
 }

+extern "C" migraphx_status
+migraphx_onnx_options_set_limit_loop_iterations(migraphx_onnx_options_t onnx_options, int64_t value)
+{
+    auto api_error_result = migraphx::try_([&] {
+        if(onnx_options == nullptr)
+            MIGRAPHX_THROW(migraphx_status_bad_param, "Bad parameter onnx_options: Null pointer");
+        migraphx::set_limit_loop_iterations((onnx_options->object), (value));
+    });
+    return api_error_result;
+}
+
 extern "C" migraphx_status migraphx_file_options_destroy(migraphx_file_options_t file_options)
 {
    auto api_error_result = migraphx::try_([&] { destroy((file_options)); });

--- a/src/api/include/migraphx/migraphx.h
+++ b/src/api/include/migraphx/migraphx.h
@@ -44,7 +44,8 @@
    m(int32_type, int32_t) \
    m(int64_type, int64_t) \
    m(uint32_type, uint32_t) \
-    m(uint64_type, uint64_t)
+    m(uint64_type, uint64_t) \
+    m(fp8e4m3fnuz_type, migraphx::fp8::fp8e4m3fnuz)
 // clang-format on

 #ifdef __cplusplus
@@ -514,6 +515,9 @@ MIGRAPHX_C_EXPORT migraphx_status migraphx_onnx_options_set_default_dyn_dim_valu
 MIGRAPHX_C_EXPORT migraphx_status migraphx_onnx_options_set_default_loop_iterations(
    migraphx_onnx_options_t onnx_options, int64_t value);

+MIGRAPHX_C_EXPORT migraphx_status migraphx_onnx_options_set_limit_loop_iterations(
+    migraphx_onnx_options_t onnx_options, int64_t value);
+
 MIGRAPHX_C_EXPORT migraphx_status
 migraphx_file_options_destroy(migraphx_file_options_t file_options);


--- a/src/api/include/migraphx/migraphx.hpp
+++ b/src/api/include/migraphx/migraphx.hpp
@@ -1321,6 +1321,12 @@ struct onnx_options : MIGRAPHX_HANDLE_BASE(onnx_options)
    {
        call(&migraphx_onnx_options_set_default_loop_iterations, this->get_handle_ptr(), value);
    }
+
+    /// Set max iteration limit for the loop operator
+    void set_limit_loop_iterations(int64_t value)
+    {
+        call(&migraphx_onnx_options_set_limit_loop_iterations, this->get_handle_ptr(), value);
+    }
 };

 /// Parse an onnx file into a migraphx program

--- a/src/api/migraphx.py
+++ b/src/api/migraphx.py
@@ -349,6 +349,11 @@ def onnx_options(h):
        api.params(value='int64_t'),
        invoke='migraphx::set_default_loop_iterations($@)',
    )
+    h.method(
+        'set_limit_loop_iterations',
+        api.params(value='int64_t'),
+        invoke='migraphx::set_limit_loop_iterations($@)',
+    )


 @auto_handle()