Release - SuperBench v0.10.0 (#607)

**Description** Cherry-pick bug fixes from v0.10.0 to main. **Major Revisions** * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590 * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591 * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592 * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595 * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596 * CI/CD - Add ndv5 topo file #597 * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593 * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599 * Dockerfile - Bug fix for rocm docker build and deploy #598 * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603 * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604 * Monitor - U...

Release - SuperBench v0.10.0 (#607)
**Description** Cherry-pick bug fixes from v0.10.0 to main. **Major Revisions** * Benchmarks: Microbenchmark - Support different hipblasLt data types in dist_inference #590 * Benchmarks: Microbenchmark - Support in-place for NCCL/RCCL benchmark #591 * Bug Fix - Fix NUMA Domains Swap Issue in NDv4 Topology File #592 * Benchmarks: Microbenchmark - Add data type option for NCCL and RCCL tests #595 * Benchmarks: Bug Fix - Make metrics of dist-inference-cpp aligned with PyTorch version #596 * CI/CD - Add ndv5 topo file #597 * Benchmarks: Microbenchmark - Improve AMD GPU P2P performance with fine-grained GPU memory #593 * Benchmarks: Build Pipeline - fix nccl and nccl test version to 2.18.3 to resolve hang issue in cuda12.2 docker #599 * Dockerfile - Bug fix for rocm docker build and deploy #598 * Benchmarks: Microbenchmark - Adapt to hipblasLt data type changes #603 * Benchmarks: Micro benchmarks - Update hipblaslt metric unit to tflops #604 * Monitor - U...
2c88db90 · Yifan Xiong · GitHub · 2c2096ed · 2c88db90 · 2c88db90
Unverified Commit 2c88db90 authored Jan 07, 2024 by Yifan Xiong Committed by GitHub Jan 08, 2024
20 changed files
--- a/.github/workflows/build-image.yml
+++ b/.github/workflows/build-image.yml
@@ -18,6 +18,7 @@ jobs:
  docker:
    name: Docker build ${{ matrix.name }}
    runs-on: ${{ matrix.runner }}
+    timeout-minutes: 600
    permissions:
      contents: read
      packages: write
@@ -27,15 +28,23 @@ jobs:
        - name: cuda12.2
          dockerfile: cuda12.2
          tags: superbench/main:cuda12.2
-          runner: ubuntu-latest
+          runner: [self-hosted, rocm-build]
+          build_args: "NUM_MAKE_JOBS=64"
        - name: cuda11.1.1
          dockerfile: cuda11.1.1
          tags: superbench/main:cuda11.1.1,superbench/superbench:latest
          runner: ubuntu-latest
+          build_args: "NUM_MAKE_JOBS=8"
        - name: rocm5.7
          dockerfile: rocm5.7.x
          tags: superbench/main:rocm5.7
          runner: [self-hosted, rocm-build]
+          build_args: "NUM_MAKE_JOBS=64"
+        - name: rocm6.0
+          dockerfile: rocm6.0.x
+          tags: superbench/main:rocm6.0
+          runner: [self-hosted, rocm-build]
+          build_args: "NUM_MAKE_JOBS=64"
    steps:
      - name: Checkout
        uses: actions/checkout@v2
@@ -75,7 +84,7 @@ jobs:
          fi
          DOCKERFILE=dockerfile/${{ matrix.dockerfile }}.dockerfile

-          BUILD_ARGS="NUM_MAKE_JOBS=8"
+          BUILD_ARGS=${{ matrix.build_args }}
          if [[ "${{ matrix.extra_args }}" ]]; then
            BUILD_ARGS="${BUILD_ARGS} ${{ matrix.extra_args }}"
          fi
@@ -87,11 +96,11 @@ jobs:
            CACHE_TO="type=inline,mode=max"
          fi

-          echo ::set-output name=dockerfile::${DOCKERFILE}
-          echo ::set-output name=build_args::${BUILD_ARGS}
-          echo ::set-output name=tags::${TAGS}
-          echo ::set-output name=cache_from::${CACHE_FROM}
-          echo ::set-output name=cache_to::${CACHE_TO}
+          echo "dockerfile=${DOCKERFILE}" >> "$GITHUB_OUTPUT"
+          echo "build_args=${BUILD_ARGS}" >> "$GITHUB_OUTPUT"
+          echo "tags=${TAGS}" >> "$GITHUB_OUTPUT"
+          echo "cache_from=${CACHE_FROM}" >> "$GITHUB_OUTPUT"
+          echo "cache_to=${CACHE_TO}" >> "$GITHUB_OUTPUT"
      - name: Echo build args
        run: echo ${{ steps.metadata.outputs.build_args }}
      - name: Echo image tag
@@ -106,6 +115,9 @@ jobs:
        with:
          username: ${{ secrets.DOCKERHUB_USERNAME }}
          password: ${{ secrets.DOCKERHUB_TOKEN }}
+      - name: Pull cache image
+        run: sudo docker pull ${{ steps.metadata.outputs.tags }}
+        continue-on-error: true
      - name: Login to the GitHub Container Registry
        uses: docker/login-action@v1
        if: ${{ github.event_name == 'release' }}

--- a/.gitmodules
+++ b/.gitmodules
@@ -24,3 +24,9 @@
 [submodule "third_party/msccl"]
 	path = third_party/msccl
 	url = https://github.com/Azure/msccl
+[submodule "third_party/Megatron/Megatron-LM"]
+	path = third_party/Megatron/Megatron-LM
+	url = https://github.com/NVIDIA/Megatron-LM.git
+[submodule "third_party/Megatron/Megatron-DeepSpeed"]
+	path = third_party/Megatron/Megatron-DeepSpeed
+	url = https://github.com/microsoft/Megatron-DeepSpeed.git
--- a/README.md
+++ b/README.md
@@ -15,7 +15,7 @@

 __SuperBench__ is a validation and profiling tool for AI infrastructure.

-📢 [v0.9.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.9.0) has been released!
+📢 [v0.10.0](https://github.com/microsoft/superbenchmark/releases/tag/v0.10.0) has been released!

 ## _Check [aka.ms/superbench](https://aka.ms/superbench) for more details._


--- a/dockerfile/cuda12.2.dockerfile
+++ b/dockerfile/cuda12.2.dockerfile
@@ -7,7 +7,7 @@ FROM nvcr.io/nvidia/pytorch:23.10-py3
 # NVIDIA:
 #   - CUDA: 12.2.2
 #   - cuDNN: 8.9.5
-#   - NCCL: v2.19.3-1
+#   - NCCL: v2.18.3-1
 # Mellanox:
 #   - OFED: 23.07-0.5.1.2
 #   - HPC-X: v2.16
@@ -113,6 +113,13 @@ RUN cd /tmp && \
    mv amd-blis /opt/AMD && \
    rm -rf aocl-blis-linux-aocc-4.0.tar.gz

+# Install NCCL 2.18.3
+RUN cd /tmp && \
+    git clone -b v2.18.3-1 https://github.com/NVIDIA/nccl.git && \
+    cd nccl && \
+    make -j src.build && \
+    make install && \
+    rm -rf /tmp/nccl

 ENV PATH="${PATH}" \
    LD_LIBRARY_PATH="/usr/local/lib:${LD_LIBRARY_PATH}" \

--- a/dockerfile/directx12.dockerfile
+++ b/dockerfile/directx12.dockerfile
@@ -54,6 +54,8 @@ RUN curl -s -L https://dist.nuget.org/win-x86-commandline/latest/nuget.exe -o "%
 # Run the setup script to install the visual studio components
 RUN "%SB_HOME%\\dockerfile\\directx\\install-components.bat"

+RUN powershell -Command "Set-ItemProperty -Path HKLM:\SYSTEM\CurrentControlSet\Control\FileSystem -Name LongPathsEnabled -Value 1;"
+RUN git config --system core.longpaths true
 # Install Superbench
 RUN python -m pip install setuptools==65.0.0 && \
    python -m pip install --no-cache-dir .[amdworker] && \

--- a/dockerfile/etc/ndv4-topo.xml
+++ b/dockerfile/etc/ndv4-topo.xml
 <system version="1">
  <cpu numaid="0" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
    <pci busid="ffff:ff:01.0" class="0x060400" link_speed="16 GT/s" link_width="16">
-      <pci busid="0001:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
-      <pci busid="0101:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
-      <pci busid="0002:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
-      <pci busid="0102:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
-    </pci>
-  </cpu>
-  <cpu numaid="1" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
-    <pci busid="ffff:ff:02.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="0003:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0103:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0004:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0104:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
-  <cpu numaid="2" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
-      <pci busid="ffff:ff:03.0" class="0x060400" link_speed="16 GT/s" link_width="16">
-      <pci busid="000b:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
-      <pci busid="0105:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
-      <pci busid="000c:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
-      <pci busid="0106:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
+  <cpu numaid="1" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
+    <pci busid="ffff:ff:02.0" class="0x060400" link_speed="16 GT/s" link_width="16">
+      <pci busid="0001:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
+      <pci busid="0101:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
+      <pci busid="0002:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
+      <pci busid="0102:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
-  <cpu numaid="3" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
-    <pci busid="ffff:ff:04.0" class="0x060400" link_speed="16 GT/s" link_width="16">
+  <cpu numaid="2" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
+    <pci busid="ffff:ff:03.0" class="0x060400" link_speed="16 GT/s" link_width="16">
      <pci busid="000d:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0107:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
      <pci busid="000e:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
      <pci busid="0108:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
    </pci>
  </cpu>
+  <cpu numaid="3" affinity="0000ffff,0000ffff" arch="x86_64" vendor="AuthenticAMD" familyid="23" modelid="49">
+    <pci busid="ffff:ff:04.0" class="0x060400" link_speed="16 GT/s" link_width="16">
+      <pci busid="000b:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
+      <pci busid="0105:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
+      <pci busid="000c:00:00.0" class="0x030200" link_speed="16 GT/s" link_width="16"/>
+      <pci busid="0106:00:00.0" class="0x020700" link_speed="16 GT/s" link_width="16"/>
+    </pci>
+  </cpu>
 </system>
--- a/dockerfile/etc/ndv5-topo.xml
+++ b/dockerfile/etc/ndv5-topo.xml
+<system version="1">
+  <cpu numaid="0" affinity="ffffffff,ffff0000,00000000" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
+    <pci busid="ffff:ff:01.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
+      <pci busid="0001:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
+      <pci busid="0101:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
+    </pci>
+    <pci busid="ffff:ff:02.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
+      <pci busid="0002:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
+      <pci busid="0102:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
+    </pci>
+    <pci busid="ffff:ff:03.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
+      <pci busid="0003:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
+      <pci busid="0103:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
+    </pci>
+    <pci busid="ffff:ff:04.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
+      <pci busid="0008:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
+      <pci busid="0104:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
+    </pci>
+  </cpu>
+  <cpu numaid="1" affinity="00000000,0000ffff,ffffffff" arch="x86_64" vendor="GenuineIntel" familyid="6" modelid="143">
+    <pci busid="ffff:ff:05.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
+      <pci busid="0009:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
+      <pci busid="0105:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
+    </pci>
+    <pci busid="ffff:ff:06.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
+      <pci busid="000a:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
+      <pci busid="0106:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
+    </pci>
+    <pci busid="ffff:ff:07.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
+      <pci busid="000b:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
+      <pci busid="0107:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
+    </pci>
+    <pci busid="ffff:ff:08.0" class="0x060400" link_speed="32.0 GT/s PCIe" link_width="16" vendor="0x0000" device="0x0000" subsystem_vendor="0x0000" subsystem_device="0x0000">
+      <pci busid="000c:00:00.0" class="0x030200" link_speed="32.0 GT/s PCIe" link_width="16"/>
+      <pci busid="0108:00:00.0" class="0x020700" link_speed="32.0 GT/s PCIe" link_width="16"/>
+    </pci>
+  </cpu>
+</system>
\ No newline at end of file
--- a/dockerfile/rocm5.7.x.dockerfile
+++ b/dockerfile/rocm5.7.x.dockerfile
@@ -17,6 +17,7 @@ RUN apt-get update && \
    apt-get -q install -y --no-install-recommends  \
    autoconf \
    automake \
+    bc \
    build-essential \
    curl \
    dmidecode \
@@ -27,6 +28,7 @@ RUN apt-get update && \
    libaio-dev \
    libboost-program-options-dev \
    libcap2 \
+    libcurl4-openssl-dev \
    libnuma-dev \
    libpci-dev \
    libssl-dev \
@@ -38,6 +40,7 @@ RUN apt-get update && \
    openssh-client \
    openssh-server \
    pciutils \
+    python3-mpi4py \
    rsync \
    sudo \
    util-linux \
@@ -46,11 +49,11 @@ RUN apt-get update && \
    && \
    rm -rf /tmp/*

-ARG NUM_MAKE_JOBS=16
+ARG NUM_MAKE_JOBS=

 # Check if CMake is installed and its version
 RUN cmake_version=$(cmake --version 2>/dev/null | grep -oP "(?<=cmake version )(\d+\.\d+)" || echo "0.0") && \
-    required_version="3.26.4" && \
+    required_version="3.24.1" && \
    if [ "$(printf "%s\n" "$required_version" "$cmake_version" | sort -V | head -n 1)" != "$required_version" ]; then \
    echo "existing cmake version is ${cmake_version}" && \
    cd /tmp && \
@@ -100,40 +103,26 @@ RUN if ! command -v ofed_info >/dev/null 2>&1; then \
    rm -rf MLNX_OFED_LINUX-${OFED_VERSION}* ; \
    fi

-# Install UCX
-ENV UCX_VERSION=1.14.1
-RUN if [ -z "$(ls -A /opt/ucx)" ]; then \
-    echo "/opt/ucx is empty. Installing UCX..."; \
-    cd /tmp && \
-    git clone https://github.com/openucx/ucx.git -b v${UCX_VERSION} && \
-    cd ucx && \
-    ./autogen.sh && \
-    mkdir build && \
-    cd build && \
-    ../configure -prefix=$UCX_DIR --with-rocm=/opt/rocm --without-knem && \
-    make -j $(nproc) && make -j $(nproc) install && rm -rf /tmp/ucx-${UCX_VERSION} ; \
-    else \
-    echo "/opt/ucx is not empty. Skipping UCX installation."; \
-    fi
+# Add target file to help determine which device(s) to build for
+ENV ROCM_PATH=/opt/rocm
+RUN bash -c 'echo -e "gfx90a:xnack-\ngfx90a:xnac+\ngfx940\ngfx941\ngfx942\ngfx1030\ngfx1100\ngfx1101\ngfx1102\n" >> ${ROCM_PATH}/bin/target.lst'

 # Install OpenMPI
 ENV OPENMPI_VERSION=4.1.x
+ENV MPI_HOME=/usr/local/mpi
 # Check if Open MPI is installed
-RUN [ -d /usr/local/bin/mpirun ] || { \
-    echo "Open MPI not found. Installing Open MPI..." && \
-    cd /tmp && \
+RUN cd /tmp && \
    git clone --recursive https://github.com/open-mpi/ompi.git -b v${OPENMPI_VERSION}  && \
    cd ompi && \
    ./autogen.pl && \
    mkdir build && \
    cd build && \
-    ../configure --prefix=/usr/local  --enable-orterun-prefix-by-default  --enable-mpirun-prefix-by-default  --enable-prte-prefix-by-default --enable-mca-no-build=btl-uct --with-ucx=/opt/ucx --with-rocm=/opt/rocm && \
+    ../configure --prefix=/usr/local/mpi  --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default  --enable-prte-prefix-by-default --with-rocm=/opt/rocm && \
    make -j $(nproc) && \
    make -j $(nproc) install && \
    ldconfig && \
    cd / && \
-    rm -rf /tmp/openmpi-${OPENMPI_VERSION}* ;\
-    }
+    rm -rf /tmp/openmpi-${OPENMPI_VERSION}*

 # Install Intel MLC
 RUN cd /tmp && \
@@ -148,12 +137,18 @@ RUN cd /opt/ &&  \
    cd rccl && \
    mkdir build && \
    cd build && \
-    CXX=/opt/rocm/bin/hipcc cmake -DCMAKE_PREFIX_PATH=/opt/rocm/ .. && \
+    CXX=/opt/rocm/bin/hipcc cmake -DHIP_COMPILER=clang -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=1 \
+    -DCMAKE_PREFIX_PATH="${ROCM_PATH}/hsa;${ROCM_PATH}/hip;${ROCM_PATH}/share/rocm/cmake/;${ROCM_PATH}" \
+    .. && \
    make -j${NUM_MAKE_JOBS}

-ENV PATH="/opt/superbench/bin:/usr/local/bin/:/opt/rocm/hip/bin/:/opt/rocm/bin/:${PATH}" \
+# Install AMD SMI Python Library
+RUN cd /opt/rocm/share/amd_smi && \
+    python3 -m pip install --user .
+
+ENV PATH="/usr/local/mpi/bin:/opt/superbench/bin:/usr/local/bin/:/opt/rocm/hip/bin/:/opt/rocm/bin/:${PATH}" \
    LD_PRELOAD="/opt/rccl/build/librccl.so:$LD_PRELOAD" \
-    LD_LIBRARY_PATH="/opt/ucx/lib:/usr/local/lib/:/opt/rocm/lib:${LD_LIBRARY_PATH}" \
+    LD_LIBRARY_PATH="/usr/local/mpi/lib:/usr/lib/x86_64-linux-gnu/:/usr/local/lib/:/opt/rocm/lib:${LD_LIBRARY_PATH}" \
    SB_HOME=/opt/superbench \
    SB_MICRO_PATH=/opt/superbench \
    ANSIBLE_DEPRECATION_WARNINGS=FALSE \
@@ -163,13 +158,19 @@ RUN echo PATH="$PATH" > /etc/environment && \
    echo LD_LIBRARY_PATH="$LD_LIBRARY_PATH" >> /etc/environment && \
    echo SB_MICRO_PATH="$SB_MICRO_PATH" >> /etc/environment

+RUN apt install rocm-cmake -y && \
+    python3 -m pip install --upgrade pip wheel setuptools==65.7
+
 WORKDIR ${SB_HOME}

+ADD third_party third_party
+# Apply patch
+RUN cd third_party/perftest && \
+    git apply ../perftest_rocm6.patch
+RUN make RCCL_HOME=/opt/rccl/build/ ROCBLAS_BRANCH=release/rocm-rel-5.7.1.1 HIPBLASLT_BRANCH=release/rocm-rel-5.7 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm
+
 ADD . .
-RUN apt install rocm-cmake -y && \
-    python3 -m pip install --upgrade pip wheel setuptools==65.7 && \
-    python3 -m pip install .[amdworker]  && \
+#ENV USE_HIPBLASLT_DATATYPE=1
+RUN python3 -m pip install .[amdworker]  && \
+    CXX=/opt/rocm/bin/hipcc make cppbuild  && \
    make postinstall
-RUN make cppbuild
-ADD third_party third_party
-RUN make RCCL_HOME=/opt/rccl/build/ ROCBLAS_BRANCH=release/rocm-rel-5.7.1.1 HIPBLASLT_BRANCH=release-staging/rocm-rel-5.7 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm
--- a/dockerfile/rocm6.0.x.dockerfile
+++ b/dockerfile/rocm6.0.x.dockerfile
+ARG BASE_IMAGE=rocm/pytorch:rocm6.0_ubuntu22.04_py3.9_pytorch_2.0.1
+
+FROM ${BASE_IMAGE}
+
+# OS:
+#   - Ubuntu: 22.04
+#   - Docker Client: 20.10.8
+# ROCm:
+#   - ROCm: 6.0
+# Lib:
+#   - torch: 2.0.1
+#   - rccl: 2.18.3+hip6.0 develop:7e1cbb4
+#   - hipblaslt: release/rocm-rel-6.0
+#   - openmpi: 4.1.x
+#   - apex: 1.0.0
+# Intel:
+#   - mlc: v3.10
+
+LABEL maintainer="SuperBench"
+
+ENV DEBIAN_FRONTEND=noninteractive
+RUN apt-get update && \
+    apt-get -q install -y --no-install-recommends  \
+    autoconf \
+    automake \
+    bc \
+    build-essential \
+    curl \
+    dmidecode \
+    git \
+    hipify-clang \
+    iproute2 \
+    jq \
+    libaio-dev \
+    libboost-program-options-dev \
+    libcap2 \
+    libcurl4-openssl-dev \
+    libnuma-dev \
+    libpci-dev \
+    libssl-dev \
+    libtinfo5 \
+    libtool \
+    lshw \
+    net-tools \
+    numactl \
+    openssh-client \
+    openssh-server \
+    pciutils \
+    python3-mpi4py \
+    rsync \
+    sudo \
+    util-linux \
+    vim \
+    wget \
+    && \
+    rm -rf /tmp/*
+
+ARG NUM_MAKE_JOBS=64
+
+# Check if CMake is installed and its version
+RUN cmake_version=$(cmake --version 2>/dev/null | grep -oP "(?<=cmake version )(\d+\.\d+)" || echo "0.0") && \
+    required_version="3.24.1" && \
+    if [ "$(printf "%s\n" "$required_version" "$cmake_version" | sort -V | head -n 1)" != "$required_version" ]; then \
+    echo "existing cmake version is ${cmake_version}" && \
+    cd /tmp && \
+    wget -q https://github.com/Kitware/CMake/releases/download/v${required_version}/cmake-${required_version}.tar.gz && \
+    tar xzf cmake-${required_version}.tar.gz && \
+    cd cmake-${required_version} && \
+    ./bootstrap --prefix=/usr --no-system-curl --parallel=16 && \
+    make -j ${NUM_MAKE_JOBS} && \
+    make install && \
+    rm -rf /tmp/cmake-${required_version}* \
+    else \
+    echo "CMake version is greater than or equal to 3.23"; \
+    fi
+
+# Install Docker
+ENV DOCKER_VERSION=20.10.8
+RUN cd /tmp && \
+    wget -q https://download.docker.com/linux/static/stable/x86_64/docker-${DOCKER_VERSION}.tgz -O docker.tgz && \
+    tar --extract --file docker.tgz --strip-components 1 --directory /usr/local/bin/ && \
+    rm docker.tgz
+
+# Update system config
+RUN mkdir -p /root/.ssh && \
+    touch /root/.ssh/authorized_keys && \
+    mkdir -p /var/run/sshd && \
+    sed -i "s/[# ]*PermitRootLogin prohibit-password/PermitRootLogin yes/" /etc/ssh/sshd_config && \
+    sed -i "s/[# ]*PermitUserEnvironment no/PermitUserEnvironment yes/" /etc/ssh/sshd_config && \
+    sed -i "s/[# ]*Port.*/Port 22/" /etc/ssh/sshd_config && \
+    echo "* soft nofile 1048576\n* hard nofile 1048576" >> /etc/security/limits.conf && \
+    echo "root soft nofile 1048576\nroot hard nofile 1048576" >> /etc/security/limits.conf
+
+
+# Get Ubuntu version and set as an environment variable
+RUN export UBUNTU_VERSION=$(lsb_release -r -s)
+RUN echo "Ubuntu version: $UBUNTU_VERSION"
+ENV UBUNTU_VERSION=${UBUNTU_VERSION}
+
+# Install OFED
+ENV OFED_VERSION=5.9-0.5.6.0
+# Check if ofed_info is present and has a version
+RUN if ! command -v ofed_info >/dev/null 2>&1; then \
+    echo "OFED not found. Installing OFED..."; \
+    cd /tmp && \
+    wget -q http://content.mellanox.com/ofed/MLNX_OFED-${OFED_VERSION}/MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \
+    tar xzf MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64.tgz && \
+    PATH=/usr/bin:${PATH} MLNX_OFED_LINUX-${OFED_VERSION}-ubuntu${UBUNTU_VERSION}-x86_64/mlnxofedinstall --user-space-only --without-fw-update --force --all && \
+    rm -rf MLNX_OFED_LINUX-${OFED_VERSION}* ; \
+    fi
+
+# Add target file to help determine which device(s) to build for
+ENV ROCM_PATH=/opt/rocm
+RUN bash -c 'echo -e "gfx90a:xnack-\ngfx90a:xnac+\ngfx940\ngfx941\ngfx942:sramecc+:xnack-\n" >> ${ROCM_PATH}/bin/target.lst'
+
+# Install OpenMPI
+ENV OPENMPI_VERSION=4.1.x
+ENV MPI_HOME=/usr/local/mpi
+# Check if Open MPI is installed
+RUN cd /tmp && \
+    git clone --recursive https://github.com/open-mpi/ompi.git -b v${OPENMPI_VERSION}  && \
+    cd ompi && \
+    ./autogen.pl && \
+    mkdir build && \
+    cd build && \
+    ../configure --prefix=/usr/local/mpi  --enable-orterun-prefix-by-default --enable-mpirun-prefix-by-default  --enable-prte-prefix-by-default --with-rocm=/opt/rocm && \
+    make -j $(nproc) && \
+    make -j $(nproc) install && \
+    ldconfig && \
+    cd / && \
+    rm -rf /tmp/openmpi-${OPENMPI_VERSION}*
+
+# Install Intel MLC
+RUN cd /tmp && \
+    wget -q https://downloadmirror.intel.com/763324/mlc_v3.10.tgz -O mlc.tgz && \
+    tar xzf mlc.tgz Linux/mlc && \
+    cp ./Linux/mlc /usr/local/bin/ && \
+    rm -rf ./Linux mlc.tgz
+
+# Install RCCL
+RUN cd /opt/ &&  \
+    git clone https://github.com/ROCmSoftwarePlatform/rccl.git && \
+    cd rccl && \
+    mkdir build && \
+    cd build && \
+    CXX=/opt/rocm/bin/hipcc cmake -DHIP_COMPILER=clang -DCMAKE_BUILD_TYPE=Release -DCMAKE_VERBOSE_MAKEFILE=1 \
+    -DCMAKE_PREFIX_PATH="${ROCM_PATH}/hsa;${ROCM_PATH}/hip;${ROCM_PATH}/share/rocm/cmake/;${ROCM_PATH}" \
+    .. && \
+    make -j${NUM_MAKE_JOBS}
+
+ENV PATH="/usr/local/mpi/bin:/opt/superbench/bin:/usr/local/bin/:/opt/rocm/hip/bin/:/opt/rocm/bin/:${PATH}" \
+    LD_PRELOAD="/opt/rccl/build/librccl.so:$LD_PRELOAD" \
+    LD_LIBRARY_PATH="/usr/local/mpi/lib:/usr/lib/x86_64-linux-gnu/:/usr/local/lib/:/opt/rocm/lib:${LD_LIBRARY_PATH}" \
+    SB_HOME=/opt/superbench \
+    SB_MICRO_PATH=/opt/superbench \
+    ANSIBLE_DEPRECATION_WARNINGS=FALSE \
+    ANSIBLE_COLLECTIONS_PATH=/usr/share/ansible/collections
+
+RUN echo PATH="$PATH" > /etc/environment && \
+    echo LD_LIBRARY_PATH="$LD_LIBRARY_PATH" >> /etc/environment && \
+    echo SB_MICRO_PATH="$SB_MICRO_PATH" >> /etc/environment
+
+RUN apt install rocm-cmake -y && \
+    python3 -m pip install --upgrade pip wheel setuptools==65.7
+
+WORKDIR ${SB_HOME}
+
+ADD third_party third_party
+# Apply patch
+RUN cd third_party/perftest && \
+    git apply ../perftest_rocm6.patch
+RUN make RCCL_HOME=/opt/rccl/build/ ROCBLAS_BRANCH=release/rocm-rel-6.0 HIPBLASLT_BRANCH=release/rocm-rel-6.0 ROCM_VER=rocm-5.5.0 -C third_party rocm -o cpu_hpl -o cpu_stream -o megatron_lm
+RUN cd third_party/Megatron/Megatron-DeepSpeed && \
+    git apply ../megatron_deepspeed_rocm6.patch
+
+ADD . .
+ENV USE_HIP_DATATYPE=1
+ENV USE_HIPBLAS_COMPUTETYPE=1
+RUN python3 -m pip install .[amdworker]  && \
+    CXX=/opt/rocm/bin/hipcc make cppbuild  && \
+    make postinstall
--- a/docs/developer-guides/using-docker.mdx
+++ b/docs/developer-guides/using-docker.mdx
@@ -29,7 +29,7 @@ You need to [clone the code](./development.md#set-up) first before building the
 export DOCKER_BUILDKIT=1
 docker buildx build \
  --platform linux/amd64 --cache-to type=inline,mode=max \
-  --tag superbench-dev --file dockerfile/cuda12.1.dockerfile .
+  --tag superbench-dev --file dockerfile/cuda12.2.dockerfile .
 ```

 </TabItem>

--- a/docs/getting-started/installation.mdx
+++ b/docs/getting-started/installation.mdx
@@ -61,7 +61,7 @@ You can clone the source from GitHub and build it.
 :::note Note
 You should checkout corresponding tag to use release version, for example,

-`git clone -b v0.9.0 https://github.com/microsoft/superbenchmark`
+`git clone -b v0.10.0 https://github.com/microsoft/superbenchmark`
 :::

 ```bash

--- a/docs/getting-started/run-superbench.md
+++ b/docs/getting-started/run-superbench.md
@@ -27,7 +27,7 @@ sb deploy -f remote.ini --host-password [password]
 :::note Note
 You should deploy corresponding Docker image to use release version, for example,

-`sb deploy -f local.ini -i superbench/superbench:v0.9.0-cuda12.1`
+`sb deploy -f local.ini -i superbench/superbench:v0.10.0-cuda12.2`

 You should note that version of git repo only determines version of sb CLI, and not the sb container. You should define the container version even if you specified a release version for the git clone.


--- a/docs/superbench-config.mdx
+++ b/docs/superbench-config.mdx
@@ -70,7 +70,7 @@ superbench:
 <TabItem value='example'>

 ```yaml
-version: v0.9
+version: v0.10
 superbench:
  enable: benchmark_1
  monitor:

--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -58,17 +58,18 @@ Large scale matmul operation using `torch.matmul` with one GPU.
 |--------------------------------|-----------|--------------------------------|
 | pytorch-matmul/nosharding_time | time (ms) | Time of pure matmul operation. |

-### `cublaslt-gemm`
+### `cublaslt-gemm` / `hipblaslt-gemm`

 #### Introduction

-Measure the GEMM performance of [`cublasLtMatmul`](https://docs.nvidia.com/cuda/cublas/#cublasltmatmul).
+Measure the GEMM performance of [`cublasLtMatmul`](https://docs.nvidia.com/cuda/cublas/#cublasltmatmul) or [`hipblasLt-bench`](https://github.com/ROCm/hipBLASLt/blob/develop/clients/benchmarks/README.md).

 #### Metrics

 | Name                                                      | Unit           | Description                     |
-|----------------------------------------------------------|----------------|---------------------------------|
+|-----------------------------------------------------------|----------------|---------------------------------|
 | cublaslt-gemm/${dtype}\_${batch}\_${m}\_${n}\_${k}_flops  | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |
+| hipblaslt-gemm/${dtype}\_${batch}\_${m}\_${n}\_${k}_flops | FLOPS (TFLOPS) | TFLOPS of measured GEMM kernel. |

 ### `cublas-function`

@@ -243,6 +244,7 @@ or [AMD](https://github.com/ROCm-Developer-Tools/HIP/tree/master/samples/1_Utils
 ### `gpu-copy-bw`

 Measure the memory copy bandwidth performed by GPU SM/DMA engine, including device-to-host, host-to-device and device-to-device.
+For measurements of peer-to-peer communication performance between AMD GPUs, GPU memory buffers are allocated in `hipDeviceMallocUncached` (previous `hipDeviceMallocFinegrained`) mode to maximize performance.

 #### Metrics

@@ -283,6 +285,7 @@ Measure the performance of NCCL/RCCL operations under multi nodes' traffic patte
 performed by [nccl-tests](https://github.com/NVIDIA/nccl-tests/tree/44df0bf010dcc95e840ca0fb7466c67cff3f1f0f)
 or [rccl-tests](https://github.com/ROCmSoftwarePlatform/rccl-tests/tree/dc1ad4853d7ec738387d42a75a58a98d7af00c7b).
 Support the following operations currently: allreduce, allgather, broadcast, reduce, reducescatter, alltoall.
+Support both in-place and out-of-place measurements.

 Support the following traffic patterns:
 * `all-nodes`, validate the NCCL/RCCL performance across all VM nodes simultaneously.

--- a/docs/user-tutorial/container-images.mdx
+++ b/docs/user-tutorial/container-images.mdx
@@ -29,7 +29,9 @@ available tags are listed below for all stable versions.
 <TabItem value='cuda'>

 | Tag                | Description                         |
-|-------------------|------------------------------------|
+|--------------------|-------------------------------------|
+| v0.10.0-cuda12.2   | SuperBench v0.10.0 with CUDA 12.2   |
+| v0.10.0-cuda11.1.1 | SuperBench v0.10.0 with CUDA 11.1.1 |
 | v0.9.0-cuda12.1    | SuperBench v0.9.0 with CUDA 12.1    |
 | v0.9.0-cuda11.1.1  | SuperBench v0.9.0 with CUDA 11.1.1  |
 | v0.8.0-cuda12.1    | SuperBench v0.8.0 with CUDA 12.1    |
@@ -48,6 +50,7 @@ available tags are listed below for all stable versions.

 | Tag                           | Description                                      |
 |-------------------------------|--------------------------------------------------|
+| v0.10.0-rocm5.7               | SuperBench v0.10.0 with ROCm 5.7                 |
 | v0.9.0-rocm5.1.3              | SuperBench v0.9.0 with ROCm 5.1.3                |
 | v0.9.0-rocm5.1.1              | SuperBench v0.9.0 with ROCm 5.1.1                |
 | v0.9.0-rocm5.0.1              | SuperBench v0.9.0 with ROCm 5.0.1                |

--- a/docs/user-tutorial/data-diagnosis.md
+++ b/docs/user-tutorial/data-diagnosis.md
@@ -65,7 +65,7 @@ superbench:
 example:
 ```yaml
 # SuperBench rules
-version: v0.9
+version: v0.10
 superbench:
  rules:
    failure-rule:

--- a/docs/user-tutorial/result-summary.md
+++ b/docs/user-tutorial/result-summary.md
@@ -58,7 +58,7 @@ superbench:

 ```yaml title="Example"
 # SuperBench rules
-version: v0.9
+version: v0.10
 superbench:
  rules:
    kernel_launch:

--- a/superbench/__init__.py
+++ b/superbench/__init__.py
@@ -6,5 +6,5 @@
 Provide hardware and software benchmarks for AI systems.
 """

-__version__ = '0.9.0'
+__version__ = '0.10.0'
 __author__ = 'Microsoft'
--- a/superbench/benchmarks/micro_benchmarks/cuda_nccl_bw_performance.py
+++ b/superbench/benchmarks/micro_benchmarks/cuda_nccl_bw_performance.py
@@ -94,6 +94,17 @@ def add_parser_arguments(self):
            default=0,
            help='Number of graph launch iterations. Set to 0 to disable graph mode. Default: 0.',
        )
+        self._parser.add_argument(
+            '--in_place',
+            action='store_true',
+            help='If specified, collect in-place numbers, else collect out-of-place numbers.',
+        )
+        self._parser.add_argument(
+            '--data_type',
+            type=str,
+            default='float',
+            help='Data type used in NCCL operations. Default: float.',
+        )

    def _preprocess(self):
        """Preprocess/preparation operations before the benchmarking.
@@ -123,9 +134,10 @@ def _preprocess(self):
                return False

            command = os.path.join(self._args.bin_dir, self._bin_name)
-            command += ' -b {} -e {} -f {} -g {} -c {} -n {} -w {} -G {}'.format(
+            command += ' -b {} -e {} -f {} -g {} -c {} -n {} -w {} -G {} -d {}'.format(
                self._args.minbytes, self._args.maxbytes, str(self._args.stepfactor), str(self._args.ngpus),
-                str(self._args.check), str(self._args.iters), str(self._args.warmup_iters), str(self._args.graph_iters)
+                str(self._args.check), str(self._args.iters), str(self._args.warmup_iters), str(self._args.graph_iters),
+                self._args.data_type
            )
            self._commands.append(command)

@@ -171,9 +183,9 @@ def _process_raw_result(self, cmd_idx, raw_output):    # noqa: C901
            content = content[out_of_place_index + 1:out_of_bound_index]
            # Parse max out of bound bus bw as the result
            size_index = -1
-            time_index = -1
-            busbw_index = -1
-            algbw_index = -1
+            time_index = None
+            busbw_index = None
+            algbw_index = None
            for line in content:
                if 'time' in line and 'busbw' in line:
                    # Get index of selected column
@@ -181,11 +193,17 @@ def _process_raw_result(self, cmd_idx, raw_output):    # noqa: C901
                    line = re.sub(r' +', ' ', line).split(' ')
                    # Get first index of condition in list, if it not existing, raise exception
                    size_index = line.index('size')
+                    # Need index from the end because sometimes previous fields (like redop) can be empty
+                    if self._args.in_place:
+                        time_index = -1 - list(reversed(line)).index('time')
+                        busbw_index = -1 - list(reversed(line)).index('busbw')
+                        algbw_index = -1 - list(reversed(line)).index('algbw')
+                    else:
                        time_index = line.index('time') - len(line)
                        busbw_index = line.index('busbw') - len(line)
                        algbw_index = line.index('algbw') - len(line)
                    break
-            if size_index != -1 and busbw_index != -1 and time_index != -1 and algbw_index != -1:
+            if size_index != -1 and busbw_index is not None and time_index is not None and algbw_index is not None:
                for line in content:
                    line = line.strip(' ')
                    line = re.sub(r' +', ' ', line).split(' ')

--- a/superbench/benchmarks/micro_benchmarks/dist_inference.py
+++ b/superbench/benchmarks/micro_benchmarks/dist_inference.py
@@ -493,13 +493,12 @@ def _process_raw_result(self, cmd_idx, raw_output):

        try:
            output_lines = [x.strip() for x in raw_output.strip().splitlines()]
-            step_time = None
+            step_times = []
            for output_line in output_lines:
-                if ' ms per iteration' in output_line:
-                    step_time = float(output_line.split(' ms per iteration')[0].split()[-1])
-                    break
+                if output_line.startswith('Latency of step'):
+                    step_times.append(float(output_line.split(' ms')[0].split()[-1]))
            return self._process_numeric_result(
-                'step_times', [step_time], reduce_type=ReduceType.MAX, cal_percentile=True
+                'step_times', step_times, reduce_type=ReduceType.MAX, cal_percentile=True
            )
        except BaseException as e:
            return self._set_error_code_and_print_error_msg(