Benchmarks: Add gpu-hpl and gpu-hpl-mxp micro benchmarks (#15)

Add gpu-hpl and gpu-hpl-mxp micro benchmarks backed by rocHPL and rocHPL-MxP. Implemented a shared GPU HPL base that: - Generates per-workload HPL dat files and parses the corresponding output files. - Supports common HPL inputs such as process grid, matrix size, block size, broadcast topology, warmup, iterations, and reduce operator. - Adds rocHPL-specific tuning parameters for gpu-hpl. - Formats metric keys from input-derived workload attributes. - Reports `flops`, `time`, and `tests_pass` metrics with warmup-aware aggregation. Add benchmark registrations, parser tests, sample output fixtures, documentation, and recommended configurations for gpu-hpl and gpu-hpl-mxp. Update rocHPL and rocHPL-MxP third-party integration with build patches, install targets, and SuperBench run helper scripts. Also update gpu-hpcg metric naming to use flops instead of gflops, remove standalone domain/verification-style metrics from the documented metric surface, and refresh Hygon HPCG documentation/config references accordingly.

Benchmarks: Add gpu-hpl and gpu-hpl-mxp micro benchmarks (#15)
Add gpu-hpl and gpu-hpl-mxp micro benchmarks backed by rocHPL and rocHPL-MxP. Implemented a shared GPU HPL base that: - Generates per-workload HPL dat files and parses the corresponding output files. - Supports common HPL inputs such as process grid, matrix size, block size, broadcast topology, warmup, iterations, and reduce operator. - Adds rocHPL-specific tuning parameters for gpu-hpl. - Formats metric keys from input-derived workload attributes. - Reports `flops`, `time`, and `tests_pass` metrics with warmup-aware aggregation. Add benchmark registrations, parser tests, sample output fixtures, documentation, and recommended configurations for gpu-hpl and gpu-hpl-mxp. Update rocHPL and rocHPL-MxP third-party integration with build patches, install targets, and SuperBench run helper scripts. Also update gpu-hpcg metric naming to use flops instead of gflops, remove standalone domain/verification-style metrics from the documented metric surface, and refresh Hygon HPCG documentation/config references accordingly.
4fa10f4d · one · GitHub · 88cd18df · 4fa10f4d · 4fa10f4d
Unverified Commit 4fa10f4d authored Apr 23, 2026 by one Committed by GitHub Apr 23, 2026
20 changed files
--- a/.gitmodules
+++ b/.gitmodules
@@ -39,3 +39,9 @@
 [submodule "third_party/rocHPCG"]
 	path = third_party/rocHPCG
 	url = https://github.com/ROCm/rocHPCG.git
+[submodule "third_party/rocHPL"]
+	path = third_party/rocHPL
+	url = https://github.com/ROCm/rocHPL.git
+[submodule "third_party/rocHPL-MxP"]
+	path = third_party/rocHPL-MxP
+	url = https://github.com/ROCm/rocHPL-MxP.git
--- a/docs/user-tutorial/benchmarks/micro-benchmarks.md
+++ b/docs/user-tutorial/benchmarks/micro-benchmarks.md
@@ -189,6 +189,107 @@ Performed by [High-Performance Linpack Benchmark for Distributed-Memory Computer
 | cpu-hpl/throughput | bandwidth (GFlops) | Compute bandwidth.                                                        |
 | cpu-hpl/time       | time (s)           | Time elapsed during HPL run.                                              |

+### `gpu-hpl`
+
+#### Introduction
+
+Measure GPU HPL performance for dense linear algebra workloads.
+Performed by [rocHPL](https://github.com/ROCm/rocHPL).
+
+#### Parameters
+
+`gpu-hpl` always generates an HPL input `.dat` file from the command-line parameters.
+The generated file name and output file name are derived from the same workload prefix used in metric keys.
+
+| Parameter              | Default | Description                                                                 |
+|------------------------|---------|-----------------------------------------------------------------------------|
+| `--p`                  | `1`     | Number of rows in the MPI process grid.                                     |
+| `--q`                  | `1`     | Number of columns in the MPI process grid.                                  |
+| `--local-p`            |         | Optional number of rows in the node-local MPI process grid.                 |
+| `--local-q`            |         | Optional number of columns in the node-local MPI process grid.              |
+| `--n`                  | `45312` | Global matrix size.                                                         |
+| `--nb`                 | `384`   | Panel/block size.                                                           |
+| `--warmup`             | `0`     | Number of warmup HPL runs to exclude from result aggregation.               |
+| `--iterations`         | `1`     | Number of measured HPL runs to include in result aggregation.               |
+| `--reduce-op`          | `max`   | Reduce operator for measured runs by FLOPS: `mean`, `median`, `max`, `min`. |
+| `--pmap`               | `1`     | Process mapping: `0` for row-major, `1` for column-major.                   |
+| `--bcast`              | `0`     | Broadcast topology: `0` for 1rg, `1` for 1rM, `2` for 2rg, `3` for 2rM, `4` for Lng, `5` for LnM. |
+| `--threshold`          | `16.0`  | Residual check threshold.                                                   |
+| `--pfact`              | `2`     | Panel factorization: `0` for left, `1` for Crout, `2` for right.            |
+| `--nbmin`              | `32`    | Recursive stopping criterion.                                               |
+| `--ndiv`               | `2`     | Number of panels in recursion.                                              |
+| `--rfact`              | `2`     | Recursive panel factorization: `0` for left, `1` for Crout, `2` for right.  |
+| `--depth`              | `1`     | Lookahead depth.                                                            |
+| `--swap`               | `1`     | Swapping algorithm: `0` for binary exchange, `1` for long, `2` for mix.     |
+| `--swapping-threshold` | `64`    | Swapping threshold.                                                         |
+| `--l1`                 | `0`     | L1 storage form: `0` for transposed, `1` for non-transposed.                |
+| `--u`                  | `0`     | U storage form: `0` for transposed, `1` for non-transposed.                 |
+| `--equilibration`      | `0`     | Equilibration: `0` for no, `1` for yes.                                     |
+| `--memory-alignment`   | `8`     | Memory alignment in doubles.                                                |
+
+`--warmup` runs are excluded from result aggregation. `--reduce-op` is applied to the measured FLOPS values.
+The reported `_time` metric is reduced in the same performance direction by applying `--reduce-op` to `1 / time`
+and then converting the result back to seconds.
+
+#### Metrics
+
+rocHPL reports performance, time, and correctness metrics.
+The metric key includes the configured HPL variant, process grid, matrix size, and block size:
+`${tv}_P${P}_Q${Q}_N${N}_NB${NB}`.
+The `tv` field is based on the rocHPL `T/V` value and includes an extended suffix for `L1`, `U`,
+`Equilibration`, and `memory-alignment`. For example, `WC11R2R32_TTN8` uses transposed `L1`,
+transposed `U`, no equilibration, and memory alignment `8`.
+
+| Name                                                  | Unit           | Description                                      |
+|-------------------------------------------------------|----------------|--------------------------------------------------|
+| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_flops`      | FLOPS (GFLOPS) | Throughput for the specified rocHPL run.         |
+| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_time`       | time (s)       | Time elapsed during the specified HPL run.       |
+| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_tests_pass` |                | Whether residual checks passed (1: pass, 0: fail). |
+
+### `gpu-hpl-mxp`
+
+#### Introduction
+
+Measure GPU HPL-MxP performance for mixed-precision dense linear algebra workloads.
+Performed by [rocHPL-MxP](https://github.com/ROCm/rocHPL-MxP).
+
+#### Parameters
+
+`gpu-hpl-mxp` always generates an HPL-MxP input `.dat` file from the command-line parameters.
+The generated file name and output file name are derived from the same workload prefix used in metric keys.
+
+| Parameter      | Default | Description                                                                 |
+|----------------|---------|-----------------------------------------------------------------------------|
+| `--p`          | `1`     | Number of rows in the MPI process grid.                                     |
+| `--q`          | `1`     | Number of columns in the MPI process grid.                                  |
+| `--local-p`    |         | Optional number of rows in the node-local MPI process grid.                 |
+| `--local-q`    |         | Optional number of columns in the node-local MPI process grid.              |
+| `--n`          | `61440` | Global matrix size.                                                         |
+| `--nb`         | `2560`  | Panel/block size.                                                           |
+| `--warmup`     | `0`     | Number of warmup HPL-MxP runs to exclude from result aggregation.           |
+| `--iterations` | `1`     | Number of measured HPL-MxP runs to include in result aggregation.           |
+| `--reduce-op`  | `max`   | Reduce operator for measured runs by FLOPS: `mean`, `median`, `max`, `min`. |
+| `--pmap`       | `1`     | Process mapping: `0` for row-major, `1` for column-major.                   |
+| `--bcast`      | `0`     | Broadcast topology: `0` for 1rg, `1` for 1rM, `2` for 2rg, `3` for 2rM, `4` for Lng, `5` for LnM. |
+| `--threshold`  | `16.0`  | Residual check threshold.                                                   |
+
+`--warmup` runs are excluded from result aggregation. `--reduce-op` is applied to the measured FLOPS values.
+The reported `_time` metric is reduced in the same performance direction by applying `--reduce-op` to `1 / time`
+and then converting the result back to seconds.
+
+#### Metrics
+
+rocHPL-MxP reports performance, time, and correctness metrics.
+The metric key includes the configured HPL-MxP variant, process grid, matrix size, and block size:
+`${tv}_P${P}_Q${Q}_N${N}_NB${NB}`.
+The `tv` field is based on the rocHPL-MxP `T/V` value, for example `WC1`.
+
+| Name                                                      | Unit           | Description                                      |
+|-----------------------------------------------------------|----------------|--------------------------------------------------|
+| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_flops`      | FLOPS (GFLOPS) | Throughput for the specified rocHPL-MxP run.     |
+| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_time`       | time (s)       | Time elapsed during the specified HPL-MxP run.   |
+| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_tests_pass` |                | Whether residual checks passed (1: pass, 0: fail). |
+
 ### `gpu-hpcg`

 #### Introduction
@@ -196,6 +297,22 @@ Performed by [High-Performance Linpack Benchmark for Distributed-Memory Computer
 Measure GPU HPCG performance for sparse linear algebra and multigrid-style workloads.
 Performed by [rocHPCG](https://github.com/ROCm/rocHPCG).

+#### Parameters
+
+| Parameter | Default | Description                                                                 |
+|-----------|---------|-----------------------------------------------------------------------------|
+| `--npx`   | `1`     | Number of MPI processes in the x dimension.                                 |
+| `--npy`   | `1`     | Number of MPI processes in the y dimension.                                 |
+| `--npz`   | `1`     | Number of MPI processes in the z dimension.                                 |
+| `--nx`    | `560`   | Local problem size in the x dimension.                                      |
+| `--ny`    | `280`   | Local problem size in the y dimension.                                      |
+| `--nz`    | `280`   | Local problem size in the z dimension.                                      |
+| `--rt`    | `60`    | Benchmark runtime in seconds.                                               |
+| `--tol`   | `1.0`   | Verification control: `0` runs reference verification; non-zero skips it.   |
+| `--pz`    | `0`     | Partition boundary in the z process dimension.                              |
+| `--zl`    | `--nz`  | Local `nz` value for processes with z rank lower than `--pz`.               |
+| `--zu`    | `--nz`  | Local `nz` value for processes with z rank greater than or equal to `--pz`. |
+
 #### Metrics

 rocHPCG reports performance and time metrics.
@@ -205,9 +322,9 @@ The metric key includes the configured process domain and local problem size:

 | Name                                                                                             | Unit             | Description                                             |
 |--------------------------------------------------------------------------------------------------|------------------|---------------------------------------------------------|
-| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_gflops`                         | FLOPS (GFLOPS)   | Throughput for the specified rocHPCG operation.         |
+| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_flops`                          | FLOPS (GFLOPS)   | Throughput for the specified rocHPCG operation.         |
 | `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_bandwidth`                      | bandwidth (GB/s) | Bandwidth for the specified rocHPCG operation.          |
-| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_gflops_per_process`             | FLOPS (GFLOPS)   | Per-process throughput for the specified operation.     |
+| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_flops_per_process`              | FLOPS (GFLOPS)   | Per-process throughput for the specified operation.     |
 | `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_bandwidth_per_process`          | bandwidth (GB/s) | Per-process bandwidth for the specified operation.      |
 | `gpu-hpcg/setup_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}`                                  | time (s)         | Setup phase duration.                                   |
 | `gpu-hpcg/optimization_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}`                           | time (s)         | Optimization phase duration.                            |

--- a/superbench/analyzer/metric_sorter.py
+++ b/superbench/analyzer/metric_sorter.py
@@ -16,7 +16,7 @@
    r'^(?P<subject>final|ddot|waxpby|spmv|mg|total)_'
    r'p(?P<npx>\d+)x(?P<npy>\d+)x(?P<npz>\d+)_'
    r'n(?P<nx>\d+)x(?P<ny>\d+)x(?P<nz>\d+)_'
-    r'(?P<type>gflops|bandwidth|gflops_per_process|bandwidth_per_process)$'
+    r'(?P<type>flops|bandwidth|flops_per_process|bandwidth_per_process)$'
 )
 _HPCG_TIME_PATTERN = re.compile(
    r'^(?P<subject>setup_time|optimization_time|total_time)_'
@@ -37,9 +37,9 @@
 }

 _HPCG_PERF_TYPE_ORDER = {
-    'gflops': 0,
+    'flops': 0,
    'bandwidth': 1,
-    'gflops_per_process': 2,
+    'flops_per_process': 2,
    'bandwidth_per_process': 3,
 }


--- a/superbench/benchmarks/micro_benchmarks/__init__.py
+++ b/superbench/benchmarks/micro_benchmarks/__init__.py
@@ -7,6 +7,7 @@
 from superbench.benchmarks.micro_benchmarks.gemm_flops_performance_base import GemmFlopsBenchmark
 from superbench.benchmarks.micro_benchmarks.memory_bw_performance_base import MemBwBenchmark
 from superbench.benchmarks.micro_benchmarks.gpu_hpcg_performance_base import GpuHpcgBenchmark
+from superbench.benchmarks.micro_benchmarks.gpu_hpl_performance_base import GpuHplBenchmark

 from superbench.benchmarks.micro_benchmarks.computation_communication_overlap import ComputationCommunicationOverlap
 from superbench.benchmarks.micro_benchmarks.cublas_function import CublasBenchmark
@@ -17,6 +18,8 @@
 from superbench.benchmarks.micro_benchmarks.dtk_memory_bw_performance import DtkMemBwBenchmark
 from superbench.benchmarks.micro_benchmarks.dtk_gemm_flops_performance import DtkGemmFlopsBenchmark
 from superbench.benchmarks.micro_benchmarks.dtk_hpcg_performance import DtkHpcgBenchmark
+from superbench.benchmarks.micro_benchmarks.dtk_hpl_performance import DtkHplBenchmark
+from superbench.benchmarks.micro_benchmarks.dtk_hpl_mxp_performance import DtkHplMxpBenchmark
 from superbench.benchmarks.micro_benchmarks.cuda_gemm_flops_performance import CudaGemmFlopsBenchmark
 from superbench.benchmarks.micro_benchmarks.cuda_memory_bw_performance import CudaMemBwBenchmark
 from superbench.benchmarks.micro_benchmarks.cuda_nccl_bw_performance import CudaNcclBwBenchmark
@@ -60,6 +63,8 @@
    'DiskBenchmark',
    'DistInference',
    'DtkGemmFlopsBenchmark',
+    'DtkHplBenchmark',
+    'DtkHplMxpBenchmark',
    'RocmHipBlasLtBenchmark',
    'DtkHipBlasLtBenchmark',
    'DtkMemBwBenchmark',
@@ -68,6 +73,7 @@
    'GpuBurnBenchmark',
    'GpuCopyBwBenchmark',
    'GpuHpcgBenchmark',
+    'GpuHplBenchmark',
    'GpuStreamBenchmark',
    'IBBenchmark',
    'IBLoopbackBenchmark',

--- a/superbench/benchmarks/micro_benchmarks/dtk_hpl_mxp_performance.py
+++ b/superbench/benchmarks/micro_benchmarks/dtk_hpl_mxp_performance.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+"""Module of the DTK HPL-MxP benchmark."""
+
+from superbench.benchmarks import BenchmarkRegistry, Platform
+from superbench.benchmarks.micro_benchmarks import GpuHplBenchmark
+from superbench.benchmarks.micro_benchmarks.gpu_hpl_performance_base import format_hpl_mxp_tv
+
+
+class DtkHplMxpBenchmark(GpuHplBenchmark):
+    """The DTK HPL-MxP benchmark class."""
+
+    _default_bin_name = 'run_rochplmxp'
+    _default_dat_name = 'HPL-MxP.dat'
+    _default_out_name = 'HPL-MxP.out'
+    _file_label = 'HPL-MxP'
+    _default_n = 61440
+    _default_nb = 2560
+
+    def _format_tv(self):
+        """Format the expected rocHPL-MxP T/V field from input arguments."""
+        return format_hpl_mxp_tv(self._args.pmap, self._args.bcast)
+
+    def _match_output_n(self):
+        """Return whether parsed rocHPL-MxP output N must match the input N."""
+        return False
+
+    def _format_dat_content(self):
+        """Format generated rocHPL-MxP input file content."""
+        return (
+            'HPLinpack benchmark input file\n'
+            'Innovative Computing Laboratory, University of Tennessee\n'
+            f'{self._out_file_name} output file name (if any)\n'
+            '0            device out (6=stdout,7=stderr,file)\n'
+            '1            # of problems sizes (N)\n'
+            f'{self._args.n}         Ns\n'
+            '1            # of NBs\n'
+            f'{self._args.nb}         NBs\n'
+            f'{self._args.pmap}            PMAP process mapping (0=Row-,1=Column-major)\n'
+            f'{self._args.p}            P\n'
+            f'{self._args.q}            Q\n'
+            f'{self._args.threshold}         threshold\n'
+            '1            # of broadcast\n'
+            f'{self._args.bcast}            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)\n'
+        )
+
+
+BenchmarkRegistry.register_benchmark('gpu-hpl-mxp', DtkHplMxpBenchmark, platform=Platform.DTK)
--- a/superbench/benchmarks/micro_benchmarks/dtk_hpl_performance.py
+++ b/superbench/benchmarks/micro_benchmarks/dtk_hpl_performance.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+"""Module of the DTK HPL benchmark."""
+
+from superbench.benchmarks import BenchmarkRegistry, Platform
+from superbench.benchmarks.micro_benchmarks import GpuHplBenchmark
+from superbench.benchmarks.micro_benchmarks.gpu_hpl_performance_base import format_hpl_extended_tv, format_hpl_tv
+
+
+class DtkHplBenchmark(GpuHplBenchmark):
+    """The DTK HPL benchmark class."""
+
+    _default_bin_name = 'run_rochpl'
+    _default_dat_name = 'HPL.dat'
+    _default_out_name = 'HPL.out'
+    _file_label = 'HPL'
+
+    def _add_variant_parser_arguments(self):
+        """Add rocHPL-specific arguments."""
+        self._parser.add_argument(
+            '--pfact',
+            dest='pfact',
+            type=int,
+            default=2,
+            choices=[0, 1, 2],
+            required=False,
+            help='Panel factorization: 0 for left, 1 for Crout, 2 for right.',
+        )
+        self._parser.add_argument(
+            '--nbmin',
+            dest='nbmin',
+            type=int,
+            default=32,
+            required=False,
+            help='Recursive stopping criterion.',
+        )
+        self._parser.add_argument(
+            '--ndiv',
+            dest='ndiv',
+            type=int,
+            default=2,
+            required=False,
+            help='Number of panels in recursion.',
+        )
+        self._parser.add_argument(
+            '--rfact',
+            dest='rfact',
+            type=int,
+            default=2,
+            choices=[0, 1, 2],
+            required=False,
+            help='Recursive panel factorization: 0 for left, 1 for Crout, 2 for right.',
+        )
+        self._parser.add_argument(
+            '--depth',
+            dest='depth',
+            type=int,
+            default=1,
+            required=False,
+            help='Lookahead depth.',
+        )
+        self._parser.add_argument(
+            '--swap',
+            dest='swap',
+            type=int,
+            default=1,
+            choices=[0, 1, 2],
+            required=False,
+            help='Swapping algorithm: 0 for binary exchange, 1 for long, 2 for mix.',
+        )
+        self._parser.add_argument(
+            '--swapping-threshold',
+            dest='swapping_threshold',
+            type=int,
+            default=64,
+            required=False,
+            help='Swapping threshold.',
+        )
+        self._parser.add_argument(
+            '--l1',
+            dest='l1',
+            type=int,
+            default=0,
+            choices=[0, 1],
+            required=False,
+            help='L1 storage form: 0 for transposed, 1 for non-transposed.',
+        )
+        self._parser.add_argument(
+            '--u',
+            dest='u',
+            type=int,
+            default=0,
+            choices=[0, 1],
+            required=False,
+            help='U storage form: 0 for transposed, 1 for non-transposed.',
+        )
+        self._parser.add_argument(
+            '--equilibration',
+            dest='equilibration',
+            type=int,
+            default=0,
+            choices=[0, 1],
+            required=False,
+            help='Equilibration: 0 for no, 1 for yes.',
+        )
+        self._parser.add_argument(
+            '--memory-alignment',
+            dest='memory_alignment',
+            type=int,
+            default=8,
+            required=False,
+            help='Memory alignment in double.',
+        )
+
+    def _format_tv(self):
+        """Format the expected rocHPL T/V field from input arguments."""
+        return format_hpl_extended_tv(
+            self._args.pmap,
+            self._args.depth,
+            self._args.bcast,
+            self._args.rfact,
+            self._args.ndiv,
+            self._args.pfact,
+            self._args.nbmin,
+            self._args.l1,
+            self._args.u,
+            self._args.equilibration,
+            self._args.memory_alignment,
+        )
+
+    def _format_output_tv(self):
+        """Format the expected rocHPL T/V field in generated output."""
+        return format_hpl_tv(
+            self._args.pmap,
+            self._args.depth,
+            self._args.bcast,
+            self._args.rfact,
+            self._args.ndiv,
+            self._args.pfact,
+            self._args.nbmin,
+        )
+
+    def _format_dat_content(self):
+        """Format generated rocHPL input file content."""
+        return (
+            'HPLinpack benchmark input file\n'
+            'Innovative Computing Laboratory, University of Tennessee\n'
+            f'{self._out_file_name} output file name (if any)\n'
+            '0            device out (6=stdout,7=stderr,file)\n'
+            '1            # of problems sizes (N)\n'
+            f'{self._args.n}         Ns\n'
+            '1            # of NBs\n'
+            f'{self._args.nb}         NBs\n'
+            f'{self._args.pmap}            PMAP process mapping (0=Row-,1=Column-major)\n'
+            '1            # of process grids (P x Q)\n'
+            f'{self._args.p}            Ps\n'
+            f'{self._args.q}            Qs\n'
+            f'{self._args.threshold}         threshold\n'
+            '1            # of panel fact\n'
+            f'{self._args.pfact}            PFACTs (0=left, 1=Crout, 2=Right)\n'
+            '1            # of recursive stopping criterium\n'
+            f'{self._args.nbmin}           NBMINs (>= 1)\n'
+            '1            # of panels in recursion\n'
+            f'{self._args.ndiv}            NDIVs\n'
+            '1            # of recursive panel fact.\n'
+            f'{self._args.rfact}            RFACTs (0=left, 1=Crout, 2=Right)\n'
+            '1            # of broadcast\n'
+            f'{self._args.bcast}            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)\n'
+            '1            # of lookahead depth\n'
+            f'{self._args.depth}            DEPTHs (>=0)\n'
+            f'{self._args.swap}            SWAP (0=bin-exch,1=long,2=mix)\n'
+            f'{self._args.swapping_threshold}           swapping threshold\n'
+            f'{self._args.l1}            L1 in (0=transposed,1=no-transposed) form\n'
+            f'{self._args.u}            U  in (0=transposed,1=no-transposed) form\n'
+            f'{self._args.equilibration}            Equilibration (0=no,1=yes)\n'
+            f'{self._args.memory_alignment}            memory alignment in double (> 0)\n'
+        )
+
+
+BenchmarkRegistry.register_benchmark('gpu-hpl', DtkHplBenchmark, platform=Platform.DTK)
--- a/superbench/benchmarks/micro_benchmarks/gpu_hpcg_performance_base.py
+++ b/superbench/benchmarks/micro_benchmarks/gpu_hpcg_performance_base.py
@@ -169,29 +169,29 @@ def _process_raw_result(self, cmd_idx, raw_output):

        parsed_results = {}
        required_metrics = {
-            'final_gflops',
+            'final_flops',
            'final_bandwidth',
-            'final_gflops_per_process',
+            'final_flops_per_process',
            'final_bandwidth_per_process',
-            'ddot_gflops',
+            'ddot_flops',
            'ddot_bandwidth',
-            'ddot_gflops_per_process',
+            'ddot_flops_per_process',
            'ddot_bandwidth_per_process',
-            'waxpby_gflops',
+            'waxpby_flops',
            'waxpby_bandwidth',
-            'waxpby_gflops_per_process',
+            'waxpby_flops_per_process',
            'waxpby_bandwidth_per_process',
-            'spmv_gflops',
+            'spmv_flops',
            'spmv_bandwidth',
-            'spmv_gflops_per_process',
+            'spmv_flops_per_process',
            'spmv_bandwidth_per_process',
-            'mg_gflops',
+            'mg_flops',
            'mg_bandwidth',
-            'mg_gflops_per_process',
+            'mg_flops_per_process',
            'mg_bandwidth_per_process',
-            'total_gflops',
+            'total_flops',
            'total_bandwidth',
-            'total_gflops_per_process',
+            'total_flops_per_process',
            'total_bandwidth_per_process',
            'setup_time',
            'optimization_time',
@@ -228,9 +228,9 @@ def _process_raw_result(self, cmd_idx, raw_output):
    def _format_metric_name(self, metric):
        """Format a rocHPCG metric with the configured process domain and local problem size."""
        metric_suffixes = (
-            'gflops_per_process',
+            'flops_per_process',
            'bandwidth_per_process',
-            'gflops',
+            'flops',
            'bandwidth',
        )
        workload = (
@@ -268,8 +268,8 @@ def _parse_operation_line(self, line, parsed_results):
        if len(gflops_values) < 2 or len(bandwidth_values) < 2:
            return False

-        parsed_results[f'{prefix}_gflops'] = gflops_values[0]
-        parsed_results[f'{prefix}_gflops_per_process'] = gflops_values[1]
+        parsed_results[f'{prefix}_flops'] = gflops_values[0]
+        parsed_results[f'{prefix}_flops_per_process'] = gflops_values[1]
        parsed_results[f'{prefix}_bandwidth'] = bandwidth_values[0]
        parsed_results[f'{prefix}_bandwidth_per_process'] = bandwidth_values[1]
        return True

--- a/superbench/benchmarks/micro_benchmarks/gpu_hpl_performance_base.py
+++ b/superbench/benchmarks/micro_benchmarks/gpu_hpl_performance_base.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT license.
+
+"""Module of the GPU HPL benchmark base class."""
+
+import os
+import re
+import statistics
+from typing import Optional
+
+from superbench.common.utils import logger
+from superbench.benchmarks.micro_benchmarks import MicroBenchmarkWithInvoke
+
+_HPL_RESULT_PATTERN = re.compile(
+    r'^\s*(?P<tv>W\S+)\s+'
+    r'(?P<n>\d+)\s+'
+    r'(?P<nb>\d+)\s+'
+    r'(?P<p>\d+)\s+'
+    r'(?P<q>\d+)\s+'
+    r'(?P<time>[+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)\s+'
+    r'(?P<flops>[+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)\s*$'
+)
+_HPL_RESIDUAL_PATTERN = re.compile(r'\.\.\.\.\.\.\s+(?P<status>PASSED|FAILED)\s*$', re.IGNORECASE)
+
+
+def _format_pmap(pmap):
+    """Format HPL process mapping token used in the T/V field."""
+    return 'R' if pmap == 0 else 'C'
+
+
+def _format_fact(fact):
+    """Format HPL panel factorization token used in the T/V field."""
+    fact_tokens = {
+        0: 'L',
+        1: 'C',
+        2: 'R',
+    }
+    return fact_tokens[fact]
+
+
+def _format_transpose(value):
+    """Format HPL transposed/no-transposed token used in the extended T/V field."""
+    return 'T' if value == 0 else 'N'
+
+
+def _format_equilibration(value):
+    """Format HPL equilibration token used in the extended T/V field."""
+    return 'N' if value == 0 else 'Y'
+
+
+def format_hpl_mxp_tv(pmap, bcast):
+    """Format the rocHPL-MxP T/V field from input parameters."""
+    return f'W{_format_pmap(pmap)}{bcast}'
+
+
+def format_hpl_tv(pmap, depth, bcast, rfact, ndiv, pfact, nbmin):
+    """Format the rocHPL T/V field from input parameters."""
+    return (
+        f'W{_format_pmap(pmap)}'
+        f'{depth}'
+        f'{bcast}'
+        f'{_format_fact(rfact)}'
+        f'{ndiv}'
+        f'{_format_fact(pfact)}'
+        f'{nbmin}'
+    )
+
+
+def format_hpl_extended_tv(pmap, depth, bcast, rfact, ndiv, pfact, nbmin, l1, u, equilibration, alignment):
+    """Format the rocHPL T/V field plus SuperBench variant suffix from input parameters."""
+    return (
+        f'{format_hpl_tv(pmap, depth, bcast, rfact, ndiv, pfact, nbmin)}'
+        f'_{_format_transpose(l1)}'
+        f'{_format_transpose(u)}'
+        f'{_format_equilibration(equilibration)}'
+        f'{alignment}'
+    )
+
+
+class GpuHplBenchmark(MicroBenchmarkWithInvoke):
+    """The GPU HPL benchmark base class."""
+
+    _default_bin_name: Optional[str] = None
+    _default_dat_name: Optional[str] = None
+    _default_out_name: Optional[str] = None
+    _file_label: Optional[str] = None
+    _default_n = 45312
+    _default_nb = 384
+
+    def __init__(self, name, parameters=''):
+        """Constructor.
+
+        Args:
+            name (str): benchmark name.
+            parameters (str): benchmark parameters.
+        """
+        super().__init__(name, parameters)
+
+        self._bin_name = self._default_bin_name
+        self._dat_path = None
+        self._out_path = None
+        self._tv = None
+        self._workload = None
+        self._dat_file_name = None
+        self._out_file_name = None
+
+    def add_parser_arguments(self):
+        """Add the specified arguments."""
+        super().add_parser_arguments()
+
+        self._parser.add_argument(
+            '--p',
+            dest='p',
+            type=int,
+            default=1,
+            required=False,
+            help='Specific MPI grid size: the number of rows in MPI grid.',
+        )
+        self._parser.add_argument(
+            '--q',
+            dest='q',
+            type=int,
+            default=1,
+            required=False,
+            help='Specific MPI grid size: the number of columns in MPI grid.',
+        )
+        self._parser.add_argument(
+            '--local-p',
+            dest='local_p',
+            type=int,
+            required=False,
+            help='Specific node-local MPI grid size: the number of rows in node-local MPI grid.',
+        )
+        self._parser.add_argument(
+            '--local-q',
+            dest='local_q',
+            type=int,
+            required=False,
+            help='Specific node-local MPI grid size: the number of columns in node-local MPI grid.',
+        )
+        self._parser.add_argument(
+            '--n',
+            dest='n',
+            type=int,
+            default=self._default_n,
+            required=False,
+            help='Specific matrix size: the number of rows/columns in global matrix.',
+        )
+        self._parser.add_argument(
+            '--nb',
+            dest='nb',
+            type=int,
+            default=self._default_nb,
+            required=False,
+            help='Specific panel size: the number of rows/columns in panels.',
+        )
+        self._parser.add_argument(
+            '--warmup',
+            type=int,
+            default=0,
+            required=False,
+            help='Number of warmup runs to exclude from result aggregation.',
+        )
+        self._parser.add_argument(
+            '--iterations',
+            type=int,
+            default=1,
+            required=False,
+            help='Number of measurement runs to include in result aggregation.',
+        )
+        self._parser.add_argument(
+            '--reduce-op',
+            dest='reduce_op',
+            type=str,
+            default='max',
+            choices=['mean', 'median', 'max', 'min'],
+            required=False,
+            help='Reduce operator for aggregating measurement runs by FLOPS.',
+        )
+        self._parser.add_argument(
+            '--pmap',
+            dest='pmap',
+            type=int,
+            default=1,
+            choices=[0, 1],
+            required=False,
+            help='Process mapping: 0 for row-major, 1 for column-major.',
+        )
+        self._parser.add_argument(
+            '--bcast',
+            dest='bcast',
+            type=int,
+            default=0,
+            choices=[0, 1, 2, 3, 4, 5],
+            required=False,
+            help='Broadcast topology: 0 for 1rg, 1 for 1rM, 2 for 2rg, 3 for 2rM, 4 for Lng, 5 for LnM.',
+        )
+        self._parser.add_argument(
+            '--threshold',
+            type=float,
+            default=16.0,
+            required=False,
+            help='Residual check threshold.',
+        )
+
+        self._add_variant_parser_arguments()
+
+    def _add_variant_parser_arguments(self):
+        """Add benchmark variant-specific arguments."""
+        pass
+
+    def _preprocess(self):
+        """Preprocess/preparation operations before benchmarking."""
+        if not super()._preprocess():
+            return False
+
+        if self._args.warmup < 0:
+            logger.error('warmup should be non-negative, while {} is set.'.format(self._args.warmup))
+            return False
+        if self._args.iterations <= 0:
+            logger.error('iterations should be positive, while {} is set.'.format(self._args.iterations))
+            return False
+
+        self._tv = self._format_tv()
+        self._workload = self._format_workload()
+        file_prefix = self._format_file_prefix()
+        self._dat_file_name = f'{file_prefix}.dat'
+        self._out_file_name = f'{file_prefix}.out'
+        self._dat_path = os.path.join(self._args.bin_dir, self._dat_file_name)
+        self._out_path = os.path.join(self._args.bin_dir, self._out_file_name)
+
+        with open(self._dat_path, 'w') as dat_file:
+            dat_file.write(self._format_dat_content())
+
+        bin_path = os.path.join(self._args.bin_dir, self._bin_name)
+        command = (
+            f'{bin_path}'
+            f' -P {self._args.p}'
+            f' -Q {self._args.q}'
+            f' --it {self._args.warmup + self._args.iterations}'
+            f' -i {self._dat_file_name}'
+        )
+        if self._args.local_p is not None:
+            command += f' -p {self._args.local_p}'
+        if self._args.local_q is not None:
+            command += f' -q {self._args.local_q}'
+
+        self._commands = [command]
+        return True
+
+    def _process_raw_result(self, cmd_idx, raw_output):
+        """Parse HPL stdout and generated output file."""
+        self._result.add_raw_data('raw_output_' + str(cmd_idx), raw_output, self._args.log_raw_data)
+
+        if self._out_path is None or not os.path.exists(self._out_path):
+            logger.error('HPL output file does not exist - path: {}.'.format(self._out_path))
+            return False
+
+        with open(self._out_path, 'r') as output_file:
+            output = output_file.read()
+        self._result.add_raw_data('hpl_output_' + str(cmd_idx), output, self._args.log_raw_data)
+
+        rows = self._parse_result_rows(output)
+        end = self._args.warmup + self._args.iterations
+        if len(rows) < end:
+            logger.error(
+                'Insufficient HPL result rows - benchmark: {}, expected: {}, found: {}.'.format(
+                    self._name, end, len(rows)
+                )
+            )
+            return False
+
+        measured_rows = rows[self._args.warmup:end]
+        flops, time = self._reduce_rows(measured_rows)
+        tests_pass = 1 if all(row['passed'] for row in measured_rows) else 0
+
+        self._result.add_result(f'{self._workload}_flops', flops)
+        self._result.add_result(f'{self._workload}_time', time)
+        self._result.add_result(f'{self._workload}_tests_pass', tests_pass)
+        return True
+
+    def _parse_result_rows(self, output):
+        """Parse matching HPL result rows from generated output content."""
+        rows = []
+        pending_row = None
+        output_tv = self._format_output_tv()
+
+        for line in output.splitlines():
+            result_match = _HPL_RESULT_PATTERN.match(line)
+            if result_match:
+                pending_row = {
+                    'tv': result_match.group('tv'),
+                    'n': int(result_match.group('n')),
+                    'nb': int(result_match.group('nb')),
+                    'p': int(result_match.group('p')),
+                    'q': int(result_match.group('q')),
+                    'time': float(result_match.group('time')),
+                    'flops': float(result_match.group('flops')),
+                }
+                if pending_row['time'] <= 0:
+                    logger.error(
+                        'Invalid HPL result time - benchmark: {}, time: {}.'.format(self._name, pending_row['time'])
+                    )
+                    pending_row = None
+                continue
+
+            residual_match = _HPL_RESIDUAL_PATTERN.search(line)
+            if residual_match and pending_row is not None:
+                if self._is_expected_result_row(pending_row, output_tv):
+                    pending_row['passed'] = residual_match.group('status').upper() == 'PASSED'
+                    rows.append(pending_row)
+                pending_row = None
+
+        return rows
+
+    def _is_expected_result_row(self, row, output_tv):
+        """Return whether a parsed output row matches the current benchmark input."""
+        if row['tv'] != output_tv:
+            return False
+        if row['nb'] != self._args.nb or row['p'] != self._args.p or row['q'] != self._args.q:
+            return False
+        if self._match_output_n() and row['n'] != self._args.n:
+            return False
+        return True
+
+    def _reduce_rows(self, rows):
+        """Reduce measured rows according to FLOPS-oriented reduce semantics."""
+        flops = self._reduce_values(row['flops'] for row in rows)
+        reciprocal_time = self._reduce_values(1 / row['time'] for row in rows)
+        return flops, 1 / reciprocal_time
+
+    def _reduce_values(self, values):
+        """Reduce values with the configured performance-oriented reduce operator."""
+        values = list(values)
+        reduce_op = self._args.reduce_op
+        if reduce_op == 'max':
+            return max(values)
+        if reduce_op == 'min':
+            return min(values)
+        if reduce_op == 'mean':
+            return statistics.mean(values)
+        return statistics.median(values)
+
+    def _format_tv(self):
+        """Format the expected T/V field from benchmark input arguments."""
+        raise NotImplementedError
+
+    def _format_output_tv(self):
+        """Format the expected T/V field in generated HPL output."""
+        return self._format_tv()
+
+    def _match_output_n(self):
+        """Return whether parsed output N must match the input N."""
+        return True
+
+    def _format_workload(self):
+        """Format the metric workload suffix from benchmark input arguments."""
+        return f'{self._tv}_P{self._args.p}_Q{self._args.q}_N{self._args.n}_NB{self._args.nb}'
+
+    def _format_file_prefix(self):
+        """Format generated HPL.dat/HPL.out file prefix."""
+        return f'{self._file_label or self._name}-{self._workload}'
+
+    def _format_dat_content(self):
+        """Format generated HPL.dat content."""
+        raise NotImplementedError
--- a/superbench/config/hygon_bw1000.yaml
+++ b/superbench/config/hygon_bw1000.yaml
@@ -154,6 +154,179 @@ superbench:
        maxbytes: 16G
        ngpus: 1
        operation: alltoall
+    gpu-hpl:r32:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 8
+          node_num: 4
+          bind_to: none
+          mca:
+            pml: ucx
+            btl: ^openib
+            btl_tcp_if_exclude: lo,docker0
+            coll_hcoll_enable: 0
+          env:
+            ROCM_PATH: /opt/dtk
+            HSA_FORCE_FINE_GRAIN_PCIE: 1
+            UCX_RNDV_SCHEME: put_zcopy
+            UCX_RNDV_FRAG_MEM_TYPE: rocm
+            UCX_MEMTYPE_CACHE: n
+      parameters:
+        p: 8
+        q: 4
+        n: 512000
+        nb: 512
+        bcast: 5
+        warmup: 1
+        iterations: 5
+    gpu-hpl:r16:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 8
+          node_num: 2
+          bind_to: none
+          mca:
+            pml: ucx
+            btl: ^openib
+            btl_tcp_if_exclude: lo,docker0
+            coll_hcoll_enable: 0
+          env:
+            ROCM_PATH: /opt/dtk
+            HSA_FORCE_FINE_GRAIN_PCIE: 1
+            UCX_RNDV_SCHEME: put_zcopy
+            UCX_RNDV_FRAG_MEM_TYPE: rocm
+            UCX_MEMTYPE_CACHE: n
+      parameters:
+        p: 8
+        q: 2
+        n: 360448
+        nb: 512
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl:r8:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 8
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 4
+        q: 2
+        n: 254976
+        nb: 512
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl:r4:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 4
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 4
+        q: 1
+        n: 180224
+        nb: 512
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl:r2:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 2
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 2
+        q: 1
+        n: 128000
+        nb: 512
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl:r1:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 1
+        q: 1
+        n: 90624
+        nb: 512
+        nbmin: 16
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl-mxp:r8:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 8
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 4
+        q: 2
+        n: 344064
+        nb: 4096
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl-mxp:r4:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 4
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 4
+        q: 1
+        n: 245760
+        nb: 4096
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl-mxp:r2:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 2
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 2
+        q: 1
+        n: 172032
+        nb: 4096
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl-mxp:r1:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 1
+        q: 1
+        n: 122880
+        nb: 4096
+        bcast: 1
+        warmup: 1
+        iterations: 5
    gpu-hpcg:r32:
      enable: false
      modes:

--- a/superbench/config/hygon_bw150.yaml
+++ b/superbench/config/hygon_bw150.yaml
@@ -160,48 +160,127 @@ superbench:
        maxbytes: 16G
        ngpus: 1
        operation: alltoall
-    gpu-hpcg:r32:
+    gpu-hpl:r8:
      enable: false
      modes:
        - name: mpi
          proc_num: 8
-          node_num: 4
-          host_list: [node01, node02, node03, node04]
+          node_num: 1
          bind_to: none
-          mca:
-            pml: ob1
-            btl: ^openib
-            btl_tcp_if_include: p14p2
-            coll_hcoll_enable: 0
      parameters:
-        nx: 560
-        ny: 280
-        nz: 280
-        rt: 10
-        npx: 4
-        npy: 4
-        npz: 2
-    gpu-hpcg:r16:
+        p: 4
+        q: 2
+        n: 256000
+        nb: 512
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl:r4:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 4
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 4
+        q: 1
+        n: 180224
+        nb: 512
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl:r2:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 2
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 2
+        q: 1
+        n: 128000
+        nb: 512
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl:r1:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 1
+        q: 1
+        n: 90624
+        nb: 512
+        bcast: 1
+        nbmin: 16
+        warmup: 1
+        iterations: 5
+    gpu-hpl-mxp:r8:
      enable: false
      modes:
        - name: mpi
          proc_num: 8
-          node_num: 2
-          host_list: [node01, node02]
+          node_num: 1
          bind_to: none
-          mca:
-            pml: ob1
-            btl: ^openib
-            btl_tcp_if_include: p14p2
-            coll_hcoll_enable: 0
      parameters:
-        nx: 560
-        ny: 280
-        nz: 280
-        rt: 10
-        npx: 4
-        npy: 2
-        npz: 2
+        p: 4
+        q: 2
+        n: 344064
+        nb: 4096
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl-mxp:r4:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 4
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 4
+        q: 1
+        n: 245760
+        nb: 4096
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl-mxp:r2:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 2
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 2
+        q: 1
+        n: 172032
+        nb: 4096
+        bcast: 1
+        warmup: 1
+        iterations: 5
+    gpu-hpl-mxp:r1:
+      enable: false
+      modes:
+        - name: mpi
+          proc_num: 1
+          node_num: 1
+          bind_to: none
+      parameters:
+        p: 1
+        q: 1
+        n: 122880
+        nb: 4096
+        bcast: 1
+        warmup: 1
+        iterations: 5
    gpu-hpcg:r8:
      enable: false
      modes:

--- a/superbench/config/hygon_bw_summary.yaml
+++ b/superbench/config/hygon_bw_summary.yaml
@@ -91,39 +91,39 @@ superbench:
      categories: HPCG gpu-hpcg:r1
      metrics:
        - gpu-hpcg:r1/(setup_time|optimization_time|total_time)_p1x1x1_n560x280x280
-        - gpu-hpcg:r1/(ddot|waxpby|spmv|mg|total|final)_p1x1x1_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
+        - gpu-hpcg:r1/(ddot|waxpby|spmv|mg|total|final)_p1x1x1_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)

    gpu_hpcg_r2:
      statistics: mean
      categories: HPCG gpu-hpcg:r2
      metrics:
        - gpu-hpcg:r2/(setup_time|optimization_time|total_time)_p2x1x1_n560x280x280
-        - gpu-hpcg:r2/(ddot|waxpby|spmv|mg|total|final)_p2x1x1_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
+        - gpu-hpcg:r2/(ddot|waxpby|spmv|mg|total|final)_p2x1x1_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)

    gpu_hpcg_r4:
      statistics: mean
      categories: HPCG gpu-hpcg:r4
      metrics:
        - gpu-hpcg:r4/(setup_time|optimization_time|total_time)_p2x2x1_n560x280x280
-        - gpu-hpcg:r4/(ddot|waxpby|spmv|mg|total|final)_p2x2x1_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
+        - gpu-hpcg:r4/(ddot|waxpby|spmv|mg|total|final)_p2x2x1_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)

    gpu_hpcg_r8:
      statistics: mean
      categories: HPCG gpu-hpcg:r8
      metrics:
        - gpu-hpcg:r8/(setup_time|optimization_time|total_time)_p2x2x2_n560x280x280
-        - gpu-hpcg:r8/(ddot|waxpby|spmv|mg|total|final)_p2x2x2_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
+        - gpu-hpcg:r8/(ddot|waxpby|spmv|mg|total|final)_p2x2x2_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)

    gpu_hpcg_r16:
      statistics: mean
      categories: HPCG gpu-hpcg:r16
      metrics:
        - gpu-hpcg:r16/(setup_time|optimization_time|total_time)_p4x2x2_n560x280x280
-        - gpu-hpcg:r16/(ddot|waxpby|spmv|mg|total|final)_p4x2x2_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
+        - gpu-hpcg:r16/(ddot|waxpby|spmv|mg|total|final)_p4x2x2_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)

    gpu_hpcg_r32:
      statistics: mean
      categories: HPCG gpu-hpcg:r32
      metrics:
        - gpu-hpcg:r32/(setup_time|optimization_time|total_time)_p4x4x2_n560x280x280
-        - gpu-hpcg:r32/(ddot|waxpby|spmv|mg|total|final)_p4x4x2_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
+        - gpu-hpcg:r32/(ddot|waxpby|spmv|mg|total|final)_p4x4x2_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)
--- a/tests/benchmarks/micro_benchmarks/test_dtk_hpcg_performance.py
+++ b/tests/benchmarks/micro_benchmarks/test_dtk_hpcg_performance.py
@@ -103,29 +103,29 @@ def test_dtk_hpcg_result_parsing_with_wrapper_noise(self):

        workload = 'p4x4x2_n560x280x280'
        expected_results = {
-            f'final_{workload}_gflops': 6904.9,
-            f'final_{workload}_gflops_per_process': 215.8,
+            f'final_{workload}_flops': 6904.9,
+            f'final_{workload}_flops_per_process': 215.8,
            f'final_{workload}_bandwidth': 52359.0,
            f'final_{workload}_bandwidth_per_process': 1636.2,
-            f'ddot_{workload}_gflops': 5849.4,
+            f'ddot_{workload}_flops': 5849.4,
            f'ddot_{workload}_bandwidth': 46794.9,
-            f'ddot_{workload}_gflops_per_process': 182.8,
+            f'ddot_{workload}_flops_per_process': 182.8,
            f'ddot_{workload}_bandwidth_per_process': 1462.3,
-            f'waxpby_{workload}_gflops': 3052.0,
+            f'waxpby_{workload}_flops': 3052.0,
            f'waxpby_{workload}_bandwidth': 36623.8,
-            f'waxpby_{workload}_gflops_per_process': 95.4,
+            f'waxpby_{workload}_flops_per_process': 95.4,
            f'waxpby_{workload}_bandwidth_per_process': 1144.5,
-            f'spmv_{workload}_gflops': 5473.9,
+            f'spmv_{workload}_flops': 5473.9,
            f'spmv_{workload}_bandwidth': 34468.8,
-            f'spmv_{workload}_gflops_per_process': 171.1,
+            f'spmv_{workload}_flops_per_process': 171.1,
            f'spmv_{workload}_bandwidth_per_process': 1077.1,
-            f'mg_{workload}_gflops': 7716.9,
+            f'mg_{workload}_flops': 7716.9,
            f'mg_{workload}_bandwidth': 59557.1,
-            f'mg_{workload}_gflops_per_process': 241.2,
+            f'mg_{workload}_flops_per_process': 241.2,
            f'mg_{workload}_bandwidth_per_process': 1861.2,
-            f'total_{workload}_gflops': 6971.0,
+            f'total_{workload}_flops': 6971.0,
            f'total_{workload}_bandwidth': 52859.9,
-            f'total_{workload}_gflops_per_process': 217.8,
+            f'total_{workload}_flops_per_process': 217.8,
            f'total_{workload}_bandwidth_per_process': 1651.9,
            f'setup_time_{workload}': 0.12,
            f'optimization_time_{workload}': 0.25,

--- a/tests/benchmarks/micro_benchmarks/test_dtk_hpl_family_performance.py
+++ b/tests/benchmarks/micro_benchmarks/test_dtk_hpl_family_performance.py
+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""Tests for DTK gpu-hpl benchmark family."""
+
+import os
+import unittest
+
+from tests.helper.testcase import BenchmarkTestCase
+from superbench.benchmarks.micro_benchmarks.dtk_hpl_mxp_performance import DtkHplMxpBenchmark
+from superbench.benchmarks.micro_benchmarks.dtk_hpl_performance import DtkHplBenchmark
+
+
+class DtkHplFamilyBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
+    """Tests for DTK gpu-hpl and gpu-hpl-mxp benchmarks."""
+    @classmethod
+    def setUpClass(cls):
+        """Hook method for setting up class fixture before running tests in the class."""
+        super().setUpClass()
+        cls.createMockEnvs(cls)
+        cls.createMockFiles(cls, ['bin/run_rochpl', 'bin/run_rochplmxp'])
+
+    def _parse_args(self, benchmark):
+        """Parse benchmark arguments without running preprocess."""
+        benchmark.add_parser_arguments()
+        ret, args, _ = benchmark.parse_args()
+        if ret:
+            benchmark._args = args
+        return ret, args
+
+    def _write_output_file(self, benchmark, output):
+        """Write generated HPL output for result parsing tests."""
+        with open(benchmark._out_path, 'w') as output_file:
+            output_file.write(output)
+
+    def _load_data_file(self, file_name):
+        """Load test data file content."""
+        data_path = os.path.join(os.path.dirname(__file__), '..', '..', 'data', file_name)
+        with open(data_path, 'r') as data_file:
+            return data_file.read()
+
+    def test_dtk_hpl_default_metric_workload(self):
+        """Test DTK gpu-hpl default workload formatting."""
+        benchmark = DtkHplBenchmark('gpu-hpl')
+        ret, args = self._parse_args(benchmark)
+
+        self.assertTrue(ret)
+        self.assertEqual(1, args.p)
+        self.assertEqual(1, args.q)
+        self.assertEqual(45312, args.n)
+        self.assertEqual(384, args.nb)
+        self.assertEqual(0, args.bcast)
+        self.assertEqual(0, args.warmup)
+        self.assertEqual(1, args.iterations)
+        self.assertEqual('max', args.reduce_op)
+
+        benchmark._tv = benchmark._format_tv()
+        self.assertEqual('WC10R2R32_TTN8', benchmark._tv)
+        self.assertEqual('WC10R2R32_TTN8_P1_Q1_N45312_NB384', benchmark._format_workload())
+
+    def test_dtk_hpl_mxp_default_metric_workload(self):
+        """Test DTK gpu-hpl-mxp default workload formatting."""
+        benchmark = DtkHplMxpBenchmark('gpu-hpl-mxp')
+        ret, args = self._parse_args(benchmark)
+
+        self.assertTrue(ret)
+        self.assertEqual(1, args.p)
+        self.assertEqual(1, args.q)
+        self.assertEqual(61440, args.n)
+        self.assertEqual(2560, args.nb)
+        self.assertEqual(0, args.bcast)
+
+        benchmark._tv = benchmark._format_tv()
+        self.assertEqual('WC0', benchmark._tv)
+        self.assertEqual('WC0_P1_Q1_N61440_NB2560', benchmark._format_workload())
+
+    def test_dtk_hpl_sample_metric_workload(self):
+        """Test DTK gpu-hpl workload formatting with sample parameters."""
+        benchmark = DtkHplBenchmark('gpu-hpl', parameters='--p 4 --q 1 --n 8192 --nb 512 --bcast 1')
+        ret, _ = self._parse_args(benchmark)
+
+        self.assertTrue(ret)
+        benchmark._tv = benchmark._format_tv()
+        self.assertEqual('WC11R2R32_TTN8', benchmark._tv)
+        self.assertEqual('WC11R2R32_TTN8_P4_Q1_N8192_NB512', benchmark._format_workload())
+
+    def test_dtk_hpl_mxp_sample_metric_workload(self):
+        """Test DTK gpu-hpl-mxp workload formatting with sample parameters."""
+        benchmark = DtkHplMxpBenchmark('gpu-hpl-mxp', parameters='--p 4 --q 1 --n 8192 --nb 4096 --bcast 1')
+        ret, _ = self._parse_args(benchmark)
+
+        self.assertTrue(ret)
+        benchmark._tv = benchmark._format_tv()
+        self.assertEqual('WC1', benchmark._tv)
+        self.assertEqual('WC1_P4_Q1_N8192_NB4096', benchmark._format_workload())
+
+    def test_dtk_hpl_only_arguments_are_not_accepted_by_mxp(self):
+        """Test rocHPL-only arguments are not accepted by gpu-hpl-mxp."""
+        hpl_benchmark = DtkHplBenchmark('gpu-hpl', parameters='--pfact 2')
+        hpl_mxp_benchmark = DtkHplMxpBenchmark('gpu-hpl-mxp', parameters='--pfact 2')
+
+        hpl_ret, _ = self._parse_args(hpl_benchmark)
+        hpl_mxp_ret, _ = self._parse_args(hpl_mxp_benchmark)
+
+        self.assertTrue(hpl_ret)
+        self.assertFalse(hpl_mxp_ret)
+
+    def test_dtk_hpl_invalid_sampling_arguments(self):
+        """Test invalid HPL sampling arguments are rejected."""
+        self.assertFalse(DtkHplBenchmark('gpu-hpl', parameters='--warmup -1')._preprocess())
+        self.assertFalse(DtkHplBenchmark('gpu-hpl', parameters='--iterations 0')._preprocess())
+
+    def test_dtk_hpl_preprocess_generates_dat_file(self):
+        """Test DTK gpu-hpl dat file and command generation."""
+        benchmark = DtkHplBenchmark('gpu-hpl')
+
+        self.assertTrue(benchmark._preprocess())
+
+        dat_file_name = 'HPL-WC10R2R32_TTN8_P1_Q1_N45312_NB384.dat'
+        out_file_name = 'HPL-WC10R2R32_TTN8_P1_Q1_N45312_NB384.out'
+        self.assertEqual(os.path.join(self._tmp_dir, 'bin', dat_file_name), benchmark._dat_path)
+        self.assertEqual(os.path.join(self._tmp_dir, 'bin', out_file_name), benchmark._out_path)
+        self.assertEqual(1, len(benchmark._commands))
+        self.assertIn(f'run_rochpl -P 1 -Q 1 --it 1 -i {dat_file_name}', benchmark._commands[0])
+
+        with open(benchmark._dat_path, 'r') as dat_file:
+            dat_content = dat_file.read()
+
+        self.assertIn(f'{out_file_name} output file name (if any)', dat_content)
+        self.assertIn('45312         Ns', dat_content)
+        self.assertIn('384         NBs', dat_content)
+        self.assertIn('0            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)', dat_content)
+        self.assertIn('2            PFACTs (0=left, 1=Crout, 2=Right)', dat_content)
+        self.assertIn('8            memory alignment in double (> 0)', dat_content)
+
+    def test_dtk_hpl_mxp_preprocess_generates_dat_file(self):
+        """Test DTK gpu-hpl-mxp dat file and command generation."""
+        benchmark = DtkHplMxpBenchmark(
+            'gpu-hpl-mxp', parameters='--p 4 --q 1 --n 8192 --nb 4096 --bcast 1 --warmup 1 --iterations 5'
+        )
+
+        self.assertTrue(benchmark._preprocess())
+
+        dat_file_name = 'HPL-MxP-WC1_P4_Q1_N8192_NB4096.dat'
+        out_file_name = 'HPL-MxP-WC1_P4_Q1_N8192_NB4096.out'
+        self.assertEqual(os.path.join(self._tmp_dir, 'bin', dat_file_name), benchmark._dat_path)
+        self.assertEqual(os.path.join(self._tmp_dir, 'bin', out_file_name), benchmark._out_path)
+        self.assertEqual(1, len(benchmark._commands))
+        self.assertIn(f'run_rochplmxp -P 4 -Q 1 --it 6 -i {dat_file_name}', benchmark._commands[0])
+
+        with open(benchmark._dat_path, 'r') as dat_file:
+            dat_content = dat_file.read()
+
+        self.assertIn(f'{out_file_name} output file name (if any)', dat_content)
+        self.assertIn('8192         Ns', dat_content)
+        self.assertIn('4096         NBs', dat_content)
+        self.assertIn('4            P', dat_content)
+        self.assertIn('1            Q', dat_content)
+        self.assertIn('1            BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)', dat_content)
+
+    def test_dtk_hpl_result_parsing_with_warmup_and_max_reduce(self):
+        """Test DTK gpu-hpl parses generated output and reduces by max FLOPS."""
+        benchmark = DtkHplBenchmark(
+            'gpu-hpl', parameters='--p 4 --q 1 --n 8192 --nb 512 --bcast 1 --warmup 1 --iterations 3 --reduce-op max'
+        )
+
+        self.assertTrue(benchmark._preprocess())
+        self._write_output_file(
+            benchmark, """
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.71              5.167e+02
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.69              5.338e+02
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.67              5.437e+02
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.67              5.450e+02
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+"""
+        )
+
+        self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
+
+        workload = 'WC11R2R32_TTN8_P4_Q1_N8192_NB512'
+        self.assertEqual(545.0, benchmark.result[f'{workload}_flops'][0])
+        self.assertEqual(0.67, benchmark.result[f'{workload}_time'][0])
+        self.assertEqual(1, benchmark.result[f'{workload}_tests_pass'][0])
+        self.assertIn('raw_output_0', benchmark.raw_data)
+        self.assertIn('hpl_output_0', benchmark.raw_data)
+
+    def test_dtk_hpl_mxp_result_parsing_does_not_match_output_n(self):
+        """Test DTK gpu-hpl-mxp parses output when output N differs from input N."""
+        benchmark = DtkHplMxpBenchmark(
+            'gpu-hpl-mxp',
+            parameters='--p 4 --q 1 --n 8192 --nb 4096 --bcast 1 --warmup 1 --iterations 2 --reduce-op min'
+        )
+
+        self.assertTrue(benchmark._preprocess())
+        self._write_output_file(
+            benchmark, """
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.78              3.742e+03
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... PASSED
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.80              3.665e+03
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... PASSED
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.78              3.767e+03
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... FAILED
+"""
+        )
+
+        self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
+
+        workload = 'WC1_P4_Q1_N8192_NB4096'
+        self.assertEqual(3665.0, benchmark.result[f'{workload}_flops'][0])
+        self.assertEqual(0.80, benchmark.result[f'{workload}_time'][0])
+        self.assertEqual(0, benchmark.result[f'{workload}_tests_pass'][0])
+
+    def test_dtk_hpl_result_parsing_with_median_reduce(self):
+        """Test DTK gpu-hpl median reduce uses reciprocal time."""
+        benchmark = DtkHplBenchmark(
+            'gpu-hpl', parameters='--p 4 --q 1 --n 8192 --nb 512 --bcast 1 --iterations 4 --reduce-op median'
+        )
+
+        self.assertTrue(benchmark._preprocess())
+        self._write_output_file(
+            benchmark, """
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.80              5.000e+02
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.40              6.000e+02
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.20              7.000e+02
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.10              8.000e+02
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+"""
+        )
+
+        self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
+
+        workload = 'WC11R2R32_TTN8_P4_Q1_N8192_NB512'
+        self.assertEqual(650.0, benchmark.result[f'{workload}_flops'][0])
+        self.assertEqual(0.26666666666666666, benchmark.result[f'{workload}_time'][0])
+        self.assertEqual(1, benchmark.result[f'{workload}_tests_pass'][0])
+
+    def test_dtk_hpl_result_parsing_with_sample_output_file(self):
+        """Test DTK gpu-hpl parses a full sample output file."""
+        benchmark = DtkHplBenchmark(
+            'gpu-hpl', parameters='--p 4 --q 1 --n 8192 --nb 512 --bcast 1 --warmup 1 --iterations 5'
+        )
+
+        self.assertTrue(benchmark._preprocess())
+        self._write_output_file(benchmark, self._load_data_file('gpu_hpl_sample.out'))
+
+        self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
+
+        workload = 'WC11R2R32_TTN8_P4_Q1_N8192_NB512'
+        self.assertEqual(545.0, benchmark.result[f'{workload}_flops'][0])
+        self.assertEqual(0.67, benchmark.result[f'{workload}_time'][0])
+        self.assertEqual(1, benchmark.result[f'{workload}_tests_pass'][0])
+
+    def test_dtk_hpl_mxp_result_parsing_with_sample_output_file(self):
+        """Test DTK gpu-hpl-mxp parses a full sample output file."""
+        benchmark = DtkHplMxpBenchmark(
+            'gpu-hpl-mxp', parameters='--p 4 --q 1 --n 8192 --nb 4096 --bcast 1 --iterations 6'
+        )
+
+        self.assertTrue(benchmark._preprocess())
+        self._write_output_file(benchmark, self._load_data_file('gpu_hpl_mxp_sample.out'))
+
+        self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
+
+        workload = 'WC1_P4_Q1_N8192_NB4096'
+        self.assertEqual(3767.0, benchmark.result[f'{workload}_flops'][0])
+        self.assertEqual(0.78, benchmark.result[f'{workload}_time'][0])
+        self.assertEqual(1, benchmark.result[f'{workload}_tests_pass'][0])
+
+    def test_dtk_hpl_result_parsing_fails_when_output_file_is_missing(self):
+        """Test DTK gpu-hpl parsing fails when generated output file is missing."""
+        benchmark = DtkHplBenchmark('gpu-hpl')
+
+        self.assertTrue(benchmark._preprocess())
+        self.assertFalse(benchmark._process_raw_result(0, 'stdout noise'))
+
+
+if __name__ == '__main__':
+    unittest.main()
--- a/tests/data/gpu_hpl_mxp_sample.out
+++ b/tests/data/gpu_hpl_mxp_sample.out
+================================================================================
+HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
+Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
+Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
+Modified by Julien Langou, University of Colorado Denver
+================================================================================
+
+An explanation of the input/output parameters follows:
+T/V    : Wall time / encoded variant.
+N      : The order of the coefficient matrix A.
+NB     : The partitioning blocking factor.
+P      : The number of process rows.
+Q      : The number of process columns.
+Time   : Time in seconds to solve the linear system.
+Gflops : Rate of execution for solving the linear system.
+
+The following parameter values will be used:
+
+N      :    8192 
+NB     :    4096 
+PMAP   : Column-major process mapping
+P      :       4 
+Q      :       1 
+BCAST  :  1ringM 
+
+--------------------------------------------------------------------------------
+
+- The matrix A is randomly generated for each test.
+- The following scaled residual check will be computed:
+      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
+- The relative machine precision (eps) is taken to be               1.110223e-16
+- Computational tests pass if scaled residuals are less than                16.0
+
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.78              3.742e+03
+HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPLMXP_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+Max aggregated wall time D bcast . . :               0.46
+Max aggregated wall time L bcast . . :               0.00
+Max aggregated wall time U bcast . . :               0.40
+Max aggregated wall time update  . . :               0.02
+Max aggregated wall time Iter Refine :               0.01
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.80              3.665e+03
+HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPLMXP_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+Max aggregated wall time D bcast . . :               0.48
+Max aggregated wall time L bcast . . :               0.00
+Max aggregated wall time U bcast . . :               0.41
+Max aggregated wall time update  . . :               0.02
+Max aggregated wall time Iter Refine :               0.01
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.78              3.767e+03
+HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPLMXP_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+Max aggregated wall time D bcast . . :               0.46
+Max aggregated wall time L bcast . . :               0.00
+Max aggregated wall time U bcast . . :               0.39
+Max aggregated wall time update  . . :               0.02
+Max aggregated wall time Iter Refine :               0.01
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.79              3.727e+03
+HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPLMXP_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+Max aggregated wall time D bcast . . :               0.47
+Max aggregated wall time L bcast . . :               0.00
+Max aggregated wall time U bcast . . :               0.40
+Max aggregated wall time update  . . :               0.02
+Max aggregated wall time Iter Refine :               0.01
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.79              3.724e+03
+HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPLMXP_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+Max aggregated wall time D bcast . . :               0.47
+Max aggregated wall time L bcast . . :               0.00
+Max aggregated wall time U bcast . . :               0.40
+Max aggregated wall time update  . . :               0.02
+Max aggregated wall time Iter Refine :               0.01
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC1            16384  4096     4     1               0.79              3.713e+03
+HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPLMXP_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+Max aggregated wall time D bcast . . :               0.47
+Max aggregated wall time L bcast . . :               0.00
+Max aggregated wall time U bcast . . :               0.40
+Max aggregated wall time update  . . :               0.02
+Max aggregated wall time Iter Refine :               0.01
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0891789 ...... PASSED
+================================================================================
+
+Finished      6 tests with the following results:
+              6 tests completed and passed residual checks,
+              0 tests completed and failed residual checks,
+              0 tests skipped because of illegal input values.
+--------------------------------------------------------------------------------
+
+End of Tests.
+================================================================================
--- a/tests/data/gpu_hpl_sample.out
+++ b/tests/data/gpu_hpl_sample.out
+================================================================================
+HPLinpack 2.2  --  High-Performance Linpack benchmark  --   February 24, 2016
+Written by A. Petitet and R. Clint Whaley,  Innovative Computing Laboratory, UTK
+Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
+Modified by Julien Langou, University of Colorado Denver
+================================================================================
+
+An explanation of the input/output parameters follows:
+T/V    : Wall time / encoded variant.
+N      : The order of the coefficient matrix A.
+NB     : The partitioning blocking factor.
+P      : The number of process rows.
+Q      : The number of process columns.
+Time   : Time in seconds to solve the linear system.
+Gflops : Rate of execution for solving the linear system.
+
+The following parameter values will be used:
+
+N      :    8192 
+NB     :     512 
+PMAP   : Column-major process mapping
+P      :       4 
+Q      :       1 
+PFACT  :   Right 
+NBMIN  :      32 
+NDIV   :       2 
+RFACT  :   Right 
+BCAST  :  1ringM 
+DEPTH  :       1 
+SWAP   : Spread-roll (long)
+L1     : transposed form
+U      : transposed form
+EQUIL  : no
+ALIGN  : 8 double precision words
+
+--------------------------------------------------------------------------------
+
+- The matrix A is randomly generated for each test.
+- The following scaled residual check will be computed:
+      ||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
+- The relative machine precision (eps) is taken to be               1.110223e-16
+- Computational tests pass if scaled residuals are less than                16.0
+
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.71              5.167e+02
+HPL_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPL_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . :               0.32
+ + Max aggregated wall time pfact . :               0.30
+ + Max aggregated wall time mxswp . :               0.19
+Max aggregated wall time laswp . . . :               0.37
+Max aggregated wall time update  . . :               0.00
+Max aggregated wall time up tr sv  . :               0.00
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.69              5.338e+02
+HPL_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPL_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . :               0.30
+ + Max aggregated wall time pfact . :               0.30
+ + Max aggregated wall time mxswp . :               0.18
+Max aggregated wall time laswp . . . :               0.36
+Max aggregated wall time update  . . :               0.00
+Max aggregated wall time up tr sv  . :               0.00
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.67              5.437e+02
+HPL_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPL_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . :               0.30
+ + Max aggregated wall time pfact . :               0.29
+ + Max aggregated wall time mxswp . :               0.18
+Max aggregated wall time laswp . . . :               0.36
+Max aggregated wall time update  . . :               0.00
+Max aggregated wall time up tr sv  . :               0.00
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.67              5.450e+02
+HPL_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPL_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . :               0.30
+ + Max aggregated wall time pfact . :               0.29
+ + Max aggregated wall time mxswp . :               0.18
+Max aggregated wall time laswp . . . :               0.36
+Max aggregated wall time update  . . :               0.00
+Max aggregated wall time up tr sv  . :               0.00
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.68              5.391e+02
+HPL_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPL_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . :               0.30
+ + Max aggregated wall time pfact . :               0.29
+ + Max aggregated wall time mxswp . :               0.18
+Max aggregated wall time laswp . . . :               0.36
+Max aggregated wall time update  . . :               0.00
+Max aggregated wall time up tr sv  . :               0.00
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+================================================================================
+T/V                N    NB     P     Q               Time                 Gflops
+--------------------------------------------------------------------------------
+WC11R2R32        8192   512     4     1               0.69              5.330e+02
+HPL_pdgesv() start time Web Apr 22 00:00:00 2026
+
+HPL_pdgesv() end time   Web Apr 22 00:00:00 2026
+
+--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . :               0.31
+ + Max aggregated wall time pfact . :               0.30
+ + Max aggregated wall time mxswp . :               0.19
+Max aggregated wall time laswp . . . :               0.36
+Max aggregated wall time update  . . :               0.00
+Max aggregated wall time up tr sv  . :               0.00
+--------------------------------------------------------------------------------
+||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)=        0.0002689 ...... PASSED
+================================================================================
+
+Finished      6 tests with the following results:
+              6 tests completed and passed residual checks,
+              0 tests completed and failed residual checks,
+              0 tests skipped because of illegal input values.
+--------------------------------------------------------------------------------
+
+End of Tests.
+================================================================================
--- a/third_party/Makefile
+++ b/third_party/Makefile
@@ -16,7 +16,7 @@ ROCM_VER ?= $(shell hipconfig -R | grep -oP '\d+\.\d+\.\d+' || echo "0.0.0")

 NUM_MAKE_JOBS ?= $(shell nproc --ignore=2)

-.PHONY: all cuda_with_msccl cuda rocm dtk common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest cuda_msccl rocm_perftest fio rocm_rccl_tests rocm_rocblas rocm_bandwidthTest gpcnet cuda_gpuburn cpu_stream cpu_hpl directx_amf_encoding_latency directx_amd rocm_hipblaslt rocm_babelstream_hip megatron_lm megatron_deepspeed apex_rocm nvbandwidth rocm_megatron_lm rocm_hpcg
+.PHONY: all cuda_with_msccl cuda rocm dtk common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest cuda_msccl rocm_perftest fio rocm_rccl_tests rocm_rocblas rocm_bandwidthTest gpcnet cuda_gpuburn cpu_stream cpu_hpl directx_amf_encoding_latency directx_amd rocm_hipblaslt rocm_babelstream_hip megatron_lm megatron_deepspeed apex_rocm nvbandwidth rocm_megatron_lm rocm_hpcg rocm_hpl rocm_hpl_mxp

 # Build targets.
 all: cuda rocm
@@ -24,7 +24,7 @@ all: cuda rocm
 cuda_with_msccl: cuda cuda_msccl
 cuda: common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest gpcnet cuda_gpuburn megatron_lm megatron_deepspeed nvbandwidth
 rocm: common rocm_perftest rocm_rccl_tests rocm_rocblas rocm_bandwidthTest rocm_hipblaslt megatron_deepspeed apex_rocm rocm_megatron_lm
-dtk: common rocm_perftest rocm_rccl_tests rocm_babelstream_hip megatron_deepspeed apex_rocm rocm_megatron_lm rocm_hpcg
+dtk: common rocm_perftest rocm_rccl_tests rocm_babelstream_hip megatron_deepspeed apex_rocm rocm_megatron_lm rocm_hpcg rocm_hpl rocm_hpl_mxp
 cpu: common cpu_perftest
 common: fio cpu_stream

@@ -202,6 +202,37 @@ ifneq (,$(wildcard rocHPCG/install.sh))
 	chmod +x $(SB_MICRO_PATH)/bin/rochpcg $(SB_MICRO_PATH)/bin/run_rochpcg
 endif

+# Build rocHPL and stage the binary for SuperBench DTK image.
+rocm_hpl: sb_micro_path
+ifneq (,$(wildcard rocHPL/install.sh))
+	cd ./rocHPL && \
+	git apply ../rochpl_dtk26.patch && \
+	./install.sh --with-rocm=$(ROCM_PATH) \
+		--with-rocblas=$(ROCM_PATH)/rocblas \
+		--with-mpi=$(MPI_HOME) \
+		--arch=$$(paste -sd ',' $(ROCM_PATH)/bin/target.lst)
+	cp -v ./rocHPL/build/bin/rochpl $(SB_MICRO_PATH)/bin/
+	cp -v ./rocHPL/build/HPL.dat $(SB_MICRO_PATH)/bin/
+	cp -v ./run_rochpl.sh $(SB_MICRO_PATH)/bin/run_rochpl
+	chmod +x $(SB_MICRO_PATH)/bin/rochpl $(SB_MICRO_PATH)/bin/run_rochpl
+endif
+
+# Build rocHPL-MxP and stage the binary and run script for SuperBench DTK image.
+rocm_hpl_mxp: sb_micro_path
+ifneq (,$(wildcard rocHPL-MxP/install.sh))
+	cd ./rocHPL-MxP && \
+	git apply ../rochplmxp_dtk26.patch && \
+	./install.sh --with-rocm=$(ROCM_PATH) \
+		--with-rocblas=$(ROCM_PATH)/rocblas \
+		--with-rocsolver=$(ROCM_PATH)/rocsolver \
+		--with-mpi=$(MPI_HOME) \
+		--arch=$$(paste -sd ',' $(ROCM_PATH)/bin/target.lst)
+	cp -v ./rocHPL-MxP/build/bin/rochplmxp $(SB_MICRO_PATH)/bin/
+	cp -v ./rocHPL-MxP/build/HPL-MxP.dat $(SB_MICRO_PATH)/bin/
+	cp -v ./run_rochplmxp.sh $(SB_MICRO_PATH)/bin/run_rochplmxp
+	chmod +x $(SB_MICRO_PATH)/bin/rochplmxp $(SB_MICRO_PATH)/bin/run_rochplmxp
+endif
+
 # Build GPCNET from commit c56fd9.
 gpcnet: sb_micro_path
 	bash -c "source ${HPCX_HOME}/hpcx-init.sh && hpcx_load && make CC=mpicc -C GPCNET all && hpcx_unload"

--- a/rocHPL @ 6f06f512
+++ b/rocHPL @ 6f06f512
+Subproject commit 6f06f5127bde6ec7a7900814c560520e3f7c8ca3
--- a/rocHPL-MxP @ 5b6a169d
+++ b/rocHPL-MxP @ 5b6a169d
+Subproject commit 5b6a169d5428e5b652341ebbad969fd2937762d4
--- a/third_party/rochpl_dtk26.patch
+++ b/third_party/rochpl_dtk26.patch
+diff --git a/CMakeLists.txt b/CMakeLists.txt
+index 6b80b24..30d3c9d 100644
+--- a/CMakeLists.txt
+++ b/CMakeLists.txt
+@@ -83,7 +83,7 @@ foreach(i ${rochpl_device_source})
+ endforeach()
+ 
+ # HIP flags workaround while target_compile_options does not work
+-list(APPEND HIP_HIPCC_FLAGS "-Wno-unused-command-line-argument -Wno-deprecated-declarations -fPIE -fopenmp")
+list(APPEND HIP_HIPCC_FLAGS "-Wno-unused-command-line-argument -Wno-deprecated-declarations -fPIE -fopenmp --gpu-max-threads-per-block=1024")
+ list(APPEND CMAKE_HOST_FLAGS "-Wno-deprecated-declarations")
+ 
+ if (CMAKE_BUILD_TYPE STREQUAL "Debug")
+@@ -95,24 +95,35 @@ else()
+ endif()
+ 
+ # GPU arch targets
+-set(TARGETS "gfx900;gfx906")
+-if(HIP_VERSION VERSION_GREATER_EQUAL "3.7")
+-  set(TARGETS "${TARGETS};gfx908")
+-endif()
+-if(HIP_VERSION VERSION_GREATER_EQUAL "4.3")
+-  set(TARGETS "${TARGETS};gfx90a")
+-endif()
+-if (HIP_VERSION VERSION_GREATER_EQUAL "5.7")
+-  set(TARGETS "${TARGETS};gfx942")
+-endif()
+-if (HIP_VERSION VERSION_GREATER_EQUAL "6.5")
+-  set(TARGETS "${TARGETS};gfx950;gfx1100")
+set(ARCHS "")
+if(DEFINED HPL_BUILD_ARCH AND NOT HPL_BUILD_ARCH STREQUAL "")
+  string(REPLACE "," ";" ARCHS "${HPL_BUILD_ARCH}")
+  list(TRANSFORM ARCHS STRIP)
+  list(REMOVE_DUPLICATES ARCHS)
+  message(STATUS "Using manually specified GPU targets: ${ARCHS}")
+else()
+  message(STATUS "Detecting available architecture")
+  find_program(ROCMINFO_EXECUTABLE rocminfo)
+  if(ROCMINFO_EXECUTABLE)
+    execute_process(
+      COMMAND ${ROCMINFO_EXECUTABLE}
+      OUTPUT_VARIABLE ROCMINFO_OUTPUT
+      ERROR_QUIET
+      OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+    string(REGEX MATCHALL "Name:[ \t]+gfx[0-9a-z]+" ARCH_MATCHES "${ROCMINFO_OUTPUT}")
+    string(REGEX REPLACE "Name:[ \t]+" "" ARCHS "${ARCH_MATCHES}")
+    list(REMOVE_DUPLICATES ARCHS)
+  endif()
+ endif()
+-if (HIP_VERSION VERSION_GREATER_EQUAL "7.0")
+-  set(TARGETS "${TARGETS};gfx1201")
+
+if(ARCHS STREQUAL "")
+  message(FATAL_ERROR "No GPU architectures detected via rocminfo and no BUILD_ARCH specified. Use ./install.sh --arch=gfxXXX")
+ endif()
+ 
+-foreach(target ${TARGETS})
+message(STATUS "Building for GPU architecture: ${ARCHS}")
+
+foreach(target ${ARCHS})
+   list(APPEND HIP_HIPCC_FLAGS "--offload-arch=${target}")
+ endforeach()
+ 
+@@ -176,7 +187,7 @@ if(MPI_GTL)
+   target_link_libraries(rochpl PRIVATE "${GTL_LIB}")
+ endif()
+ 
+-set_target_properties(rochpl PROPERTIES HIP_ARCHITECTURES "${DEFAULT_AMDGPU_TARGETS}")
+set_target_properties(rochpl PROPERTIES HIP_ARCHITECTURES "${ARCHS}")
+ 
+ # Configure a header file to pass the rocHPL version
+ configure_file("${CMAKE_CURRENT_SOURCE_DIR}/include/hpl_version.hpp.in"
+diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake
+index 6d6be5d..ed4813a 100644
+--- a/cmake/Dependencies.cmake
+++ b/cmake/Dependencies.cmake
+@@ -101,7 +101,7 @@ if(NOT ROCM_FOUND)
+   execute_process(COMMAND ${CMAKE_COMMAND} -E tar xzf ${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag}.zip
+                   WORKING_DIRECTORY ${PROJECT_EXTERN_DIR})
+ 
+-  find_package(ROCmCMakeBuildTools REQUIRED CONFIG PATHS ${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag})
+  set(CMAKE_MODULE_PATH "${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag}/share/rocm/cmake;${CMAKE_MODULE_PATH}")
+ endif()
+ 
+ include(ROCMSetupVersion)
+diff --git a/install.sh b/install.sh
+index b30a3fb..75900d8 100755
+--- a/install.sh
+++ b/install.sh
+@@ -2,7 +2,7 @@
+ # Author: Nico Trost
+ # Modified by: Noel Chalmers
+ 
+-#set -x #echo on
+# set -euo pipefail
+ 
+ # #################################################
+ # helper functions
+@@ -17,6 +17,7 @@ function display_help()
+   echo "    [--with-rocm=<dir>] Path to ROCm install (Default: /opt/rocm)"
+   echo "    [--with-rocblas=<dir>] Path to rocBLAS library (Default: /opt/rocm/rocblas)"
+   echo "    [--with-mpi=<dir>] Path to external MPI install (Default: clone+build OpenMPI)"
+  echo "    [--arch=<archs>] Specify comma separated architecture list to build (Default: detect from rocminfo)"
+   echo "    [--with-mpi-gtl=<dir>] Path to external MPI-GTL install (Optional: defaults to no gtl support)"
+   echo "    [--verbose-print] Verbose output during HPL setup (Default: true)"
+   echo "    [--progress-report] Print progress report to terminal during HPL run (Default: true)"
+@@ -33,10 +34,10 @@ supported_distro( )
+   fi
+ 
+   case "${ID}" in
+-    debian|linuxmint|ubuntu|centos|rhel|fedora|sles|tencentos)
+    debian|linuxmint|ubuntu|centos|rhel|fedora|sles|tencentos|kylin|rocky)
+         true
+         ;;
+-    *)  printf "This script is currently supported on Debian, Linuxmint, Ubuntu, CentOS, RHEL, Fedora and SLES\n"
+    *)  printf "This script is currently supported on Debian, Linuxmint, Ubuntu, CentOS, RHEL, Fedora, SLES, TencentOS, Kylin and Rocky\n"
+         exit 2
+         ;;
+   esac
+@@ -68,11 +69,11 @@ exit_with_error( )
+         printf "sudo apt install -y ${library_dependencies_ubuntu[*]}\n"
+         ;;
+ 
+-      centos|rhel|tencentos)
+      centos|rhel|tencentos|kylin)
+         printf "sudo yum -y --nogpgcheck install ${library_dependencies_centos[*]}\n"
+         ;;
+ 
+-      fedora)
+      fedora|rocky)
+         printf "sudo dnf install -y ${library_dependencies_fedora[*]}\n"
+         ;;
+ 
+@@ -224,6 +225,7 @@ verbose_print=true
+ progress_report=true
+ detailed_timing=true
+ enable_tracing=false
+arch=
+ 
+ # #################################################
+ # Parameter parsing
+@@ -232,7 +234,7 @@ enable_tracing=false
+ # check if we have a modern version of getopt that can handle whitespace and long parameters
+ getopt -T
+ if [[ $? -eq 4 ]]; then
+-  GETOPT_PARSE=$(getopt --name "${0}" --longoptions help,debug,prefix:,with-rocm:,with-mpi:,with-mpi-gtl:,with-rocblas:,verbose-print:,progress-report:,detailed-timing:,enable-tracing: --options hg -- "$@")
+  GETOPT_PARSE=$(getopt --name "${0}" --longoptions help,debug,prefix:,with-rocm:,with-mpi:,with-mpi-gtl:,with-rocblas:,verbose-print:,progress-report:,detailed-timing:,enable-tracing:,arch: --options hg -- "$@")
+ else
+   echo "Need a new version of getopt"
+   exit_with_error 1
+@@ -263,6 +265,9 @@ while true; do
+     --with-mpi)
+         with_mpi=${2}
+         shift 2 ;;
+    --arch)
+        arch=${2}
+        shift 2 ;;
+     --with-mpi-gtl)
+         with_mpi_gtl=${2}
+         shift 2 ;;
+@@ -294,9 +299,6 @@ printf "\033[32mCreating project build directory in: \033[33m${build_dir}\033[0m
+ # #################################################
+ # prep
+ # #################################################
+-# ensure a clean build environment
+-rm -rf ${build_dir}
+-
+ # Default cmake executable is called cmake
+ cmake_executable=cmake
+ 
+@@ -347,11 +349,14 @@ pushd .
+   if [[ "${enable_tracing}" == on || "${enable_tracing}" == true || "${enable_tracing}" == 1 || "${enable_tracing}" == enabled ]]; then
+     cmake_common_options="${cmake_common_options} -DHPL_TRACING=ON"
+   fi
+  if [[ -n "${arch}" ]]; then
+    cmake_common_options="${cmake_common_options} -DHPL_BUILD_ARCH=${arch}"
+  fi
+   shopt -u nocasematch
+ 
+   # Build library with AMD toolchain because of existence of device kernels
+   mkdir -p ${build_dir} && cd ${build_dir}
+-  ${cmake_executable} ${cmake_common_options} ..
+  ${cmake_executable} --fresh ${cmake_common_options} ..
+   check_exit_code 2
+ 
+   if [[ -e build.ninja ]]; then
+diff --git a/src/HPL_pdtest.cpp b/src/HPL_pdtest.cpp
+index 94a0d3f..3135763 100644
+--- a/src/HPL_pdtest.cpp
+++ b/src/HPL_pdtest.cpp
+@@ -212,7 +212,7 @@ void HPL_pdtest(HPL_T_test* TEST,
+                     ctime(&current_time_end));
+       }
+ #ifdef HPL_PROGRESS_REPORT
+-      printf("Final Score:    %7.4e GFLOPS \n", Gflops);
+      printf("Final Score:    %7.9e GFLOPS \n", Gflops);
+ #endif
+     }
+ #ifdef HPL_DETAILED_TIMING
+diff --git a/src/pgesv/HPL_pdgesv.cpp b/src/pgesv/HPL_pdgesv.cpp
+index d6c99c3..280a9a5 100644
+--- a/src/pgesv/HPL_pdgesv.cpp
+++ b/src/pgesv/HPL_pdgesv.cpp
+@@ -336,7 +336,7 @@ void HPL_pdgesv(HPL_T_grid* GRID, HPL_T_palg* ALGO, HPL_T_pmat* A) {
+       printf("  %9.3e  |", step_gflops);
+ #endif
+ 
+-      printf("    %9.3e   \n", gflops);
+      printf("    %9.9e   \n", gflops);
+     }
+ #endif
+ 
--- a/third_party/rochplmxp_dtk26.patch
+++ b/third_party/rochplmxp_dtk26.patch
+diff --git a/CMakeLists.txt b/CMakeLists.txt
+index 91afcc4..6331291 100644
+--- a/CMakeLists.txt
+++ b/CMakeLists.txt
+@@ -88,7 +88,7 @@ foreach(i ${rochplmxp_device_source})
+ endforeach()
+ 
+ # HIP flags workaround while target_compile_options does not work
+-list(APPEND HIP_HIPCC_FLAGS "-Wno-unused-command-line-argument -fPIE")
+list(APPEND HIP_HIPCC_FLAGS "-Wno-unused-command-line-argument -fPIE --gpu-max-threads-per-block=1024")
+ list(APPEND CMAKE_HOST_FLAGS "")
+ 
+ if (CMAKE_BUILD_TYPE STREQUAL "Debug")
+@@ -100,21 +100,35 @@ else()
+ endif()
+ 
+ # GPU arch targets
+-set(TARGETS "gfx900;gfx906")
+-if(HIP_VERSION VERSION_GREATER_EQUAL "3.7")
+-  set(TARGETS "${TARGETS};gfx908")
+-endif()
+-if(HIP_VERSION VERSION_GREATER_EQUAL "4.3")
+-  set(TARGETS "${TARGETS};gfx90a")
+-endif()
+-if (HIP_VERSION VERSION_GREATER_EQUAL "5.7")
+-  set(TARGETS "${TARGETS};gfx942")
+set(ARCHS "")
+if(DEFINED HPL_BUILD_ARCH AND NOT HPL_BUILD_ARCH STREQUAL "")
+  string(REPLACE "," ";" ARCHS "${HPL_BUILD_ARCH}")
+  list(TRANSFORM ARCHS STRIP)
+  list(REMOVE_DUPLICATES ARCHS)
+  message(STATUS "Using manually specified GPU targets: ${ARCHS}")
+else()
+  message(STATUS "Detecting available architecture")
+  find_program(ROCMINFO_EXECUTABLE rocminfo)
+  if(ROCMINFO_EXECUTABLE)
+    execute_process(
+      COMMAND ${ROCMINFO_EXECUTABLE}
+      OUTPUT_VARIABLE ROCMINFO_OUTPUT
+      ERROR_QUIET
+      OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+    string(REGEX MATCHALL "Name:[ \t]+gfx[0-9a-z]+" ARCH_MATCHES "${ROCMINFO_OUTPUT}")
+    string(REGEX REPLACE "Name:[ \t]+" "" ARCHS "${ARCH_MATCHES}")
+    list(REMOVE_DUPLICATES ARCHS)
+  endif()
+ endif()
+-if (HIP_VERSION VERSION_GREATER_EQUAL "6.5")
+-  set(TARGETS "${TARGETS};gfx950")
+
+if(ARCHS STREQUAL "")
+  message(FATAL_ERROR "No GPU architectures detected via rocminfo and no BUILD_ARCH specified. Use ./install.sh --arch=gfxXXX")
+ endif()
+ 
+-foreach(target ${TARGETS})
+message(STATUS "Building for GPU architecture: ${ARCHS}")
+
+foreach(target ${ARCHS})
+   list(APPEND HIP_HIPCC_FLAGS "--offload-arch=${target}")
+ endforeach()
+ 
+@@ -173,7 +187,7 @@ set_target_properties(rochplmxp PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BIN
+ 
+ set_target_properties(rochplmxp PROPERTIES LINKER_LANGUAGE CXX)
+ 
+-set_target_properties(rochplmxp PROPERTIES HIP_ARCHITECTURES "${DEFAULT_AMDGPU_TARGETS}")
+set_target_properties(rochplmxp PROPERTIES HIP_ARCHITECTURES "${ARCHS}")
+ 
+ # # Configure a header file to pass the rocHPL-MxP version
+ configure_file("${CMAKE_CURRENT_SOURCE_DIR}/include/hplmxp_version.hpp.in"
+diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake
+index 164d06d..041a8e2 100644
+--- a/cmake/Dependencies.cmake
+++ b/cmake/Dependencies.cmake
+@@ -109,7 +109,7 @@ if(NOT ROCM_FOUND)
+   execute_process(COMMAND ${CMAKE_COMMAND} -E tar xzf ${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag}.zip
+                   WORKING_DIRECTORY ${PROJECT_EXTERN_DIR})
+ 
+-  find_package(ROCmCMakeBuildTools REQUIRED CONFIG PATHS ${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag})
+  set(CMAKE_MODULE_PATH "${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag}/share/rocm/cmake;${CMAKE_MODULE_PATH}")
+ endif()
+ 
+ include(ROCMSetupVersion)
+diff --git a/install.sh b/install.sh
+index de72a20..6f2ef05 100755
+--- a/install.sh
+++ b/install.sh
+@@ -18,6 +18,7 @@ function display_help()
+   echo "    [--with-rocblas=<dir>] Path to rocBLAS library (Default: /opt/rocm/rocblas)"
+   echo "    [--with-rocsolver=<dir>] Path to rocSOLVER library (Default: /opt/rocm/rocsolver)"
+   echo "    [--with-mpi=<dir>] Path to external MPI install (Default: clone+build OpenMPI)"
+  echo "    [--arch=<archs>] Specify comma separated architecture list to build (Default: detect from rocminfo)"
+   echo "    [--verbose-print] Verbose output during HPL setup (Default: true)"
+   echo "    [--enable-tracing] Annotate profiler traces with rocTX markers (Default: false)"
+   echo "    [--progress-report] Print progress report to terminal during HPL run (Default: true)"
+@@ -33,10 +34,10 @@ supported_distro( )
+   fi
+ 
+   case "${ID}" in
+-    ubuntu|centos|rhel|fedora|sles)
+    ubuntu|centos|rhel|fedora|sles|kylin|rocky)
+         true
+         ;;
+-    *)  printf "This script is currently supported on Ubuntu, CentOS, RHEL, Fedora and SLES\n"
+    *)  printf "This script is currently supported on Ubuntu, CentOS, RHEL, Fedora, SLES, Kylin and Rocky\n"
+         exit 2
+         ;;
+   esac
+@@ -68,11 +69,11 @@ exit_with_error( )
+         printf "sudo apt install -y ${library_dependencies_ubuntu[*]}\n"
+         ;;
+ 
+-      centos|rhel)
+      centos|rhel|kylin)
+         printf "sudo yum -y --nogpgcheck install ${library_dependencies_centos[*]}\n"
+         ;;
+ 
+-      fedora)
+      fedora|rocky)
+         printf "sudo dnf install -y ${library_dependencies_fedora[*]}\n"
+         ;;
+ 
+@@ -217,6 +218,7 @@ verbose_print=true
+ enable_tracing=false
+ progress_report=true
+ detailed_timing=true
+arch=
+ 
+ # #################################################
+ # Parameter parsing
+@@ -225,7 +227,7 @@ detailed_timing=true
+ # check if we have a modern version of getopt that can handle whitespace and long parameters
+ getopt -T
+ if [[ $? -eq 4 ]]; then
+-  GETOPT_PARSE=$(getopt --name "${0}" --longoptions help,debug,prefix:,with-rocm:,with-mpi:,with-rocblas:,with-rocsolver:,verbose-print:,enable-tracing:,progress-report:,detailed-timing: --options hg -- "$@")
+  GETOPT_PARSE=$(getopt --name "${0}" --longoptions help,debug,prefix:,with-rocm:,with-mpi:,with-rocblas:,with-rocsolver:,verbose-print:,enable-tracing:,progress-report:,detailed-timing:,arch: --options hg -- "$@")
+ else
+   echo "Need a new version of getopt"
+   exit_with_error 1
+@@ -262,6 +264,9 @@ while true; do
+     --with-rocsolver)
+         with_rocsolver=${2}
+         shift 2 ;;
+    --arch)
+        arch=${2}
+        shift 2 ;;
+     --verbose-print)
+         verbose_print=${2}
+         shift 2 ;;
+@@ -335,11 +340,14 @@ pushd .
+   if [[ "${enable_tracing}" == on || "${enable_tracing}" == true || "${enable_tracing}" == 1 || "${enable_tracing}" == enabled ]]; then
+     cmake_common_options="${cmake_common_options} -DHPLMXP_TRACING=ON"
+   fi
+  if [[ -n "${arch}" ]]; then
+    cmake_common_options="${cmake_common_options} -DHPL_BUILD_ARCH=${arch}"
+  fi
+   shopt -u nocasematch
+ 
+   # Build library with AMD toolchain because of existence of device kernels
+   mkdir -p ${build_dir} && cd ${build_dir}
+-  ${cmake_executable} ${cmake_common_options} ..
+  ${cmake_executable} --fresh ${cmake_common_options} ..
+   check_exit_code 2
+ 
+   make -j$(nproc) install
+diff --git a/src/hplmxp_ptest.cpp b/src/hplmxp_ptest.cpp
+index 11d0f44..e8b1eee 100644
+--- a/src/hplmxp_ptest.cpp
+++ b/src/hplmxp_ptest.cpp
+@@ -211,7 +211,7 @@ void HPLMXP_ptest(HPLMXP_T_test& test,
+                        ctime(&current_time_end));
+       }
+ #ifdef HPLMXP_PROGRESS_REPORT
+-      printf("Final Score:    %7.4e GFLOPS \n", Gflops);
+      printf("Final Score:    %7.9e GFLOPS \n", Gflops);
+ #endif
+     }
+ #ifdef HPLMXP_DETAILED_TIMING
+diff --git a/src/pgesv/hplmxp_pgetrf.cpp b/src/pgesv/hplmxp_pgetrf.cpp
+index ccbd4c0..0230b44 100644
+--- a/src/pgesv/hplmxp_pgetrf.cpp
+++ b/src/pgesv/hplmxp_pgetrf.cpp
+@@ -420,7 +420,7 @@ void HPLMXP_pgetrf(HPLMXP_T_grid&         grid,
+       printf("  %9.3e  |", step_gflops);
+ #endif
+ 
+-      printf("    %9.3e   \n", gflops);
+      printf("    %9.9e   \n", gflops);
+     }
+ #endif
+