Unverified Commit 4fa10f4d authored by one's avatar one Committed by GitHub
Browse files

Benchmarks: Add gpu-hpl and gpu-hpl-mxp micro benchmarks (#15)

Add gpu-hpl and gpu-hpl-mxp micro benchmarks backed by rocHPL and rocHPL-MxP.

Implemented a shared GPU HPL base that:
- Generates per-workload HPL dat files and parses the corresponding output files.
- Supports common HPL inputs such as process grid, matrix size, block size, broadcast topology, warmup, iterations, and reduce operator.
- Adds rocHPL-specific tuning parameters for gpu-hpl.
- Formats metric keys from input-derived workload attributes.
- Reports `flops`, `time`, and `tests_pass` metrics with warmup-aware aggregation.

Add benchmark registrations, parser tests, sample output fixtures, documentation, and recommended configurations for gpu-hpl and gpu-hpl-mxp.

Update rocHPL and rocHPL-MxP third-party integration with build patches, install targets, and SuperBench run helper scripts.

Also update gpu-hpcg metric naming to use flops instead of gflops, remove standalone domain/verification-style metrics from the documented metric surface, and refresh Hygon HPCG documentation/config references accordingly.
parent 88cd18df
......@@ -39,3 +39,9 @@
[submodule "third_party/rocHPCG"]
path = third_party/rocHPCG
url = https://github.com/ROCm/rocHPCG.git
[submodule "third_party/rocHPL"]
path = third_party/rocHPL
url = https://github.com/ROCm/rocHPL.git
[submodule "third_party/rocHPL-MxP"]
path = third_party/rocHPL-MxP
url = https://github.com/ROCm/rocHPL-MxP.git
......@@ -189,6 +189,107 @@ Performed by [High-Performance Linpack Benchmark for Distributed-Memory Computer
| cpu-hpl/throughput | bandwidth (GFlops) | Compute bandwidth. |
| cpu-hpl/time | time (s) | Time elapsed during HPL run. |
### `gpu-hpl`
#### Introduction
Measure GPU HPL performance for dense linear algebra workloads.
Performed by [rocHPL](https://github.com/ROCm/rocHPL).
#### Parameters
`gpu-hpl` always generates an HPL input `.dat` file from the command-line parameters.
The generated file name and output file name are derived from the same workload prefix used in metric keys.
| Parameter | Default | Description |
|------------------------|---------|-----------------------------------------------------------------------------|
| `--p` | `1` | Number of rows in the MPI process grid. |
| `--q` | `1` | Number of columns in the MPI process grid. |
| `--local-p` | | Optional number of rows in the node-local MPI process grid. |
| `--local-q` | | Optional number of columns in the node-local MPI process grid. |
| `--n` | `45312` | Global matrix size. |
| `--nb` | `384` | Panel/block size. |
| `--warmup` | `0` | Number of warmup HPL runs to exclude from result aggregation. |
| `--iterations` | `1` | Number of measured HPL runs to include in result aggregation. |
| `--reduce-op` | `max` | Reduce operator for measured runs by FLOPS: `mean`, `median`, `max`, `min`. |
| `--pmap` | `1` | Process mapping: `0` for row-major, `1` for column-major. |
| `--bcast` | `0` | Broadcast topology: `0` for 1rg, `1` for 1rM, `2` for 2rg, `3` for 2rM, `4` for Lng, `5` for LnM. |
| `--threshold` | `16.0` | Residual check threshold. |
| `--pfact` | `2` | Panel factorization: `0` for left, `1` for Crout, `2` for right. |
| `--nbmin` | `32` | Recursive stopping criterion. |
| `--ndiv` | `2` | Number of panels in recursion. |
| `--rfact` | `2` | Recursive panel factorization: `0` for left, `1` for Crout, `2` for right. |
| `--depth` | `1` | Lookahead depth. |
| `--swap` | `1` | Swapping algorithm: `0` for binary exchange, `1` for long, `2` for mix. |
| `--swapping-threshold` | `64` | Swapping threshold. |
| `--l1` | `0` | L1 storage form: `0` for transposed, `1` for non-transposed. |
| `--u` | `0` | U storage form: `0` for transposed, `1` for non-transposed. |
| `--equilibration` | `0` | Equilibration: `0` for no, `1` for yes. |
| `--memory-alignment` | `8` | Memory alignment in doubles. |
`--warmup` runs are excluded from result aggregation. `--reduce-op` is applied to the measured FLOPS values.
The reported `_time` metric is reduced in the same performance direction by applying `--reduce-op` to `1 / time`
and then converting the result back to seconds.
#### Metrics
rocHPL reports performance, time, and correctness metrics.
The metric key includes the configured HPL variant, process grid, matrix size, and block size:
`${tv}_P${P}_Q${Q}_N${N}_NB${NB}`.
The `tv` field is based on the rocHPL `T/V` value and includes an extended suffix for `L1`, `U`,
`Equilibration`, and `memory-alignment`. For example, `WC11R2R32_TTN8` uses transposed `L1`,
transposed `U`, no equilibration, and memory alignment `8`.
| Name | Unit | Description |
|-------------------------------------------------------|----------------|--------------------------------------------------|
| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_flops` | FLOPS (GFLOPS) | Throughput for the specified rocHPL run. |
| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_time` | time (s) | Time elapsed during the specified HPL run. |
| `gpu-hpl/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_tests_pass` | | Whether residual checks passed (1: pass, 0: fail). |
### `gpu-hpl-mxp`
#### Introduction
Measure GPU HPL-MxP performance for mixed-precision dense linear algebra workloads.
Performed by [rocHPL-MxP](https://github.com/ROCm/rocHPL-MxP).
#### Parameters
`gpu-hpl-mxp` always generates an HPL-MxP input `.dat` file from the command-line parameters.
The generated file name and output file name are derived from the same workload prefix used in metric keys.
| Parameter | Default | Description |
|----------------|---------|-----------------------------------------------------------------------------|
| `--p` | `1` | Number of rows in the MPI process grid. |
| `--q` | `1` | Number of columns in the MPI process grid. |
| `--local-p` | | Optional number of rows in the node-local MPI process grid. |
| `--local-q` | | Optional number of columns in the node-local MPI process grid. |
| `--n` | `61440` | Global matrix size. |
| `--nb` | `2560` | Panel/block size. |
| `--warmup` | `0` | Number of warmup HPL-MxP runs to exclude from result aggregation. |
| `--iterations` | `1` | Number of measured HPL-MxP runs to include in result aggregation. |
| `--reduce-op` | `max` | Reduce operator for measured runs by FLOPS: `mean`, `median`, `max`, `min`. |
| `--pmap` | `1` | Process mapping: `0` for row-major, `1` for column-major. |
| `--bcast` | `0` | Broadcast topology: `0` for 1rg, `1` for 1rM, `2` for 2rg, `3` for 2rM, `4` for Lng, `5` for LnM. |
| `--threshold` | `16.0` | Residual check threshold. |
`--warmup` runs are excluded from result aggregation. `--reduce-op` is applied to the measured FLOPS values.
The reported `_time` metric is reduced in the same performance direction by applying `--reduce-op` to `1 / time`
and then converting the result back to seconds.
#### Metrics
rocHPL-MxP reports performance, time, and correctness metrics.
The metric key includes the configured HPL-MxP variant, process grid, matrix size, and block size:
`${tv}_P${P}_Q${Q}_N${N}_NB${NB}`.
The `tv` field is based on the rocHPL-MxP `T/V` value, for example `WC1`.
| Name | Unit | Description |
|-----------------------------------------------------------|----------------|--------------------------------------------------|
| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_flops` | FLOPS (GFLOPS) | Throughput for the specified rocHPL-MxP run. |
| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_time` | time (s) | Time elapsed during the specified HPL-MxP run. |
| `gpu-hpl-mxp/${tv}_P${P}_Q${Q}_N${N}_NB${NB}_tests_pass` | | Whether residual checks passed (1: pass, 0: fail). |
### `gpu-hpcg`
#### Introduction
......@@ -196,6 +297,22 @@ Performed by [High-Performance Linpack Benchmark for Distributed-Memory Computer
Measure GPU HPCG performance for sparse linear algebra and multigrid-style workloads.
Performed by [rocHPCG](https://github.com/ROCm/rocHPCG).
#### Parameters
| Parameter | Default | Description |
|-----------|---------|-----------------------------------------------------------------------------|
| `--npx` | `1` | Number of MPI processes in the x dimension. |
| `--npy` | `1` | Number of MPI processes in the y dimension. |
| `--npz` | `1` | Number of MPI processes in the z dimension. |
| `--nx` | `560` | Local problem size in the x dimension. |
| `--ny` | `280` | Local problem size in the y dimension. |
| `--nz` | `280` | Local problem size in the z dimension. |
| `--rt` | `60` | Benchmark runtime in seconds. |
| `--tol` | `1.0` | Verification control: `0` runs reference verification; non-zero skips it. |
| `--pz` | `0` | Partition boundary in the z process dimension. |
| `--zl` | `--nz` | Local `nz` value for processes with z rank lower than `--pz`. |
| `--zu` | `--nz` | Local `nz` value for processes with z rank greater than or equal to `--pz`. |
#### Metrics
rocHPCG reports performance and time metrics.
......@@ -205,9 +322,9 @@ The metric key includes the configured process domain and local problem size:
| Name | Unit | Description |
|--------------------------------------------------------------------------------------------------|------------------|---------------------------------------------------------|
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_gflops` | FLOPS (GFLOPS) | Throughput for the specified rocHPCG operation. |
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_flops` | FLOPS (GFLOPS) | Throughput for the specified rocHPCG operation. |
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_bandwidth` | bandwidth (GB/s) | Bandwidth for the specified rocHPCG operation. |
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_gflops_per_process` | FLOPS (GFLOPS) | Per-process throughput for the specified operation. |
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_flops_per_process` | FLOPS (GFLOPS) | Per-process throughput for the specified operation. |
| `gpu-hpcg/${operation}_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}_bandwidth_per_process` | bandwidth (GB/s) | Per-process bandwidth for the specified operation. |
| `gpu-hpcg/setup_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}` | time (s) | Setup phase duration. |
| `gpu-hpcg/optimization_time_p${npx}x${npy}x${npz}_n${nx}x${ny}x${nz}` | time (s) | Optimization phase duration. |
......
......@@ -16,7 +16,7 @@
r'^(?P<subject>final|ddot|waxpby|spmv|mg|total)_'
r'p(?P<npx>\d+)x(?P<npy>\d+)x(?P<npz>\d+)_'
r'n(?P<nx>\d+)x(?P<ny>\d+)x(?P<nz>\d+)_'
r'(?P<type>gflops|bandwidth|gflops_per_process|bandwidth_per_process)$'
r'(?P<type>flops|bandwidth|flops_per_process|bandwidth_per_process)$'
)
_HPCG_TIME_PATTERN = re.compile(
r'^(?P<subject>setup_time|optimization_time|total_time)_'
......@@ -37,9 +37,9 @@
}
_HPCG_PERF_TYPE_ORDER = {
'gflops': 0,
'flops': 0,
'bandwidth': 1,
'gflops_per_process': 2,
'flops_per_process': 2,
'bandwidth_per_process': 3,
}
......
......@@ -7,6 +7,7 @@
from superbench.benchmarks.micro_benchmarks.gemm_flops_performance_base import GemmFlopsBenchmark
from superbench.benchmarks.micro_benchmarks.memory_bw_performance_base import MemBwBenchmark
from superbench.benchmarks.micro_benchmarks.gpu_hpcg_performance_base import GpuHpcgBenchmark
from superbench.benchmarks.micro_benchmarks.gpu_hpl_performance_base import GpuHplBenchmark
from superbench.benchmarks.micro_benchmarks.computation_communication_overlap import ComputationCommunicationOverlap
from superbench.benchmarks.micro_benchmarks.cublas_function import CublasBenchmark
......@@ -17,6 +18,8 @@
from superbench.benchmarks.micro_benchmarks.dtk_memory_bw_performance import DtkMemBwBenchmark
from superbench.benchmarks.micro_benchmarks.dtk_gemm_flops_performance import DtkGemmFlopsBenchmark
from superbench.benchmarks.micro_benchmarks.dtk_hpcg_performance import DtkHpcgBenchmark
from superbench.benchmarks.micro_benchmarks.dtk_hpl_performance import DtkHplBenchmark
from superbench.benchmarks.micro_benchmarks.dtk_hpl_mxp_performance import DtkHplMxpBenchmark
from superbench.benchmarks.micro_benchmarks.cuda_gemm_flops_performance import CudaGemmFlopsBenchmark
from superbench.benchmarks.micro_benchmarks.cuda_memory_bw_performance import CudaMemBwBenchmark
from superbench.benchmarks.micro_benchmarks.cuda_nccl_bw_performance import CudaNcclBwBenchmark
......@@ -60,6 +63,8 @@
'DiskBenchmark',
'DistInference',
'DtkGemmFlopsBenchmark',
'DtkHplBenchmark',
'DtkHplMxpBenchmark',
'RocmHipBlasLtBenchmark',
'DtkHipBlasLtBenchmark',
'DtkMemBwBenchmark',
......@@ -68,6 +73,7 @@
'GpuBurnBenchmark',
'GpuCopyBwBenchmark',
'GpuHpcgBenchmark',
'GpuHplBenchmark',
'GpuStreamBenchmark',
'IBBenchmark',
'IBLoopbackBenchmark',
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
"""Module of the DTK HPL-MxP benchmark."""
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.benchmarks.micro_benchmarks import GpuHplBenchmark
from superbench.benchmarks.micro_benchmarks.gpu_hpl_performance_base import format_hpl_mxp_tv
class DtkHplMxpBenchmark(GpuHplBenchmark):
"""The DTK HPL-MxP benchmark class."""
_default_bin_name = 'run_rochplmxp'
_default_dat_name = 'HPL-MxP.dat'
_default_out_name = 'HPL-MxP.out'
_file_label = 'HPL-MxP'
_default_n = 61440
_default_nb = 2560
def _format_tv(self):
"""Format the expected rocHPL-MxP T/V field from input arguments."""
return format_hpl_mxp_tv(self._args.pmap, self._args.bcast)
def _match_output_n(self):
"""Return whether parsed rocHPL-MxP output N must match the input N."""
return False
def _format_dat_content(self):
"""Format generated rocHPL-MxP input file content."""
return (
'HPLinpack benchmark input file\n'
'Innovative Computing Laboratory, University of Tennessee\n'
f'{self._out_file_name} output file name (if any)\n'
'0 device out (6=stdout,7=stderr,file)\n'
'1 # of problems sizes (N)\n'
f'{self._args.n} Ns\n'
'1 # of NBs\n'
f'{self._args.nb} NBs\n'
f'{self._args.pmap} PMAP process mapping (0=Row-,1=Column-major)\n'
f'{self._args.p} P\n'
f'{self._args.q} Q\n'
f'{self._args.threshold} threshold\n'
'1 # of broadcast\n'
f'{self._args.bcast} BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)\n'
)
BenchmarkRegistry.register_benchmark('gpu-hpl-mxp', DtkHplMxpBenchmark, platform=Platform.DTK)
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
"""Module of the DTK HPL benchmark."""
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.benchmarks.micro_benchmarks import GpuHplBenchmark
from superbench.benchmarks.micro_benchmarks.gpu_hpl_performance_base import format_hpl_extended_tv, format_hpl_tv
class DtkHplBenchmark(GpuHplBenchmark):
"""The DTK HPL benchmark class."""
_default_bin_name = 'run_rochpl'
_default_dat_name = 'HPL.dat'
_default_out_name = 'HPL.out'
_file_label = 'HPL'
def _add_variant_parser_arguments(self):
"""Add rocHPL-specific arguments."""
self._parser.add_argument(
'--pfact',
dest='pfact',
type=int,
default=2,
choices=[0, 1, 2],
required=False,
help='Panel factorization: 0 for left, 1 for Crout, 2 for right.',
)
self._parser.add_argument(
'--nbmin',
dest='nbmin',
type=int,
default=32,
required=False,
help='Recursive stopping criterion.',
)
self._parser.add_argument(
'--ndiv',
dest='ndiv',
type=int,
default=2,
required=False,
help='Number of panels in recursion.',
)
self._parser.add_argument(
'--rfact',
dest='rfact',
type=int,
default=2,
choices=[0, 1, 2],
required=False,
help='Recursive panel factorization: 0 for left, 1 for Crout, 2 for right.',
)
self._parser.add_argument(
'--depth',
dest='depth',
type=int,
default=1,
required=False,
help='Lookahead depth.',
)
self._parser.add_argument(
'--swap',
dest='swap',
type=int,
default=1,
choices=[0, 1, 2],
required=False,
help='Swapping algorithm: 0 for binary exchange, 1 for long, 2 for mix.',
)
self._parser.add_argument(
'--swapping-threshold',
dest='swapping_threshold',
type=int,
default=64,
required=False,
help='Swapping threshold.',
)
self._parser.add_argument(
'--l1',
dest='l1',
type=int,
default=0,
choices=[0, 1],
required=False,
help='L1 storage form: 0 for transposed, 1 for non-transposed.',
)
self._parser.add_argument(
'--u',
dest='u',
type=int,
default=0,
choices=[0, 1],
required=False,
help='U storage form: 0 for transposed, 1 for non-transposed.',
)
self._parser.add_argument(
'--equilibration',
dest='equilibration',
type=int,
default=0,
choices=[0, 1],
required=False,
help='Equilibration: 0 for no, 1 for yes.',
)
self._parser.add_argument(
'--memory-alignment',
dest='memory_alignment',
type=int,
default=8,
required=False,
help='Memory alignment in double.',
)
def _format_tv(self):
"""Format the expected rocHPL T/V field from input arguments."""
return format_hpl_extended_tv(
self._args.pmap,
self._args.depth,
self._args.bcast,
self._args.rfact,
self._args.ndiv,
self._args.pfact,
self._args.nbmin,
self._args.l1,
self._args.u,
self._args.equilibration,
self._args.memory_alignment,
)
def _format_output_tv(self):
"""Format the expected rocHPL T/V field in generated output."""
return format_hpl_tv(
self._args.pmap,
self._args.depth,
self._args.bcast,
self._args.rfact,
self._args.ndiv,
self._args.pfact,
self._args.nbmin,
)
def _format_dat_content(self):
"""Format generated rocHPL input file content."""
return (
'HPLinpack benchmark input file\n'
'Innovative Computing Laboratory, University of Tennessee\n'
f'{self._out_file_name} output file name (if any)\n'
'0 device out (6=stdout,7=stderr,file)\n'
'1 # of problems sizes (N)\n'
f'{self._args.n} Ns\n'
'1 # of NBs\n'
f'{self._args.nb} NBs\n'
f'{self._args.pmap} PMAP process mapping (0=Row-,1=Column-major)\n'
'1 # of process grids (P x Q)\n'
f'{self._args.p} Ps\n'
f'{self._args.q} Qs\n'
f'{self._args.threshold} threshold\n'
'1 # of panel fact\n'
f'{self._args.pfact} PFACTs (0=left, 1=Crout, 2=Right)\n'
'1 # of recursive stopping criterium\n'
f'{self._args.nbmin} NBMINs (>= 1)\n'
'1 # of panels in recursion\n'
f'{self._args.ndiv} NDIVs\n'
'1 # of recursive panel fact.\n'
f'{self._args.rfact} RFACTs (0=left, 1=Crout, 2=Right)\n'
'1 # of broadcast\n'
f'{self._args.bcast} BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)\n'
'1 # of lookahead depth\n'
f'{self._args.depth} DEPTHs (>=0)\n'
f'{self._args.swap} SWAP (0=bin-exch,1=long,2=mix)\n'
f'{self._args.swapping_threshold} swapping threshold\n'
f'{self._args.l1} L1 in (0=transposed,1=no-transposed) form\n'
f'{self._args.u} U in (0=transposed,1=no-transposed) form\n'
f'{self._args.equilibration} Equilibration (0=no,1=yes)\n'
f'{self._args.memory_alignment} memory alignment in double (> 0)\n'
)
BenchmarkRegistry.register_benchmark('gpu-hpl', DtkHplBenchmark, platform=Platform.DTK)
......@@ -169,29 +169,29 @@ def _process_raw_result(self, cmd_idx, raw_output):
parsed_results = {}
required_metrics = {
'final_gflops',
'final_flops',
'final_bandwidth',
'final_gflops_per_process',
'final_flops_per_process',
'final_bandwidth_per_process',
'ddot_gflops',
'ddot_flops',
'ddot_bandwidth',
'ddot_gflops_per_process',
'ddot_flops_per_process',
'ddot_bandwidth_per_process',
'waxpby_gflops',
'waxpby_flops',
'waxpby_bandwidth',
'waxpby_gflops_per_process',
'waxpby_flops_per_process',
'waxpby_bandwidth_per_process',
'spmv_gflops',
'spmv_flops',
'spmv_bandwidth',
'spmv_gflops_per_process',
'spmv_flops_per_process',
'spmv_bandwidth_per_process',
'mg_gflops',
'mg_flops',
'mg_bandwidth',
'mg_gflops_per_process',
'mg_flops_per_process',
'mg_bandwidth_per_process',
'total_gflops',
'total_flops',
'total_bandwidth',
'total_gflops_per_process',
'total_flops_per_process',
'total_bandwidth_per_process',
'setup_time',
'optimization_time',
......@@ -228,9 +228,9 @@ def _process_raw_result(self, cmd_idx, raw_output):
def _format_metric_name(self, metric):
"""Format a rocHPCG metric with the configured process domain and local problem size."""
metric_suffixes = (
'gflops_per_process',
'flops_per_process',
'bandwidth_per_process',
'gflops',
'flops',
'bandwidth',
)
workload = (
......@@ -268,8 +268,8 @@ def _parse_operation_line(self, line, parsed_results):
if len(gflops_values) < 2 or len(bandwidth_values) < 2:
return False
parsed_results[f'{prefix}_gflops'] = gflops_values[0]
parsed_results[f'{prefix}_gflops_per_process'] = gflops_values[1]
parsed_results[f'{prefix}_flops'] = gflops_values[0]
parsed_results[f'{prefix}_flops_per_process'] = gflops_values[1]
parsed_results[f'{prefix}_bandwidth'] = bandwidth_values[0]
parsed_results[f'{prefix}_bandwidth_per_process'] = bandwidth_values[1]
return True
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT license.
"""Module of the GPU HPL benchmark base class."""
import os
import re
import statistics
from typing import Optional
from superbench.common.utils import logger
from superbench.benchmarks.micro_benchmarks import MicroBenchmarkWithInvoke
_HPL_RESULT_PATTERN = re.compile(
r'^\s*(?P<tv>W\S+)\s+'
r'(?P<n>\d+)\s+'
r'(?P<nb>\d+)\s+'
r'(?P<p>\d+)\s+'
r'(?P<q>\d+)\s+'
r'(?P<time>[+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)\s+'
r'(?P<flops>[+-]?(?:\d+(?:\.\d*)?|\.\d+)(?:[eE][+-]?\d+)?)\s*$'
)
_HPL_RESIDUAL_PATTERN = re.compile(r'\.\.\.\.\.\.\s+(?P<status>PASSED|FAILED)\s*$', re.IGNORECASE)
def _format_pmap(pmap):
"""Format HPL process mapping token used in the T/V field."""
return 'R' if pmap == 0 else 'C'
def _format_fact(fact):
"""Format HPL panel factorization token used in the T/V field."""
fact_tokens = {
0: 'L',
1: 'C',
2: 'R',
}
return fact_tokens[fact]
def _format_transpose(value):
"""Format HPL transposed/no-transposed token used in the extended T/V field."""
return 'T' if value == 0 else 'N'
def _format_equilibration(value):
"""Format HPL equilibration token used in the extended T/V field."""
return 'N' if value == 0 else 'Y'
def format_hpl_mxp_tv(pmap, bcast):
"""Format the rocHPL-MxP T/V field from input parameters."""
return f'W{_format_pmap(pmap)}{bcast}'
def format_hpl_tv(pmap, depth, bcast, rfact, ndiv, pfact, nbmin):
"""Format the rocHPL T/V field from input parameters."""
return (
f'W{_format_pmap(pmap)}'
f'{depth}'
f'{bcast}'
f'{_format_fact(rfact)}'
f'{ndiv}'
f'{_format_fact(pfact)}'
f'{nbmin}'
)
def format_hpl_extended_tv(pmap, depth, bcast, rfact, ndiv, pfact, nbmin, l1, u, equilibration, alignment):
"""Format the rocHPL T/V field plus SuperBench variant suffix from input parameters."""
return (
f'{format_hpl_tv(pmap, depth, bcast, rfact, ndiv, pfact, nbmin)}'
f'_{_format_transpose(l1)}'
f'{_format_transpose(u)}'
f'{_format_equilibration(equilibration)}'
f'{alignment}'
)
class GpuHplBenchmark(MicroBenchmarkWithInvoke):
"""The GPU HPL benchmark base class."""
_default_bin_name: Optional[str] = None
_default_dat_name: Optional[str] = None
_default_out_name: Optional[str] = None
_file_label: Optional[str] = None
_default_n = 45312
_default_nb = 384
def __init__(self, name, parameters=''):
"""Constructor.
Args:
name (str): benchmark name.
parameters (str): benchmark parameters.
"""
super().__init__(name, parameters)
self._bin_name = self._default_bin_name
self._dat_path = None
self._out_path = None
self._tv = None
self._workload = None
self._dat_file_name = None
self._out_file_name = None
def add_parser_arguments(self):
"""Add the specified arguments."""
super().add_parser_arguments()
self._parser.add_argument(
'--p',
dest='p',
type=int,
default=1,
required=False,
help='Specific MPI grid size: the number of rows in MPI grid.',
)
self._parser.add_argument(
'--q',
dest='q',
type=int,
default=1,
required=False,
help='Specific MPI grid size: the number of columns in MPI grid.',
)
self._parser.add_argument(
'--local-p',
dest='local_p',
type=int,
required=False,
help='Specific node-local MPI grid size: the number of rows in node-local MPI grid.',
)
self._parser.add_argument(
'--local-q',
dest='local_q',
type=int,
required=False,
help='Specific node-local MPI grid size: the number of columns in node-local MPI grid.',
)
self._parser.add_argument(
'--n',
dest='n',
type=int,
default=self._default_n,
required=False,
help='Specific matrix size: the number of rows/columns in global matrix.',
)
self._parser.add_argument(
'--nb',
dest='nb',
type=int,
default=self._default_nb,
required=False,
help='Specific panel size: the number of rows/columns in panels.',
)
self._parser.add_argument(
'--warmup',
type=int,
default=0,
required=False,
help='Number of warmup runs to exclude from result aggregation.',
)
self._parser.add_argument(
'--iterations',
type=int,
default=1,
required=False,
help='Number of measurement runs to include in result aggregation.',
)
self._parser.add_argument(
'--reduce-op',
dest='reduce_op',
type=str,
default='max',
choices=['mean', 'median', 'max', 'min'],
required=False,
help='Reduce operator for aggregating measurement runs by FLOPS.',
)
self._parser.add_argument(
'--pmap',
dest='pmap',
type=int,
default=1,
choices=[0, 1],
required=False,
help='Process mapping: 0 for row-major, 1 for column-major.',
)
self._parser.add_argument(
'--bcast',
dest='bcast',
type=int,
default=0,
choices=[0, 1, 2, 3, 4, 5],
required=False,
help='Broadcast topology: 0 for 1rg, 1 for 1rM, 2 for 2rg, 3 for 2rM, 4 for Lng, 5 for LnM.',
)
self._parser.add_argument(
'--threshold',
type=float,
default=16.0,
required=False,
help='Residual check threshold.',
)
self._add_variant_parser_arguments()
def _add_variant_parser_arguments(self):
"""Add benchmark variant-specific arguments."""
pass
def _preprocess(self):
"""Preprocess/preparation operations before benchmarking."""
if not super()._preprocess():
return False
if self._args.warmup < 0:
logger.error('warmup should be non-negative, while {} is set.'.format(self._args.warmup))
return False
if self._args.iterations <= 0:
logger.error('iterations should be positive, while {} is set.'.format(self._args.iterations))
return False
self._tv = self._format_tv()
self._workload = self._format_workload()
file_prefix = self._format_file_prefix()
self._dat_file_name = f'{file_prefix}.dat'
self._out_file_name = f'{file_prefix}.out'
self._dat_path = os.path.join(self._args.bin_dir, self._dat_file_name)
self._out_path = os.path.join(self._args.bin_dir, self._out_file_name)
with open(self._dat_path, 'w') as dat_file:
dat_file.write(self._format_dat_content())
bin_path = os.path.join(self._args.bin_dir, self._bin_name)
command = (
f'{bin_path}'
f' -P {self._args.p}'
f' -Q {self._args.q}'
f' --it {self._args.warmup + self._args.iterations}'
f' -i {self._dat_file_name}'
)
if self._args.local_p is not None:
command += f' -p {self._args.local_p}'
if self._args.local_q is not None:
command += f' -q {self._args.local_q}'
self._commands = [command]
return True
def _process_raw_result(self, cmd_idx, raw_output):
"""Parse HPL stdout and generated output file."""
self._result.add_raw_data('raw_output_' + str(cmd_idx), raw_output, self._args.log_raw_data)
if self._out_path is None or not os.path.exists(self._out_path):
logger.error('HPL output file does not exist - path: {}.'.format(self._out_path))
return False
with open(self._out_path, 'r') as output_file:
output = output_file.read()
self._result.add_raw_data('hpl_output_' + str(cmd_idx), output, self._args.log_raw_data)
rows = self._parse_result_rows(output)
end = self._args.warmup + self._args.iterations
if len(rows) < end:
logger.error(
'Insufficient HPL result rows - benchmark: {}, expected: {}, found: {}.'.format(
self._name, end, len(rows)
)
)
return False
measured_rows = rows[self._args.warmup:end]
flops, time = self._reduce_rows(measured_rows)
tests_pass = 1 if all(row['passed'] for row in measured_rows) else 0
self._result.add_result(f'{self._workload}_flops', flops)
self._result.add_result(f'{self._workload}_time', time)
self._result.add_result(f'{self._workload}_tests_pass', tests_pass)
return True
def _parse_result_rows(self, output):
"""Parse matching HPL result rows from generated output content."""
rows = []
pending_row = None
output_tv = self._format_output_tv()
for line in output.splitlines():
result_match = _HPL_RESULT_PATTERN.match(line)
if result_match:
pending_row = {
'tv': result_match.group('tv'),
'n': int(result_match.group('n')),
'nb': int(result_match.group('nb')),
'p': int(result_match.group('p')),
'q': int(result_match.group('q')),
'time': float(result_match.group('time')),
'flops': float(result_match.group('flops')),
}
if pending_row['time'] <= 0:
logger.error(
'Invalid HPL result time - benchmark: {}, time: {}.'.format(self._name, pending_row['time'])
)
pending_row = None
continue
residual_match = _HPL_RESIDUAL_PATTERN.search(line)
if residual_match and pending_row is not None:
if self._is_expected_result_row(pending_row, output_tv):
pending_row['passed'] = residual_match.group('status').upper() == 'PASSED'
rows.append(pending_row)
pending_row = None
return rows
def _is_expected_result_row(self, row, output_tv):
"""Return whether a parsed output row matches the current benchmark input."""
if row['tv'] != output_tv:
return False
if row['nb'] != self._args.nb or row['p'] != self._args.p or row['q'] != self._args.q:
return False
if self._match_output_n() and row['n'] != self._args.n:
return False
return True
def _reduce_rows(self, rows):
"""Reduce measured rows according to FLOPS-oriented reduce semantics."""
flops = self._reduce_values(row['flops'] for row in rows)
reciprocal_time = self._reduce_values(1 / row['time'] for row in rows)
return flops, 1 / reciprocal_time
def _reduce_values(self, values):
"""Reduce values with the configured performance-oriented reduce operator."""
values = list(values)
reduce_op = self._args.reduce_op
if reduce_op == 'max':
return max(values)
if reduce_op == 'min':
return min(values)
if reduce_op == 'mean':
return statistics.mean(values)
return statistics.median(values)
def _format_tv(self):
"""Format the expected T/V field from benchmark input arguments."""
raise NotImplementedError
def _format_output_tv(self):
"""Format the expected T/V field in generated HPL output."""
return self._format_tv()
def _match_output_n(self):
"""Return whether parsed output N must match the input N."""
return True
def _format_workload(self):
"""Format the metric workload suffix from benchmark input arguments."""
return f'{self._tv}_P{self._args.p}_Q{self._args.q}_N{self._args.n}_NB{self._args.nb}'
def _format_file_prefix(self):
"""Format generated HPL.dat/HPL.out file prefix."""
return f'{self._file_label or self._name}-{self._workload}'
def _format_dat_content(self):
"""Format generated HPL.dat content."""
raise NotImplementedError
......@@ -154,6 +154,179 @@ superbench:
maxbytes: 16G
ngpus: 1
operation: alltoall
gpu-hpl:r32:
enable: false
modes:
- name: mpi
proc_num: 8
node_num: 4
bind_to: none
mca:
pml: ucx
btl: ^openib
btl_tcp_if_exclude: lo,docker0
coll_hcoll_enable: 0
env:
ROCM_PATH: /opt/dtk
HSA_FORCE_FINE_GRAIN_PCIE: 1
UCX_RNDV_SCHEME: put_zcopy
UCX_RNDV_FRAG_MEM_TYPE: rocm
UCX_MEMTYPE_CACHE: n
parameters:
p: 8
q: 4
n: 512000
nb: 512
bcast: 5
warmup: 1
iterations: 5
gpu-hpl:r16:
enable: false
modes:
- name: mpi
proc_num: 8
node_num: 2
bind_to: none
mca:
pml: ucx
btl: ^openib
btl_tcp_if_exclude: lo,docker0
coll_hcoll_enable: 0
env:
ROCM_PATH: /opt/dtk
HSA_FORCE_FINE_GRAIN_PCIE: 1
UCX_RNDV_SCHEME: put_zcopy
UCX_RNDV_FRAG_MEM_TYPE: rocm
UCX_MEMTYPE_CACHE: n
parameters:
p: 8
q: 2
n: 360448
nb: 512
bcast: 1
warmup: 1
iterations: 5
gpu-hpl:r8:
enable: false
modes:
- name: mpi
proc_num: 8
node_num: 1
bind_to: none
parameters:
p: 4
q: 2
n: 254976
nb: 512
bcast: 1
warmup: 1
iterations: 5
gpu-hpl:r4:
enable: false
modes:
- name: mpi
proc_num: 4
node_num: 1
bind_to: none
parameters:
p: 4
q: 1
n: 180224
nb: 512
bcast: 1
warmup: 1
iterations: 5
gpu-hpl:r2:
enable: false
modes:
- name: mpi
proc_num: 2
node_num: 1
bind_to: none
parameters:
p: 2
q: 1
n: 128000
nb: 512
bcast: 1
warmup: 1
iterations: 5
gpu-hpl:r1:
enable: false
modes:
- name: mpi
proc_num: 1
node_num: 1
bind_to: none
parameters:
p: 1
q: 1
n: 90624
nb: 512
nbmin: 16
bcast: 1
warmup: 1
iterations: 5
gpu-hpl-mxp:r8:
enable: false
modes:
- name: mpi
proc_num: 8
node_num: 1
bind_to: none
parameters:
p: 4
q: 2
n: 344064
nb: 4096
bcast: 1
warmup: 1
iterations: 5
gpu-hpl-mxp:r4:
enable: false
modes:
- name: mpi
proc_num: 4
node_num: 1
bind_to: none
parameters:
p: 4
q: 1
n: 245760
nb: 4096
bcast: 1
warmup: 1
iterations: 5
gpu-hpl-mxp:r2:
enable: false
modes:
- name: mpi
proc_num: 2
node_num: 1
bind_to: none
parameters:
p: 2
q: 1
n: 172032
nb: 4096
bcast: 1
warmup: 1
iterations: 5
gpu-hpl-mxp:r1:
enable: false
modes:
- name: mpi
proc_num: 1
node_num: 1
bind_to: none
parameters:
p: 1
q: 1
n: 122880
nb: 4096
bcast: 1
warmup: 1
iterations: 5
gpu-hpcg:r32:
enable: false
modes:
......
......@@ -160,48 +160,127 @@ superbench:
maxbytes: 16G
ngpus: 1
operation: alltoall
gpu-hpcg:r32:
gpu-hpl:r8:
enable: false
modes:
- name: mpi
proc_num: 8
node_num: 4
host_list: [node01, node02, node03, node04]
node_num: 1
bind_to: none
mca:
pml: ob1
btl: ^openib
btl_tcp_if_include: p14p2
coll_hcoll_enable: 0
parameters:
nx: 560
ny: 280
nz: 280
rt: 10
npx: 4
npy: 4
npz: 2
gpu-hpcg:r16:
p: 4
q: 2
n: 256000
nb: 512
bcast: 1
warmup: 1
iterations: 5
gpu-hpl:r4:
enable: false
modes:
- name: mpi
proc_num: 4
node_num: 1
bind_to: none
parameters:
p: 4
q: 1
n: 180224
nb: 512
bcast: 1
warmup: 1
iterations: 5
gpu-hpl:r2:
enable: false
modes:
- name: mpi
proc_num: 2
node_num: 1
bind_to: none
parameters:
p: 2
q: 1
n: 128000
nb: 512
bcast: 1
warmup: 1
iterations: 5
gpu-hpl:r1:
enable: false
modes:
- name: mpi
proc_num: 1
node_num: 1
bind_to: none
parameters:
p: 1
q: 1
n: 90624
nb: 512
bcast: 1
nbmin: 16
warmup: 1
iterations: 5
gpu-hpl-mxp:r8:
enable: false
modes:
- name: mpi
proc_num: 8
node_num: 2
host_list: [node01, node02]
node_num: 1
bind_to: none
mca:
pml: ob1
btl: ^openib
btl_tcp_if_include: p14p2
coll_hcoll_enable: 0
parameters:
nx: 560
ny: 280
nz: 280
rt: 10
npx: 4
npy: 2
npz: 2
p: 4
q: 2
n: 344064
nb: 4096
bcast: 1
warmup: 1
iterations: 5
gpu-hpl-mxp:r4:
enable: false
modes:
- name: mpi
proc_num: 4
node_num: 1
bind_to: none
parameters:
p: 4
q: 1
n: 245760
nb: 4096
bcast: 1
warmup: 1
iterations: 5
gpu-hpl-mxp:r2:
enable: false
modes:
- name: mpi
proc_num: 2
node_num: 1
bind_to: none
parameters:
p: 2
q: 1
n: 172032
nb: 4096
bcast: 1
warmup: 1
iterations: 5
gpu-hpl-mxp:r1:
enable: false
modes:
- name: mpi
proc_num: 1
node_num: 1
bind_to: none
parameters:
p: 1
q: 1
n: 122880
nb: 4096
bcast: 1
warmup: 1
iterations: 5
gpu-hpcg:r8:
enable: false
modes:
......
......@@ -91,39 +91,39 @@ superbench:
categories: HPCG gpu-hpcg:r1
metrics:
- gpu-hpcg:r1/(setup_time|optimization_time|total_time)_p1x1x1_n560x280x280
- gpu-hpcg:r1/(ddot|waxpby|spmv|mg|total|final)_p1x1x1_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
- gpu-hpcg:r1/(ddot|waxpby|spmv|mg|total|final)_p1x1x1_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)
gpu_hpcg_r2:
statistics: mean
categories: HPCG gpu-hpcg:r2
metrics:
- gpu-hpcg:r2/(setup_time|optimization_time|total_time)_p2x1x1_n560x280x280
- gpu-hpcg:r2/(ddot|waxpby|spmv|mg|total|final)_p2x1x1_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
- gpu-hpcg:r2/(ddot|waxpby|spmv|mg|total|final)_p2x1x1_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)
gpu_hpcg_r4:
statistics: mean
categories: HPCG gpu-hpcg:r4
metrics:
- gpu-hpcg:r4/(setup_time|optimization_time|total_time)_p2x2x1_n560x280x280
- gpu-hpcg:r4/(ddot|waxpby|spmv|mg|total|final)_p2x2x1_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
- gpu-hpcg:r4/(ddot|waxpby|spmv|mg|total|final)_p2x2x1_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)
gpu_hpcg_r8:
statistics: mean
categories: HPCG gpu-hpcg:r8
metrics:
- gpu-hpcg:r8/(setup_time|optimization_time|total_time)_p2x2x2_n560x280x280
- gpu-hpcg:r8/(ddot|waxpby|spmv|mg|total|final)_p2x2x2_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
- gpu-hpcg:r8/(ddot|waxpby|spmv|mg|total|final)_p2x2x2_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)
gpu_hpcg_r16:
statistics: mean
categories: HPCG gpu-hpcg:r16
metrics:
- gpu-hpcg:r16/(setup_time|optimization_time|total_time)_p4x2x2_n560x280x280
- gpu-hpcg:r16/(ddot|waxpby|spmv|mg|total|final)_p4x2x2_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
- gpu-hpcg:r16/(ddot|waxpby|spmv|mg|total|final)_p4x2x2_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)
gpu_hpcg_r32:
statistics: mean
categories: HPCG gpu-hpcg:r32
metrics:
- gpu-hpcg:r32/(setup_time|optimization_time|total_time)_p4x4x2_n560x280x280
- gpu-hpcg:r32/(ddot|waxpby|spmv|mg|total|final)_p4x4x2_n560x280x280_(gflops_per_process|bandwidth_per_process|gflops|bandwidth)
- gpu-hpcg:r32/(ddot|waxpby|spmv|mg|total|final)_p4x4x2_n560x280x280_(flops_per_process|bandwidth_per_process|flops|bandwidth)
......@@ -103,29 +103,29 @@ def test_dtk_hpcg_result_parsing_with_wrapper_noise(self):
workload = 'p4x4x2_n560x280x280'
expected_results = {
f'final_{workload}_gflops': 6904.9,
f'final_{workload}_gflops_per_process': 215.8,
f'final_{workload}_flops': 6904.9,
f'final_{workload}_flops_per_process': 215.8,
f'final_{workload}_bandwidth': 52359.0,
f'final_{workload}_bandwidth_per_process': 1636.2,
f'ddot_{workload}_gflops': 5849.4,
f'ddot_{workload}_flops': 5849.4,
f'ddot_{workload}_bandwidth': 46794.9,
f'ddot_{workload}_gflops_per_process': 182.8,
f'ddot_{workload}_flops_per_process': 182.8,
f'ddot_{workload}_bandwidth_per_process': 1462.3,
f'waxpby_{workload}_gflops': 3052.0,
f'waxpby_{workload}_flops': 3052.0,
f'waxpby_{workload}_bandwidth': 36623.8,
f'waxpby_{workload}_gflops_per_process': 95.4,
f'waxpby_{workload}_flops_per_process': 95.4,
f'waxpby_{workload}_bandwidth_per_process': 1144.5,
f'spmv_{workload}_gflops': 5473.9,
f'spmv_{workload}_flops': 5473.9,
f'spmv_{workload}_bandwidth': 34468.8,
f'spmv_{workload}_gflops_per_process': 171.1,
f'spmv_{workload}_flops_per_process': 171.1,
f'spmv_{workload}_bandwidth_per_process': 1077.1,
f'mg_{workload}_gflops': 7716.9,
f'mg_{workload}_flops': 7716.9,
f'mg_{workload}_bandwidth': 59557.1,
f'mg_{workload}_gflops_per_process': 241.2,
f'mg_{workload}_flops_per_process': 241.2,
f'mg_{workload}_bandwidth_per_process': 1861.2,
f'total_{workload}_gflops': 6971.0,
f'total_{workload}_flops': 6971.0,
f'total_{workload}_bandwidth': 52859.9,
f'total_{workload}_gflops_per_process': 217.8,
f'total_{workload}_flops_per_process': 217.8,
f'total_{workload}_bandwidth_per_process': 1651.9,
f'setup_time_{workload}': 0.12,
f'optimization_time_{workload}': 0.25,
......
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.
"""Tests for DTK gpu-hpl benchmark family."""
import os
import unittest
from tests.helper.testcase import BenchmarkTestCase
from superbench.benchmarks.micro_benchmarks.dtk_hpl_mxp_performance import DtkHplMxpBenchmark
from superbench.benchmarks.micro_benchmarks.dtk_hpl_performance import DtkHplBenchmark
class DtkHplFamilyBenchmarkTest(BenchmarkTestCase, unittest.TestCase):
"""Tests for DTK gpu-hpl and gpu-hpl-mxp benchmarks."""
@classmethod
def setUpClass(cls):
"""Hook method for setting up class fixture before running tests in the class."""
super().setUpClass()
cls.createMockEnvs(cls)
cls.createMockFiles(cls, ['bin/run_rochpl', 'bin/run_rochplmxp'])
def _parse_args(self, benchmark):
"""Parse benchmark arguments without running preprocess."""
benchmark.add_parser_arguments()
ret, args, _ = benchmark.parse_args()
if ret:
benchmark._args = args
return ret, args
def _write_output_file(self, benchmark, output):
"""Write generated HPL output for result parsing tests."""
with open(benchmark._out_path, 'w') as output_file:
output_file.write(output)
def _load_data_file(self, file_name):
"""Load test data file content."""
data_path = os.path.join(os.path.dirname(__file__), '..', '..', 'data', file_name)
with open(data_path, 'r') as data_file:
return data_file.read()
def test_dtk_hpl_default_metric_workload(self):
"""Test DTK gpu-hpl default workload formatting."""
benchmark = DtkHplBenchmark('gpu-hpl')
ret, args = self._parse_args(benchmark)
self.assertTrue(ret)
self.assertEqual(1, args.p)
self.assertEqual(1, args.q)
self.assertEqual(45312, args.n)
self.assertEqual(384, args.nb)
self.assertEqual(0, args.bcast)
self.assertEqual(0, args.warmup)
self.assertEqual(1, args.iterations)
self.assertEqual('max', args.reduce_op)
benchmark._tv = benchmark._format_tv()
self.assertEqual('WC10R2R32_TTN8', benchmark._tv)
self.assertEqual('WC10R2R32_TTN8_P1_Q1_N45312_NB384', benchmark._format_workload())
def test_dtk_hpl_mxp_default_metric_workload(self):
"""Test DTK gpu-hpl-mxp default workload formatting."""
benchmark = DtkHplMxpBenchmark('gpu-hpl-mxp')
ret, args = self._parse_args(benchmark)
self.assertTrue(ret)
self.assertEqual(1, args.p)
self.assertEqual(1, args.q)
self.assertEqual(61440, args.n)
self.assertEqual(2560, args.nb)
self.assertEqual(0, args.bcast)
benchmark._tv = benchmark._format_tv()
self.assertEqual('WC0', benchmark._tv)
self.assertEqual('WC0_P1_Q1_N61440_NB2560', benchmark._format_workload())
def test_dtk_hpl_sample_metric_workload(self):
"""Test DTK gpu-hpl workload formatting with sample parameters."""
benchmark = DtkHplBenchmark('gpu-hpl', parameters='--p 4 --q 1 --n 8192 --nb 512 --bcast 1')
ret, _ = self._parse_args(benchmark)
self.assertTrue(ret)
benchmark._tv = benchmark._format_tv()
self.assertEqual('WC11R2R32_TTN8', benchmark._tv)
self.assertEqual('WC11R2R32_TTN8_P4_Q1_N8192_NB512', benchmark._format_workload())
def test_dtk_hpl_mxp_sample_metric_workload(self):
"""Test DTK gpu-hpl-mxp workload formatting with sample parameters."""
benchmark = DtkHplMxpBenchmark('gpu-hpl-mxp', parameters='--p 4 --q 1 --n 8192 --nb 4096 --bcast 1')
ret, _ = self._parse_args(benchmark)
self.assertTrue(ret)
benchmark._tv = benchmark._format_tv()
self.assertEqual('WC1', benchmark._tv)
self.assertEqual('WC1_P4_Q1_N8192_NB4096', benchmark._format_workload())
def test_dtk_hpl_only_arguments_are_not_accepted_by_mxp(self):
"""Test rocHPL-only arguments are not accepted by gpu-hpl-mxp."""
hpl_benchmark = DtkHplBenchmark('gpu-hpl', parameters='--pfact 2')
hpl_mxp_benchmark = DtkHplMxpBenchmark('gpu-hpl-mxp', parameters='--pfact 2')
hpl_ret, _ = self._parse_args(hpl_benchmark)
hpl_mxp_ret, _ = self._parse_args(hpl_mxp_benchmark)
self.assertTrue(hpl_ret)
self.assertFalse(hpl_mxp_ret)
def test_dtk_hpl_invalid_sampling_arguments(self):
"""Test invalid HPL sampling arguments are rejected."""
self.assertFalse(DtkHplBenchmark('gpu-hpl', parameters='--warmup -1')._preprocess())
self.assertFalse(DtkHplBenchmark('gpu-hpl', parameters='--iterations 0')._preprocess())
def test_dtk_hpl_preprocess_generates_dat_file(self):
"""Test DTK gpu-hpl dat file and command generation."""
benchmark = DtkHplBenchmark('gpu-hpl')
self.assertTrue(benchmark._preprocess())
dat_file_name = 'HPL-WC10R2R32_TTN8_P1_Q1_N45312_NB384.dat'
out_file_name = 'HPL-WC10R2R32_TTN8_P1_Q1_N45312_NB384.out'
self.assertEqual(os.path.join(self._tmp_dir, 'bin', dat_file_name), benchmark._dat_path)
self.assertEqual(os.path.join(self._tmp_dir, 'bin', out_file_name), benchmark._out_path)
self.assertEqual(1, len(benchmark._commands))
self.assertIn(f'run_rochpl -P 1 -Q 1 --it 1 -i {dat_file_name}', benchmark._commands[0])
with open(benchmark._dat_path, 'r') as dat_file:
dat_content = dat_file.read()
self.assertIn(f'{out_file_name} output file name (if any)', dat_content)
self.assertIn('45312 Ns', dat_content)
self.assertIn('384 NBs', dat_content)
self.assertIn('0 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)', dat_content)
self.assertIn('2 PFACTs (0=left, 1=Crout, 2=Right)', dat_content)
self.assertIn('8 memory alignment in double (> 0)', dat_content)
def test_dtk_hpl_mxp_preprocess_generates_dat_file(self):
"""Test DTK gpu-hpl-mxp dat file and command generation."""
benchmark = DtkHplMxpBenchmark(
'gpu-hpl-mxp', parameters='--p 4 --q 1 --n 8192 --nb 4096 --bcast 1 --warmup 1 --iterations 5'
)
self.assertTrue(benchmark._preprocess())
dat_file_name = 'HPL-MxP-WC1_P4_Q1_N8192_NB4096.dat'
out_file_name = 'HPL-MxP-WC1_P4_Q1_N8192_NB4096.out'
self.assertEqual(os.path.join(self._tmp_dir, 'bin', dat_file_name), benchmark._dat_path)
self.assertEqual(os.path.join(self._tmp_dir, 'bin', out_file_name), benchmark._out_path)
self.assertEqual(1, len(benchmark._commands))
self.assertIn(f'run_rochplmxp -P 4 -Q 1 --it 6 -i {dat_file_name}', benchmark._commands[0])
with open(benchmark._dat_path, 'r') as dat_file:
dat_content = dat_file.read()
self.assertIn(f'{out_file_name} output file name (if any)', dat_content)
self.assertIn('8192 Ns', dat_content)
self.assertIn('4096 NBs', dat_content)
self.assertIn('4 P', dat_content)
self.assertIn('1 Q', dat_content)
self.assertIn('1 BCASTs (0=1rg,1=1rM,2=2rg,3=2rM,4=Lng,5=LnM)', dat_content)
def test_dtk_hpl_result_parsing_with_warmup_and_max_reduce(self):
"""Test DTK gpu-hpl parses generated output and reduces by max FLOPS."""
benchmark = DtkHplBenchmark(
'gpu-hpl', parameters='--p 4 --q 1 --n 8192 --nb 512 --bcast 1 --warmup 1 --iterations 3 --reduce-op max'
)
self.assertTrue(benchmark._preprocess())
self._write_output_file(
benchmark, """
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.71 5.167e+02
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.69 5.338e+02
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.67 5.437e+02
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.67 5.450e+02
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
"""
)
self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
workload = 'WC11R2R32_TTN8_P4_Q1_N8192_NB512'
self.assertEqual(545.0, benchmark.result[f'{workload}_flops'][0])
self.assertEqual(0.67, benchmark.result[f'{workload}_time'][0])
self.assertEqual(1, benchmark.result[f'{workload}_tests_pass'][0])
self.assertIn('raw_output_0', benchmark.raw_data)
self.assertIn('hpl_output_0', benchmark.raw_data)
def test_dtk_hpl_mxp_result_parsing_does_not_match_output_n(self):
"""Test DTK gpu-hpl-mxp parses output when output N differs from input N."""
benchmark = DtkHplMxpBenchmark(
'gpu-hpl-mxp',
parameters='--p 4 --q 1 --n 8192 --nb 4096 --bcast 1 --warmup 1 --iterations 2 --reduce-op min'
)
self.assertTrue(benchmark._preprocess())
self._write_output_file(
benchmark, """
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.78 3.742e+03
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... PASSED
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.80 3.665e+03
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... PASSED
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.78 3.767e+03
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... FAILED
"""
)
self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
workload = 'WC1_P4_Q1_N8192_NB4096'
self.assertEqual(3665.0, benchmark.result[f'{workload}_flops'][0])
self.assertEqual(0.80, benchmark.result[f'{workload}_time'][0])
self.assertEqual(0, benchmark.result[f'{workload}_tests_pass'][0])
def test_dtk_hpl_result_parsing_with_median_reduce(self):
"""Test DTK gpu-hpl median reduce uses reciprocal time."""
benchmark = DtkHplBenchmark(
'gpu-hpl', parameters='--p 4 --q 1 --n 8192 --nb 512 --bcast 1 --iterations 4 --reduce-op median'
)
self.assertTrue(benchmark._preprocess())
self._write_output_file(
benchmark, """
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.80 5.000e+02
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.40 6.000e+02
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.20 7.000e+02
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.10 8.000e+02
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
"""
)
self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
workload = 'WC11R2R32_TTN8_P4_Q1_N8192_NB512'
self.assertEqual(650.0, benchmark.result[f'{workload}_flops'][0])
self.assertEqual(0.26666666666666666, benchmark.result[f'{workload}_time'][0])
self.assertEqual(1, benchmark.result[f'{workload}_tests_pass'][0])
def test_dtk_hpl_result_parsing_with_sample_output_file(self):
"""Test DTK gpu-hpl parses a full sample output file."""
benchmark = DtkHplBenchmark(
'gpu-hpl', parameters='--p 4 --q 1 --n 8192 --nb 512 --bcast 1 --warmup 1 --iterations 5'
)
self.assertTrue(benchmark._preprocess())
self._write_output_file(benchmark, self._load_data_file('gpu_hpl_sample.out'))
self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
workload = 'WC11R2R32_TTN8_P4_Q1_N8192_NB512'
self.assertEqual(545.0, benchmark.result[f'{workload}_flops'][0])
self.assertEqual(0.67, benchmark.result[f'{workload}_time'][0])
self.assertEqual(1, benchmark.result[f'{workload}_tests_pass'][0])
def test_dtk_hpl_mxp_result_parsing_with_sample_output_file(self):
"""Test DTK gpu-hpl-mxp parses a full sample output file."""
benchmark = DtkHplMxpBenchmark(
'gpu-hpl-mxp', parameters='--p 4 --q 1 --n 8192 --nb 4096 --bcast 1 --iterations 6'
)
self.assertTrue(benchmark._preprocess())
self._write_output_file(benchmark, self._load_data_file('gpu_hpl_mxp_sample.out'))
self.assertTrue(benchmark._process_raw_result(0, 'stdout noise'))
workload = 'WC1_P4_Q1_N8192_NB4096'
self.assertEqual(3767.0, benchmark.result[f'{workload}_flops'][0])
self.assertEqual(0.78, benchmark.result[f'{workload}_time'][0])
self.assertEqual(1, benchmark.result[f'{workload}_tests_pass'][0])
def test_dtk_hpl_result_parsing_fails_when_output_file_is_missing(self):
"""Test DTK gpu-hpl parsing fails when generated output file is missing."""
benchmark = DtkHplBenchmark('gpu-hpl')
self.assertTrue(benchmark._preprocess())
self.assertFalse(benchmark._process_raw_result(0, 'stdout noise'))
if __name__ == '__main__':
unittest.main()
================================================================================
HPLinpack 2.2 -- High-Performance Linpack benchmark -- February 24, 2016
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 8192
NB : 4096
PMAP : Column-major process mapping
P : 4
Q : 1
BCAST : 1ringM
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.78 3.742e+03
HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
HPLMXP_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time D bcast . . : 0.46
Max aggregated wall time L bcast . . : 0.00
Max aggregated wall time U bcast . . : 0.40
Max aggregated wall time update . . : 0.02
Max aggregated wall time Iter Refine : 0.01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.80 3.665e+03
HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
HPLMXP_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time D bcast . . : 0.48
Max aggregated wall time L bcast . . : 0.00
Max aggregated wall time U bcast . . : 0.41
Max aggregated wall time update . . : 0.02
Max aggregated wall time Iter Refine : 0.01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.78 3.767e+03
HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
HPLMXP_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time D bcast . . : 0.46
Max aggregated wall time L bcast . . : 0.00
Max aggregated wall time U bcast . . : 0.39
Max aggregated wall time update . . : 0.02
Max aggregated wall time Iter Refine : 0.01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.79 3.727e+03
HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
HPLMXP_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time D bcast . . : 0.47
Max aggregated wall time L bcast . . : 0.00
Max aggregated wall time U bcast . . : 0.40
Max aggregated wall time update . . : 0.02
Max aggregated wall time Iter Refine : 0.01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.79 3.724e+03
HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
HPLMXP_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time D bcast . . : 0.47
Max aggregated wall time L bcast . . : 0.00
Max aggregated wall time U bcast . . : 0.40
Max aggregated wall time update . . : 0.02
Max aggregated wall time Iter Refine : 0.01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC1 16384 4096 4 1 0.79 3.713e+03
HPLMXP_pdgesv() start time Web Apr 22 00:00:00 2026
HPLMXP_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
Max aggregated wall time D bcast . . : 0.47
Max aggregated wall time L bcast . . : 0.00
Max aggregated wall time U bcast . . : 0.40
Max aggregated wall time update . . : 0.02
Max aggregated wall time Iter Refine : 0.01
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0891789 ...... PASSED
================================================================================
Finished 6 tests with the following results:
6 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
================================================================================
HPLinpack 2.2 -- High-Performance Linpack benchmark -- February 24, 2016
Written by A. Petitet and R. Clint Whaley, Innovative Computing Laboratory, UTK
Modified by Piotr Luszczek, Innovative Computing Laboratory, UTK
Modified by Julien Langou, University of Colorado Denver
================================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 8192
NB : 512
PMAP : Column-major process mapping
P : 4
Q : 1
PFACT : Right
NBMIN : 32
NDIV : 2
RFACT : Right
BCAST : 1ringM
DEPTH : 1
SWAP : Spread-roll (long)
L1 : transposed form
U : transposed form
EQUIL : no
ALIGN : 8 double precision words
--------------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual check will be computed:
||Ax-b||_oo / ( eps * ( || x ||_oo * || A ||_oo + || b ||_oo ) * N )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.71 5.167e+02
HPL_pdgesv() start time Web Apr 22 00:00:00 2026
HPL_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . : 0.32
+ + Max aggregated wall time pfact . : 0.30
+ + Max aggregated wall time mxswp . : 0.19
Max aggregated wall time laswp . . . : 0.37
Max aggregated wall time update . . : 0.00
Max aggregated wall time up tr sv . : 0.00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.69 5.338e+02
HPL_pdgesv() start time Web Apr 22 00:00:00 2026
HPL_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . : 0.30
+ + Max aggregated wall time pfact . : 0.30
+ + Max aggregated wall time mxswp . : 0.18
Max aggregated wall time laswp . . . : 0.36
Max aggregated wall time update . . : 0.00
Max aggregated wall time up tr sv . : 0.00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.67 5.437e+02
HPL_pdgesv() start time Web Apr 22 00:00:00 2026
HPL_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . : 0.30
+ + Max aggregated wall time pfact . : 0.29
+ + Max aggregated wall time mxswp . : 0.18
Max aggregated wall time laswp . . . : 0.36
Max aggregated wall time update . . : 0.00
Max aggregated wall time up tr sv . : 0.00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.67 5.450e+02
HPL_pdgesv() start time Web Apr 22 00:00:00 2026
HPL_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . : 0.30
+ + Max aggregated wall time pfact . : 0.29
+ + Max aggregated wall time mxswp . : 0.18
Max aggregated wall time laswp . . . : 0.36
Max aggregated wall time update . . : 0.00
Max aggregated wall time up tr sv . : 0.00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.68 5.391e+02
HPL_pdgesv() start time Web Apr 22 00:00:00 2026
HPL_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . : 0.30
+ + Max aggregated wall time pfact . : 0.29
+ + Max aggregated wall time mxswp . : 0.18
Max aggregated wall time laswp . . . : 0.36
Max aggregated wall time update . . : 0.00
Max aggregated wall time up tr sv . : 0.00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
================================================================================
T/V N NB P Q Time Gflops
--------------------------------------------------------------------------------
WC11R2R32 8192 512 4 1 0.69 5.330e+02
HPL_pdgesv() start time Web Apr 22 00:00:00 2026
HPL_pdgesv() end time Web Apr 22 00:00:00 2026
--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV--VVV-
+ Max aggregated wall time rfact . . : 0.31
+ + Max aggregated wall time pfact . : 0.30
+ + Max aggregated wall time mxswp . : 0.19
Max aggregated wall time laswp . . . : 0.36
Max aggregated wall time update . . : 0.00
Max aggregated wall time up tr sv . : 0.00
--------------------------------------------------------------------------------
||Ax-b||_oo/(eps*(||A||_oo*||x||_oo+||b||_oo)*N)= 0.0002689 ...... PASSED
================================================================================
Finished 6 tests with the following results:
6 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
--------------------------------------------------------------------------------
End of Tests.
================================================================================
......@@ -16,7 +16,7 @@ ROCM_VER ?= $(shell hipconfig -R | grep -oP '\d+\.\d+\.\d+' || echo "0.0.0")
NUM_MAKE_JOBS ?= $(shell nproc --ignore=2)
.PHONY: all cuda_with_msccl cuda rocm dtk common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest cuda_msccl rocm_perftest fio rocm_rccl_tests rocm_rocblas rocm_bandwidthTest gpcnet cuda_gpuburn cpu_stream cpu_hpl directx_amf_encoding_latency directx_amd rocm_hipblaslt rocm_babelstream_hip megatron_lm megatron_deepspeed apex_rocm nvbandwidth rocm_megatron_lm rocm_hpcg
.PHONY: all cuda_with_msccl cuda rocm dtk common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest cuda_msccl rocm_perftest fio rocm_rccl_tests rocm_rocblas rocm_bandwidthTest gpcnet cuda_gpuburn cpu_stream cpu_hpl directx_amf_encoding_latency directx_amd rocm_hipblaslt rocm_babelstream_hip megatron_lm megatron_deepspeed apex_rocm nvbandwidth rocm_megatron_lm rocm_hpcg rocm_hpl rocm_hpl_mxp
# Build targets.
all: cuda rocm
......@@ -24,7 +24,7 @@ all: cuda rocm
cuda_with_msccl: cuda cuda_msccl
cuda: common cuda_cutlass cuda_bandwidthTest cuda_nccl_tests cuda_perftest gpcnet cuda_gpuburn megatron_lm megatron_deepspeed nvbandwidth
rocm: common rocm_perftest rocm_rccl_tests rocm_rocblas rocm_bandwidthTest rocm_hipblaslt megatron_deepspeed apex_rocm rocm_megatron_lm
dtk: common rocm_perftest rocm_rccl_tests rocm_babelstream_hip megatron_deepspeed apex_rocm rocm_megatron_lm rocm_hpcg
dtk: common rocm_perftest rocm_rccl_tests rocm_babelstream_hip megatron_deepspeed apex_rocm rocm_megatron_lm rocm_hpcg rocm_hpl rocm_hpl_mxp
cpu: common cpu_perftest
common: fio cpu_stream
......@@ -202,6 +202,37 @@ ifneq (,$(wildcard rocHPCG/install.sh))
chmod +x $(SB_MICRO_PATH)/bin/rochpcg $(SB_MICRO_PATH)/bin/run_rochpcg
endif
# Build rocHPL and stage the binary for SuperBench DTK image.
rocm_hpl: sb_micro_path
ifneq (,$(wildcard rocHPL/install.sh))
cd ./rocHPL && \
git apply ../rochpl_dtk26.patch && \
./install.sh --with-rocm=$(ROCM_PATH) \
--with-rocblas=$(ROCM_PATH)/rocblas \
--with-mpi=$(MPI_HOME) \
--arch=$$(paste -sd ',' $(ROCM_PATH)/bin/target.lst)
cp -v ./rocHPL/build/bin/rochpl $(SB_MICRO_PATH)/bin/
cp -v ./rocHPL/build/HPL.dat $(SB_MICRO_PATH)/bin/
cp -v ./run_rochpl.sh $(SB_MICRO_PATH)/bin/run_rochpl
chmod +x $(SB_MICRO_PATH)/bin/rochpl $(SB_MICRO_PATH)/bin/run_rochpl
endif
# Build rocHPL-MxP and stage the binary and run script for SuperBench DTK image.
rocm_hpl_mxp: sb_micro_path
ifneq (,$(wildcard rocHPL-MxP/install.sh))
cd ./rocHPL-MxP && \
git apply ../rochplmxp_dtk26.patch && \
./install.sh --with-rocm=$(ROCM_PATH) \
--with-rocblas=$(ROCM_PATH)/rocblas \
--with-rocsolver=$(ROCM_PATH)/rocsolver \
--with-mpi=$(MPI_HOME) \
--arch=$$(paste -sd ',' $(ROCM_PATH)/bin/target.lst)
cp -v ./rocHPL-MxP/build/bin/rochplmxp $(SB_MICRO_PATH)/bin/
cp -v ./rocHPL-MxP/build/HPL-MxP.dat $(SB_MICRO_PATH)/bin/
cp -v ./run_rochplmxp.sh $(SB_MICRO_PATH)/bin/run_rochplmxp
chmod +x $(SB_MICRO_PATH)/bin/rochplmxp $(SB_MICRO_PATH)/bin/run_rochplmxp
endif
# Build GPCNET from commit c56fd9.
gpcnet: sb_micro_path
bash -c "source ${HPCX_HOME}/hpcx-init.sh && hpcx_load && make CC=mpicc -C GPCNET all && hpcx_unload"
......
Subproject commit 6f06f5127bde6ec7a7900814c560520e3f7c8ca3
Subproject commit 5b6a169d5428e5b652341ebbad969fd2937762d4
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 6b80b24..30d3c9d 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -83,7 +83,7 @@ foreach(i ${rochpl_device_source})
endforeach()
# HIP flags workaround while target_compile_options does not work
-list(APPEND HIP_HIPCC_FLAGS "-Wno-unused-command-line-argument -Wno-deprecated-declarations -fPIE -fopenmp")
+list(APPEND HIP_HIPCC_FLAGS "-Wno-unused-command-line-argument -Wno-deprecated-declarations -fPIE -fopenmp --gpu-max-threads-per-block=1024")
list(APPEND CMAKE_HOST_FLAGS "-Wno-deprecated-declarations")
if (CMAKE_BUILD_TYPE STREQUAL "Debug")
@@ -95,24 +95,35 @@ else()
endif()
# GPU arch targets
-set(TARGETS "gfx900;gfx906")
-if(HIP_VERSION VERSION_GREATER_EQUAL "3.7")
- set(TARGETS "${TARGETS};gfx908")
-endif()
-if(HIP_VERSION VERSION_GREATER_EQUAL "4.3")
- set(TARGETS "${TARGETS};gfx90a")
-endif()
-if (HIP_VERSION VERSION_GREATER_EQUAL "5.7")
- set(TARGETS "${TARGETS};gfx942")
-endif()
-if (HIP_VERSION VERSION_GREATER_EQUAL "6.5")
- set(TARGETS "${TARGETS};gfx950;gfx1100")
+set(ARCHS "")
+if(DEFINED HPL_BUILD_ARCH AND NOT HPL_BUILD_ARCH STREQUAL "")
+ string(REPLACE "," ";" ARCHS "${HPL_BUILD_ARCH}")
+ list(TRANSFORM ARCHS STRIP)
+ list(REMOVE_DUPLICATES ARCHS)
+ message(STATUS "Using manually specified GPU targets: ${ARCHS}")
+else()
+ message(STATUS "Detecting available architecture")
+ find_program(ROCMINFO_EXECUTABLE rocminfo)
+ if(ROCMINFO_EXECUTABLE)
+ execute_process(
+ COMMAND ${ROCMINFO_EXECUTABLE}
+ OUTPUT_VARIABLE ROCMINFO_OUTPUT
+ ERROR_QUIET
+ OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+ string(REGEX MATCHALL "Name:[ \t]+gfx[0-9a-z]+" ARCH_MATCHES "${ROCMINFO_OUTPUT}")
+ string(REGEX REPLACE "Name:[ \t]+" "" ARCHS "${ARCH_MATCHES}")
+ list(REMOVE_DUPLICATES ARCHS)
+ endif()
endif()
-if (HIP_VERSION VERSION_GREATER_EQUAL "7.0")
- set(TARGETS "${TARGETS};gfx1201")
+
+if(ARCHS STREQUAL "")
+ message(FATAL_ERROR "No GPU architectures detected via rocminfo and no BUILD_ARCH specified. Use ./install.sh --arch=gfxXXX")
endif()
-foreach(target ${TARGETS})
+message(STATUS "Building for GPU architecture: ${ARCHS}")
+
+foreach(target ${ARCHS})
list(APPEND HIP_HIPCC_FLAGS "--offload-arch=${target}")
endforeach()
@@ -176,7 +187,7 @@ if(MPI_GTL)
target_link_libraries(rochpl PRIVATE "${GTL_LIB}")
endif()
-set_target_properties(rochpl PROPERTIES HIP_ARCHITECTURES "${DEFAULT_AMDGPU_TARGETS}")
+set_target_properties(rochpl PROPERTIES HIP_ARCHITECTURES "${ARCHS}")
# Configure a header file to pass the rocHPL version
configure_file("${CMAKE_CURRENT_SOURCE_DIR}/include/hpl_version.hpp.in"
diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake
index 6d6be5d..ed4813a 100644
--- a/cmake/Dependencies.cmake
+++ b/cmake/Dependencies.cmake
@@ -101,7 +101,7 @@ if(NOT ROCM_FOUND)
execute_process(COMMAND ${CMAKE_COMMAND} -E tar xzf ${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag}.zip
WORKING_DIRECTORY ${PROJECT_EXTERN_DIR})
- find_package(ROCmCMakeBuildTools REQUIRED CONFIG PATHS ${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag})
+ set(CMAKE_MODULE_PATH "${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag}/share/rocm/cmake;${CMAKE_MODULE_PATH}")
endif()
include(ROCMSetupVersion)
diff --git a/install.sh b/install.sh
index b30a3fb..75900d8 100755
--- a/install.sh
+++ b/install.sh
@@ -2,7 +2,7 @@
# Author: Nico Trost
# Modified by: Noel Chalmers
-#set -x #echo on
+# set -euo pipefail
# #################################################
# helper functions
@@ -17,6 +17,7 @@ function display_help()
echo " [--with-rocm=<dir>] Path to ROCm install (Default: /opt/rocm)"
echo " [--with-rocblas=<dir>] Path to rocBLAS library (Default: /opt/rocm/rocblas)"
echo " [--with-mpi=<dir>] Path to external MPI install (Default: clone+build OpenMPI)"
+ echo " [--arch=<archs>] Specify comma separated architecture list to build (Default: detect from rocminfo)"
echo " [--with-mpi-gtl=<dir>] Path to external MPI-GTL install (Optional: defaults to no gtl support)"
echo " [--verbose-print] Verbose output during HPL setup (Default: true)"
echo " [--progress-report] Print progress report to terminal during HPL run (Default: true)"
@@ -33,10 +34,10 @@ supported_distro( )
fi
case "${ID}" in
- debian|linuxmint|ubuntu|centos|rhel|fedora|sles|tencentos)
+ debian|linuxmint|ubuntu|centos|rhel|fedora|sles|tencentos|kylin|rocky)
true
;;
- *) printf "This script is currently supported on Debian, Linuxmint, Ubuntu, CentOS, RHEL, Fedora and SLES\n"
+ *) printf "This script is currently supported on Debian, Linuxmint, Ubuntu, CentOS, RHEL, Fedora, SLES, TencentOS, Kylin and Rocky\n"
exit 2
;;
esac
@@ -68,11 +69,11 @@ exit_with_error( )
printf "sudo apt install -y ${library_dependencies_ubuntu[*]}\n"
;;
- centos|rhel|tencentos)
+ centos|rhel|tencentos|kylin)
printf "sudo yum -y --nogpgcheck install ${library_dependencies_centos[*]}\n"
;;
- fedora)
+ fedora|rocky)
printf "sudo dnf install -y ${library_dependencies_fedora[*]}\n"
;;
@@ -224,6 +225,7 @@ verbose_print=true
progress_report=true
detailed_timing=true
enable_tracing=false
+arch=
# #################################################
# Parameter parsing
@@ -232,7 +234,7 @@ enable_tracing=false
# check if we have a modern version of getopt that can handle whitespace and long parameters
getopt -T
if [[ $? -eq 4 ]]; then
- GETOPT_PARSE=$(getopt --name "${0}" --longoptions help,debug,prefix:,with-rocm:,with-mpi:,with-mpi-gtl:,with-rocblas:,verbose-print:,progress-report:,detailed-timing:,enable-tracing: --options hg -- "$@")
+ GETOPT_PARSE=$(getopt --name "${0}" --longoptions help,debug,prefix:,with-rocm:,with-mpi:,with-mpi-gtl:,with-rocblas:,verbose-print:,progress-report:,detailed-timing:,enable-tracing:,arch: --options hg -- "$@")
else
echo "Need a new version of getopt"
exit_with_error 1
@@ -263,6 +265,9 @@ while true; do
--with-mpi)
with_mpi=${2}
shift 2 ;;
+ --arch)
+ arch=${2}
+ shift 2 ;;
--with-mpi-gtl)
with_mpi_gtl=${2}
shift 2 ;;
@@ -294,9 +299,6 @@ printf "\033[32mCreating project build directory in: \033[33m${build_dir}\033[0m
# #################################################
# prep
# #################################################
-# ensure a clean build environment
-rm -rf ${build_dir}
-
# Default cmake executable is called cmake
cmake_executable=cmake
@@ -347,11 +349,14 @@ pushd .
if [[ "${enable_tracing}" == on || "${enable_tracing}" == true || "${enable_tracing}" == 1 || "${enable_tracing}" == enabled ]]; then
cmake_common_options="${cmake_common_options} -DHPL_TRACING=ON"
fi
+ if [[ -n "${arch}" ]]; then
+ cmake_common_options="${cmake_common_options} -DHPL_BUILD_ARCH=${arch}"
+ fi
shopt -u nocasematch
# Build library with AMD toolchain because of existence of device kernels
mkdir -p ${build_dir} && cd ${build_dir}
- ${cmake_executable} ${cmake_common_options} ..
+ ${cmake_executable} --fresh ${cmake_common_options} ..
check_exit_code 2
if [[ -e build.ninja ]]; then
diff --git a/src/HPL_pdtest.cpp b/src/HPL_pdtest.cpp
index 94a0d3f..3135763 100644
--- a/src/HPL_pdtest.cpp
+++ b/src/HPL_pdtest.cpp
@@ -212,7 +212,7 @@ void HPL_pdtest(HPL_T_test* TEST,
ctime(&current_time_end));
}
#ifdef HPL_PROGRESS_REPORT
- printf("Final Score: %7.4e GFLOPS \n", Gflops);
+ printf("Final Score: %7.9e GFLOPS \n", Gflops);
#endif
}
#ifdef HPL_DETAILED_TIMING
diff --git a/src/pgesv/HPL_pdgesv.cpp b/src/pgesv/HPL_pdgesv.cpp
index d6c99c3..280a9a5 100644
--- a/src/pgesv/HPL_pdgesv.cpp
+++ b/src/pgesv/HPL_pdgesv.cpp
@@ -336,7 +336,7 @@ void HPL_pdgesv(HPL_T_grid* GRID, HPL_T_palg* ALGO, HPL_T_pmat* A) {
printf(" %9.3e |", step_gflops);
#endif
- printf(" %9.3e \n", gflops);
+ printf(" %9.9e \n", gflops);
}
#endif
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 91afcc4..6331291 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -88,7 +88,7 @@ foreach(i ${rochplmxp_device_source})
endforeach()
# HIP flags workaround while target_compile_options does not work
-list(APPEND HIP_HIPCC_FLAGS "-Wno-unused-command-line-argument -fPIE")
+list(APPEND HIP_HIPCC_FLAGS "-Wno-unused-command-line-argument -fPIE --gpu-max-threads-per-block=1024")
list(APPEND CMAKE_HOST_FLAGS "")
if (CMAKE_BUILD_TYPE STREQUAL "Debug")
@@ -100,21 +100,35 @@ else()
endif()
# GPU arch targets
-set(TARGETS "gfx900;gfx906")
-if(HIP_VERSION VERSION_GREATER_EQUAL "3.7")
- set(TARGETS "${TARGETS};gfx908")
-endif()
-if(HIP_VERSION VERSION_GREATER_EQUAL "4.3")
- set(TARGETS "${TARGETS};gfx90a")
-endif()
-if (HIP_VERSION VERSION_GREATER_EQUAL "5.7")
- set(TARGETS "${TARGETS};gfx942")
+set(ARCHS "")
+if(DEFINED HPL_BUILD_ARCH AND NOT HPL_BUILD_ARCH STREQUAL "")
+ string(REPLACE "," ";" ARCHS "${HPL_BUILD_ARCH}")
+ list(TRANSFORM ARCHS STRIP)
+ list(REMOVE_DUPLICATES ARCHS)
+ message(STATUS "Using manually specified GPU targets: ${ARCHS}")
+else()
+ message(STATUS "Detecting available architecture")
+ find_program(ROCMINFO_EXECUTABLE rocminfo)
+ if(ROCMINFO_EXECUTABLE)
+ execute_process(
+ COMMAND ${ROCMINFO_EXECUTABLE}
+ OUTPUT_VARIABLE ROCMINFO_OUTPUT
+ ERROR_QUIET
+ OUTPUT_STRIP_TRAILING_WHITESPACE)
+
+ string(REGEX MATCHALL "Name:[ \t]+gfx[0-9a-z]+" ARCH_MATCHES "${ROCMINFO_OUTPUT}")
+ string(REGEX REPLACE "Name:[ \t]+" "" ARCHS "${ARCH_MATCHES}")
+ list(REMOVE_DUPLICATES ARCHS)
+ endif()
endif()
-if (HIP_VERSION VERSION_GREATER_EQUAL "6.5")
- set(TARGETS "${TARGETS};gfx950")
+
+if(ARCHS STREQUAL "")
+ message(FATAL_ERROR "No GPU architectures detected via rocminfo and no BUILD_ARCH specified. Use ./install.sh --arch=gfxXXX")
endif()
-foreach(target ${TARGETS})
+message(STATUS "Building for GPU architecture: ${ARCHS}")
+
+foreach(target ${ARCHS})
list(APPEND HIP_HIPCC_FLAGS "--offload-arch=${target}")
endforeach()
@@ -173,7 +187,7 @@ set_target_properties(rochplmxp PROPERTIES RUNTIME_OUTPUT_DIRECTORY "${CMAKE_BIN
set_target_properties(rochplmxp PROPERTIES LINKER_LANGUAGE CXX)
-set_target_properties(rochplmxp PROPERTIES HIP_ARCHITECTURES "${DEFAULT_AMDGPU_TARGETS}")
+set_target_properties(rochplmxp PROPERTIES HIP_ARCHITECTURES "${ARCHS}")
# # Configure a header file to pass the rocHPL-MxP version
configure_file("${CMAKE_CURRENT_SOURCE_DIR}/include/hplmxp_version.hpp.in"
diff --git a/cmake/Dependencies.cmake b/cmake/Dependencies.cmake
index 164d06d..041a8e2 100644
--- a/cmake/Dependencies.cmake
+++ b/cmake/Dependencies.cmake
@@ -109,7 +109,7 @@ if(NOT ROCM_FOUND)
execute_process(COMMAND ${CMAKE_COMMAND} -E tar xzf ${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag}.zip
WORKING_DIRECTORY ${PROJECT_EXTERN_DIR})
- find_package(ROCmCMakeBuildTools REQUIRED CONFIG PATHS ${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag})
+ set(CMAKE_MODULE_PATH "${PROJECT_EXTERN_DIR}/rocm-cmake-${rocm_cmake_tag}/share/rocm/cmake;${CMAKE_MODULE_PATH}")
endif()
include(ROCMSetupVersion)
diff --git a/install.sh b/install.sh
index de72a20..6f2ef05 100755
--- a/install.sh
+++ b/install.sh
@@ -18,6 +18,7 @@ function display_help()
echo " [--with-rocblas=<dir>] Path to rocBLAS library (Default: /opt/rocm/rocblas)"
echo " [--with-rocsolver=<dir>] Path to rocSOLVER library (Default: /opt/rocm/rocsolver)"
echo " [--with-mpi=<dir>] Path to external MPI install (Default: clone+build OpenMPI)"
+ echo " [--arch=<archs>] Specify comma separated architecture list to build (Default: detect from rocminfo)"
echo " [--verbose-print] Verbose output during HPL setup (Default: true)"
echo " [--enable-tracing] Annotate profiler traces with rocTX markers (Default: false)"
echo " [--progress-report] Print progress report to terminal during HPL run (Default: true)"
@@ -33,10 +34,10 @@ supported_distro( )
fi
case "${ID}" in
- ubuntu|centos|rhel|fedora|sles)
+ ubuntu|centos|rhel|fedora|sles|kylin|rocky)
true
;;
- *) printf "This script is currently supported on Ubuntu, CentOS, RHEL, Fedora and SLES\n"
+ *) printf "This script is currently supported on Ubuntu, CentOS, RHEL, Fedora, SLES, Kylin and Rocky\n"
exit 2
;;
esac
@@ -68,11 +69,11 @@ exit_with_error( )
printf "sudo apt install -y ${library_dependencies_ubuntu[*]}\n"
;;
- centos|rhel)
+ centos|rhel|kylin)
printf "sudo yum -y --nogpgcheck install ${library_dependencies_centos[*]}\n"
;;
- fedora)
+ fedora|rocky)
printf "sudo dnf install -y ${library_dependencies_fedora[*]}\n"
;;
@@ -217,6 +218,7 @@ verbose_print=true
enable_tracing=false
progress_report=true
detailed_timing=true
+arch=
# #################################################
# Parameter parsing
@@ -225,7 +227,7 @@ detailed_timing=true
# check if we have a modern version of getopt that can handle whitespace and long parameters
getopt -T
if [[ $? -eq 4 ]]; then
- GETOPT_PARSE=$(getopt --name "${0}" --longoptions help,debug,prefix:,with-rocm:,with-mpi:,with-rocblas:,with-rocsolver:,verbose-print:,enable-tracing:,progress-report:,detailed-timing: --options hg -- "$@")
+ GETOPT_PARSE=$(getopt --name "${0}" --longoptions help,debug,prefix:,with-rocm:,with-mpi:,with-rocblas:,with-rocsolver:,verbose-print:,enable-tracing:,progress-report:,detailed-timing:,arch: --options hg -- "$@")
else
echo "Need a new version of getopt"
exit_with_error 1
@@ -262,6 +264,9 @@ while true; do
--with-rocsolver)
with_rocsolver=${2}
shift 2 ;;
+ --arch)
+ arch=${2}
+ shift 2 ;;
--verbose-print)
verbose_print=${2}
shift 2 ;;
@@ -335,11 +340,14 @@ pushd .
if [[ "${enable_tracing}" == on || "${enable_tracing}" == true || "${enable_tracing}" == 1 || "${enable_tracing}" == enabled ]]; then
cmake_common_options="${cmake_common_options} -DHPLMXP_TRACING=ON"
fi
+ if [[ -n "${arch}" ]]; then
+ cmake_common_options="${cmake_common_options} -DHPL_BUILD_ARCH=${arch}"
+ fi
shopt -u nocasematch
# Build library with AMD toolchain because of existence of device kernels
mkdir -p ${build_dir} && cd ${build_dir}
- ${cmake_executable} ${cmake_common_options} ..
+ ${cmake_executable} --fresh ${cmake_common_options} ..
check_exit_code 2
make -j$(nproc) install
diff --git a/src/hplmxp_ptest.cpp b/src/hplmxp_ptest.cpp
index 11d0f44..e8b1eee 100644
--- a/src/hplmxp_ptest.cpp
+++ b/src/hplmxp_ptest.cpp
@@ -211,7 +211,7 @@ void HPLMXP_ptest(HPLMXP_T_test& test,
ctime(&current_time_end));
}
#ifdef HPLMXP_PROGRESS_REPORT
- printf("Final Score: %7.4e GFLOPS \n", Gflops);
+ printf("Final Score: %7.9e GFLOPS \n", Gflops);
#endif
}
#ifdef HPLMXP_DETAILED_TIMING
diff --git a/src/pgesv/hplmxp_pgetrf.cpp b/src/pgesv/hplmxp_pgetrf.cpp
index ccbd4c0..0230b44 100644
--- a/src/pgesv/hplmxp_pgetrf.cpp
+++ b/src/pgesv/hplmxp_pgetrf.cpp
@@ -420,7 +420,7 @@ void HPLMXP_pgetrf(HPLMXP_T_grid& grid,
printf(" %9.3e |", step_gflops);
#endif
- printf(" %9.3e \n", gflops);
+ printf(" %9.9e \n", gflops);
}
#endif
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment