:keywords: TransferBench, API, ROCm, documentation, HIP
:description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
:keywords: Using TransferBench, TransferBench Usage, TransferBench How To, API, ROCm, documentation, HIP
.. _using-transferbench:
---------------------
Using TransferBench
---------------------
Users have control over the SRC and DST memory locations by indicating memory type followed by the device index. TransferBench supports the following:
You can control the SRC and DST memory locations by indicating the memory type followed by the device index. TransferBench supports the following memory types:
* coarse-grained pinned host memory
* unpinned host memory
* fine-grained host memory
* coarse-grained global device memory
* fine-grained global device memory
* null memory (for an empty transfer).
* Coarse-grained pinned host
* Unpinned host
* Fine-grained host
* Coarse-grained global device
* Fine-grained global device
* Null (for an empty transfer)
In addition, users can determine the size of the transfer (number of bytes to copy) for their tests.
In addition, you can determine the size of the transfer (number of bytes to copy) for the tests.
Users can also specify executors of the transfer. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used.
You can also specify transfer executors. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of Sub-Executors (SE). The number of SEs specifies the number of CPU threads in the case of a CPU executor and the number of compute units (CU) for a GPU executor.
For a DMA executor, the SE argument determines the number of streams to be used.
Refer to the following example to use TransferBench.
You can specify the transfers in a configuration file or use preset configurations for transfers.
--------------------
ConfigFile format
--------------------
Specifying transfers in a configuration file
----------------------------------------------
A Transfer is defined as a single operation where an executor reads and adds together values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
This simplifies to a simple copy operation when dealing with single SRC/DST.
A transfer is defined as a single operation where an executor reads and adds together values from SRC memory locations, followed by writing the sum to the DST memory locations.
This simplifies to a copy operation when using a single SRC or DST.
Here's a copy operation from a single SRC to DST:
.. code-block:: bash
...
...
@@ -34,73 +37,110 @@ This simplifies to a simple copy operation when dealing with single SRC/DST.
SRC 1 -> Executor -> DST 1
SRC X DST Y
Three Executors are supported by TransferBench::
Three executors are supported by TransferBench:
.. code-block:: bash
Executor: SubExecutor:
1) CPU CPU thread
2) GPU GPU threadblock/Compute Unit (CU)
3) DMA N/A. (May only be used for copies (single SRC/DST)
1. CPU CPU thread
2. GPU GPU threadblock/Compute Unit (CU)
3. DMA N/A (Can only be used for a single SRC to DST copy)
Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
Each line in the configuration file defines a set of transfers, also known as a test, to run in parallel.
There are two ways to specify a test:
1) Basic
- **Basic**
The basic specification assumes the same number of SubExecutors (SE) used per Transfer
A positive number of Transfers is specified followed by that number of triplets describing each Transfer
The basic specification assumes the same number of SEs used per transfer.
A positive number of transfers is specified, followed by the number of SEs and triplets describing each transfer:
.. code-block:: bash
Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
2) Advanced
The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
**Example**:
A negative number of Transfers is specified, followed by quintuplets describing each Transfer
A non-zero number of bytes specified will override any provided value
1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
-2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
**Example**:
.. code-block:: bash
-2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs and 2Mb from GPU1 to GPU0 with 2 SEs
Here is the list of arguments used to specify transfers in the config file:
.. _config_file_arguments_table:
.. list-table::
:header-rows: 1
* - Argument
- Description
* - Transfers
- Number of transfers to be run in parallel
* - SE
- Number of SEs to use (CPU threads or GPU threadblocks)
* - srcMemL
- Source memory locations (where the data is read)
* - Executor
- | Executor is specified by a character indicating type, followed by the device index (0-indexed):
| - C: CPU-executed (indexed from 0 to NUMA nodes - 1)
| - G: GPU-executed (indexed from 0 to GPUs - 1)
| - D: DMA-executor (indexed from 0 to GPUs - 1)
* - dstMemL
- Destination memory locations (where the data is written)
Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
Lines starting with will be ignored. Lines starting with will be echoed to output
* - bytesL
- | Number of bytes to copy (use command-line specified size when 0).
| Must be a multiple of four and can be suffixed with ('K','M', or 'G').
| Memory locations are specified by one or more device characters or device index pairs.
| Characters indicate memory type and are followed by device index (0-indexed).
| Here are the characters and their respective memory locations:
| - C: Pinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
| - U: Unpinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
| - B: Fine-grain host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
| - G: Global device memory (on GPU device, indexed from 0 to [GPUs - 1])
| - F: Fine-grain device memory (on GPU device, indexed from 0 to [GPUs - 1])
| - N: Null memory (index ignored)
Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::
Round brackets and arrows "->" can be included for human clarity, but will be ignored.
Lines starting with # are ignored while lines starting with ## are echoed to the output.
**Transfer examples:**
Single GPU-executed transfer between GPU 0 and 1 using 4 CUs::
1 4 (G0->G0->G1)
Single DMAexecuted Transfer between GPUs 0 and 1::
Single DMA-executed transfer between GPU 0 and 1::
1 1 (G0->D0->G1)
Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
Copying 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
...
...
@@ -116,5 +156,53 @@ Broadcast from GPU 0 to GPU 0 and GPU 1::
1 16 (G0->G0->G0G1)
.. note::
Running TransferBench with no arguments displays usage instructions and detected topology information.
Using preset configurations
------------------------------
Here is the list of preset configurations that can be used instead of configuration files:
.. list-table::
:header-rows: 1
* - Configuration
- Description
* - ``a2a``
- All-to-all benchmark test
* - ``cmdline``
- Allows transfers to run from the command line instead of a configuration file
* - ``healthcheck``
- Simple health check (supported on AMD Instinct MI300 series only)
* - ``p2p``
- Peer-to-peer benchmark test
* - ``pcopy``
- Benchmark parallel copies from a single GPU to other GPUs
* - ``rsweep``
- Random sweep across possible sets of transfers
* - ``rwrite``
- Benchmark parallel remote writes from a single GPU to other GPUs
* - ``scaling``
- GPU subexecutor scaling tests
* - ``schmoo``
- Read, write, or copy operation on local or remote between two GPUs
* - ``sweep``
- Sweep across possible sets of transfers
Performance tuning
---------------------
When you use the same GPU executor in multiple simultaneous transfers on separate streams by setting ``USE_SINGLE_STREAM=0``, the performance might be serialized due to the maximum number of hardware queues available.
To improve the performance, adjust the number of maximum hardware queues using ``GPU_MAX_HW_QUEUES``.
:description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
:keywords: TransferBench, API, ROCm, documentation, HIP
****************************
TransferBench documentation
****************************
TransferBench is a utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. This simplifies to a simple copy operation when dealing with a single SRC/DST.
For more information, see `GitHub. <https://github.com/ROCm/TransferBench>`_
TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
This simplifies to a simple copy operation when dealing with a single SRC or DST.
The code is open and hosted at `<https://github.com/ROCm/TransferBench>`_.
:description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
:keywords: Build TransferBench, Install TransferBench, API, ROCm, HIP
.. _install-transferbench:
---------------------------
TransferBench installation
Installing TransferBench
---------------------------
The following software is required to install TransferBench:
This topic describes how to build TransferBench.
* ROCm stack installed on the system (HIP runtime)
* `libnuma` installed on the system
Prerequisite
---------------
* Install ROCm stack on the system to obtain :doc:`HIP runtime <hip:index>`
* Install ``libnuma`` on the system
* `Enable AMD IOMMU <https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#iommu-configuration-systems-with-256-cpu-threads>`_ and set to passthrough for AMD Instinct cards
--------------------------
Building TransferBench
--------------------------
------------------------
To build TransferBench using Makefile, use the following instruction:
To build TransferBench using Makefile, use:
.. code-block:: bash
$ make
make
To build TransferBench using CMake, use the following commands:
To build TransferBench using CMake, use:
.. code-block:: bash
$ mkdir build
mkdir build
cd build
CXX=/opt/rocm/bin/hipcc cmake ..
make
.. note::
$ cd build
If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately.
$ CXX=/opt/rocm/bin/hipcc cmake ..
Building documentation
-----------------------
$ make
To build documentation locally, use:
.. Note::
.. code-block:: bash
If ROCm is installed in a folder other than `/opt/rocm/`, set `ROCM_PATH` appropriately.