:keywords: TransferBench, API, ROCm, documentation, HIP
Using TransferBench
---------------------
Users have control over the SRC and DST memory locations by indicating memory type followed by the device index. TransferBench supports the following:
* coarse-grained pinned host memory
* unpinned host memory
* fine-grained host memory
* coarse-grained global device memory
* fine-grained global device memory
* null memory (for an empty transfer).
In addition, users can determine the size of the transfer (number of bytes to copy) for their tests.
Users can also specify executors of the transfer. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used.
Refer to the following example to use TransferBench.
--------------------
ConfigFile format
--------------------
A Transfer is defined as a single operation where an executor reads and adds together values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
This simplifies to a simple copy operation when dealing with single SRC/DST.
.. code-block:: bash
SRC 0 DST 0
SRC 1 -> Executor -> DST 1
SRC X DST Y
Three Executors are supported by TransferBench::
Executor: SubExecutor:
1) CPU CPU thread
2) GPU GPU threadblock/Compute Unit (CU)
3) DMA N/A. (May only be used for copies (single SRC/DST)
Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
There are two ways to specify a test:
1) Basic
The basic specification assumes the same number of SubExecutors (SE) used per Transfer
A positive number of Transfers is specified followed by that number of triplets describing each Transfer
Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
2) Advanced
A negative number of Transfers is specified, followed by quintuplets describing each Transfer
A non-zero number of bytes specified will override any provided value
TransferBench is a simple utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs).
TransferBench is a utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. This simplifies to a simple copy operation when dealing with a single SRC/DST.
A Transfer is defined as a single operation where an executor reads and adds together values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. This simplifies to a simple copy operation when dealing with single SRC/DST.
For more information, see `GitHub. <https://github.com/ROCm/TransferBench>`_
The user has control over the SRC and DST memory locations by indicating memory type followed by the device index. TransferBench supports coarse-grained pinned host memory, unpinned host memory, fine-grained host memory, coarse-grained global device memory, fine-grained global device memory, and null memory (for an empty transfer). In addition, the user can determine the size of the transfer (number of bytes to copy) for their tests.
The executor of the transfer can also be specified by the user. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used.
For more examples, please refer to :ref:`Examples`