Unverified Commit 340244a6 authored by srawat's avatar srawat Committed by GitHub
Browse files

Doc update (#123)

parent 203808ed
# TransferBench # TransferBench
TransferBench is a simple utility capable of benchmarking simultaneous copies between user-specified TransferBench is a utility for benchmarking simultaneous copies between user-specified
CPU and GPU devices. CPU and GPU devices.
Documentation for TransferBench is available at Documentation for TransferBench is available at
[https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html](https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html). [https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html](https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html).
\ No newline at end of file
## Requirements
* You must have a ROCm stack installed on your system (HIP runtime)
* You must have `libnuma` installed on your system
* AMD IOMMU must be enabled and set to passthrough for AMD Instinct cards
## Documentation
To build documentation locally, use the following code:
```shell
cd docs
pip3 install -r .sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
```
## Building TransferBench
You can build TransferBench using Makefile or CMake.
* Makefile:
```shell
make
```
* CMake:
```shell
mkdir build
cd build
CXX=/opt/rocm/bin/hipcc cmake ..
make
```
If ROCm is not installed in `/opt/rocm/`, you must set `ROCM_PATH` to the correct location.
## NVIDIA platform support
You can build TransferBench to run on NVIDIA platforms via HIP or native NVCC.
Use the following code to build with HIP for NVIDIA (note that you must have a HIP-compatible CUDA
version installed, e.g., CUDA 11.5):
```shell
CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
```
Use the following code to build with native NVCC (builds `TransferBenchCuda`):
```shell
make
```
## Things to note
* Running TransferBench with no arguments displays usage instructions and detected topology
information
* You can use several preset configurations instead of a configuration file:
* `a2a` : All-to-all benchmark test
* `cmdline` : Take in Transfers to run from command-line instead of via file
* `healthcheck` : Simple health check (supported on MI300 series only)
* `p2p` : Peer-to-peer benchmark test
* `pcopy` : Benchmark parallel copies from a single GPU to other GPUs
* `rsweep` : Random sweep across possible sets of transfers
* `rwrite` : Benchmarks parallel remote writes from a single GPU to other GPUs
* `scaling`: GPU subexecutor scaling tests
* `schmoo` : Local/Remote read/write/copy between two GPUs
* `sweep` : Sweep across possible sets of transfers
* When using the same GPU executor in multiple simultaneous transfers on separate streams (USE_SINGLE_STREAM=0),
performance may be serialized due to the maximum number of hardware queues available
* The number of maximum hardware queues can be adjusted via `GPU_MAX_HW_QUEUES`
--------------------
ConfigFile format
--------------------
A Transfer is defined as a single operation where an executor reads and adds together
values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
This simplifies to a simple copy operation when dealing with single SRC/DST.
SRC 0 DST 0
SRC 1 -> Executor -> DST 1
SRC X DST Y
Three Executors are supported by TransferBench::
Executor: SubExecutor:
1) CPU CPU thread
2) GPU GPU threadblock/Compute Unit (CU)
3) DMA N/A. (May only be used for copies (single SRC/DST)
Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
There are two ways to specify a test:
1) Basic
The basic specification assumes the same number of SubExecutors (SE) used per Transfer
A positive number of Transfers is specified followed by that number of triplets describing each Transfer
Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
2) Advanced
A negative number of Transfers is specified, followed by quintuplets describing each Transfer
A non-zero number of bytes specified will override any provided value
-Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
Argument Details:::
Transfers: Number of Transfers to be run in parallel
SEs : Number of SubExectors to use (CPU threads/ GPU threadblocks)
srcMemL : Source memory locations (Where the data is to be read from)
Executor : Executor is specified by a character indicating type, followed by device index (0-indexed)
- C: CPU-executed (Indexed from 0 to NUMA nodes - 1)
- G: GPU-executed (Indexed from 0 to GPUs - 1)
- D: DMA-executor (Indexed from 0 to GPUs - 1)
dstMemL : Destination memory locations (Where the data is to be written to)
bytesL : Number of bytes to copy (0 means use command-line specified size)
Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
Memory locations are specified by one or more (device character / device index) pairs
Character indicating memory type followed by device index (0-indexed)
Supported memory locations are:
- C: Pinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- U: Unpinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- B: Fine-grain host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- G: Global device memory (on GPU device indexed from 0 to [GPUs - 1])
- F: Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
- N: Null memory (index ignored)
Examples:::
1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
-2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
Lines starting with will be ignored. Lines starting with will be echoed to output
Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::
1 4 (G0->G0->G1)
Single DMA executed Transfer between GPUs 0 and 1::
1 1 (G0->D0->G1)
Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
"Memset" by GPU 0 to GPU 0 memory::
1 32 (N0->G0->G0)
"Read-only" by CPU 0::
1 4 (C0->C0->N0)
Broadcast from GPU 0 to GPU 0 and GPU 1::
1 16 (G0->G0->G0G1)
.. _Examples:
--------------------
Examples
--------------------
.. toctree::
:maxdepth: 3
:caption: Contents:
configfile_format
.. meta:: .. meta::
:description: TransferBench documentation :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
:keywords: TransferBench, API, ROCm, documentation, HIP :keywords: Using TransferBench, TransferBench Usage, TransferBench How To, API, ROCm, documentation, HIP
.. _using-transferbench:
---------------------
Using TransferBench Using TransferBench
--------------------- ---------------------
Users have control over the SRC and DST memory locations by indicating memory type followed by the device index. TransferBench supports the following:
* coarse-grained pinned host memory You can control the SRC and DST memory locations by indicating the memory type followed by the device index. TransferBench supports the following memory types:
* unpinned host memory
* fine-grained host memory * Coarse-grained pinned host
* coarse-grained global device memory * Unpinned host
* fine-grained global device memory * Fine-grained host
* null memory (for an empty transfer). * Coarse-grained global device
* Fine-grained global device
* Null (for an empty transfer)
In addition, users can determine the size of the transfer (number of bytes to copy) for their tests. In addition, you can determine the size of the transfer (number of bytes to copy) for the tests.
Users can also specify executors of the transfer. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used. You can also specify transfer executors. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of Sub-Executors (SE). The number of SEs specifies the number of CPU threads in the case of a CPU executor and the number of compute units (CU) for a GPU executor.
For a DMA executor, the SE argument determines the number of streams to be used.
Refer to the following example to use TransferBench. You can specify the transfers in a configuration file or use preset configurations for transfers.
-------------------- Specifying transfers in a configuration file
ConfigFile format ----------------------------------------------
--------------------
A Transfer is defined as a single operation where an executor reads and adds together values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations. A transfer is defined as a single operation where an executor reads and adds together values from SRC memory locations, followed by writing the sum to the DST memory locations.
This simplifies to a simple copy operation when dealing with single SRC/DST. This simplifies to a copy operation when using a single SRC or DST.
Here's a copy operation from a single SRC to DST:
.. code-block:: bash .. code-block:: bash
...@@ -34,73 +37,110 @@ This simplifies to a simple copy operation when dealing with single SRC/DST. ...@@ -34,73 +37,110 @@ This simplifies to a simple copy operation when dealing with single SRC/DST.
SRC 1 -> Executor -> DST 1 SRC 1 -> Executor -> DST 1
SRC X DST Y SRC X DST Y
Three Executors are supported by TransferBench:: Three executors are supported by TransferBench:
Executor: SubExecutor: .. code-block:: bash
1) CPU CPU thread
2) GPU GPU threadblock/Compute Unit (CU)
3) DMA N/A. (May only be used for copies (single SRC/DST)
Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel Executor: SubExecutor:
1. CPU CPU thread
2. GPU GPU threadblock/Compute Unit (CU)
3. DMA N/A (Can only be used for a single SRC to DST copy)
Each line in the configuration file defines a set of transfers, also known as a test, to run in parallel.
There are two ways to specify a test: There are two ways to specify a test:
1) Basic - **Basic**
The basic specification assumes the same number of SEs used per transfer.
A positive number of transfers is specified, followed by the number of SEs and triplets describing each transfer:
The basic specification assumes the same number of SubExecutors (SE) used per Transfer .. code-block:: bash
A positive number of Transfers is specified followed by that number of triplets describing each Transfer
Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL) Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
2) Advanced The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
A negative number of Transfers is specified, followed by quintuplets describing each Transfer **Example**:
A non-zero number of bytes specified will override any provided value
-Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL) .. code-block:: bash
Argument Details::: 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
1 4 (G2->C1->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
2 4 G0->G0->G1 G1->G1->G0 Copies from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
Transfers: Number of Transfers to be run in parallel - **Advanced**
SEs : Number of SubExectors to use (CPU threads/ GPU threadblocks)
srcMemL : Source memory locations (Where the data is to be read from)
Executor : Executor is specified by a character indicating type, followed by device index (0-indexed)
- C: CPU-executed (Indexed from 0 to NUMA nodes - 1)
- G: GPU-executed (Indexed from 0 to GPUs - 1)
- D: DMA-executor (Indexed from 0 to GPUs - 1)
dstMemL : Destination memory locations (Where the data is to be written to)
bytesL : Number of bytes to copy (0 means use command-line specified size)
Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
Memory locations are specified by one or more (device character / device index) pairs In the advanced specification, a negative number of transfers is specified, followed by quintuplets describing each transfer.
Character indicating memory type followed by device index (0-indexed) Specifying a non-zero number of bytes overrides any provided value.
Supported memory locations are:
- C: Pinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- U: Unpinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- B: Fine-grain host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- G: Global device memory (on GPU device indexed from 0 to [GPUs - 1])
- F: Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
- N: Null memory (index ignored)
Examples::: .. code-block:: bash
1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1 Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
-2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
**Example**:
.. code-block:: bash
-2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs and 2Mb from GPU1 to GPU0 with 2 SEs
Here is the list of arguments used to specify transfers in the config file:
.. _config_file_arguments_table:
.. list-table::
:header-rows: 1
* - Argument
- Description
* - Transfers
- Number of transfers to be run in parallel
* - SE
- Number of SEs to use (CPU threads or GPU threadblocks)
* - srcMemL
- Source memory locations (where the data is read)
* - Executor
- | Executor is specified by a character indicating type, followed by the device index (0-indexed):
| - C: CPU-executed (indexed from 0 to NUMA nodes - 1)
| - G: GPU-executed (indexed from 0 to GPUs - 1)
| - D: DMA-executor (indexed from 0 to GPUs - 1)
Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary * - dstMemL
Lines starting with will be ignored. Lines starting with will be echoed to output - Destination memory locations (where the data is written)
Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs:: * - bytesL
- | Number of bytes to copy (use command-line specified size when 0).
| Must be a multiple of four and can be suffixed with ('K','M', or 'G').
| Memory locations are specified by one or more device characters or device index pairs.
| Characters indicate memory type and are followed by device index (0-indexed).
| Here are the characters and their respective memory locations:
| - C: Pinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
| - U: Unpinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
| - B: Fine-grain host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
| - G: Global device memory (on GPU device, indexed from 0 to [GPUs - 1])
| - F: Fine-grain device memory (on GPU device, indexed from 0 to [GPUs - 1])
| - N: Null memory (index ignored)
Round brackets and arrows "->" can be included for human clarity, but will be ignored.
Lines starting with # are ignored while lines starting with ## are echoed to the output.
**Transfer examples:**
Single GPU-executed transfer between GPU 0 and 1 using 4 CUs::
1 4 (G0->G0->G1) 1 4 (G0->G0->G1)
Single DMA executed Transfer between GPUs 0 and 1:: Single DMA-executed transfer between GPU 0 and 1::
1 1 (G0->D0->G1) 1 1 (G0->D0->G1)
Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs:: Copying 1Mb from GPU 0 to GPU 1 with 4 CUs, and 2Mb from GPU 1 to GPU 0 with 8 CUs::
-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M) -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
...@@ -116,5 +156,53 @@ Broadcast from GPU 0 to GPU 0 and GPU 1:: ...@@ -116,5 +156,53 @@ Broadcast from GPU 0 to GPU 0 and GPU 1::
1 16 (G0->G0->G0G1) 1 16 (G0->G0->G0G1)
.. note::
Running TransferBench with no arguments displays usage instructions and detected topology information.
Using preset configurations
------------------------------
Here is the list of preset configurations that can be used instead of configuration files:
.. list-table::
:header-rows: 1
* - Configuration
- Description
* - ``a2a``
- All-to-all benchmark test
* - ``cmdline``
- Allows transfers to run from the command line instead of a configuration file
* - ``healthcheck``
- Simple health check (supported on AMD Instinct MI300 series only)
* - ``p2p``
- Peer-to-peer benchmark test
* - ``pcopy``
- Benchmark parallel copies from a single GPU to other GPUs
* - ``rsweep``
- Random sweep across possible sets of transfers
* - ``rwrite``
- Benchmark parallel remote writes from a single GPU to other GPUs
* - ``scaling``
- GPU subexecutor scaling tests
* - ``schmoo``
- Read, write, or copy operation on local or remote between two GPUs
* - ``sweep``
- Sweep across possible sets of transfers
Performance tuning
---------------------
When you use the same GPU executor in multiple simultaneous transfers on separate streams by setting ``USE_SINGLE_STREAM=0``, the performance might be serialized due to the maximum number of hardware queues available.
To improve the performance, adjust the number of maximum hardware queues using ``GPU_MAX_HW_QUEUES``.
.. meta::
:description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
:keywords: TransferBench, API, ROCm, documentation, HIP
**************************** ****************************
TransferBench documentation TransferBench documentation
**************************** ****************************
TransferBench is a utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. This simplifies to a simple copy operation when dealing with a single SRC/DST.
For more information, see `GitHub. <https://github.com/ROCm/TransferBench>`_ TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
This simplifies to a simple copy operation when dealing with a single SRC or DST.
The code is open and hosted at `<https://github.com/ROCm/TransferBench>`_.
.. grid:: 2 .. grid:: 2
:gutter: 3 :gutter: 3
.. grid-item-card:: Install .. grid-item-card:: Install
* :doc:`TransferBench installation <./install/install>` * :ref:`install-transferbench`
.. grid-item-card:: API reference .. grid-item-card:: API reference
* :doc:`API library <reference/api>` * :ref:`transferbench-api`
.. grid-item-card:: How to .. grid-item-card:: How to
* :doc:`Use TransferBench <how to/use-transferbench>` * :ref:`using-transferbench`
To contribute to the documentation, refer to To contribute to the documentation, refer to
`Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_. `Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
You can find licensing information on the You can find licensing information on the
`Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page. `Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
.. meta:: .. meta::
:description: TransferBench documentation :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
:keywords: TransferBench, API, ROCm, HIP :keywords: Build TransferBench, Install TransferBench, API, ROCm, HIP
.. _install-transferbench:
--------------------------- ---------------------------
TransferBench installation Installing TransferBench
--------------------------- ---------------------------
The following software is required to install TransferBench: This topic describes how to build TransferBench.
* ROCm stack installed on the system (HIP runtime) Prerequisite
* `libnuma` installed on the system ---------------
* Install ROCm stack on the system to obtain :doc:`HIP runtime <hip:index>`
* Install ``libnuma`` on the system
* `Enable AMD IOMMU <https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#iommu-configuration-systems-with-256-cpu-threads>`_ and set to passthrough for AMD Instinct cards
--------------------------
Building TransferBench Building TransferBench
-------------------------- ------------------------
To build TransferBench using Makefile, use the following instruction: To build TransferBench using Makefile, use:
.. code-block:: bash .. code-block:: bash
$ make make
To build TransferBench using CMake, use the following commands: To build TransferBench using CMake, use:
.. code-block:: bash .. code-block:: bash
$ mkdir build mkdir build
cd build
$ cd build CXX=/opt/rocm/bin/hipcc cmake ..
make
$ CXX=/opt/rocm/bin/hipcc cmake ..
$ make
.. Note:: .. note::
If ROCm is installed in a folder other than `/opt/rocm/`, set `ROCM_PATH` appropriately. If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately.
Building documentation
-----------------------
To build documentation locally, use:
.. code-block:: bash
cd docs
pip3 install -r .sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
--------------------------
NVIDIA platform support NVIDIA platform support
-------------------------- --------------------------
TransferBench may also be built to run on NVIDIA platforms via HIP, but requires a HIP-compatible CUDA version installed. For example, CUDA 11.5. You can build TransferBench to run on NVIDIA platforms using native NVIDIA CUDA Compiler Driver (NVCC).
To build with native NVCC, use:
.. code-block:: bash
make
To build on NVIDIA platforms, use the following instruction: TransferBench looks for NVCC in ``/usr/local/cuda`` by default. To modify the location of NVCC, use environment variable `CUDA_PATH`:
.. code-block:: bash .. code-block:: bash
CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
CUDA_PATH=/usr/local/cuda make
----- .. meta::
API :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
----- :keywords: TransferBench API, TransferBench library, documentation, HIP
.. _transferbench-api:
--------------------------
TransferBench API library
--------------------------
.. doxygenindex:: .. doxygenindex::
...@@ -5,18 +5,14 @@ subtrees: ...@@ -5,18 +5,14 @@ subtrees:
- caption: Install - caption: Install
entries: entries:
- file: install/install.rst - file: install/install.rst
title: TransferBench installation
- caption: API reference - caption: API reference
entries: entries:
- file: reference/api.rst - file: reference/api.rst
title: API library
- caption: How to - caption: How to
entries: entries:
- file: how to/use-transferbench.rst - file: how to/use-transferbench.rst
title: Use TransferBench
- caption: About - caption: About
entries: entries:
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment