use-transferbench.rst 7.23 KB
Newer Older
1
.. meta::
srawat's avatar
srawat committed
2
  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
srawat's avatar
srawat committed
3
  :keywords: TransferBench usage, TransferBench how to, TransferBench user guide, TransferBench user manual
4

srawat's avatar
srawat committed
5
.. _using-transferbench:
6

srawat's avatar
srawat committed
7
---------------------
8
9
10
Using TransferBench
---------------------

srawat's avatar
srawat committed
11
12
13
14
15
16
17
18
You can control the SRC and DST memory locations by indicating the memory type followed by the device index. TransferBench supports the following memory types:

* Coarse-grained pinned host
* Unpinned host
* Fine-grained host
* Coarse-grained global device
* Fine-grained global device
* Null (for an empty transfer)
19

srawat's avatar
srawat committed
20
In addition, you can determine the size of the transfer (number of bytes to copy) for the tests.
21

srawat's avatar
srawat committed
22
23
You can also specify transfer executors. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of Sub-Executors (SE). The number of SEs specifies the number of CPU threads in the case of a CPU executor and the number of compute units (CU) for a GPU executor.
For a DMA executor, the SE argument determines the number of streams to be used.
24

srawat's avatar
srawat committed
25
You can specify the transfers in a configuration file or use preset configurations for transfers.
26

srawat's avatar
srawat committed
27
28
Specifying transfers in a configuration file
----------------------------------------------
29

srawat's avatar
srawat committed
30
31
32
A transfer is defined as a single operation where an executor reads and adds together values from SRC memory locations, followed by writing the sum to the DST memory locations.
This simplifies to a copy operation when using a single SRC or DST.
Here's a copy operation from a single SRC to DST:
33
34
35
36
37
38
39

.. code-block:: bash

   SRC 0                DST 0
   SRC 1 -> Executor -> DST 1
   SRC X                DST Y

srawat's avatar
srawat committed
40
Three executors are supported by TransferBench:
41

srawat's avatar
srawat committed
42
.. code-block:: bash
43

srawat's avatar
srawat committed
44
45
46
47
48
49
  Executor:        SubExecutor:
  1. CPU           CPU thread
  2. GPU           GPU threadblock/Compute Unit (CU)
  3. DMA           N/A (Can only be used for a single SRC to DST copy)

Each line in the configuration file defines a set of transfers, also known as a test, to run in parallel.
50
51
52

There are two ways to specify a test:

srawat's avatar
srawat committed
53
54
55
56
- **Basic**

  The basic specification assumes the same number of SEs used per transfer.
  A positive number of transfers is specified, followed by the number of SEs and triplets describing each transfer:
57

srawat's avatar
srawat committed
58
  .. code-block:: bash
59

srawat's avatar
srawat committed
60
    Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
61

srawat's avatar
srawat committed
62
  The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
63

srawat's avatar
srawat committed
64
  **Example**:
65

srawat's avatar
srawat committed
66
  .. code-block:: bash
67

srawat's avatar
srawat committed
68
69
70
   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
   1 4 (G2->C1->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
   2 4 G0->G0->G1 G1->G1->G0          Copies from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
71

srawat's avatar
srawat committed
72
- **Advanced**
73

srawat's avatar
srawat committed
74
75
  In the advanced specification, a negative number of transfers is specified, followed by quintuplets describing each transfer.
  Specifying a non-zero number of bytes overrides any provided value.
76

srawat's avatar
srawat committed
77
  .. code-block:: bash
78

srawat's avatar
srawat committed
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
    Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)

  The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.

  **Example**:

  .. code-block:: bash

   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs and 2Mb from GPU1 to GPU0 with 2 SEs

Here is the list of arguments used to specify transfers in the config file:

.. _config_file_arguments_table:

.. list-table::
   :header-rows: 1

   * - Argument
     - Description

   * - Transfers
     - Number of transfers to be run in parallel

   * - SE
     - Number of SEs to use (CPU threads or GPU threadblocks)

   * - srcMemL
     - Source memory locations (where the data is read)

   * - Executor
     - | Executor is specified by a character indicating type, followed by the device index (0-indexed):
       | - C: CPU-executed  (indexed from 0 to NUMA nodes - 1)
       | - G: GPU-executed  (indexed from 0 to GPUs - 1)
       | - D: DMA-executor  (indexed from 0 to GPUs - 1)
113

srawat's avatar
srawat committed
114
115
   * - dstMemL
     - Destination memory locations (where the data is written)
116

srawat's avatar
srawat committed
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
   * - bytesL
     - | Number of bytes to copy (use command-line specified size when 0).
       | Must be a multiple of four and can be suffixed with ('K','M', or 'G').
       | Memory locations are specified by one or more device characters or device index pairs.
       | Characters indicate memory type and are followed by device index (0-indexed).
       | Here are the characters and their respective memory locations:
       | - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
       | - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
       | - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
       | - G:    Global device memory     (on GPU device, indexed from 0 to [GPUs - 1])
       | - F:    Fine-grain device memory (on GPU device, indexed from 0 to [GPUs - 1])
       | - N:    Null memory              (index ignored)

Round brackets and arrows "->" can be included for human clarity, but will be ignored.
Lines starting with # are ignored while lines starting with ## are echoed to the output.

**Transfer examples:**

Single GPU-executed transfer between GPU 0 and 1 using 4 CUs::
136
137
138

   1 4 (G0->G0->G1)

srawat's avatar
srawat committed
139
Single DMA-executed transfer between GPU 0 and 1::
140
141
142

   1 1 (G0->D0->G1)

srawat's avatar
srawat committed
143
Copying 1Mb from GPU 0 to GPU 1 with 4 CUs, and 2Mb from GPU 1 to GPU 0 with 8 CUs::
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158

   -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)

"Memset" by GPU 0 to GPU 0 memory::

   1 32 (N0->G0->G0)

"Read-only" by CPU 0::

   1 4 (C0->C0->N0)

Broadcast from GPU 0 to GPU 0 and GPU 1::

   1 16 (G0->G0->G0G1)

srawat's avatar
srawat committed
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
.. note::

   Running TransferBench with no arguments displays usage instructions and detected topology information.

Using preset configurations
------------------------------

Here is the list of preset configurations that can be used instead of configuration files:

.. list-table::
   :header-rows: 1

   * - Configuration
     - Description

   * - ``a2a``
     - All-to-all benchmark test

   * - ``cmdline``
     - Allows transfers to run from the command line instead of a configuration file
179

srawat's avatar
srawat committed
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
   * - ``healthcheck``
     - Simple health check (supported on AMD Instinct MI300 series only)

   * - ``p2p``
     - Peer-to-peer benchmark test

   * - ``pcopy``
     - Benchmark parallel copies from a single GPU to other GPUs

   * - ``rsweep``
     - Random sweep across possible sets of transfers

   * - ``rwrite``
     - Benchmark parallel remote writes from a single GPU to other GPUs

   * - ``scaling``
     - GPU subexecutor scaling tests

   * - ``schmoo``
     - Read, write, or copy operation on local or remote between two GPUs

   * - ``sweep``
     - Sweep across possible sets of transfers

Performance tuning
---------------------
206

srawat's avatar
srawat committed
207
208
When you use the same GPU executor in multiple simultaneous transfers on separate streams by setting ``USE_SINGLE_STREAM=0``, the performance might be serialized due to the maximum number of hardware queues available.
To improve the performance, adjust the number of maximum hardware queues using ``GPU_MAX_HW_QUEUES``.