Doc update (#123)

340244a6 · srawat · GitHub · 203808ed · 340244a6 · 203808ed
Unverified Commit 340244a6 authored Oct 08, 2024 by srawat Committed by GitHub Oct 08, 2024
8 changed files
--- a/README.md
+++ b/README.md
 # TransferBench

-TransferBench is a simple utility capable of benchmarking simultaneous copies between user-specified
+TransferBench is a utility for benchmarking simultaneous copies between user-specified
 CPU and GPU devices.

 Documentation for TransferBench is available at
 [https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html](https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html).
\ No newline at end of file
-
-## Requirements
-
-* You must have a ROCm stack installed on your system (HIP runtime)
-* You must have `libnuma` installed on your system
-* AMD IOMMU must be enabled and set to passthrough for AMD Instinct cards
-
-## Documentation
-
-To build documentation locally, use the following code:
-
-```shell
-cd docs
-
-pip3 install -r .sphinx/requirements.txt
-
-python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
-```
-
-## Building TransferBench
-
-You can build TransferBench using Makefile or CMake.
-
-* Makefile:
-
-  ```shell
-  make
-  ```
-
-* CMake:
-
-  ```shell
-  mkdir build
-  cd build
-  CXX=/opt/rocm/bin/hipcc cmake ..
-  make
-  ```
-
-  If ROCm is not installed in `/opt/rocm/`, you must set `ROCM_PATH` to the correct location.
-
-## NVIDIA platform support
-
-You can build TransferBench to run on NVIDIA platforms via HIP or native NVCC.
-
-Use the following code to build with HIP for NVIDIA (note that you must have a HIP-compatible CUDA
-version installed, e.g., CUDA 11.5):
-
-```shell
-CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
-```
-
-Use the following code to build with native NVCC (builds `TransferBenchCuda`):
-
-```shell
-make
-```
-
-## Things to note
-
-* Running TransferBench with no arguments displays usage instructions and detected topology
-  information
-* You can use several preset configurations instead of a configuration file:
-  * `a2a` : All-to-all benchmark test
-  * `cmdline` : Take in Transfers to run from command-line instead of via file
-  * `healthcheck` : Simple health check (supported on MI300 series only)
-  * `p2p`    : Peer-to-peer benchmark test
-  * `pcopy`  : Benchmark parallel copies from a single GPU to other GPUs
-  * `rsweep` : Random sweep across possible sets of transfers
-  * `rwrite` : Benchmarks parallel remote writes from a single GPU to other GPUs
-  * `scaling`: GPU subexecutor scaling tests
-  * `schmoo` : Local/Remote read/write/copy between two GPUs
-  * `sweep`  : Sweep across possible sets of transfers
-
-* When using the same GPU executor in multiple simultaneous transfers on separate streams (USE_SINGLE_STREAM=0),
-  performance may be serialized due to the maximum number of hardware queues available
-  * The number of maximum hardware queues can be adjusted via `GPU_MAX_HW_QUEUES`
--- a/docs/examples/configfile_format.rst
+++ b/docs/examples/configfile_format.rst
--------------------
-ConfigFile format
--------------------
-
-A Transfer is defined as a single operation where an executor reads and adds together
-values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
-This simplifies to a simple copy operation when dealing with single SRC/DST.
-
-   SRC 0                DST 0
-   SRC 1 -> Executor -> DST 1
-   SRC X                DST Y
-
-Three Executors are supported by TransferBench::
-
-   Executor:        SubExecutor:
-   1) CPU           CPU thread
-   2) GPU           GPU threadblock/Compute Unit (CU)
-   3) DMA           N/A.                                 (May only be used for copies (single SRC/DST)
-
-Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
-
-There are two ways to specify a test:
-
-1) Basic
-
-   The basic specification assumes the same number of SubExecutors (SE) used per Transfer
-   A positive number of Transfers is specified followed by that number of triplets describing each Transfer
-
-   Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
-
-2) Advanced
-
-   A negative number of Transfers is specified, followed by quintuplets describing each Transfer
-   A non-zero number of bytes specified will override any provided value
-
-   -Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
-
-Argument Details:::
-
-   Transfers:   Number of Transfers to be run in parallel
-   SEs      :   Number of SubExectors to use (CPU threads/ GPU threadblocks)
-   srcMemL   :   Source memory locations (Where the data is to be read from)
-   Executor  :   Executor is specified by a character indicating type, followed by device index (0-indexed)
-                  - C: CPU-executed  (Indexed from 0 to NUMA nodes - 1)
-                  - G: GPU-executed  (Indexed from 0 to GPUs - 1)
-                  - D: DMA-executor  (Indexed from 0 to GPUs - 1)
-   dstMemL   :   Destination memory locations (Where the data is to be written to)
-   bytesL    :   Number of bytes to copy (0 means use command-line specified size)
-                  Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
-
-                  Memory locations are specified by one or more (device character / device index) pairs
-                  Character indicating memory type followed by device index (0-indexed)
-                  Supported memory locations are:
-                  - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - G:    Global device memory     (on GPU device indexed from 0 to [GPUs - 1])
-                  - F:    Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
-                  - N:    Null memory              (index ignored)
-
-Examples:::
-
-   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
-   1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
-   2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
-   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
-
-Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
-Lines starting with will be ignored. Lines starting with will be echoed to output
-
-Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::
-
-   1 4 (G0->G0->G1)
-
-Single DMA executed Transfer between GPUs 0 and 1::
-
-   1 1 (G0->D0->G1)
-
-Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
-
-   -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
-
-"Memset" by GPU 0 to GPU 0 memory::
-
-   1 32 (N0->G0->G0)
-
-"Read-only" by CPU 0::
-
-   1 4 (C0->C0->N0)
-
-Broadcast from GPU 0 to GPU 0 and GPU 1::
-
-   1 16 (G0->G0->G0G1)
--- a/docs/examples/index.rst
+++ b/docs/examples/index.rst
-.. _Examples:
-
--------------------
-Examples
--------------------
-
-.. toctree::
-   :maxdepth: 3
-   :caption: Contents:
-
-   configfile_format
--- a/docs/how to/use-transferbench.rst
+++ b/docs/how to/use-transferbench.rst
 .. meta::
-  :description: TransferBench documentation 
-  :keywords: TransferBench, API, ROCm, documentation, HIP
+  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
+  :keywords: Using TransferBench, TransferBench Usage, TransferBench How To, API, ROCm, documentation, HIP

+.. _using-transferbench:

+---------------------
 Using TransferBench
 ---------------------

-Users have control over the SRC and DST memory locations by indicating memory type followed by the device index. TransferBench supports the following:
+You can control the SRC and DST memory locations by indicating the memory type followed by the device index. TransferBench supports the following memory types:

-* coarse-grained pinned host memory
-* unpinned host memory
-* fine-grained host memory
-* coarse-grained global device memory
-* fine-grained global device memory
-* null memory (for an empty transfer).
+* Coarse-grained pinned host
+* Unpinned host
+* Fine-grained host
+* Coarse-grained global device
+* Fine-grained global device
+* Null (for an empty transfer)

-In addition, users can determine the size of the transfer (number of bytes to copy) for their tests.
+In addition, you can determine the size of the transfer (number of bytes to copy) for the tests.

-Users can also specify executors of the transfer. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used.
+You can also specify transfer executors. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of Sub-Executors (SE). The number of SEs specifies the number of CPU threads in the case of a CPU executor and the number of compute units (CU) for a GPU executor.
+For a DMA executor, the SE argument determines the number of streams to be used.

-Refer to the following example to use TransferBench.
+You can specify the transfers in a configuration file or use preset configurations for transfers.

--------------------
-ConfigFile format
--------------------
+Specifying transfers in a configuration file
+----------------------------------------------

-A Transfer is defined as a single operation where an executor reads and adds together values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
-This simplifies to a simple copy operation when dealing with single SRC/DST.
+A transfer is defined as a single operation where an executor reads and adds together values from SRC memory locations, followed by writing the sum to the DST memory locations.
+This simplifies to a copy operation when using a single SRC or DST.
+Here's a copy operation from a single SRC to DST:

 .. code-block:: bash

@@ -34,73 +37,110 @@ This simplifies to a simple copy operation when dealing with single SRC/DST.
   SRC 1 -> Executor -> DST 1
   SRC X                DST Y

-Three Executors are supported by TransferBench::
+Three executors are supported by TransferBench:
+
+.. code-block:: bash

  Executor:        SubExecutor:
-   1) CPU           CPU thread
-   2) GPU           GPU threadblock/Compute Unit (CU)
-   3) DMA           N/A.                                 (May only be used for copies (single SRC/DST)
+  1. CPU           CPU thread
+  2. GPU           GPU threadblock/Compute Unit (CU)
+  3. DMA           N/A (Can only be used for a single SRC to DST copy)

-Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
+Each line in the configuration file defines a set of transfers, also known as a test, to run in parallel.

 There are two ways to specify a test:

-1) Basic
+- **Basic**

-   The basic specification assumes the same number of SubExecutors (SE) used per Transfer
-   A positive number of Transfers is specified followed by that number of triplets describing each Transfer
+  The basic specification assumes the same number of SEs used per transfer.
+  A positive number of transfers is specified, followed by the number of SEs and triplets describing each transfer:
+
+  .. code-block:: bash

    Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)

-2) Advanced
+  The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
+
+  **Example**:

-   A negative number of Transfers is specified, followed by quintuplets describing each Transfer
-   A non-zero number of bytes specified will override any provided value
+  .. code-block:: bash

-   -Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
+   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
+   1 4 (G2->C1->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
+   2 4 G0->G0->G1 G1->G1->G0          Copies from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs

-Argument Details:::
+- **Advanced**

-   Transfers:   Number of Transfers to be run in parallel
-   SEs      :   Number of SubExectors to use (CPU threads/ GPU threadblocks)
-   srcMemL   :   Source memory locations (Where the data is to be read from)
-   Executor  :   Executor is specified by a character indicating type, followed by device index (0-indexed)
-                  - C: CPU-executed  (Indexed from 0 to NUMA nodes - 1)
-                  - G: GPU-executed  (Indexed from 0 to GPUs - 1)
-                  - D: DMA-executor  (Indexed from 0 to GPUs - 1)
-   dstMemL   :   Destination memory locations (Where the data is to be written to)
-   bytesL    :   Number of bytes to copy (0 means use command-line specified size)
-                  Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
+  In the advanced specification, a negative number of transfers is specified, followed by quintuplets describing each transfer.
+  Specifying a non-zero number of bytes overrides any provided value.

-                  Memory locations are specified by one or more (device character / device index) pairs
-                  Character indicating memory type followed by device index (0-indexed)
-                  Supported memory locations are:
-                  - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - G:    Global device memory     (on GPU device indexed from 0 to [GPUs - 1])
-                  - F:    Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
-                  - N:    Null memory              (index ignored)
+  .. code-block:: bash

-Examples:::
+    Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)

-   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
-   1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
-   2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
-   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
+  The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
+
+  **Example**:
+
+  .. code-block:: bash
+
+   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs and 2Mb from GPU1 to GPU0 with 2 SEs
+
+Here is the list of arguments used to specify transfers in the config file:
+
+.. _config_file_arguments_table:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Argument
+     - Description
+
+   * - Transfers
+     - Number of transfers to be run in parallel
+
+   * - SE
+     - Number of SEs to use (CPU threads or GPU threadblocks)
+
+   * - srcMemL
+     - Source memory locations (where the data is read)
+
+   * - Executor
+     - | Executor is specified by a character indicating type, followed by the device index (0-indexed):
+       | - C: CPU-executed  (indexed from 0 to NUMA nodes - 1)
+       | - G: GPU-executed  (indexed from 0 to GPUs - 1)
+       | - D: DMA-executor  (indexed from 0 to GPUs - 1)
+
+   * - dstMemL
+     - Destination memory locations (where the data is written)

-Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
-Lines starting with will be ignored. Lines starting with will be echoed to output
+   * - bytesL
+     - | Number of bytes to copy (use command-line specified size when 0).
+       | Must be a multiple of four and can be suffixed with ('K','M', or 'G').
+       | Memory locations are specified by one or more device characters or device index pairs.
+       | Characters indicate memory type and are followed by device index (0-indexed).
+       | Here are the characters and their respective memory locations:
+       | - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
+       | - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
+       | - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
+       | - G:    Global device memory     (on GPU device, indexed from 0 to [GPUs - 1])
+       | - F:    Fine-grain device memory (on GPU device, indexed from 0 to [GPUs - 1])
+       | - N:    Null memory              (index ignored)

-Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::
+Round brackets and arrows "->" can be included for human clarity, but will be ignored.
+Lines starting with # are ignored while lines starting with ## are echoed to the output.
+
+**Transfer examples:**
+
+Single GPU-executed transfer between GPU 0 and 1 using 4 CUs::

   1 4 (G0->G0->G1)

-Single DMA executed Transfer between GPUs 0 and 1::
+Single DMA-executed transfer between GPU 0 and 1::

   1 1 (G0->D0->G1)

-Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
+Copying 1Mb from GPU 0 to GPU 1 with 4 CUs, and 2Mb from GPU 1 to GPU 0 with 8 CUs::

   -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)

@@ -116,5 +156,53 @@ Broadcast from GPU 0 to GPU 0 and GPU 1::

   1 16 (G0->G0->G0G1)

+.. note::
+
+   Running TransferBench with no arguments displays usage instructions and detected topology information.
+
+Using preset configurations
+------------------------------
+
+Here is the list of preset configurations that can be used instead of configuration files:
+
+.. list-table::
+   :header-rows: 1
+
+   * - Configuration
+     - Description
+
+   * - ``a2a``
+     - All-to-all benchmark test
+
+   * - ``cmdline``
+     - Allows transfers to run from the command line instead of a configuration file

+   * - ``healthcheck``
+     - Simple health check (supported on AMD Instinct MI300 series only)
+
+   * - ``p2p``
+     - Peer-to-peer benchmark test
+
+   * - ``pcopy``
+     - Benchmark parallel copies from a single GPU to other GPUs
+
+   * - ``rsweep``
+     - Random sweep across possible sets of transfers
+
+   * - ``rwrite``
+     - Benchmark parallel remote writes from a single GPU to other GPUs
+
+   * - ``scaling``
+     - GPU subexecutor scaling tests
+
+   * - ``schmoo``
+     - Read, write, or copy operation on local or remote between two GPUs
+
+   * - ``sweep``
+     - Sweep across possible sets of transfers
+
+Performance tuning
+---------------------

+When you use the same GPU executor in multiple simultaneous transfers on separate streams by setting ``USE_SINGLE_STREAM=0``, the performance might be serialized due to the maximum number of hardware queues available.
+To improve the performance, adjust the number of maximum hardware queues using ``GPU_MAX_HW_QUEUES``.
--- a/docs/index.rst
+++ b/docs/index.rst
+.. meta::
+  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
+  :keywords: TransferBench, API, ROCm, documentation, HIP
+
 ****************************
 TransferBench documentation
 ****************************
-TransferBench is a utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. This simplifies to a simple copy operation when dealing with a single SRC/DST.

-For more information, see `GitHub. <https://github.com/ROCm/TransferBench>`_
+TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
+This simplifies to a simple copy operation when dealing with a single SRC or DST.
+
+The code is open and hosted at `<https://github.com/ROCm/TransferBench>`_.

 .. grid:: 2
  :gutter: 3

  .. grid-item-card:: Install

-       * :doc:`TransferBench installation <./install/install>`
+    * :ref:`install-transferbench`

  .. grid-item-card:: API reference

-    * :doc:`API library <reference/api>`
-  
+    * :ref:`transferbench-api`

  .. grid-item-card:: How to

-    * :doc:`Use TransferBench <how to/use-transferbench>`
-
+    * :ref:`using-transferbench`

 To contribute to the documentation, refer to
 `Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.

 You can find licensing information on the
 `Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
-
-
-
--- a/docs/install/install.rst
+++ b/docs/install/install.rst
 .. meta::
-  :description: TransferBench documentation 
-  :keywords: TransferBench, API, ROCm, HIP
+  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
+  :keywords: Build TransferBench, Install TransferBench, API, ROCm, HIP
+
+.. _install-transferbench:

 ---------------------------
-TransferBench installation
+Installing TransferBench
 ---------------------------

-The following software is required to install TransferBench:
+This topic describes how to build TransferBench.

-* ROCm stack installed on the system (HIP runtime)
-* `libnuma` installed on the system
+Prerequisite
+---------------
+
+* Install ROCm stack on the system to obtain :doc:`HIP runtime <hip:index>`
+* Install ``libnuma`` on the system
+* `Enable AMD IOMMU <https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#iommu-configuration-systems-with-256-cpu-threads>`_ and set to passthrough for AMD Instinct cards

--------------------------
 Building TransferBench
--------------------------
+------------------------

-To build TransferBench using Makefile, use the following instruction:
+To build TransferBench using Makefile, use:

 .. code-block:: bash

-            $ make
+  make

-To build TransferBench using CMake, use the following commands:
+To build TransferBench using CMake, use:

 .. code-block:: bash

-                $ mkdir build
+  mkdir build
+  cd build
+  CXX=/opt/rocm/bin/hipcc cmake ..
+  make
+
+.. note::

-                $ cd build
+  If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately.

-                $ CXX=/opt/rocm/bin/hipcc cmake ..
+Building documentation
+-----------------------

-                $ make
+To build documentation locally, use:

-.. Note:: 
+.. code-block:: bash

-If ROCm is installed in a folder other than `/opt/rocm/`, set `ROCM_PATH` appropriately.
+  cd docs
+  pip3 install -r .sphinx/requirements.txt
+  python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html

--------------------------
 NVIDIA platform support
 --------------------------

-TransferBench may also be built to run on NVIDIA platforms via HIP, but requires a HIP-compatible CUDA version installed. For example, CUDA 11.5.
+You can build TransferBench to run on NVIDIA platforms using native NVIDIA CUDA Compiler Driver (NVCC).

-To build on NVIDIA platforms, use the following instruction:
+To build with native NVCC, use:

 .. code-block:: bash

-             CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
+  make
+
+TransferBench looks for NVCC in ``/usr/local/cuda`` by default. To modify the location of NVCC, use environment variable `CUDA_PATH`:
+
+.. code-block:: bash

+  CUDA_PATH=/usr/local/cuda make
--- a/docs/reference/api.rst
+++ b/docs/reference/api.rst
-----
-API
-----
+.. meta::
+  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
+  :keywords: TransferBench API, TransferBench library, documentation, HIP
+
+.. _transferbench-api:
+
+--------------------------
+TransferBench API library
+--------------------------

 .. doxygenindex::
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -5,18 +5,14 @@ subtrees:
 - caption: Install
  entries:
  - file: install/install.rst
-    title: TransferBench installation 

 - caption: API reference
  entries:
  - file: reference/api.rst
-    title: API library

 - caption: How to
  entries:
  - file: how to/use-transferbench.rst
-    title: Use TransferBench
- 

 - caption: About
  entries: