Doc update (#123)

340244a6 · srawat · GitHub · 203808ed · 340244a6 · 203808ed
Unverified Commit 340244a6 authored Oct 08, 2024 by srawat Committed by GitHub Oct 08, 2024
8 changed files
--- a/README.md
+++ b/README.md
 # TransferBench
-TransferBench is a simple utility capable of benchmarking simultaneous copies between user-specified
+TransferBench is a utility for benchmarking simultaneous copies between user-specified
 CPU and GPU devices.
 Documentation for TransferBench is available at
 [https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html](https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html).
\ No newline at end of file
-## Requirements
-* You must have a ROCm stack installed on your system (HIP runtime)
-* You must have `libnuma` installed on your system
-* AMD IOMMU must be enabled and set to passthrough for AMD Instinct cards
-## Documentation
-To build documentation locally, use the following code:
-```shell
-cd docs
-pip3 install -r .sphinx/requirements.txt
-python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
-```
-## Building TransferBench
-You can build TransferBench using Makefile or CMake.
-* Makefile:
-  ```shell
-  make
-  ```
-* CMake:
-  ```shell
-  mkdir build
-  cd build
-  CXX=/opt/rocm/bin/hipcc cmake ..
-  make
-  ```
-  If ROCm is not installed in `/opt/rocm/`, you must set `ROCM_PATH` to the correct location.
-## NVIDIA platform support
-You can build TransferBench to run on NVIDIA platforms via HIP or native NVCC.
-Use the following code to build with HIP for NVIDIA (note that you must have a HIP-compatible CUDA
-version installed, e.g., CUDA 11.5):
-```shell
-CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
-```
-Use the following code to build with native NVCC (builds `TransferBenchCuda`):
-```shell
-make
-```
-## Things to note
-* Running TransferBench with no arguments displays usage instructions and detected topology
-  information
-* You can use several preset configurations instead of a configuration file:
-  * `a2a` : All-to-all benchmark test
-  * `cmdline` : Take in Transfers to run from command-line instead of via file
-  * `healthcheck` : Simple health check (supported on MI300 series only)
-  * `p2p`    : Peer-to-peer benchmark test
-  * `pcopy`  : Benchmark parallel copies from a single GPU to other GPUs
-  * `rsweep` : Random sweep across possible sets of transfers
-  * `rwrite` : Benchmarks parallel remote writes from a single GPU to other GPUs
-  * `scaling`: GPU subexecutor scaling tests
-  * `schmoo` : Local/Remote read/write/copy between two GPUs
-  * `sweep`  : Sweep across possible sets of transfers
-* When using the same GPU executor in multiple simultaneous transfers on separate streams (USE_SINGLE_STREAM=0),
-  performance may be serialized due to the maximum number of hardware queues available
-  * The number of maximum hardware queues can be adjusted via `GPU_MAX_HW_QUEUES`
--- a/docs/examples/configfile_format.rst
+++ b/docs/examples/configfile_format.rst
--------------------
-ConfigFile format
--------------------
-A Transfer is defined as a single operation where an executor reads and adds together
-values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
-This simplifies to a simple copy operation when dealing with single SRC/DST.
-   SRC 0                DST 0
-   SRC 1 -> Executor -> DST 1
-   SRC X                DST Y
-Three Executors are supported by TransferBench::
-   Executor:        SubExecutor:
-   1) CPU           CPU thread
-   2) GPU           GPU threadblock/Compute Unit (CU)
-   3) DMA           N/A.                                 (May only be used for copies (single SRC/DST)
-Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
-There are two ways to specify a test:
-1) Basic
-   The basic specification assumes the same number of SubExecutors (SE) used per Transfer
-   A positive number of Transfers is specified followed by that number of triplets describing each Transfer
-   Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
-2) Advanced
-   A negative number of Transfers is specified, followed by quintuplets describing each Transfer
-   A non-zero number of bytes specified will override any provided value
-   -Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
-Argument Details:::
-   Transfers:   Number of Transfers to be run in parallel
-   SEs      :   Number of SubExectors to use (CPU threads/ GPU threadblocks)
-   srcMemL   :   Source memory locations (Where the data is to be read from)
-   Executor  :   Executor is specified by a character indicating type, followed by device index (0-indexed)
-                  - C: CPU-executed  (Indexed from 0 to NUMA nodes - 1)
-                  - G: GPU-executed  (Indexed from 0 to GPUs - 1)
-                  - D: DMA-executor  (Indexed from 0 to GPUs - 1)
-   dstMemL   :   Destination memory locations (Where the data is to be written to)
-   bytesL    :   Number of bytes to copy (0 means use command-line specified size)
-                  Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
-                  Memory locations are specified by one or more (device character / device index) pairs
-                  Character indicating memory type followed by device index (0-indexed)
-                  Supported memory locations are:
-                  - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - G:    Global device memory     (on GPU device indexed from 0 to [GPUs - 1])
-                  - F:    Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
-                  - N:    Null memory              (index ignored)
-Examples:::
-   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
-   1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
-   2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
-   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
-Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
-Lines starting with will be ignored. Lines starting with will be echoed to output
-Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::
-   1 4 (G0->G0->G1)
-Single DMA executed Transfer between GPUs 0 and 1::
-   1 1 (G0->D0->G1)
-Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
-   -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
-"Memset" by GPU 0 to GPU 0 memory::
-   1 32 (N0->G0->G0)
-"Read-only" by CPU 0::
-   1 4 (C0->C0->N0)
-Broadcast from GPU 0 to GPU 0 and GPU 1::
-   1 16 (G0->G0->G0G1)
--- a/docs/examples/index.rst
+++ b/docs/examples/index.rst
-.. _Examples:
--------------------
-Examples
--------------------
-.. toctree::
-   :maxdepth: 3
-   :caption: Contents:
-   configfile_format
--- a/docs/how to/use-transferbench.rst
+++ b/docs/how to/use-transferbench.rst
 .. meta::
-  :description: TransferBench documentation 
+  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: TransferBench, API, ROCm, documentation, HIP
+  :keywords: Using TransferBench, TransferBench Usage, TransferBench How To, API, ROCm, documentation, HIP
+.. _using-transferbench:
+---------------------
 Using TransferBench
 ---------------------
-Users have control over the SRC and DST memory locations by indicating memory type followed by the device index. TransferBench supports the following:
-* coarse-grained pinned host memory
+You can control the SRC and DST memory locations by indicating the memory type followed by the device index. TransferBench supports the following memory types:
-* unpinned host memory
-* fine-grained host memory
+* Coarse-grained pinned host
-* coarse-grained global device memory
+* Unpinned host
-* fine-grained global device memory
+* Fine-grained host
-* null memory (for an empty transfer).
+* Coarse-grained global device
+* Fine-grained global device
+* Null (for an empty transfer)
-In addition, users can determine the size of the transfer (number of bytes to copy) for their tests.
+In addition, you can determine the size of the transfer (number of bytes to copy) for the tests.
-Users can also specify executors of the transfer. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used.
+You can also specify transfer executors. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of Sub-Executors (SE). The number of SEs specifies the number of CPU threads in the case of a CPU executor and the number of compute units (CU) for a GPU executor.
+For a DMA executor, the SE argument determines the number of streams to be used.
-Refer to the following example to use TransferBench.
+You can specify the transfers in a configuration file or use preset configurations for transfers.
--------------------
+Specifying transfers in a configuration file
-ConfigFile format
+----------------------------------------------
--------------------
-A Transfer is defined as a single operation where an executor reads and adds together values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
+A transfer is defined as a single operation where an executor reads and adds together values from SRC memory locations, followed by writing the sum to the DST memory locations.
-This simplifies to a simple copy operation when dealing with single SRC/DST.
+This simplifies to a copy operation when using a single SRC or DST.
+Here's a copy operation from a single SRC to DST:
 .. code-block:: bash
@@ -34,73 +37,110 @@ This simplifies to a simple copy operation when dealing with single SRC/DST.
   SRC 1 -> Executor -> DST 1
   SRC X                DST Y
-Three Executors are supported by TransferBench::
+Three executors are supported by TransferBench:
-   Executor:        SubExecutor:
+.. code-block:: bash
-   1) CPU           CPU thread
-   2) GPU           GPU threadblock/Compute Unit (CU)
-   3) DMA           N/A.                                 (May only be used for copies (single SRC/DST)
-Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
+  Executor:        SubExecutor:
+  1. CPU           CPU thread
+  2. GPU           GPU threadblock/Compute Unit (CU)
+  3. DMA           N/A (Can only be used for a single SRC to DST copy)
+Each line in the configuration file defines a set of transfers, also known as a test, to run in parallel.
 There are two ways to specify a test:
-1) Basic
+- **Basic**
+  The basic specification assumes the same number of SEs used per transfer.
+  A positive number of transfers is specified, followed by the number of SEs and triplets describing each transfer:
-   The basic specification assumes the same number of SubExecutors (SE) used per Transfer
+  .. code-block:: bash
-   A positive number of Transfers is specified followed by that number of triplets describing each Transfer
-   Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
+    Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
-2) Advanced
+  The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
-   A negative number of Transfers is specified, followed by quintuplets describing each Transfer
+  **Example**:
-   A non-zero number of bytes specified will override any provided value
-   -Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
+  .. code-block:: bash
-Argument Details:::
+   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
+   1 4 (G2->C1->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
+   2 4 G0->G0->G1 G1->G1->G0          Copies from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
-   Transfers:   Number of Transfers to be run in parallel
+- **Advanced**
-   SEs      :   Number of SubExectors to use (CPU threads/ GPU threadblocks)
-   srcMemL   :   Source memory locations (Where the data is to be read from)
-   Executor  :   Executor is specified by a character indicating type, followed by device index (0-indexed)
-                  - C: CPU-executed  (Indexed from 0 to NUMA nodes - 1)
-                  - G: GPU-executed  (Indexed from 0 to GPUs - 1)
-                  - D: DMA-executor  (Indexed from 0 to GPUs - 1)
-   dstMemL   :   Destination memory locations (Where the data is to be written to)
-   bytesL    :   Number of bytes to copy (0 means use command-line specified size)
-                  Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
-                  Memory locations are specified by one or more (device character / device index) pairs
+  In the advanced specification, a negative number of transfers is specified, followed by quintuplets describing each transfer.
-                  Character indicating memory type followed by device index (0-indexed)
+  Specifying a non-zero number of bytes overrides any provided value.
-                  Supported memory locations are:
-                  - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
-                  - G:    Global device memory     (on GPU device indexed from 0 to [GPUs - 1])
-                  - F:    Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
-                  - N:    Null memory              (index ignored)
-Examples:::
+  .. code-block:: bash
-   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
+    Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
-   1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
-   2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
+  The arguments used to specify transfers in the config file are described in the :ref:`arguments table <config_file_arguments_table>`.
-   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
+  **Example**:
+  .. code-block:: bash
+   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs and 2Mb from GPU1 to GPU0 with 2 SEs
+Here is the list of arguments used to specify transfers in the config file:
+.. _config_file_arguments_table:
+.. list-table::
+   :header-rows: 1
+   * - Argument
+     - Description
+   * - Transfers
+     - Number of transfers to be run in parallel
+   * - SE
+     - Number of SEs to use (CPU threads or GPU threadblocks)
+   * - srcMemL
+     - Source memory locations (where the data is read)
+   * - Executor
+     - | Executor is specified by a character indicating type, followed by the device index (0-indexed):
+       | - C: CPU-executed  (indexed from 0 to NUMA nodes - 1)
+       | - G: GPU-executed  (indexed from 0 to GPUs - 1)
+       | - D: DMA-executor  (indexed from 0 to GPUs - 1)
-Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
+   * - dstMemL
-Lines starting with will be ignored. Lines starting with will be echoed to output
+     - Destination memory locations (where the data is written)
-Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::
+   * - bytesL
+     - | Number of bytes to copy (use command-line specified size when 0).
+       | Must be a multiple of four and can be suffixed with ('K','M', or 'G').
+       | Memory locations are specified by one or more device characters or device index pairs.
+       | Characters indicate memory type and are followed by device index (0-indexed).
+       | Here are the characters and their respective memory locations:
+       | - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
+       | - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
+       | - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
+       | - G:    Global device memory     (on GPU device, indexed from 0 to [GPUs - 1])
+       | - F:    Fine-grain device memory (on GPU device, indexed from 0 to [GPUs - 1])
+       | - N:    Null memory              (index ignored)
+Round brackets and arrows "->" can be included for human clarity, but will be ignored.
+Lines starting with # are ignored while lines starting with ## are echoed to the output.
+**Transfer examples:**
+Single GPU-executed transfer between GPU 0 and 1 using 4 CUs::
   1 4 (G0->G0->G1)
-Single DMA executed Transfer between GPUs 0 and 1::
+Single DMA-executed transfer between GPU 0 and 1::
   1 1 (G0->D0->G1)
-Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
+Copying 1Mb from GPU 0 to GPU 1 with 4 CUs, and 2Mb from GPU 1 to GPU 0 with 8 CUs::
   -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
@@ -116,5 +156,53 @@ Broadcast from GPU 0 to GPU 0 and GPU 1::
   1 16 (G0->G0->G0G1)
+.. note::
+   Running TransferBench with no arguments displays usage instructions and detected topology information.
+Using preset configurations
+------------------------------
+Here is the list of preset configurations that can be used instead of configuration files:
+.. list-table::
+   :header-rows: 1
+   * - Configuration
+     - Description
+   * - ``a2a``
+     - All-to-all benchmark test
+   * - ``cmdline``
+     - Allows transfers to run from the command line instead of a configuration file
+   * - ``healthcheck``
+     - Simple health check (supported on AMD Instinct MI300 series only)
+   * - ``p2p``
+     - Peer-to-peer benchmark test
+   * - ``pcopy``
+     - Benchmark parallel copies from a single GPU to other GPUs
+   * - ``rsweep``
+     - Random sweep across possible sets of transfers
+   * - ``rwrite``
+     - Benchmark parallel remote writes from a single GPU to other GPUs
+   * - ``scaling``
+     - GPU subexecutor scaling tests
+   * - ``schmoo``
+     - Read, write, or copy operation on local or remote between two GPUs
+   * - ``sweep``
+     - Sweep across possible sets of transfers
+Performance tuning
+---------------------
+When you use the same GPU executor in multiple simultaneous transfers on separate streams by setting ``USE_SINGLE_STREAM=0``, the performance might be serialized due to the maximum number of hardware queues available.
+To improve the performance, adjust the number of maximum hardware queues using ``GPU_MAX_HW_QUEUES``.
--- a/docs/index.rst
+++ b/docs/index.rst
+.. meta::
+  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
+  :keywords: TransferBench, API, ROCm, documentation, HIP
 ****************************
 TransferBench documentation
 ****************************
-TransferBench is a utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. This simplifies to a simple copy operation when dealing with a single SRC/DST.
-For more information, see `GitHub. <https://github.com/ROCm/TransferBench>`_
+TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
+This simplifies to a simple copy operation when dealing with a single SRC or DST.
+The code is open and hosted at `<https://github.com/ROCm/TransferBench>`_.
 .. grid:: 2
  :gutter: 3
  .. grid-item-card:: Install
-       * :doc:`TransferBench installation <./install/install>`
+    * :ref:`install-transferbench`
  .. grid-item-card:: API reference
-    * :doc:`API library <reference/api>`
+    * :ref:`transferbench-api`
  .. grid-item-card:: How to
-    * :doc:`Use TransferBench <how to/use-transferbench>`
+    * :ref:`using-transferbench`
 To contribute to the documentation, refer to
 `Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
 You can find licensing information on the
 `Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
--- a/docs/install/install.rst
+++ b/docs/install/install.rst
 .. meta::
-  :description: TransferBench documentation 
+  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-  :keywords: TransferBench, API, ROCm, HIP
+  :keywords: Build TransferBench, Install TransferBench, API, ROCm, HIP
+.. _install-transferbench:
 ---------------------------
-TransferBench installation
+Installing TransferBench
 ---------------------------
-The following software is required to install TransferBench:
+This topic describes how to build TransferBench.
-* ROCm stack installed on the system (HIP runtime)
+Prerequisite
-* `libnuma` installed on the system
+---------------
+* Install ROCm stack on the system to obtain :doc:`HIP runtime <hip:index>`
+* Install ``libnuma`` on the system
+* `Enable AMD IOMMU <https://rocm.docs.amd.com/en/latest/how-to/system-optimization/mi300x.html#iommu-configuration-systems-with-256-cpu-threads>`_ and set to passthrough for AMD Instinct cards
--------------------------
 Building TransferBench
--------------------------
+------------------------
-To build TransferBench using Makefile, use the following instruction:
+To build TransferBench using Makefile, use:
 .. code-block:: bash
-            $ make
+  make
-To build TransferBench using CMake, use the following commands:
+To build TransferBench using CMake, use:
 .. code-block:: bash
-                $ mkdir build
+  mkdir build
+  cd build
-                $ cd build
+  CXX=/opt/rocm/bin/hipcc cmake ..
+  make
-                $ CXX=/opt/rocm/bin/hipcc cmake ..
-                $ make
-.. Note:: 
+.. note::
-If ROCm is installed in a folder other than `/opt/rocm/`, set `ROCM_PATH` appropriately.
+  If ROCm is installed in a folder other than ``/opt/rocm/``, set ``ROCM_PATH`` appropriately.
+Building documentation
+-----------------------
+To build documentation locally, use:
+.. code-block:: bash
+  cd docs
+  pip3 install -r .sphinx/requirements.txt
+  python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
--------------------------
 NVIDIA platform support
 --------------------------
-TransferBench may also be built to run on NVIDIA platforms via HIP, but requires a HIP-compatible CUDA version installed. For example, CUDA 11.5.
+You can build TransferBench to run on NVIDIA platforms using native NVIDIA CUDA Compiler Driver (NVCC).
+To build with native NVCC, use:
+.. code-block:: bash
+  make
-To build on NVIDIA platforms, use the following instruction:
+TransferBench looks for NVCC in ``/usr/local/cuda`` by default. To modify the location of NVCC, use environment variable `CUDA_PATH`:
 .. code-block:: bash
-             CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
+  CUDA_PATH=/usr/local/cuda make
--- a/docs/reference/api.rst
+++ b/docs/reference/api.rst
-----
+.. meta::
-API
+  :description: TransferBench is a utility to benchmark simultaneous transfers between user-specified devices (CPUs or GPUs)
-----
+  :keywords: TransferBench API, TransferBench library, documentation, HIP
+.. _transferbench-api:
+--------------------------
+TransferBench API library
+--------------------------
 .. doxygenindex::
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
@@ -5,18 +5,14 @@ subtrees:
 - caption: Install
  entries:
  - file: install/install.rst
-    title: TransferBench installation 
 - caption: API reference
  entries:
  - file: reference/api.rst
-    title: API library
 - caption: How to
  entries:
  - file: how to/use-transferbench.rst
-    title: Use TransferBench
 - caption: About
  entries: