Transfer benchdocs reorg (#103)

9b7f0b75 · Roopa Malavally · GitHub · 3ee75292 · 9b7f0b75 · 9b7f0b75
Unverified Commit 9b7f0b75 authored May 22, 2024 by Roopa Malavally Committed by GitHub May 22, 2024
6 changed files
--- a/docs/examples/configfile_format.rst
+++ b/docs/examples/configfile_format.rst
 --------------------
-ConfigFile Format
+ConfigFile format
 --------------------
-A Transfer is defined as a single operation where an Executor reads and adds together
+A Transfer is defined as a single operation where an executor reads and adds together
 values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
-This simplifies to a simple copy operation when dealing with single SRC/DST.::
+This simplifies to a simple copy operation when dealing with single SRC/DST.
   SRC 0                DST 0
   SRC 1 -> Executor -> DST 1
@@ -19,7 +19,7 @@ Three Executors are supported by TransferBench::
 Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
-There are two ways to specify a Test:
+There are two ways to specify a test:
 1) Basic

--- a/docs/how to/use-transferbench.rst
+++ b/docs/how to/use-transferbench.rst
+.. meta::
+  :description: TransferBench documentation 
+  :keywords: TransferBench, API, ROCm, documentation, HIP
+Using TransferBench
+---------------------
+Users have control over the SRC and DST memory locations by indicating memory type followed by the device index. TransferBench supports the following:
+* coarse-grained pinned host memory
+* unpinned host memory
+* fine-grained host memory
+* coarse-grained global device memory
+* fine-grained global device memory
+* null memory (for an empty transfer).
+In addition, users can determine the size of the transfer (number of bytes to copy) for their tests.
+Users can also specify executors of the transfer. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used.
+Refer to the following example to use TransferBench.
+--------------------
+ConfigFile format
+--------------------
+A Transfer is defined as a single operation where an executor reads and adds together values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
+This simplifies to a simple copy operation when dealing with single SRC/DST.
+.. code-block:: bash
+   SRC 0                DST 0
+   SRC 1 -> Executor -> DST 1
+   SRC X                DST Y
+Three Executors are supported by TransferBench::
+   Executor:        SubExecutor:
+   1) CPU           CPU thread
+   2) GPU           GPU threadblock/Compute Unit (CU)
+   3) DMA           N/A.                                 (May only be used for copies (single SRC/DST)
+Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
+There are two ways to specify a test:
+1) Basic
+   The basic specification assumes the same number of SubExecutors (SE) used per Transfer
+   A positive number of Transfers is specified followed by that number of triplets describing each Transfer
+   Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
+2) Advanced
+   A negative number of Transfers is specified, followed by quintuplets describing each Transfer
+   A non-zero number of bytes specified will override any provided value
+   -Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
+Argument Details:::
+   Transfers:   Number of Transfers to be run in parallel
+   SEs      :   Number of SubExectors to use (CPU threads/ GPU threadblocks)
+   srcMemL   :   Source memory locations (Where the data is to be read from)
+   Executor  :   Executor is specified by a character indicating type, followed by device index (0-indexed)
+                  - C: CPU-executed  (Indexed from 0 to NUMA nodes - 1)
+                  - G: GPU-executed  (Indexed from 0 to GPUs - 1)
+                  - D: DMA-executor  (Indexed from 0 to GPUs - 1)
+   dstMemL   :   Destination memory locations (Where the data is to be written to)
+   bytesL    :   Number of bytes to copy (0 means use command-line specified size)
+                  Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
+                  Memory locations are specified by one or more (device character / device index) pairs
+                  Character indicating memory type followed by device index (0-indexed)
+                  Supported memory locations are:
+                  - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
+                  - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
+                  - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
+                  - G:    Global device memory     (on GPU device indexed from 0 to [GPUs - 1])
+                  - F:    Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
+                  - N:    Null memory              (index ignored)
+Examples:::
+   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
+   1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
+   2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
+   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
+Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
+Lines starting with will be ignored. Lines starting with will be echoed to output
+Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::
+   1 4 (G0->G0->G1)
+Single DMA executed Transfer between GPUs 0 and 1::
+   1 1 (G0->D0->G1)
+Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
+   -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
+"Memset" by GPU 0 to GPU 0 memory::
+   1 32 (N0->G0->G0)
+"Read-only" by CPU 0::
+   1 4 (C0->C0->N0)
+Broadcast from GPU 0 to GPU 0 and GPU 1::
+   1 16 (G0->G0->G0G1)
--- a/docs/index.rst
+++ b/docs/index.rst
-*******************************************
+****************************
-Welcome to TransferBench's documentation!
+TransferBench documentation
-*******************************************
+****************************
-TransferBench is a simple utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs).
+TransferBench is a utility capable of benchmarking simultaneous transfers between user-specified devices (CPUs/GPUs). A transfer is a single operation where an executor reads and adds values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. This simplifies to a simple copy operation when dealing with a single SRC/DST.
-A Transfer is defined as a single operation where an executor reads and adds together values from source (SRC) memory locations, then writes the sum to destination (DST) memory locations. This simplifies to a simple copy operation when dealing with single SRC/DST.
+For more information, see `GitHub. <https://github.com/ROCm/TransferBench>`_
+.. grid:: 2
+  :gutter: 3
+  .. grid-item-card:: Install
+       * :doc:`TransferBench installation <./install/install>`
+  .. grid-item-card:: API reference
+    * :doc:`API library <reference/api>`
+  .. grid-item-card:: How to
+    * :doc:`Use TransferBench <how to/use-transferbench>`
+To contribute to the documentation, refer to
+`Contributing to ROCm <https://rocm.docs.amd.com/en/latest/contribute/contributing.html>`_.
+You can find licensing information on the
+`Licensing <https://rocm.docs.amd.com/en/latest/about/license.html>`_ page.
-The user has control over the SRC and DST memory locations by indicating memory type followed by the device index. TransferBench supports coarse-grained pinned host memory, unpinned host memory, fine-grained host memory, coarse-grained global device memory, fine-grained global device memory, and null memory (for an empty transfer). In addition, the user can determine the size of the transfer (number of bytes to copy) for their tests.
-The executor of the transfer can also be specified by the user. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used.
-For more examples, please refer to :ref:`Examples`
--- a/docs/install/install.rst
+++ b/docs/install/install.rst
+.. meta::
+  :description: TransferBench documentation 
+  :keywords: TransferBench, API, ROCm, HIP
+---------------------------
+TransferBench installation
+---------------------------
+The following software is required to install TransferBench:
+* ROCm stack installed on the system (HIP runtime)
+* `libnuma` installed on the system
+--------------------------
+Building TransferBench
+--------------------------
+To build TransferBench using Makefile, use the following instruction:
+.. code-block:: bash
+            $ make
+To build TransferBench using CMake, use the following commands:
+.. code-block:: bash
+                $ mkdir build
+                $ cd build
+                $ CXX=/opt/rocm/bin/hipcc cmake ..
+                $ make
+.. Note:: 
+If ROCm is installed in a folder other than `/opt/rocm/`, set `ROCM_PATH` appropriately.
+--------------------------
+NVIDIA platform support
+--------------------------
+TransferBench may also be built to run on NVIDIA platforms via HIP, but requires a HIP-compatible CUDA version installed. For example, CUDA 11.5.
+To build on NVIDIA platforms, use the following instruction:
+.. code-block:: bash
+             CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
--- a/docs/reference/api.rst
+++ b/docs/reference/api.rst
+-----
+API
+-----
+.. doxygenindex::
--- a/docs/sphinx/_toc.yml.in
+++ b/docs/sphinx/_toc.yml.in
-# Anywhere {branch} is used, the branch name will be substituted.
-# These comments will also be removed.
 defaults:
  numbered: False
-  maxdepth: 6
 root: index
 subtrees:
-  - entries:
+- caption: Install
-    - file: instructions
+  entries:
-    - file: examples/index
+  - file: install/install.rst
-    - file: api
+    title: TransferBench installation 
-  - caption: About
-    entries:
+- caption: API reference
-      - file: license
+  entries:
+  - file: reference/api.rst
+    title: API library
+- caption: How to
+  entries:
+  - file: how to/use-transferbench.rst
+    title: Use TransferBench
+- caption: About
+  entries:
+  - file: license.md