Unverified Commit c02da24b authored by Sam Wu's avatar Sam Wu Committed by GitHub
Browse files

Add Read the Docs configuration files for building documentation (#15)

* add read the docs configs

* add examples to docs

format example docs

* formatting for configfile format doc page

* generate doxygen docs
parent b5439548
# documentation artifacts
build/
_build/
_images/
_static/
_templates/
_toc.yml
docBin/
# Read the Docs configuration file
# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
version: 2
sphinx:
configuration: docs/conf.py
formats: [htmlzip]
python:
version: "3.8"
install:
- requirements: docs/.sphinx/requirements.txt
......@@ -7,6 +7,18 @@ TransferBench is a simple utility capable of benchmarking simultaneous copies be
1. ROCm stack installed on the system (HIP runtime)
2. libnuma installed on system
## Documentation
Run the steps below to build documentation locally.
```
cd docs
pip3 install -r .sphinx/requirements.txt
python3 -m sphinx -T -E -b html -d _build/doctrees -D language=en . _build/html
```
## Building
To build TransferBench using Makefile:
```shell
......
This diff is collapsed.
# Anywhere {branch} is used, the branch name will be substituted.
# These comments will also be removed.
defaults:
numbered: False
maxdepth: 6
root: index
subtrees:
- entries:
- file: instructions
- file: examples/index
- file: api
git+https://github.com/RadeonOpenCompute/rocm-docs-core.git
#
# This file is autogenerated by pip-compile with Python 3.10
# by the following command:
#
# pip-compile requirements.in
#
accessible-pygments==0.0.3
# via pydata-sphinx-theme
alabaster==0.7.13
# via sphinx
asttokens==2.2.1
# via stack-data
attrs==22.2.0
# via
# jsonschema
# jupyter-cache
babel==2.12.1
# via
# pydata-sphinx-theme
# sphinx
backcall==0.2.0
# via ipython
beautifulsoup4==4.12.0
# via pydata-sphinx-theme
breathe==4.34.0
# via rocm-docs-core
certifi==2022.12.7
# via requests
cffi==1.15.1
# via pynacl
charset-normalizer==3.1.0
# via requests
click==8.1.3
# via
# jupyter-cache
# sphinx-external-toc
comm==0.1.2
# via ipykernel
debugpy==1.6.6
# via ipykernel
decorator==5.1.1
# via ipython
deprecated==1.2.13
# via pygithub
docutils==0.16
# via
# breathe
# myst-parser
# pydata-sphinx-theme
# rocm-docs-core
# sphinx
executing==1.2.0
# via stack-data
fastjsonschema==2.16.3
# via nbformat
gitdb==4.0.10
# via gitpython
gitpython==3.1.31
# via rocm-docs-core
greenlet==2.0.2
# via sqlalchemy
idna==3.4
# via requests
imagesize==1.4.1
# via sphinx
importlib-metadata==6.1.0
# via
# jupyter-cache
# myst-nb
importlib-resources==5.10.4
# via rocm-docs-core
ipykernel==6.22.0
# via myst-nb
ipython==8.11.0
# via
# ipykernel
# myst-nb
jedi==0.18.2
# via ipython
jinja2==3.1.2
# via
# myst-parser
# sphinx
jsonschema==4.17.3
# via nbformat
jupyter-cache==0.5.0
# via myst-nb
jupyter-client==8.1.0
# via
# ipykernel
# nbclient
jupyter-core==5.3.0
# via
# ipykernel
# jupyter-client
# nbformat
linkify-it-py==1.0.3
# via myst-parser
markdown-it-py==2.2.0
# via
# mdit-py-plugins
# myst-parser
markupsafe==2.1.2
# via jinja2
matplotlib-inline==0.1.6
# via
# ipykernel
# ipython
mdit-py-plugins==0.3.5
# via myst-parser
mdurl==0.1.2
# via markdown-it-py
myst-nb==0.17.1
# via rocm-docs-core
myst-parser[linkify]==0.18.1
# via
# myst-nb
# rocm-docs-core
nbclient==0.5.13
# via
# jupyter-cache
# myst-nb
nbformat==5.8.0
# via
# jupyter-cache
# myst-nb
# nbclient
nest-asyncio==1.5.6
# via
# ipykernel
# nbclient
packaging==23.0
# via
# ipykernel
# pydata-sphinx-theme
# sphinx
parso==0.8.3
# via jedi
pexpect==4.8.0
# via ipython
pickleshare==0.7.5
# via ipython
platformdirs==3.1.1
# via jupyter-core
prompt-toolkit==3.0.38
# via ipython
psutil==5.9.4
# via ipykernel
ptyprocess==0.7.0
# via pexpect
pure-eval==0.2.2
# via stack-data
pycparser==2.21
# via cffi
pydata-sphinx-theme==0.13.1
# via sphinx-book-theme
pygithub==1.57
# via rocm-docs-core
pygments==2.14.0
# via
# accessible-pygments
# ipython
# pydata-sphinx-theme
# sphinx
pyjwt==2.6.0
# via pygithub
pynacl==1.5.0
# via pygithub
pyrsistent==0.19.3
# via jsonschema
python-dateutil==2.8.2
# via jupyter-client
pyyaml==6.0
# via
# jupyter-cache
# myst-nb
# myst-parser
# sphinx-external-toc
pyzmq==25.0.2
# via
# ipykernel
# jupyter-client
requests==2.28.2
# via
# pygithub
# sphinx
rocm-docs-core @ git+https://github.com/RadeonOpenCompute/rocm-docs-core.git
# via -r requirements.in
six==1.16.0
# via
# asttokens
# python-dateutil
smmap==5.0.0
# via gitdb
snowballstemmer==2.2.0
# via sphinx
soupsieve==2.4
# via beautifulsoup4
sphinx==4.3.1
# via
# breathe
# myst-nb
# myst-parser
# pydata-sphinx-theme
# rocm-docs-core
# sphinx-book-theme
# sphinx-copybutton
# sphinx-design
# sphinx-external-toc
# sphinx-notfound-page
sphinx-book-theme==1.0.0rc2
# via rocm-docs-core
sphinx-copybutton==0.5.1
# via rocm-docs-core
sphinx-design==0.3.0
# via rocm-docs-core
sphinx-external-toc==0.3.1
# via rocm-docs-core
sphinx-notfound-page==0.8.3
# via rocm-docs-core
sphinxcontrib-applehelp==1.0.4
# via sphinx
sphinxcontrib-devhelp==1.0.2
# via sphinx
sphinxcontrib-htmlhelp==2.0.1
# via sphinx
sphinxcontrib-jsmath==1.0.1
# via sphinx
sphinxcontrib-qthelp==1.0.3
# via sphinx
sphinxcontrib-serializinghtml==1.1.5
# via sphinx
sqlalchemy==1.4.47
# via jupyter-cache
stack-data==0.6.2
# via ipython
tabulate==0.9.0
# via jupyter-cache
tornado==6.2
# via
# ipykernel
# jupyter-client
traitlets==5.9.0
# via
# comm
# ipykernel
# ipython
# jupyter-client
# jupyter-core
# matplotlib-inline
# nbclient
# nbformat
typing-extensions==4.5.0
# via
# myst-nb
# myst-parser
uc-micro-py==1.0.1
# via linkify-it-py
urllib3==1.26.15
# via requests
wcwidth==0.2.6
# via prompt-toolkit
wrapt==1.15.0
# via deprecated
zipp==3.15.0
# via importlib-metadata
# The following packages are considered to be unsafe in a requirements file:
# setuptools
-----
API
-----
.. doxygenindex::
# Configuration file for the Sphinx documentation builder.
#
# This file only contains a selection of the most common options. For a full
# list see the documentation:
# https://www.sphinx-doc.org/en/master/usage/configuration.html
from rocm_docs import ROCmDocs
docs_core = ROCmDocs("TransferBench Documentation")
docs_core.setup()
docs_core.run_doxygen()
for sphinx_var in ROCmDocs.SPHINX_VARS:
globals()[sphinx_var] = getattr(docs_core, sphinx_var)
--------------------
ConfigFile Format
--------------------
A Transfer is defined as a single operation where an Executor reads and adds together
values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
This simplifies to a simple copy operation when dealing with single SRC/DST.::
SRC 0 DST 0
SRC 1 -> Executor -> DST 1
SRC X DST Y
Three Executors are supported by TransferBench::
Executor: SubExecutor:
1) CPU CPU thread
2) GPU GPU threadblock/Compute Unit (CU)
3) DMA N/A. (May only be used for copies (single SRC/DST)
Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
There are two ways to specify a Test:
1) Basic
The basic specification assumes the same number of SubExecutors (SE) used per Transfer
A positive number of Transfers is specified followed by that number of triplets describing each Transfer
Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
2) Advanced
A negative number of Transfers is specified, followed by quintuplets describing each Transfer
A non-zero number of bytes specified will override any provided value
-Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)
Argument Details:::
Transfers: Number of Transfers to be run in parallel
SEs : Number of SubExectors to use (CPU threads/ GPU threadblocks)
srcMemL : Source memory locations (Where the data is to be read from)
Executor : Executor is specified by a character indicating type, followed by device index (0-indexed)
- C: CPU-executed (Indexed from 0 to NUMA nodes - 1)
- G: GPU-executed (Indexed from 0 to GPUs - 1)
- D: DMA-executor (Indexed from 0 to GPUs - 1)
dstMemL : Destination memory locations (Where the data is to be written to)
bytesL : Number of bytes to copy (0 means use command-line specified size)
Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
Memory locations are specified by one or more (device character / device index) pairs
Character indicating memory type followed by device index (0-indexed)
Supported memory locations are:
- C: Pinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- U: Unpinned host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- B: Fine-grain host memory (on NUMA node, indexed from 0 to [NUMA nodes-1])
- G: Global device memory (on GPU device indexed from 0 to [GPUs - 1])
- F: Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
- N: Null memory (index ignored)
Examples:::
1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
-2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
Lines starting with will be ignored. Lines starting with will be echoed to output
Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::
1 4 (G0->G0->G1)
Single DMA executed Transfer between GPUs 0 and 1::
1 1 (G0->D0->G1)
Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::
-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
"Memset" by GPU 0 to GPU 0 memory::
1 32 (N0->G0->G0)
"Read-only" by CPU 0::
1 4 (C0->C0->N0)
Broadcast from GPU 0 to GPU 0 and GPU 1::
1 16 (G0->G0->G0G1)
.. _Examples:
--------------------
Examples
--------------------
.. toctree::
:maxdepth: 3
:caption: Contents:
configfile_format
......@@ -8,4 +8,4 @@ The user has control over the SRC and DST memory locations by indicating memory
The executor of the transfer can also be specified by the user. The options are CPU, kernel-based GPU, and SDMA-based GPU (DMA) executors. TransferBench also provides the option to choose the number of sub-executors. In case of a CPU executor this argument specifies the number of CPU threads, while for a GPU executor it defines the number of compute units (CU). If DMA is specified as the executor, the sub-executor argument determines the number of streams to be used.
For more examples, please refer to the example.cfg file in the examples folder.
For more examples, please refer to :ref:`Examples`
-------------
Requirements
-------------
1. ROCm stack installed on the system (HIP runtime)
2. libnuma installed on system
-------------
Building
-------------
To build TransferBench using Makefile:
::
$ make
To build TransferBench using cmake:
::
$ mkdir build
$ cd build
$ CXX=/opt/rocm/bin/hipcc cmake ..
$ make
If ROCm is installed in a folder other than `/opt/rocm/`, set ROCM_PATH appropriately
--------------------------
NVIDIA platform support
--------------------------
TransferBench may also be built to run on NVIDIA platforms via HIP, but requires a HIP-compatible CUDA version installed (e.g. CUDA 11.5)
To build:
::
CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
# ConfigFile Format:
# ==================
# A Transfer is defined as a single operation where an Executor reads and adds together
# values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
# This simplifies to a simple copy operation when dealing with single SRC/DST.
#
# SRC 0 DST 0
# SRC 1 -> Executor -> DST 1
# SRC X DST Y
# Three Executors are supported by TransferBench
# Executor: SubExecutor:
# 1) CPU CPU thread
# 2) GPU GPU threadblock/Compute Unit (CU)
# 3) DMA N/A. (May only be used for copies (single SRC/DST)
# Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel
# There are two ways to specify a Test:
# 1) Basic
# The basic specification assumes the same number of SubExecutors (SE) used per Transfer
# A positive number of Transfers is specified followed by that number of triplets describing each Transfer
# #Transfers #SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)
# 2) Advanced
# A negative number of Transfers is specified, followed by quintuplets describing each Transfer
# A non-zero number of bytes specified will override any provided value
# -#Transfers (srcMem1->Executor1->dstMem1 #SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL #SEsL BytesL)
# Argument Details:
# #Transfers: Number of Transfers to be run in parallel
# #SEs : Number of SubExectors to use (CPU threads/ GPU threadblocks)
# srcMemL : Source memory locations (Where the data is to be read from)
# Executor : Executor is specified by a character indicating type, followed by device index (0-indexed)
# - C: CPU-executed (Indexed from 0 to # NUMA nodes - 1)
# - G: GPU-executed (Indexed from 0 to # GPUs - 1)
# - D: DMA-executor (Indexed from 0 to # GPUs - 1)
# dstMemL : Destination memory locations (Where the data is to be written to)
# bytesL : Number of bytes to copy (0 means use command-line specified size)
# Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')
#
# Memory locations are specified by one or more (device character / device index) pairs
# Character indicating memory type followed by device index (0-indexed)
# Supported memory locations are:
# - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - U: Unpinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - B: Fine-grain host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1])
# - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1])
# - N: Null memory (index ignored)
# Examples:
# 1 4 (G0->G0->G1) Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
# 1 4 (C1->G2->G0) Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
# 2 4 G0->G0->G1 G1->G1->G0 Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
# -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs
# Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
# Lines starting with # will be ignored. Lines starting with ## will be echoed to output
## Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs
1 4 (G0->G0->G1)
## Single DMA executed Transfer between GPUs 0 and 1
1 1 (G0->D0->G1)
## Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs
-2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)
## "Memset" by GPU 0 to GPU 0 memory
1 32 (N0->G0->G0)
## "Read-only" by CPU 0
1 4 (C0->C0->N0)
## Broadcast from GPU 0 to GPU 0 and GPU 1
1 16 (G0->G0->G0G1)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment