Skip to content
GitLab
Menu
Projects
Groups
Snippets
Loading...
Help
Help
Support
Community forum
Keyboard shortcuts
?
Submit feedback
Contribute to GitLab
Sign in / Register
Toggle navigation
Menu
Open sidebar
one
TransferBench
Commits
dda6ebe5
Unverified
Commit
dda6ebe5
authored
Nov 08, 2023
by
Lisa
Committed by
GitHub
Nov 08, 2023
Browse files
Merge pull request #62 from LisaDelaney/readme-updates
changelog updates
parents
004710fb
97b5e7fc
Changes
2
Show whitespace changes
Inline
Side-by-side
Showing
2 changed files
with
404 additions
and
250 deletions
+404
-250
CHANGELOG.md
CHANGELOG.md
+354
-213
README.md
README.md
+50
-37
No files found.
CHANGELOG.md
View file @
dda6ebe5
# Changelog for TransferBench
# Changelog for TransferBench
Documentation for TransferBench is available at
[
https://rocm.docs.amd.com/projects/TransferBench
](
https://rocm.docs.amd.com/projects/TransferBench
)
.
## v1.34
## v1.34
### Added
-
Set GPU_KERNEL=3 to default for gfx942
### Additions
*
Set
`GPU_KERNEL=3`
as default for gfx942
## v1.33
## v1.33
### Added
-
Adding ALWAYS_VALIDATE env var to allow for validation after every iteration instead of just once at end of all iterations
### Additions
*
Added the
`ALWAYS_VALIDATE`
environment variable to allow for validation after every iteration, instead
of only once at the end of all iterations
## v1.32
## v1.32
### Modified
-
Increased line limit from 2048 to 32768
### Changes
*
Increased the line limit from 2048 to 32768
## v1.31
## v1.31
### Modified
-
SHOW_ITERATIONS now show XCC:CU instead of just CU ID
### Changes
-
SHOW_ITERATIONS also printed when USE_SINGLE_STREAM=1
*
`SHOW_ITERATIONS`
now shows XCC:CU instead of just CU ID
*
`SHOW_ITERATIONS`
is printed when
`USE_SINGLE_STREAM`
=1
## v1.30
## v1.30
### Added
-
BLOCK_SIZE added to control threadblock size (Must be multiple of 64, up to 512)
### Additions
-
BLOCK_ORDER added to control how work is ordered for GFX-executors running USE_SINGLE_STREAM=1
-
0 - Threadblocks for Transfers are ordered sequentially (Default)
*
`BLOCK_SIZE`
has been added to control the threadblock size (must be a multiple of 64, up to 512)
-
1 - Threadblocks for Transfers are interleaved
*
`BLOCK_ORDER`
has been added to control how work is ordered for GFX-executors running
-
2 - Threadblocks for Transfers are ordered randomly
`USE_SINGLE_STREAM`
=1
*
0 - Threadblocks for transfers are ordered sequentially (default)
*
1 - Threadblocks for transfers are interleaved
*
2 - Threadblocks for transfers are ordered randomly
## v1.29
## v1.29
### Added
-
a2a preset config now responds to USE_REMOTE_READ
### Additions
### Fixed
-
Race-condition during wall-clock initialization caused "inf" during single stream runs
*
A2A preset config now responds to
`USE_REMOTE_READ`
-
CU numbering output after CU masking
### Modified
### Fixes
-
Default number of warmups reverted to 3
-
Default unroll factor for gfx940/941 set to 6
*
Race-condition during wall-clock initialization caused "inf" during single-stream runs
*
CU numbering output after CU masking
### Changes
*
The default number of warmups has been reverted to 3
*
The default unroll factor for gfx940/941 has been set to 6
## v1.28
## v1.28
### Added
-
Added A2A_DIRECT which only executes all-to-all only directly connected GPUs (on by default now)
### Additions
-
Added average statistics for p2p and a2a benchmarks
-
Added USE_FINE_GRAIN for p2p benchmark.
*
Added
`A2A_DIRECT`
, which only runs all-to-all on directly connected GPUs (now on by default)
-
With older devices, p2p performance with default coarse grain device memory stops timing as soon as request sent to data fabric,
*
Added average statistics for P2P and A2A benchmarks
not actually when it arrives remotely, which may artificially inflate bandwidth numbers, especially when sending small amounts of data
*
Added
`USE_FINE_GRAIN`
for P2P benchmark
### Modified
*
With older devices, P2P performance with default coarse-grain device memory stops timing as soon
-
Modified P2P output to help distinguish between CPU / GPU devices
as a request is sent to data fabric, and not actually when it arrives remotely. This can artificially
### Fixed
inflate bandwidth numbers, especially when sending small amounts of data.
-
Fixed Makefile target to prevent unnecessary re-compilation
### Changes
*
Modified P2P output to help distinguish between CPU and GPU devices
### Fixes
*
Fixed Makefile target to prevent unnecessary re-compilation
## v1.27
## v1.27
### Added
-
Adding cmdline preset to allow specify simple tests on command line
### Additions
-
E.g. ./TransferBench cmdline 64M "1 4 G0->G0->G1"
-
Adding environment variable HIDE_ENV, which skips printing of environment variable values
*
Added cmdline preset to allow specification of simple tests on command line (e.g.,
-
Adding environment variable CU_MASK, which allows selection of which CUs to execute on
`./TransferBench cmdline 64M "1 4 G0->G0->G1"`
)
-
CU_MASK is specified in CU indices (0-#CUs-1), and '-' can be used to denote ranges of values
*
Adding the
`HIDE_ENV`
environment variable, which stops environment variable values from printing
-
E.g.: CU_MASK=3-8,16 would request Transfer be executed only CUs 3,4,5,6,7,8,16
*
Adding the
`CU_MASK`
environment variable, which allows you to select the CUs to run on
-
NOTE: This is somewhat experimental and may not work on all hardware
*
`CU_MASK`
is specified in CU indices (0-#CUs-1), where ' - ' can be used to denote ranges of values
-
SHOW_ITERATIONS now shows CU usage for that iteration (experimental)
(e.g.,
`CU_MASK`
=3-8,16 requests that transfer be run only on CUs 3,4,5,6,7,8,16)
### Modified
*
Note that this is somewhat experimental and may not work on all hardware
-
Adding extra comments on commonly missing includes with details on how to install them
*
`SHOW_ITERATIONS`
now shows CU usage for that iteration (experimental)
### Fixed
-
CUDA compilation should work again (wall_clock64 CUDA alias was not defined)
### Changes
*
Added extra comments on commonly missing includes with details on how to install them
### Fixes
*
CUDA compilation works again (the
`wall_clock64`
CUDA alias was not defined)
## v1.26
## v1.26
### Added
-
Setting SHOW_ITERATIONS=1 provides additional information about per-iteration timing for file and p2p configs
-
For file configs, iterations are sorted from min to max bandwidth and displayed with standard deviation
-
For p2p, min/max/standard deviation is shown for each direction.
### Changed
### Additions
-
P2P benchmark formatting changed. Now reports bidirectional bandwidth in each direction (as well as sum) for clarity
*
Setting SHOW_ITERATIONS=1 provides additional information about per-iteration timing for file and
P2P configs
*
For file configs, iterations are sorted from min to max bandwidth and displayed with standard
deviation
*
For P2P, min/max/standard deviation is shown for each direction
### Changes
*
P2P benchmark formatting now reports bidirectional bandwidth in each direction (as well as sum) for
clarity
## v1.25
## v1.25
### Fixed
-
Fixed bug in P2P bidirectional benchmark using incorrect number of subExecutors for CPU
<->
GPU tests
### Fixes
*
Fixed a bug in the P2P bidirectional benchmark that used the incorrect number of
`subExecutors`
for
CPU
<->
GPU tests
## v1.24
## v1.24
### Added
-
New All-To-All GPU benchmark accessed by preset "a2a"
### Additions
-
Adding gfx941 wall clock frequency
*
New All-To-All GPU benchmark accessed by preset "A2A"
*
Added gfx941 wall clock frequency
## v1.23
## v1.23
### Added
-
New GPU subexec scaling benchmark accessed by preset "scaling"
### Additions
-
Tests GPU-GFX copy performance based on # of CUs used
*
New GPU subexec scaling benchmark accessed by preset "scaling"
*
Tests GPU-GFX copy performance based on # of CUs used
## v1.22
## v1.22
### Modified
-
Switching kernel timing function to wall_clock64
### Changes
*
Switched the kernel timing function to
`wall_clock64`
## v1.21
## v1.21
### Fixed
-
Fixed bug with SAMPLING_FACTOR
### Fixes
*
Fixed a bug with
`SAMPLING_FACTOR`
## v1.20
## v1.20
### Fixed
-
VALIDATE_DIRECT can now be used with USE_PREP_KERNEL
### Fixes
-
Switch to local GPU for validating GPU memory
*
`VALIDATE_DIRECT`
can now be used with
`USE_PREP_KERNEL`
*
Switched to local GPU for validating GPU memory
## v1.19
## v1.19
### Added
-
VALIDATE_DIRECT now also applies to source memory array checking
### Additions
-
Adding null memory pointer check prior to deallocation
*
`VALIDATE_DIRECT`
now also applies to source memory array checking
*
Added null memory pointer check prior to deallocation
## v1.18
## v1.18
### Added
-
Adding ability to validate GPU destination memory directly without going through CPU staging buffer (VALIDATE_DIRECT)
### Additions
-
NOTE: This will only work on AMD devices with large-bar access enable and may slow things down considerably
### Changed
*
Adding the ability to validate GPU destination memory directly without going through the CPU
-
Refactored how environment variables are displayed
staging buffer (
`VALIDATE_DIRECT`
)
-
Mismatch stops after first detected error within an array instead of list all mismatched elements
*
Note that this only works on AMD devices with large-bar access enabled, and may slow things down
considerably
### Changes
*
Refactored how environment variables are displayed
*
Mismatch stops after the first detected error within an array instead of listing all mismatched
elements
## v1.17
## v1.17
### Added
-
Allow switch to GFX kernel for source array initialization (USE_PREP_KERNEL)
### Additions
-
USE_PREP_KERNEL cannot be used with FILL_PATTERN
-
Adding ability to compile with nvcc only (TransferBenchCuda)
*
Allowed switch to GFX kernel for source array initialization (
`USE_PREP_KERNEL`
)
### Changed
*
Note that
`USE_PREP_KERNEL`
can't be used with
`FILL_PATTERN`
-
Default pattern set to [Element i = ((i
* 517) modulo 383 + 31) *
(srcBufferIdx + 1)]
*
Added the ability to compile with nvcc only (
`TransferBenchCuda`
)
### Fixed
-
Re-adding example.cfg file
### Changes
*
The default pattern was set to [Element i = ((i
* 517) modulo 383 + 31) *
(srcBufferIdx + 1)]
### Fixes
*
Added the
`example.cfg`
file
## v1.16
## v1.16
### Added
-
Additional src array validation during preparation
### Additions
-
Adding new env var CONTINUE_ON_ERROR to resume tests after mis-match detection
-
Initializing GPU memory to 0 during allocation
*
Additional src array validation during preparation
*
Added a new environment variable (
`CONTINUE_ON_ERROR`
) to resume tests after a mis-match
detection
*
Initialized GPU memory to 0 during allocation
## v1.15
## v1.15
### Fixed
-
Fixed a bug that prevented single Transfers > 8GB
### Fixes
### Changed
-
Removed "check for latest ROCm" warning when allocating too much memory
*
Fixed a bug that prevented single transfers greater than 8 GB
-
Printing off source memory value as well when mis-match is detected
### Changes
*
Removed "check for latest ROCm" warning when allocating too much memory
*
Off-source memory value is now printed when a mis-match is detected
## v1.14
## v1.14
### Added
-
Added documentation
### Additions
-
Added pthread linking in src/Makefile and CMakeLists.txt
-
Added printing off the hex value of the floats for output and reference
*
Added documentation
*
Added pthread linking in src/Makefile and CMakeLists.txt
*
Added printing off the hex value of the floats for output and reference
## v1.13
## v1.13
### Added
-
Added support for cmake
### Changed
### Additions
-
Converted to the Pitchfork layout standard
*
Added support for cmake
### Changes
*
Converted to the Pitchfork layout standard
## v1.12
## v1.12
### Added
-
Added support for TransferBench on NVIDIA platforms (via HIP_PLATFORM=nvidia)
### Additions
-
CPU executors on NVIDIA platform cannot access GPU memory (no large-bar access)
*
Added support for TransferBench on NVIDIA platforms (via
`HIP_PLATFORM`
=nvidia)
*
Note that CPU executors on NVIDIA platform cannot access GPU memory (no large-bar access)
## v1.11
## v1.11
### Added
-
New multi-input / multi-output support (MIMO). Transfers now can reduce (element-wise summation) multiple input memory arrays
### Additions
and write the sums to multiple outputs
-
New GPU-DMA executor 'D' (uses hipMemcpy for SDMA copies). Previously this was done using USE_HIP_CALL, but now this allows
*
Added multi-input/multi-output (MIMO) support: transfers now can reduce (element-wise summation)
GPU-GFX kernel to run in parallel with GPU-DMA instead of applying to all GPU executors globally.
multiple input memory arrays and write sums to multiple outputs
-
GPU-DMA executor can only be used for single-input/single-output Transfers
*
Added GPU-DMA executor 'D', which uses
`hipMemcpy`
for SDMA copies
-
GPU-DMA executor can only be associated with one SubExecutor
*
Previously, this was done using
`USE_HIP_CALL`
, but now GPU-GFX kernel can run in parallel with
-
Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or write-only Transfers
GPU-DMA, instead of applying to all GPU executors globally
-
Added new GPU_KERNEL environment variable that allows for switching between various GPU-GFX reduction kernels
*
GPU-DMA executor can only be used for single-input/single-output transfers
*
GPU-DMA executor can only be associated with one SubExecutor
### Optimized
*
Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or
-
Slightly improved GPU-GFX kernel performance based on hardware architecture when running with fewer CUs
write-only transfers
*
Added new
`GPU_KERNEL`
environment variable that allows switching between various GPU-GFX
### Changed
reduction kernels
-
Updated the example.cfg file to cover the new features
-
Updated output to support MIMO
### Optimizations
-
Changed CUs/CPUs threads naming to SubExecutors for consistency
-
Sweep Preset:
*
Improved GPU-GFX kernel performance based on hardware architecture when running with
-
Default sweep preset executors now includes DMA
fewer CUs
-
P2P Benchmarks:
-
Now only works via "p2p". Removed "p2p_rr", "g2g" and "g2g_rr".
### Changes
-
Setting NUM_CPU_DEVICES=0 can be used to only benchmark GPU devices (like "g2g")
-
New environment variable USE_REMOTE_READ replaces "_rr" presets
*
Updated the
`example.cfg`
file to cover new features
-
New environment variable USE_GPU_DMA=1 replaces USE_HIP_CALL=1 for benchmarking with GPU-DMA Executor
*
Updated output to support MIMO
-
Number of GPU SubExecutors for benchmark can be specified via NUM_GPU_SE
*
Changed CU and CPU thread naming to SubExecutors for consistency
-
Defaults to all CUs for GPU-GFX, 1 for GPU-DMA
*
Sweep Preset: default sweep preset executors now includes DMA
-
Number of CPU SubExecutors for benchmark can be specified via NUM_CPU_SE
*
P2P benchmarks:
-
Psuedo-random input pattern has been slightly adjusted to have different patterns for each input array within same Transfer
*
Removed
`p2p_rr`
,
`g2g`
and
`g2g_rr`
(now only works via P2P)
*
Setting
`NUM_CPU_DEVICES`
=0 can only be used to benchmark GPU devices (like
`g2g`
)
### Removed
*
The new
`USE_REMOTE_READ`
environment variable replaces
`_rr`
presets
-
USE_HIP_CALL has been removed. Use GPU-DMA executor 'D' or set USE_GPU_DMA=1 for P2P benchmark presets
*
New environment variable
`USE_GPU_DMA`
=1 replaces
`USE_HIP_CALL`
=1 for benchmarking with
-
Currently warning will be issued if USE_HIP_CALL is set to 1 and program will terminate
GPU-DMA Executor
-
Removed NUM_CPU_PER_TRANSFER - The number of CPU SubExecutors will be whatever is specified for the Transfer
*
Number of GPU SubExecutors for benchmark can be specified via
`NUM_GPU_SE`
-
Removed USE_MEMSET environment variable. This can now be done via a Transfer using the null memory type
*
Defaults to all CUs for GPU-GFX, 1 for GPU-DMA
*
Number of CPU SubExecutors for benchmark can be specified via
`NUM_CPU_SE`
*
Psuedo-random input pattern has been slightly adjusted to have different patterns for each input
array within same transfer
### Removals
*
`USE_HIP_CALL`
: use
`GPU-DMA`
executor 'D' or set
`USE_GPU_DMA`
=1 for P2P
benchmark presets
*
Currently, a warning will be issued if
`USE_HIP_CALL`
is set to 1 and the program will stop
*
`NUM_CPU_PER_TRANSFER`
: the number of CPU SubExecutors will be whatever is specified for the
transfer
*
`USE_MEMSET`
: this function can now be done via a transfer using the null memory type
## v1.10
## v1.10
### Fixed
-
Fix incorrect bandwidth calculation when using single stream mode and per-Transfer data sizes
### Fixes
*
Fixed incorrect bandwidth calculation when using single stream mode and per-transfer data sizes
## v1.09
## v1.09
### Added
-
Printing off src/dst memory addresses during interactive mode
### Additions
### Changed
-
Switching to numa_set_preferred instead of set_mempolicy
*
Printing off src/dst memory addresses during interactive mode
### Changes
*
Switching to
`numa_set_preferred`
instead of
`set_mempolicy`
## v1.08
## v1.08
### Changed
-
Fixing handling of non-configured NUMA nodes
### Changes
-
Topology detection now shows actual NUMA node indices
-
Fix for issue with NUM_GPU_DEVICES
*
Fixed handling of non-configured NUMA nodes
*
Topology detection now shows actual NUMA node indices
*
Fixed 'for' issue with
`NUM_GPU_DEVICES`
## v1.07
## v1.07
### Changed
-
Fix bug with allocations involving non-default CPU memory types
### Fixes
*
Fixed bug with allocations involving non-default CPU memory types
## v1.06
## v1.06
### Added
-
Added unpinned CPU memory type ('U'). May require HSA_XNACK=1 in order to access via GPU executors
### Additions
-
Adding logging of sweep configuration to lastSweep.cfg
-
Adding ability to specify number of CUs to use for sweep-based presets
*
Unpinned CPU memory type ('U'), which may require
`HSA_XNACK`
=1 in order to access via
### Changed
GPU executors
-
Fixing random sweep repeatibility
*
Added sweep configuration logging to
`lastSweep.cfg`
-
Fixing bug with CPU NUMA node memory allocation
*
Ability to specify the number of CUs to use for sweep-based presets
-
Modified advanced configuration file format to accept bytes per Transfer
### Changes
*
Modified advanced configuration file format to accept bytes-per-transfer
### Fixes
*
Fixed random sweep repeatability
*
Fixed bug with CPU NUMA node memory allocation
## v1.05
## v1.05
### Added
-
Topology output now includes NUMA node information
### Additions
-
Support for NUMA nodes with no CPU cores (e.g. CXL memory)
### Removed
*
Topology output now includes NUMA node information
-
SWEEP_SRC_IS_EXE environment variable
*
Support for NUMA nodes with no CPU cores (e.g., CXL memory)
### Removals
*
The
`SWEEP_SRC_IS_EXE`
environment variable was removed
## v1.04
## v1.04
### Added
-
New environment variables for sweep based presets
### Additions
-
SWEEP_XGMI_MIN - Min number of XGMI hops for Transfers
-
SWEEP_XGMI_MAX - Max number of XGMI hops for Transfers
*
There are new environment variables for sweep based presets:
-
SWEEP_SEED - Random seed being used
*
`SWEEP_XGMI_MIN`
: The minumum number of XGMI hops for transfers
-
SWEEP_RAND_BYTES - Use random amount of bytes (up to pre-specified N) for each Transfer
*
`SWEEP_XGMI_MAX`
: The maximum number of XGMI hops for transfers
### Changed
*
`SWEEP_SEED`
: Uses a random seed
-
CSV output for sweep includes env vars section followed by output
*
`SWEEP_RAND_BYTES`
: Uses a random amount of bytes (up to pre-specified N) for each transfer
-
CSV output no longer lists env var parameters in columns
-
Default number of warmup iterations changed from 3 to 1
### Changes
-
Splitting CSV output of link type to ExeToSrcLinkType and ExeToDstLinkType
*
CSV output for sweep now includes an environment variables section followed by output
*
CSV output no longer lists environment variable parameters in columns
*
We changed the default number of warmup iterations from 3 to 1
*
Split CSV output of link type to
`ExeToSrcLinkType`
and
`ExeToDstLinkType`
## v1.03
## v1.03
### Added
-
New preset modes stress-test benchmarks "sweep" and "randomsweep"
### Additions
-
sweep iterates over all possible sets of Transfers to test
-
randomsweep iterates over random sets of Transfers
*
There are new preset modes stress-test benchmarks:
`sweep`
and
`randomsweep`
-
New sweep-only environment variables can modify sweep
*
`sweep`
iterates over all possible sets of transfers to test
-
SWEEP_SRC - String containing only "B","C","F", or "G", defining possible source memory types
*
`randomsweep`
iterates over random sets of transfers
-
SWEEP_EXE - String containing only "C", or "G", defining possible executors
*
New sweep-only environment variables can modify
`sweep`
-
SWEEP_DST - String containing only "B","C","F", or "G", defining possible destination memory types
*
`SWEEP_SRC`
: String containing only "B","C","F", or "G" that defines possible source memory types
-
SWEEP_SRC_IS_EXE - Restrict executor to be the same as the source if non-zero
*
`SWEEP_EXE`
: String containing only "C" or "G" that defines possible executors
-
SWEEP_MIN - Minimum number of parallel transfers to test
*
`SWEEP_DST`
: String containing only "B","C","F", or "G" that defines possible destination memory types
-
SWEEP_MAX - Maximum number of parallel transfers to test
*
`SWEEP_SRC_IS_EXE`
: Restrict the executor to be the same as the source, if non-zero
-
SWEEP_COUNT - Maximum number of tests to run
*
`SWEEP_MIN`
: Minimum number of parallel transfers to test
-
SWEEP_TIME_LIMIT - Maximum number of seconds to run tests for
*
`SWEEP_MAX`
: Maximum number of parallel transfers to test
-
New environment variable to restrict number of available GPUs to test on (primarily for sweep runs)
*
`SWEEP_COUNT`
: Maximum number of tests to run
-
NUM_CPU_DEVICES - Number of CPU devices
*
`SWEEP_TIME_LIMIT`
: Maximum number of seconds to run tests
-
NUM_GPU_DEVICES - Number of GPU devices
*
New environment variables to restrict number of available devices to test on (primarily for sweep
### Changed
runs)
-
Fixed timing display for CPU-executors when using single stream mode
*
`NUM_CPU_DEVICES`
: Number of CPU devices
*
`NUM_GPU_DEVICES`
: Number of GPU devices
### Fixes
*
Fixed timing display for CPU executors when using single-stream mode
## v1.02
## v1.02
### Added
-
Setting NUM_ITERATIONS to negative number indicates to run for -NUM_ITERATIONS seconds per Test
### Additions
### Changed
-
Copies are now refered to as Transfers instead of Links
*
Setting
`NUM_ITERATIONS`
to a negative number indicates a run of -
`NUM_ITERATIONS`
seconds per
-
Re-ordering how env vars are displayed (alphabetically now)
test
### Removed
-
Combined timing is now always on for kernel-based GPU copies. COMBINED_TIMING env var has been removed
### Changes
-
Use single sync is no longer supported to facility variable iterations. USE_SINGLE_SYNC env var has been removed
*
Copies are now referred to as 'transfers' instead of 'links'
*
Reordered how environment variables are displayed (alphabetically now)
### Removals
*
Combined timing is now always on for kernel-based GPU copies; the
`COMBINED_TIMING`
environment variable has been removed
*
Single sync is no longer supported for facility variable iterations; the
`USE_SINGLE_SYNC`
environmental variable has been removed
## v1.01
## v1.01
### Added
-
Adding USE_SINGLE_STREAM feature
### Additions
-
All Links that execute on the same GPU device are executed with a single kernel launch on a single stream
-
Does not work with USE_HIP_CALL and forces USE_SINGLE_SYNC to collect timings
*
Added the
`USE_SINGLE_STREAM`
feature
-
Adding ability to request coherent / fine-grained host memory ('B')
*
All Links that run on the same GPU device are run with a single kernel launch on a single stream
### Changed
*
This doesn't work with
`USE_HIP_CALL`
, and it forces
`USE_SINGLE_SYNC`
to collect timings
-
Separating TransferBench from RCCL repo
*
Added the ability to request coherent or fine-grained host memory ('B')
-
Peer-to-peer benchmark mode now works OUTPUT_TO_CSV
-
Toplogy display now works with OUTPUT_TO_CSV
### Changes
-
Moving documentation about config file into example.cfg
### Removed
*
Separated the TransferBench repository from the RCCL repository
-
Removed config file generation
*
Peer-to-peer benchmark mode now works with
`OUTPUT_TO_CSV`
-
Removed show pointer address environment variable (SHOW_ADDR)
*
Toplogy display now works with
`OUTPUT_TO_CSV`
*
Moved the documentation about the config file into
`example.cfg`
### Removals
*
Removed config file generation
*
Removed the 'show pointer address' (
`SHOW_ADDR`
) environment variable
README.md
View file @
dda6ebe5
# TransferBench
# TransferBench
TransferBench is a simple utility capable of benchmarking simultaneous copies between user-specified devices (CPUs/GPUs).
TransferBench is a simple utility capable of benchmarking simultaneous copies between user-specified
CPU and GPU devices.
Documentation for TransferBench is available at
[
https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html
](
https://rocm.docs.amd.com/projects/TransferBench/en/latest/index.html
)
.
## Requirements
## Requirements
1.
ROCm stack installed on
the
system (HIP runtime)
*
You must have a
ROCm stack installed on
your
system (HIP runtime)
2.
libnuma installed on system
*
You must have
`
libnuma
`
installed on
your
system
## Documentation
## Documentation
Run the steps below t
o build documentation locally
.
T
o build documentation locally
, use the following code:
```
```
shell
cd
docs
cd
docs
pip3
install
-r
.sphinx/requirements.txt
pip3
install
-r
.sphinx/requirements.txt
...
@@ -19,45 +23,54 @@ pip3 install -r .sphinx/requirements.txt
...
@@ -19,45 +23,54 @@ pip3 install -r .sphinx/requirements.txt
python3
-m
sphinx
-T
-E
-b
html
-d
_build/doctrees
-D
language
=
en
.
_build/html
python3
-m
sphinx
-T
-E
-b
html
-d
_build/doctrees
-D
language
=
en
.
_build/html
```
```
## Building
## Building TransferBench
To build TransferBench using Makefile:
You can build TransferBench using Makefile or CMake.
*
Makefile:
```
shell
```
shell
$
make
make
```
```
To build TransferBench using cmake:
*
CMake:
```
shell
```
shell
$
mkdir
build
mkdir
build
$
cd
build
cd
build
$
CXX
=
/opt/rocm/bin/hipcc cmake ..
CXX
=
/opt/rocm/bin/hipcc cmake ..
$
make
make
```
```
If ROCm is installed in
a folder other than
`/opt/rocm/`
, set ROCM_PATH
appropriately
If ROCm is
not
installed in
`/opt/rocm/`
,
you must
set
`
ROCM_PATH
`
to the correct location.
## NVIDIA platform support
## NVIDIA platform support
TransferBench may also be built
to run on NVIDIA platforms
either
via HIP
,
or native
nvcc
You can build TransferBench
to run on NVIDIA platforms via HIP or native
NVCC.
To build with HIP for NVIDIA (requires HIP-compatible CUDA version installed e.g. CUDA 11.5):
Use the following code to build with HIP for NVIDIA (note that you must have a HIP-compatible CUDA
```
version installed, e.g., CUDA 11.5):
CUDA_PATH=<path_to_CUDA> HIP_PLATFORM=nvidia make`
```
To build with native nvcc: (Builds TransferBenchCuda)
```
shell
CUDA_PATH
=
<path_to_CUDA>
HIP_PLATFORM
=
nvidia make
`
```
```
make
Use the following code to build with native NVCC (builds
`TransferBenchCuda`
):
```
shell
make
```
```
## Hints and suggestions
## Things to note
-
Running TransferBench with no arguments will display usage instructions and detected topology information
-
There are several preset configurations that can be used instead of a configuration file
*
Running TransferBench with no arguments displays usage instructions and detected topology
including:
information
-
p2p - Peer to peer benchmark test
*
You can use several preset configurations instead of a configuration file:
-
sweep - Sweep across possible sets of Transfers
*
`p2p`
: Peer-to-peer benchmark test
-
rsweep - Random sweep across possible sets of Transfers
*
`sweep`
: Sweep across possible sets of transfers
-
When using the same GPU executor in multiple simultaneous Transfers, performance may be
*
`rsweep`
: Random sweep across possible sets of transfers
serialized due to the maximum number of hardware queues available.
*
When using the same GPU executor in multiple simultaneous transfers, performance may be
-
The number of maximum hardware queues can be adjusted via GPU_MAX_HW_QUEUES
serialized due to the maximum number of hardware queues available
-
Alternatively, running in single stream mode (USE_SINGLE_STREAM=1) may avoid this issue
*
The number of maximum hardware queues can be adjusted via
`GPU_MAX_HW_QUEUES`
by launching all Transfers on a single stream instead of individual streams
*
Alternatively, running in single-stream mode (
`USE_SINGLE_STREAM`
=1) may avoid this issue
by launching all transfers on a single stream, rather than on individual streams
Write
Preview
Markdown
is supported
0%
Try again
or
attach a new file
.
Attach a file
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Cancel
Please
register
or
sign in
to comment