CHANGELOG.md 7.05 KB
Newer Older
Gilbert Lee's avatar
Gilbert Lee committed
1
2
# Changelog for TransferBench

3
4
5
6
7
8
## v1.16
### Added
- Additional src array validation during preparation
- Adding new env var CONTINUE_ON_ERROR to resume tests after mis-match detection
- Initializing GPU memory to 0 during allocation

9
10
11
12
13
14
15
## v1.15
### Fixed
- Fixed a bug that prevented single Transfers > 8GB
### Changed
- Removed "check for latest ROCm" warning when allocating too much memory
- Printing off source memory value as well when mis-match is detected

PedramAlizadeh's avatar
PedramAlizadeh committed
16
17
18
19
## v1.14
### Added
- Added documentation
- Added pthread linking in src/Makefile and CMakeLists.txt
20
- Added printing off the hex value of the floats for output and reference
21

PedramAlizadeh's avatar
PedramAlizadeh committed
22
23
24
25
26
27
28
## v1.13
### Added
- Added support for cmake

### Changed
- Converted to the Pitchfork layout standard

29
30
31
32
33
## v1.12
### Added
- Added support for TransferBench on NVIDIA platforms (via HIP_PLATFORM=nvidia)
  - CPU executors on NVIDIA platform cannot access GPU memory (no large-bar access)

gilbertlee-amd's avatar
gilbertlee-amd committed
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
## v1.11
### Added
- New multi-input / multi-output support (MIMO).  Transfers now can reduce (element-wise summation) multiple input memory arrays
  and write the sums to multiple outputs
- New GPU-DMA executor 'D' (uses hipMemcpy for SDMA copies).  Previously this was done using USE_HIP_CALL, but now this allows
  GPU-GFX kernel to run in parallel with GPU-DMA instead of applying to all GPU executors globally.
  - GPU-DMA executor can only be used for single-input/single-output Transfers
  - GPU-DMA executor can only be associated with one SubExecutor
- Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or write-only Transfers
- Added new GPU_KERNEL environment variable that allows for switching between various GPU-GFX reduction kernels

### Optimized
- Slightly improved GPU-GFX kernel performance based on hardware architecture when running with fewer CUs

### Changed
- Updated the example.cfg file to cover the new features
- Updated output to support MIMO
- Changed CUs/CPUs threads naming to SubExecutors for consistency
- Sweep Preset:
  - Default sweep preset executors now includes DMA
- P2P Benchmarks:
  - Now only works via "p2p".  Removed "p2p_rr", "g2g" and "g2g_rr".
    - Setting NUM_CPU_DEVICES=0 can be used to only benchmark GPU devices (like "g2g")
    - New environment variable USE_REMOTE_READ replaces "_rr" presets
  - New environment variable USE_GPU_DMA=1 replaces USE_HIP_CALL=1 for benchmarking with GPU-DMA Executor
  - Number of GPU SubExecutors for benchmark can be specified via NUM_GPU_SE
    - Defaults to all CUs for GPU-GFX, 1 for GPU-DMA
  - Number of CPU SubExecutors for benchmark can be specified via NUM_CPU_SE
- Psuedo-random input pattern has been slightly adjusted to have different patterns for each input array within same Transfer

### Removed
- USE_HIP_CALL has been removed.  Use GPU-DMA executor 'D' or set USE_GPU_DMA=1 for P2P benchmark presets
  - Currently warning will be issued if USE_HIP_CALL is set to 1 and program will terminate
- Removed NUM_CPU_PER_TRANSFER - The number of CPU SubExecutors will be whatever is specified for the Transfer
- Removed USE_MEMSET environment variable.  This can now be done via a Transfer using the null memory type

70
71
72
73
## v1.10
### Fixed
- Fix incorrect bandwidth calculation when using single stream mode and per-Transfer data sizes

74
75
76
77
78
79
## v1.09
### Added
- Printing off src/dst memory addresses during interactive mode
### Changed
- Switching to numa_set_preferred instead of set_mempolicy

80
81
82
83
84
85
## v1.08
### Changed
- Fixing handling of non-configured NUMA nodes
- Topology detection now shows actual NUMA node indices
- Fix for issue with NUM_GPU_DEVICES

86
87
88
89
## v1.07
### Changed
- Fix bug with allocations involving non-default CPU memory types

gilbertlee-amd's avatar
gilbertlee-amd committed
90
91
92
93
94
95
96
97
98
99
## v1.06
### Added
- Added unpinned CPU memory type ('U').  May require HSA_XNACK=1 in order to access via GPU executors
- Adding logging of sweep configuration to lastSweep.cfg
- Adding ability to specify number of CUs to use for sweep-based presets
### Changed
- Fixing random sweep repeatibility
- Fixing bug with CPU NUMA node memory allocation
- Modified advanced configuration file format to accept bytes per Transfer

100
101
102
103
104
105
106
## v1.05
### Added
- Topology output now includes NUMA node information
- Support for NUMA nodes with no CPU cores (e.g. CXL memory)
### Removed
- SWEEP_SRC_IS_EXE environment variable

107
108
109
110
111
112
113
114
115
116
117
118
119
## v1.04
### Added
- New environment variables for sweep based presets
  - SWEEP_XGMI_MIN   - Min number of XGMI hops for Transfers
  - SWEEP_XGMI_MAX   - Max number of XGMI hops for Transfers
  - SWEEP_SEED       - Random seed being used
  - SWEEP_RAND_BYTES - Use random amount of bytes (up to pre-specified N) for each Transfer
### Changed
  - CSV output for sweep includes env vars section followed by output
  - CSV output no longer lists env var parameters in columns
  - Default number of warmup iterations changed from 3 to 1
  - Splitting CSV output of link type to ExeToSrcLinkType and ExeToDstLinkType

Gilbert Lee's avatar
Gilbert Lee committed
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
## v1.03
### Added
- New preset modes stress-test benchmarks "sweep" and "randomsweep"
  - sweep iterates over all possible sets of Transfers to test
  - randomsweep iterates over random sets of Transfers
  -  New sweep-only environment variables can modify sweep
     - SWEEP_SRC - String containing only "B","C","F", or "G", defining possible source memory types
     - SWEEP_EXE - String containing only "C", or "G", defining possible executors
     - SWEEP_DST - String containing only "B","C","F", or "G", defining possible destination memory types
     - SWEEP_SRC_IS_EXE - Restrict executor to be the same as the source if non-zero
     - SWEEP_MIN - Minimum number of parallel transfers to test
     - SWEEP_MAX - Maximum number of parallel transfers to test
     - SWEEP_COUNT - Maximum number of tests to run
     - SWEEP_TIME_LIMIT - Maximum number of seconds to run tests for
- New environment variable to restrict number of available GPUs to test on (primarily for sweep runs)
  - NUM_CPU_DEVICES - Number of CPU devices
  - NUM_GPU_DEVICES - Number of GPU devices
### Changed
- Fixed timing display for CPU-executors when using single stream mode

Gilbert Lee's avatar
Gilbert Lee committed
140
141
142
143
144
145
146
147
148
149
## v1.02
### Added
- Setting NUM_ITERATIONS to negative number indicates to run for -NUM_ITERATIONS seconds per Test
### Changed
- Copies are now refered to as Transfers instead of Links
- Re-ordering how env vars are displayed (alphabetically now)
### Removed
- Combined timing is now always on for kernel-based GPU copies. COMBINED_TIMING env var has been removed
- Use single sync is no longer supported to facility variable iterations. USE_SINGLE_SYNC env var has been removed

Gilbert Lee's avatar
Gilbert Lee committed
150
151
152
153
154
155
156
157
158
159
160
161
162
163
## v1.01
### Added
- Adding USE_SINGLE_STREAM feature
  - All Links that execute on the same GPU device are executed with a single kernel launch on a single stream
  - Does not work with USE_HIP_CALL and forces USE_SINGLE_SYNC to collect timings
  - Adding ability to request coherent / fine-grained host memory ('B')
### Changed
- Separating TransferBench from RCCL repo
- Peer-to-peer benchmark mode now works OUTPUT_TO_CSV
- Toplogy display now works with OUTPUT_TO_CSV
- Moving documentation about config file into example.cfg
### Removed
- Removed config file generation
- Removed show pointer address environment variable (SHOW_ADDR)