# Changelog for TransferBench

## v1.27
### Added
- Adding cmdline preset to allow specify simple tests on command line
- E.g. ./TransferBench cmdline 64M "1 4 G0->G0->G1"
- Adding environment variable HIDE_ENV, which skips printing of environment variable values
- Adding environment variable CU_MASK, which allows selection of which CUs to execute on
- CU_MASK is specified in CU indices (0-#CUs-1), and '-' can be used to denote ranges of values
  - E.g.: CU_MASK=3-8,16 would request Transfer be executed only CUs 3,4,5,6,7,8,16
  - NOTE: This is somewhat experimental and may not work on all hardware
- SHOW_ITERATIONS now shows CU usage for that iteration (experimental)
### Modified
- Adding extra comments on commonly missing includes with details on how to install them
### Fixed
- CUDA compilation should work again (wall_clock64 CUDA alias was not defined)

## v1.26
### Added
- Setting SHOW_ITERATIONS=1 provides additional information about per-iteration timing for file and p2p configs
  - For file configs, iterations are sorted from min to max bandwidth and displayed with standard deviation
  - For p2p, min/max/standard deviation is shown for each direction.

### Changed
- P2P benchmark formatting changed.  Now reports bidirectional bandwidth in each direction (as well as sum) for clarity

## v1.25
### Fixed
- Fixed bug in P2P bidirectional benchmark using incorrect number of subExecutors for CPU<->GPU tests

## v1.24
### Added
- New All-To-All GPU benchmark accessed by preset "a2a"
- Adding gfx941 wall clock frequency

## v1.23
### Added
- New GPU subexec scaling benchmark accessed by preset "scaling"
  - Tests GPU-GFX copy performance based on # of CUs used

## v1.22
### Modified
- Switching kernel timing function to wall_clock64

## v1.21
### Fixed
- Fixed bug with SAMPLING_FACTOR

## v1.20
### Fixed
- VALIDATE_DIRECT can now be used with USE_PREP_KERNEL
- Switch to local GPU for validating GPU memory

## v1.19
### Added
- VALIDATE_DIRECT now also applies to source memory array checking
- Adding null memory pointer check prior to deallocation

## v1.18
### Added
- Adding ability to validate GPU destination memory directly without going through CPU staging buffer (VALIDATE_DIRECT)
  - NOTE: This will only work on AMD devices with large-bar access enable and may slow things down considerably
### Changed
- Refactored how environment variables are displayed
- Mismatch stops after first detected error within an array instead of list all mismatched elements

## v1.17
### Added
- Allow switch to GFX kernel for source array initialization (USE_PREP_KERNEL)
  - USE_PREP_KERNEL cannot be used with FILL_PATTERN
- Adding ability to compile with nvcc only (TransferBenchCuda)
### Changed
- Default pattern set to [Element i = ((i * 517) modulo 383 + 31) * (srcBufferIdx + 1)]
### Fixed
- Re-adding example.cfg file

## v1.16
### Added
- Additional src array validation during preparation
- Adding new env var CONTINUE_ON_ERROR to resume tests after mis-match detection
- Initializing GPU memory to 0 during allocation

## v1.15
### Fixed
- Fixed a bug that prevented single Transfers > 8GB
### Changed
- Removed "check for latest ROCm" warning when allocating too much memory
- Printing off source memory value as well when mis-match is detected

## v1.14
### Added
- Added documentation
- Added pthread linking in src/Makefile and CMakeLists.txt
- Added printing off the hex value of the floats for output and reference

## v1.13
### Added
- Added support for cmake

### Changed
- Converted to the Pitchfork layout standard

## v1.12
### Added
- Added support for TransferBench on NVIDIA platforms (via HIP_PLATFORM=nvidia)
  - CPU executors on NVIDIA platform cannot access GPU memory (no large-bar access)

## v1.11
### Added
- New multi-input / multi-output support (MIMO).  Transfers now can reduce (element-wise summation) multiple input memory arrays
  and write the sums to multiple outputs
- New GPU-DMA executor 'D' (uses hipMemcpy for SDMA copies).  Previously this was done using USE_HIP_CALL, but now this allows
  GPU-GFX kernel to run in parallel with GPU-DMA instead of applying to all GPU executors globally.
  - GPU-DMA executor can only be used for single-input/single-output Transfers
  - GPU-DMA executor can only be associated with one SubExecutor
- Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or write-only Transfers
- Added new GPU_KERNEL environment variable that allows for switching between various GPU-GFX reduction kernels

### Optimized
- Slightly improved GPU-GFX kernel performance based on hardware architecture when running with fewer CUs

### Changed
- Updated the example.cfg file to cover the new features
- Updated output to support MIMO
- Changed CUs/CPUs threads naming to SubExecutors for consistency
- Sweep Preset:
  - Default sweep preset executors now includes DMA
- P2P Benchmarks:
  - Now only works via "p2p".  Removed "p2p_rr", "g2g" and "g2g_rr".
    - Setting NUM_CPU_DEVICES=0 can be used to only benchmark GPU devices (like "g2g")
    - New environment variable USE_REMOTE_READ replaces "_rr" presets
  - New environment variable USE_GPU_DMA=1 replaces USE_HIP_CALL=1 for benchmarking with GPU-DMA Executor
  - Number of GPU SubExecutors for benchmark can be specified via NUM_GPU_SE
    - Defaults to all CUs for GPU-GFX, 1 for GPU-DMA
  - Number of CPU SubExecutors for benchmark can be specified via NUM_CPU_SE
- Psuedo-random input pattern has been slightly adjusted to have different patterns for each input array within same Transfer

### Removed
- USE_HIP_CALL has been removed.  Use GPU-DMA executor 'D' or set USE_GPU_DMA=1 for P2P benchmark presets
  - Currently warning will be issued if USE_HIP_CALL is set to 1 and program will terminate
- Removed NUM_CPU_PER_TRANSFER - The number of CPU SubExecutors will be whatever is specified for the Transfer
- Removed USE_MEMSET environment variable.  This can now be done via a Transfer using the null memory type

## v1.10
### Fixed
- Fix incorrect bandwidth calculation when using single stream mode and per-Transfer data sizes

## v1.09
### Added
- Printing off src/dst memory addresses during interactive mode
### Changed
- Switching to numa_set_preferred instead of set_mempolicy

## v1.08
### Changed
- Fixing handling of non-configured NUMA nodes
- Topology detection now shows actual NUMA node indices
- Fix for issue with NUM_GPU_DEVICES

## v1.07
### Changed
- Fix bug with allocations involving non-default CPU memory types

## v1.06
### Added
- Added unpinned CPU memory type ('U').  May require HSA_XNACK=1 in order to access via GPU executors
- Adding logging of sweep configuration to lastSweep.cfg
- Adding ability to specify number of CUs to use for sweep-based presets
### Changed
- Fixing random sweep repeatibility
- Fixing bug with CPU NUMA node memory allocation
- Modified advanced configuration file format to accept bytes per Transfer

## v1.05
### Added
- Topology output now includes NUMA node information
- Support for NUMA nodes with no CPU cores (e.g. CXL memory)
### Removed
- SWEEP_SRC_IS_EXE environment variable

## v1.04
### Added
- New environment variables for sweep based presets
  - SWEEP_XGMI_MIN   - Min number of XGMI hops for Transfers
  - SWEEP_XGMI_MAX   - Max number of XGMI hops for Transfers
  - SWEEP_SEED       - Random seed being used
  - SWEEP_RAND_BYTES - Use random amount of bytes (up to pre-specified N) for each Transfer
### Changed
  - CSV output for sweep includes env vars section followed by output
  - CSV output no longer lists env var parameters in columns
  - Default number of warmup iterations changed from 3 to 1
  - Splitting CSV output of link type to ExeToSrcLinkType and ExeToDstLinkType

## v1.03
### Added
- New preset modes stress-test benchmarks "sweep" and "randomsweep"
  - sweep iterates over all possible sets of Transfers to test
  - randomsweep iterates over random sets of Transfers
  -  New sweep-only environment variables can modify sweep
     - SWEEP_SRC - String containing only "B","C","F", or "G", defining possible source memory types
     - SWEEP_EXE - String containing only "C", or "G", defining possible executors
     - SWEEP_DST - String containing only "B","C","F", or "G", defining possible destination memory types
     - SWEEP_SRC_IS_EXE - Restrict executor to be the same as the source if non-zero
     - SWEEP_MIN - Minimum number of parallel transfers to test
     - SWEEP_MAX - Maximum number of parallel transfers to test
     - SWEEP_COUNT - Maximum number of tests to run
     - SWEEP_TIME_LIMIT - Maximum number of seconds to run tests for
- New environment variable to restrict number of available GPUs to test on (primarily for sweep runs)
  - NUM_CPU_DEVICES - Number of CPU devices
  - NUM_GPU_DEVICES - Number of GPU devices
### Changed
- Fixed timing display for CPU-executors when using single stream mode

## v1.02
### Added
- Setting NUM_ITERATIONS to negative number indicates to run for -NUM_ITERATIONS seconds per Test
### Changed
- Copies are now refered to as Transfers instead of Links
- Re-ordering how env vars are displayed (alphabetically now)
### Removed
- Combined timing is now always on for kernel-based GPU copies. COMBINED_TIMING env var has been removed
- Use single sync is no longer supported to facility variable iterations. USE_SINGLE_SYNC env var has been removed

## v1.01
### Added
- Adding USE_SINGLE_STREAM feature
  - All Links that execute on the same GPU device are executed with a single kernel launch on a single stream
  - Does not work with USE_HIP_CALL and forces USE_SINGLE_SYNC to collect timings
  - Adding ability to request coherent / fine-grained host memory ('B')
### Changed
- Separating TransferBench from RCCL repo
- Peer-to-peer benchmark mode now works OUTPUT_TO_CSV
- Toplogy display now works with OUTPUT_TO_CSV
- Moving documentation about config file into example.cfg
### Removed
- Removed config file generation
- Removed show pointer address environment variable (SHOW_ADDR)