# Changelog for TransferBench ## v1.11 ### Added - New multi-input / multi-output support (MIMO). Transfers now can reduce (element-wise summation) multiple input memory arrays and write the sums to multiple outputs - New GPU-DMA executor 'D' (uses hipMemcpy for SDMA copies). Previously this was done using USE_HIP_CALL, but now this allows GPU-GFX kernel to run in parallel with GPU-DMA instead of applying to all GPU executors globally. - GPU-DMA executor can only be used for single-input/single-output Transfers - GPU-DMA executor can only be associated with one SubExecutor - Added new "Null" memory type 'N', which represents empty memory. This allows for read-only or write-only Transfers - Added new GPU_KERNEL environment variable that allows for switching between various GPU-GFX reduction kernels ### Optimized - Slightly improved GPU-GFX kernel performance based on hardware architecture when running with fewer CUs ### Changed - Updated the example.cfg file to cover the new features - Updated output to support MIMO - Changed CUs/CPUs threads naming to SubExecutors for consistency - Sweep Preset: - Default sweep preset executors now includes DMA - P2P Benchmarks: - Now only works via "p2p". Removed "p2p_rr", "g2g" and "g2g_rr". - Setting NUM_CPU_DEVICES=0 can be used to only benchmark GPU devices (like "g2g") - New environment variable USE_REMOTE_READ replaces "_rr" presets - New environment variable USE_GPU_DMA=1 replaces USE_HIP_CALL=1 for benchmarking with GPU-DMA Executor - Number of GPU SubExecutors for benchmark can be specified via NUM_GPU_SE - Defaults to all CUs for GPU-GFX, 1 for GPU-DMA - Number of CPU SubExecutors for benchmark can be specified via NUM_CPU_SE - Psuedo-random input pattern has been slightly adjusted to have different patterns for each input array within same Transfer ### Removed - USE_HIP_CALL has been removed. Use GPU-DMA executor 'D' or set USE_GPU_DMA=1 for P2P benchmark presets - Currently warning will be issued if USE_HIP_CALL is set to 1 and program will terminate - Removed NUM_CPU_PER_TRANSFER - The number of CPU SubExecutors will be whatever is specified for the Transfer - Removed USE_MEMSET environment variable. This can now be done via a Transfer using the null memory type ## v1.10 ### Fixed - Fix incorrect bandwidth calculation when using single stream mode and per-Transfer data sizes ## v1.09 ### Added - Printing off src/dst memory addresses during interactive mode ### Changed - Switching to numa_set_preferred instead of set_mempolicy ## v1.08 ### Changed - Fixing handling of non-configured NUMA nodes - Topology detection now shows actual NUMA node indices - Fix for issue with NUM_GPU_DEVICES ## v1.07 ### Changed - Fix bug with allocations involving non-default CPU memory types ## v1.06 ### Added - Added unpinned CPU memory type ('U'). May require HSA_XNACK=1 in order to access via GPU executors - Adding logging of sweep configuration to lastSweep.cfg - Adding ability to specify number of CUs to use for sweep-based presets ### Changed - Fixing random sweep repeatibility - Fixing bug with CPU NUMA node memory allocation - Modified advanced configuration file format to accept bytes per Transfer ## v1.05 ### Added - Topology output now includes NUMA node information - Support for NUMA nodes with no CPU cores (e.g. CXL memory) ### Removed - SWEEP_SRC_IS_EXE environment variable ## v1.04 ### Added - New environment variables for sweep based presets - SWEEP_XGMI_MIN - Min number of XGMI hops for Transfers - SWEEP_XGMI_MAX - Max number of XGMI hops for Transfers - SWEEP_SEED - Random seed being used - SWEEP_RAND_BYTES - Use random amount of bytes (up to pre-specified N) for each Transfer ### Changed - CSV output for sweep includes env vars section followed by output - CSV output no longer lists env var parameters in columns - Default number of warmup iterations changed from 3 to 1 - Splitting CSV output of link type to ExeToSrcLinkType and ExeToDstLinkType ## v1.03 ### Added - New preset modes stress-test benchmarks "sweep" and "randomsweep" - sweep iterates over all possible sets of Transfers to test - randomsweep iterates over random sets of Transfers - New sweep-only environment variables can modify sweep - SWEEP_SRC - String containing only "B","C","F", or "G", defining possible source memory types - SWEEP_EXE - String containing only "C", or "G", defining possible executors - SWEEP_DST - String containing only "B","C","F", or "G", defining possible destination memory types - SWEEP_SRC_IS_EXE - Restrict executor to be the same as the source if non-zero - SWEEP_MIN - Minimum number of parallel transfers to test - SWEEP_MAX - Maximum number of parallel transfers to test - SWEEP_COUNT - Maximum number of tests to run - SWEEP_TIME_LIMIT - Maximum number of seconds to run tests for - New environment variable to restrict number of available GPUs to test on (primarily for sweep runs) - NUM_CPU_DEVICES - Number of CPU devices - NUM_GPU_DEVICES - Number of GPU devices ### Changed - Fixed timing display for CPU-executors when using single stream mode ## v1.02 ### Added - Setting NUM_ITERATIONS to negative number indicates to run for -NUM_ITERATIONS seconds per Test ### Changed - Copies are now refered to as Transfers instead of Links - Re-ordering how env vars are displayed (alphabetically now) ### Removed - Combined timing is now always on for kernel-based GPU copies. COMBINED_TIMING env var has been removed - Use single sync is no longer supported to facility variable iterations. USE_SINGLE_SYNC env var has been removed ## v1.01 ### Added - Adding USE_SINGLE_STREAM feature - All Links that execute on the same GPU device are executed with a single kernel launch on a single stream - Does not work with USE_HIP_CALL and forces USE_SINGLE_SYNC to collect timings - Adding ability to request coherent / fine-grained host memory ('B') ### Changed - Separating TransferBench from RCCL repo - Peer-to-peer benchmark mode now works OUTPUT_TO_CSV - Toplogy display now works with OUTPUT_TO_CSV - Moving documentation about config file into example.cfg ### Removed - Removed config file generation - Removed show pointer address environment variable (SHOW_ADDR)