# ConfigFile Format: # ================== # A Link is defined as a uni-directional transfer from src memory location to dst memory location # executed by either CPU or GPU # Each single line in the configuration file defines a set of Links to run in parallel # There are two ways to specify the configuration file: # 1) Basic # The basic specification assumes the same number of threadblocks/CUs used per GPU-executed Link # A positive number of Links is specified followed by that number of triplets describing each Link # #Links #CUs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL) # 2) Advanced # The advanced specification allows different number of threadblocks/CUs used per GPU-executed Link # A negative number of links is specified, followed by quadruples describing each Link # -#Links (srcMem1->Executor1->dstMem1 #CUs1) ... (srcMemL->ExecutorL->dstMemL #CUsL) # Argument Details: # #Links : Number of Links to be run in parallel # #CUs : Number of threadblocks/CUs to use for a GPU-executed Link # srcMemL : Source memory location (Where the data is to be read from). Ignored in memset mode # Executor: Executor are specified by a character indicating executor type, followed by device index (0-indexed) # - C: CPU-executed (Indexed from 0 to 1) # - G: GPU-executed (Indexed from 0 to 3) # dstMemL : Destination memory location (Where the data is to be written to) # Memory locations are specified by a character indicating memory type, # followed by device index (0-indexed) # Supported memory locations are: # - C: Pinned host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1]) # - B: Fine-grain host memory (on NUMA node, indexed from 0 to [# NUMA nodes-1]) # - G: Global device memory (on GPU device indexed from 0 to [# GPUs - 1]) # - F: Fine-grain device memory (on GPU device indexed from 0 to [# GPUs - 1]) # Examples: # 1 4 (G0->G0->G1) Single link using 4 CUs on GPU0 to copy from GPU0 to GPU1 # 1 4 (C1->G2->G0) Single link using 4 CUs on GPU2 to copy from CPU1 to GPU0 # 2 4 G0->G0->G1 G1->G1->G0 Runs 2 Links in parallel. GPU0 to GPU1, and GPU1 to GPU0, each with 4 CUs # -2 (G0 G0 G1 4) (G1 G1 G0 2) Runs 2 Links in parallel. GPU0 to GPU1 with 4 CUs, and GPU1 to GPU0 with 2 CUs # Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary # Lines starting with # will be ignored. Lines starting with ## will be echoed to output # Single GPU-executed link between GPUs 0 and 1 using 4 CUs 1 4 (G0->G0->G1)