--------------------
ConfigFile Format
--------------------

A Transfer is defined as a single operation where an Executor reads and adds together
values from Source (SRC) memory locations, then writes the sum to destination (DST) memory locations.
This simplifies to a simple copy operation when dealing with single SRC/DST.::

   SRC 0                DST 0
   SRC 1 -> Executor -> DST 1
   SRC X                DST Y

Three Executors are supported by TransferBench::

   Executor:        SubExecutor:
   1) CPU           CPU thread
   2) GPU           GPU threadblock/Compute Unit (CU)
   3) DMA           N/A.                                 (May only be used for copies (single SRC/DST)

Each single line in the configuration file defines a set of Transfers (a Test) to run in parallel

There are two ways to specify a Test:

1) Basic

   The basic specification assumes the same number of SubExecutors (SE) used per Transfer
   A positive number of Transfers is specified followed by that number of triplets describing each Transfer

   Transfers SEs (srcMem1->Executor1->dstMem1) ... (srcMemL->ExecutorL->dstMemL)

2) Advanced

   A negative number of Transfers is specified, followed by quintuplets describing each Transfer
   A non-zero number of bytes specified will override any provided value

   -Transfers (srcMem1->Executor1->dstMem1 SEs1 Bytes1) ... (srcMemL->ExecutorL->dstMemL SEsL BytesL)

Argument Details:::

   Transfers:   Number of Transfers to be run in parallel
   SEs      :   Number of SubExectors to use (CPU threads/ GPU threadblocks)
   srcMemL   :   Source memory locations (Where the data is to be read from)
   Executor  :   Executor is specified by a character indicating type, followed by device index (0-indexed)
                  - C: CPU-executed  (Indexed from 0 to NUMA nodes - 1)
                  - G: GPU-executed  (Indexed from 0 to GPUs - 1)
                  - D: DMA-executor  (Indexed from 0 to GPUs - 1)
   dstMemL   :   Destination memory locations (Where the data is to be written to)
   bytesL    :   Number of bytes to copy (0 means use command-line specified size)
                  Must be a multiple of 4 and may be suffixed with ('K','M', or 'G')

                  Memory locations are specified by one or more (device character / device index) pairs
                  Character indicating memory type followed by device index (0-indexed)
                  Supported memory locations are:
                  - C:    Pinned host memory       (on NUMA node, indexed from 0 to [NUMA nodes-1])
                  - U:    Unpinned host memory     (on NUMA node, indexed from 0 to [NUMA nodes-1])
                  - B:    Fine-grain host memory   (on NUMA node, indexed from 0 to [NUMA nodes-1])
                  - G:    Global device memory     (on GPU device indexed from 0 to [GPUs - 1])
                  - F:    Fine-grain device memory (on GPU device indexed from 0 to [GPUs - 1])
                  - N:    Null memory              (index ignored)

Examples:::

   1 4 (G0->G0->G1)                   Uses 4 CUs on GPU0 to copy from GPU0 to GPU1
   1 4 (C1->G2->G0)                   Uses 4 CUs on GPU2 to copy from CPU1 to GPU0
   2 4 G0->G0->G1 G1->G1->G0          Copes from GPU0 to GPU1, and GPU1 to GPU0, each with 4 SEs
   -2 (G0 G0 G1 4 1M) (G1 G1 G0 2 2M) Copies 1Mb from GPU0 to GPU1 with 4 SEs, and 2Mb from GPU1 to GPU0 with 2 SEs

Round brackets and arrows' ->' may be included for human clarity, but will be ignored and are unnecessary
Lines starting with will be ignored. Lines starting with will be echoed to output

Single GPU-executed Transfer between GPUs 0 and 1 using 4 CUs::

   1 4 (G0->G0->G1)

Single DMA executed Transfer between GPUs 0 and 1::

   1 1 (G0->D0->G1)

Copy 1Mb from GPU0 to GPU1 with 4 CUs, and 2Mb from GPU1 to GPU0 with 8 CUs::

   -2 (G0->G0->G1 4 1M) (G1->G1->G0 8 2M)

"Memset" by GPU 0 to GPU 0 memory::

   1 32 (N0->G0->G0)

"Read-only" by CPU 0::

   1 4 (C0->C0->N0)

Broadcast from GPU 0 to GPU 0 and GPU 1::

   1 16 (G0->G0->G0G1)