There are three concepts, Message, Descriptor and Allocation.
Message is the core struct for communication. Message contains three major field, metadata(string), payload(cpu memory buffers), tensor(cpu/gpu memory buffer, with device as attribute).
Descriptor and Allocation are for the read scenario. A typical read operation as follows
```cpp
pipe->readDescriptor(
[](constError&error,Descriptordescriptor){
// Descriptor contains metadata of the message, the data size of each payload, the device information of tensors and other metadatas other than the real buffer
// User should allocate the proper memory based on the descriptor, and set back the allocated memory to Allocation object
Allocationallocation;
// Then call pipe->read to ask pipe to receive the real buffer into allocations
pipe->read(allocation,[](constError&error){});
});
```
To send the message is much simpler
```cpp
// Resource cleaning should be handled in the callback
pipe->write(message,callback_fn)
```
## Register the underlying communication channel
There are two concept, transport and channel.
Transport is the basic component for communication like sockets, which only supports cpu buffers.
Channel is higher abstraction over transport, which can support gpu buffers, or utilize multiple transport method to acceelerate communication
Tensorpipe will try to setup the channel based on priority.
There are more channels supported by tensorpipe, such as CUDA IPC (for cuda communication on the same machine), CMA(using shared memory on the same machine), CUDA GDR(using infiniband with CUDA GPUDirect for gpu buffer), CUDA Basic(using socket+seperate thread to copy buffer to CUDA memory.
Quote from tensorpipe:
Backends come in two flavors:
Transports are the connections used by the pipes to transfer control messages, and the (smallish) core payloads. They are meant to be lightweight and low-latency. The most basic transport is a simple TCP one, which should work in all scenarios. A more optimized one, for example, is based on a ring buffer allocated in shared memory, which two processes on the same machine can use to communicate by performing just a memory copy, without passing through the kernel.
Channels are where the heavy lifting takes place, as they take care of copying the (larger) tensor data. High bandwidths are a requirement. Examples include multiplexing chunks of data across multiple TCP sockets and processes, so to saturate the NIC's bandwidth. Or using a CUDA memcpy call to transfer memory from one GPU to another using NVLink.