// Since cuda memcpys in streams are async, this gets a bit tricky.
// We can't just consume the queue normally, otherwise the stream would become very backlogged.
// From the point when the a transfer is put into the stream until the transfer corresponding to the block is complete, we need to hold a strong reference to the block.
// If we don't do this, the block may be evicted and overwritten before the transfer is complete.
// To do this, we use a queue to track blocks currently being offloaded. Once the offload is complete (as indicated by a CudaEvent), the reference to the block is dropped.