• Guangguan's avatar
    Fix for data error and kernel hung because of inflight rdma channel head update · b65b22ed
    Guangguan authored
    
    
    When dispatch/combine, neither sender nor receiver waits
    for the finish of the rdma channel head update, which may
    result in the remaining inflight head update wqes even after
    the kernel finished. Once the infight wqes arrive after the
    rdma channel head buffer cleaning for the next round of
    dispatch/combine, the rdma channel head buffer will be re-
    written to a none-zero value. The rdma sender can reuse the
    data buffer before the rdma receivers consume the date buffer
    because of the wrong rdma channel head, cauing date error and
    kernel hung.
    For performance considering, to overlap the inflight wqes' RTT,
    fix this issue by waiting for all previous inflight wqes to
    complete before cleaning rdma buffers in the next round of
    dispatch/combine.
    Signed-off-by: default avatarGuangguan Wang <guangguan.wang@linux.alibaba.com>
    b65b22ed
internode.cu 99.2 KB