• Weile's avatar
    Add NUM_SRCS/NUM_DSTS template parameters to GpuReduceKernel (#209) · 44140eeb
    Weile authored
      - Instantiates optimized kernels for common Transfer types:
        - Copy (1 src → 1 dst): Optimized single-source data copy
        - Read-only (1 src → 0 dst): Optimized memory read validation
        - Write-only (0 src → 1 dst): Optimized memory write/initialization
      - Compiler eliminates dead code loops for these specialized cases, improving performance by up to 7% for all-to-all workloads on MI3xx machines
      - Update CHANGELOG
    44140eeb
TransferBench.hpp 164 KB