Add NUM_SRCS/NUM_DSTS template parameters to GpuReduceKernel (#209)
- Instantiates optimized kernels for common Transfer types:
- Copy (1 src → 1 dst): Optimized single-source data copy
- Read-only (1 src → 0 dst): Optimized memory read validation
- Write-only (0 src → 1 dst): Optimized memory write/initialization
- Compiler eliminates dead code loops for these specialized cases, improving performance by up to 7% for all-to-all workloads on MI3xx machines
- Update CHANGELOG
Showing
Please register or sign in to comment