• Anton Gorenko's avatar
    Optimize sorting kernels and tune block sizes · 7279c539
    Anton Gorenko authored
    * Compile kernels with max block size of 256 threads:
      The default hipcc behavior since ROCm 4.2 is to compile kernels
      with 1024 threads unless __launch_bounds__ is specified. This
      significantly increases register pressure especially in heavy kernels
      (double precision, for example), requiring register spilling;
    * Optimize computeRange by using multiple blocks for reduction;
    * Use blocks of 1024 threads for computeBucketPositions - it is executed
      as a single work group so larger block size is faster;
    * Sort up-to lenghtNextPow2 instead of blockDim.x (faster for short
      buckets);
    * Optimize sortShortList2;
    * Optimize sortBuckets with bit instructions;
    * Decrease bucket size for non-uniform sorting: too many buckets may
      have sizes too large to sort in shared memory;
    * Add more sizes in tests.
    7279c539
TestHipSort.cpp 5.51 KB