adjusted kernel config for better perf. removed divergence in welford warp reduction.
Attach a file by drag & drop or click to upload