Launch params (512, 1, 1) are larger than launch bounds (256) for kernel _ZN12_GLOBAL__N_113prepareInput2IfNS_9ReduceNilEEEvPT_lT0_iiml please add __launch_bounds__ to kernel define or use --gpu-max-threads-per-block recompile program !