use ford/for instead of static_ford/static_for in threadwise copy, somehow register spill is greatly reduced on AMD