// output merged global tensor descriptor, for calculating origin of thread tensor
// output merged global tensor descriptor, for calculating origin of thread tensor
// in global memory
// in global memory
// JD: Even thought we changecd ghe layut of the output for wrw we dont need to change the following unfold to merge because the unfloded dimension is alredy contiguous