using tuple (instead of vector) for holding C thread matrix data to solve register over-allocation issue
Attach a file by drag & drop or click to upload