* adding in-thread shuffle * update softmax example * refactor grid gemm * refactor gemm: layouts * bug fix * clean * clean
* make it simple * batched gemm+softmax+gemm
* removing program server * specify launch bound per kernel instance