* fixed bugs in softmax kernel and update unittest for softmax * remove redundancy mask for softmax grad * test both cuda/triton kernel in unittest