fix mask softmax and support inference for arbitrary length sequences
add inject_openfold
fix minor bug in gather
refactor kernel implementation