merge sequence parallelism's layernorm all-reduce into distributed optimizer.
Attach a file by drag & drop or click to upload