@@ -12,6 +12,35 @@ specific language governing permissions and limitations under the License.
...
@@ -12,6 +12,35 @@ specific language governing permissions and limitations under the License.
# Debugging
# Debugging
## Multi-GPU Network Issues Debug
When training or inferencing with `DistributedDataParallel` and multiple GPU, if you run into issue of inter-communication between processes and/or nodes, you can use the following script to diagnose network issues.
This will dump a lot of NCCL-related debug information, which you can then search online if you find that some problems are reported. Or if you're not sure how to interpret the output you can share the log file in an Issue.