README.md 3.73 KB
Newer Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# RCCL REPLAYER
Collective log replayer tool for RCCL.

## Table of Contents

1. [Introduction](#introduction)
2. [Features](#features)
3. [How It Works](#how-it-works)
4. [Installation](#installation)
5. [Usage](#usage)

## Introduction

Replayer is a dubugging tool designed to analyze and replay collective logs obtained from RCCL (ROCm Communication Collectives Library) runs. It can be a useful tool when trying to recreate problem situations (without as much setup), or as a user-directed utility to run collectives (by crafting their own 'logfile').

## Features

- Parses and validates collective logs from RCCL runs.
- Detects missing/faulty group calls and provides report.
- Replays collective calls based on the recorded data.
- Skips faulty group calls during replay.
- Supports various MPI ranks and GPU configurations.
- Supports multi-node environment. 

*Note: RCCL Replayer executes collective calls with dummy data.*

## How It Works

Replayer operates in the following steps:

1. **Collective Log Collection:** During your RCCL runs, the collective logs are generated when NCCL_DEBUG=INFO and NCCL_DEBUG_SUBSYS=COLL enabled, capturing important information like hostname, deviceIdx, collective call type, number of elements used, data type, operation type, task number, and global rank number about collective communication patterns.

2. **Data Aggregation:** Replayer collects and pareses the collective logs. organizing them based on opCount (collective count in the group call), and global rank information.

3. **Group Call Validation:** After acquiring data from the collective logs and generating group calls, the replayer validates the results using two different methods. For Non-Send/Recv collectives, it checks if each MPI rank has the required number of collective tasks. For Send/Recv collectives, it verifies if they all have a matching pair.

4. **Replaying RCCL:** Based on the aggregated and validated data, Replayer will replay the collective logs to reproduce the RCCL runs from your application.

5. **Reporting and Skipping.** Replayer outputs the detected faulty group calls and skips them during replay. It provides a report showing which group calls were skipped and why and, at the end, summarizes how many group calls were replayed and how many were skipped.

## Installation

To build the replayer, follow these steps:
1. Navigate to the rccl_replayer directory.
2. Make sure 'MPI_DIR' is set to the path where your MPI installation is located.

```bash
    cd rccl/tools/rccl_replayer
    MPI_DIR=/path/to/mpi make
```

Depending on the MPI library used and your installation path, you may need to set the MPI_DIR path accordingly.


## Usage

After successfully building the replayer, you can run it using the following command:

```bash
    mpirun -np <numProcesses> ./rcclReplayer </path/to/logfile> <numGpusPerMpiRank>
```

Replace <numProcesses> with the number of MPI processes you want to run during the replay, </path/to/logfile> with the path to the collective log file generated during your RCCL runs, and <numGpusPerMpiRank> with the number of GPUs per MPI rank used in your application.

Depending on the MPI library you use, you may need to modify the mpirun command accordingly.

### Multi-Node Environment:

If multiple nodes were used for your application, you can also replay the collective logs using multiple nodes. See the following command:

```bash
     mpirun --hostfile <path/to/hostfile.txt> -np <numProcesses> ./rcclReplayer </path/to/logfile> <numGpusPerMpiRank>
```

### SLURM:

For systems using SLURM, you can use the following command to replay the collective logs:

```bash
    srun -N <numNodes> -n <numProcesses> ./rcclReplayer </path/to/logfile> <numGpusPerMpiRank>
```

Replace <numNodes> with the number of nodes used in your application.