README.md 1.66 KB
Newer Older
one's avatar
one committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# hytop - monitoring tools

## Quick start

```bash
uv pip install -e .
hytop gpu --help
```

## Prerequesites

- Python >= 3.10
- Python packages: `rich`, `typer`
- Passwordless SSH for remote monitoring

## `hytop gpu`

A lightweight script for live `hy-smi` polling with rolling averages across multiple hosts. It features a modern terminal UI and can be used as a blocking scheduler for GPU jobs.

### Usage

Simple examples:

```bash
# Local node, all GPUs, 5-second rolling window
hytop gpu -n 1 --window 5

# Two nodes, monitor only GPU 0 and 1
hytop gpu -H node01,node02 --devices 0,1 -n 1

# Exit with code 0 when all monitored GPUs are available
hytop gpu --devices 0,1 --wait-idle

# Wait at most 300s for availability (exit 124 on timeout)
hytop gpu --devices 0,1 --wait-idle --timeout 300
```

Queue jobs in shared environments:

```bash
if hytop gpu -H node01,node02 --wait-idle --timeout 300; then
  echo "GPUs available, starting workload..."
  # YOUR COMMAND HERE (e.g., python train.py)
else
  echo "Error: GPUs not available in time, aborting pipeline."
  exit 1
fi
```

### Exit Codes

Designed to be script-friendly:

* `0`: Availability condition met (GPUs are idle).
* `124`: Timeout reached before the availability condition was met.
* `130`: Interrupted by the user (Ctrl+C).
* `2`: Argument or input error.

## Development

### Version bump

Version is sourced from `src/hytop/__init__.py` (`__version__`).

```bash
# patch: 0.1.0 -> 0.1.1
python scripts/bump_version.py patch

# minor: 0.1.1 -> 0.2.0
python scripts/bump_version.py minor

# major: 0.2.0 -> 1.0.0
python scripts/bump_version.py major

# set an explicit version
python scripts/bump_version.py set 1.2.3
```