README.md 2.29 KB
Newer Older
one's avatar
one committed
1
2
3
4
5
6
7
8
9
# hytop - monitoring tools

## Quick start

```bash
uv pip install -e .
hytop gpu --help
```

one's avatar
one committed
10
## Prerequisites
one's avatar
one committed
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35

- Python >= 3.10
- Python packages: `rich`, `typer`
- Passwordless SSH for remote monitoring

## `hytop gpu`

A lightweight script for live `hy-smi` polling with rolling averages across multiple hosts. It features a modern terminal UI and can be used as a blocking scheduler for GPU jobs.

### Usage

Simple examples:

```bash
# Local node, all GPUs, 5-second rolling window
hytop gpu -n 1 --window 5

# Two nodes, monitor only GPU 0 and 1
hytop gpu -H node01,node02 --devices 0,1 -n 1

# Exit with code 0 when all monitored GPUs are available
hytop gpu --devices 0,1 --wait-idle

# Wait at most 300s for availability (exit 124 on timeout)
hytop gpu --devices 0,1 --wait-idle --timeout 300
one's avatar
one committed
36
37
38
39

# Fine-grained columns (output order follows show-flag order)
hytop gpu --showtemp --showpower
hytop gpu --showpower --showtemp
one's avatar
one committed
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
```

Queue jobs in shared environments:

```bash
if hytop gpu -H node01,node02 --wait-idle --timeout 300; then
  echo "GPUs available, starting workload..."
  # YOUR COMMAND HERE (e.g., python train.py)
else
  echo "Error: GPUs not available in time, aborting pipeline."
  exit 1
fi
```

### Exit Codes

Designed to be script-friendly:

* `0`: Availability condition met (GPUs are idle).
* `124`: Timeout reached before the availability condition was met.
* `130`: Interrupted by the user (Ctrl+C).
* `2`: Argument or input error.

one's avatar
one committed
63
64
65
66
67
68
69
70
71
72
73
74
75
### Fine-grained metric flags

`hytop gpu` uses formatted `hy-smi --json` output and supports a subset of `hy-smi` `--show*` flags:

- `--showtemp`: GPU core temperature (`Temp`)
- `--showpower`: average package power (`AvgPwr`, plus `AvgPwr@window`)
- `--showhcuclocks`: sclk frequency (`sclk`)
- `--showmemuse`: VRAM usage (`VRAM%`)
- `--showuse`: GPU utilization (`GPU%`, plus `GPU%@window`)

If no `--show*` flags are specified, hytop defaults to:
`--showtemp --showpower --showhcuclocks --showmemuse --showuse`.

one's avatar
one committed
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
## Development

### Version bump

Version is sourced from `src/hytop/__init__.py` (`__version__`).

```bash
# patch: 0.1.0 -> 0.1.1
python scripts/bump_version.py patch

# minor: 0.1.1 -> 0.2.0
python scripts/bump_version.py minor

# major: 0.2.0 -> 1.0.0
python scripts/bump_version.py major

# set an explicit version
python scripts/bump_version.py set 1.2.3
```