README.md 2.65 KB
Newer Older
one's avatar
one committed
1
2
3
4
5
6
# hytop - monitoring tools

## Quick start

```bash
uv pip install -e .
7
hytop --help
one's avatar
one committed
8
9
10
hytop gpu --help
```

one's avatar
one committed
11
## Prerequisites
one's avatar
one committed
12
13
14

- Python >= 3.10
- Python packages: `rich`, `typer`
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
- Passwordless SSH for remote 

## `hytop`

```bash
# Show the version number
hytop --version

# Specify a timeout for the subcommand
hytop --timeout 300 [COMMAND]

# 0.5-second interval and 5-second rolling window for the subcommand
hytop -n 0.5 --window 5 [COMMAND]

# Specify a list of nodes for the subcommand
hytop -H node01,node02 [COMMAND]
```
one's avatar
one committed
32
33
34
35
36
37
38
39
40
41

## `hytop gpu`

A lightweight script for live `hy-smi` polling with rolling averages across multiple hosts. It features a modern terminal UI and can be used as a blocking scheduler for GPU jobs.

### Usage

Simple examples:

```bash
42
43
# Local node, all GPUs
hytop gpu
one's avatar
one committed
44

45
46
# Two nodes, 0.5-second interval
hytop -H node01,node02 -n 0.5 gpu
one's avatar
one committed
47
48
49
50

# Exit with code 0 when all monitored GPUs are available
hytop gpu --devices 0,1 --wait-idle

51
52
53
# Wait for GPUs to be idle for 30 seconds before exiting
hytop gpu --devices 0,1 --wait-idle --wait-idle-seconds 30

one's avatar
one committed
54
55
# Wait at most 300s for availability (exit 124 on timeout)
hytop gpu --devices 0,1 --wait-idle --timeout 300
one's avatar
one committed
56
57
58
59

# Fine-grained columns (output order follows show-flag order)
hytop gpu --showtemp --showpower
hytop gpu --showpower --showtemp
one's avatar
one committed
60
61
62
63
64
```

Queue jobs in shared environments:

```bash
65
if hytop -H node01,node02 gpu --timeout 300 --wait-idle; then
one's avatar
one committed
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
  echo "GPUs available, starting workload..."
  # YOUR COMMAND HERE (e.g., python train.py)
else
  echo "Error: GPUs not available in time, aborting pipeline."
fi
```

### Exit Codes

Designed to be script-friendly:

* `0`: Availability condition met (GPUs are idle).
* `124`: Timeout reached before the availability condition was met.
* `130`: Interrupted by the user (Ctrl+C).
* `2`: Argument or input error.

one's avatar
one committed
82
83
84
85
86
87
### Fine-grained metric flags

`hytop gpu` uses formatted `hy-smi --json` output and supports a subset of `hy-smi` `--show*` flags:

- `--showtemp`: GPU core temperature (`Temp`)
- `--showpower`: average package power (`AvgPwr`, plus `AvgPwr@window`)
88
- `--showsclk`: sclk frequency (`sclk`)
one's avatar
one committed
89
90
91
92
- `--showmemuse`: VRAM usage (`VRAM%`)
- `--showuse`: GPU utilization (`GPU%`, plus `GPU%@window`)

If no `--show*` flags are specified, hytop defaults to:
93
`--showtemp --showpower --showsclk --showmemuse --showuse`.
one's avatar
one committed
94

one's avatar
one committed
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
## Development

### Version bump

Version is sourced from `src/hytop/__init__.py` (`__version__`).

```bash
# patch: 0.1.0 -> 0.1.1
python scripts/bump_version.py patch

# minor: 0.1.1 -> 0.2.0
python scripts/bump_version.py minor

# major: 0.2.0 -> 1.0.0
python scripts/bump_version.py major

# set an explicit version
python scripts/bump_version.py set 1.2.3
```