README.md 4.63 KB
Newer Older
one's avatar
one committed
1
2
3
4
# hytop - monitoring tools

## Quick start

one's avatar
one committed
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
### Install from PyPI

Using `pipx` (recommended):

```bash
pipx install hytop
hytop --help
```

Using `uv`:

```bash
uv tool install hytop
hytop --help
```

### Install from source

23
uv:
one's avatar
one committed
24

one's avatar
one committed
25
```bash
one's avatar
one committed
26
27
28
uv run hytop --help
```

29
pip:
one's avatar
one committed
30
31

```bash
32
33
34
35
36
37
38
39
pip install .
hytop --help
```

pipx:

```bash
pipx install .
40
hytop --help
one's avatar
one committed
41
42
```

one's avatar
one committed
43
## Prerequisites
one's avatar
one committed
44
45
46

- Python >= 3.10
- Python packages: `rich`, `typer`
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
- Passwordless SSH for remote 

## `hytop`

```bash
# Show the version number
hytop --version

# Specify a timeout for the subcommand
hytop --timeout 300 [COMMAND]

# 0.5-second interval and 5-second rolling window for the subcommand
hytop -n 0.5 --window 5 [COMMAND]

# Specify a list of nodes for the subcommand
hytop -H node01,node02 [COMMAND]
```
one's avatar
one committed
64
65
66
67
68
69
70
71
72
73

## `hytop gpu`

A lightweight script for live `hy-smi` polling with rolling averages across multiple hosts. It features a modern terminal UI and can be used as a blocking scheduler for GPU jobs.

### Usage

Simple examples:

```bash
74
75
# Local node, all GPUs
hytop gpu
one's avatar
one committed
76

77
78
# Two nodes, 0.5-second interval
hytop -H node01,node02 -n 0.5 gpu
one's avatar
one committed
79
80
81
82

# Exit with code 0 when all monitored GPUs are available
hytop gpu --devices 0,1 --wait-idle

83
84
85
# Wait for GPUs to be idle for 30 seconds before exiting
hytop gpu --devices 0,1 --wait-idle --wait-idle-seconds 30

one's avatar
one committed
86
87
# Wait at most 300s for availability (exit 124 on timeout)
hytop gpu --devices 0,1 --wait-idle --timeout 300
one's avatar
one committed
88
89
90
91

# Fine-grained columns (output order follows show-flag order)
hytop gpu --showtemp --showpower
hytop gpu --showpower --showtemp
one's avatar
one committed
92
93
94
95
96
```

Queue jobs in shared environments:

```bash
97
if hytop -H node01,node02 gpu --timeout 300 --wait-idle; then
one's avatar
one committed
98
99
100
101
102
103
104
  echo "GPUs available, starting workload..."
  # YOUR COMMAND HERE (e.g., python train.py)
else
  echo "Error: GPUs not available in time, aborting pipeline."
fi
```

one's avatar
one committed
105
### Exit codes
one's avatar
one committed
106
107
108
109
110
111
112
113

Designed to be script-friendly:

* `0`: Availability condition met (GPUs are idle).
* `124`: Timeout reached before the availability condition was met.
* `130`: Interrupted by the user (Ctrl+C).
* `2`: Argument or input error.

one's avatar
one committed
114
115
116
117
118
119
### Fine-grained metric flags

`hytop gpu` uses formatted `hy-smi --json` output and supports a subset of `hy-smi` `--show*` flags:

- `--showtemp`: GPU core temperature (`Temp`)
- `--showpower`: average package power (`AvgPwr`, plus `AvgPwr@window`)
120
- `--showsclk`: sclk frequency (`sclk`)
one's avatar
one committed
121
122
123
124
- `--showmemuse`: VRAM usage (`VRAM%`)
- `--showuse`: GPU utilization (`GPU%`, plus `GPU%@window`)

If no `--show*` flags are specified, hytop defaults to:
125
`--showtemp --showpower --showsclk --showmemuse --showuse`.
one's avatar
one committed
126

one's avatar
one committed
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
### SSH transport tuning

`hytop` keeps the same lightweight SSH pull model and enables SSH connection reuse by default in the core layer (applies to all subcommands using SSH collection):

- `ControlMaster=auto`
- `ControlPersist=30s`
- `ControlPath=~/.ssh/hytop-%C`
- `ServerAliveInterval=5`
- `ServerAliveCountMax=1`

## `hytop net`

Lightweight pull-based network monitor for Ethernet and InfiniBand across one or more hosts.

### Usage

```bash
# Local host, auto-discover eth+ib interfaces
hytop net

# Two hosts, 0.5-second interval
hytop -H node01,node02 -n 0.5 net

# IB-only monitoring
hytop net --kind ib

# Include only selected interfaces
hytop net --ifaces eth0,mlx5_0/p1

# Stop after 60 seconds (returns 124 on timeout)
hytop --timeout 60 net
```

one's avatar
one committed
160
161
## Development

one's avatar
one committed
162
Clone the repo and run `make setup` to create the virtual environment, install all dependencies (including dev), and configure pre-commit hooks:
one's avatar
one committed
163

one's avatar
one committed
164
165
166
167
168
```bash
make setup
```

Common development commands:
one's avatar
one committed
169
170

```bash
one's avatar
one committed
171
172
173
174
175
176
make format     # Auto-fix and format code (ruff)
make lint       # Check code style and errors without modifying files
make test       # Run all unit tests (pytest)
make bump part=patch  # Bump version (patch/minor/major or X.Y.Z)
make clean      # Remove build caches and the virtual environment
```
one's avatar
one committed
177

one's avatar
one committed
178
### Version bump
one's avatar
one committed
179

one's avatar
one committed
180
181
182
183
184
Version is managed automatically via `bump-my-version`. Running the bump command will:
1. Update `__version__` in `src/hytop/__init__.py`
2. Update `current_version` in `pyproject.toml`
3. Create a commit (e.g., `[hytop] Bump version: 0.1.1 → 0.1.2`)
4. Create a tag (e.g., `hytop-0.1.2`)
one's avatar
one committed
185

one's avatar
one committed
186
```bash
one's avatar
one committed
187
188
make bump part=patch          # 0.1.1 -> 0.1.2
make bump part=minor          # 0.1.2 -> 0.2.0
one's avatar
one committed
189
make bump part=major          # 0.2.0 -> 1.0.0
one's avatar
one committed
190
make bump part=1.2.3          # set an explicit version
one's avatar
one committed
191
```
one's avatar
one committed
192
193
194

### Publish

one's avatar
one committed
195
Releases are automatically published to PyPI via GitHub Actions when pushing a version tag.
one's avatar
one committed
196
197

```bash
one's avatar
one committed
198
# 1. Bump version (auto-commits and auto-tags)
one's avatar
one committed
199
200
make bump part=patch

one's avatar
one committed
201
202
# 2. Push commits and tags to trigger GitHub Actions release
git push --follow-tags
one's avatar
one committed
203
204
205
206
207
208
209
```

To test building distributions locally:

```bash
make build
```