README.md 3.16 KB
Newer Older
one's avatar
one committed
1
2
3
4
# hytop - monitoring tools

## Quick start

one's avatar
one committed
5
6
Use `uv` to run hytop:

one's avatar
one committed
7
```bash
one's avatar
one committed
8
9
10
11
12
13
14
uv run hytop --help
```

Run hytop directly:

```bash
source .venv/Scripts/activate
15
hytop --help
one's avatar
one committed
16
17
```

one's avatar
one committed
18
## Prerequisites
one's avatar
one committed
19
20
21

- Python >= 3.10
- Python packages: `rich`, `typer`
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
- Passwordless SSH for remote 

## `hytop`

```bash
# Show the version number
hytop --version

# Specify a timeout for the subcommand
hytop --timeout 300 [COMMAND]

# 0.5-second interval and 5-second rolling window for the subcommand
hytop -n 0.5 --window 5 [COMMAND]

# Specify a list of nodes for the subcommand
hytop -H node01,node02 [COMMAND]
```
one's avatar
one committed
39
40
41
42
43
44
45
46
47
48

## `hytop gpu`

A lightweight script for live `hy-smi` polling with rolling averages across multiple hosts. It features a modern terminal UI and can be used as a blocking scheduler for GPU jobs.

### Usage

Simple examples:

```bash
49
50
# Local node, all GPUs
hytop gpu
one's avatar
one committed
51

52
53
# Two nodes, 0.5-second interval
hytop -H node01,node02 -n 0.5 gpu
one's avatar
one committed
54
55
56
57

# Exit with code 0 when all monitored GPUs are available
hytop gpu --devices 0,1 --wait-idle

58
59
60
# Wait for GPUs to be idle for 30 seconds before exiting
hytop gpu --devices 0,1 --wait-idle --wait-idle-seconds 30

one's avatar
one committed
61
62
# Wait at most 300s for availability (exit 124 on timeout)
hytop gpu --devices 0,1 --wait-idle --timeout 300
one's avatar
one committed
63
64
65
66

# Fine-grained columns (output order follows show-flag order)
hytop gpu --showtemp --showpower
hytop gpu --showpower --showtemp
one's avatar
one committed
67
68
69
70
71
```

Queue jobs in shared environments:

```bash
72
if hytop -H node01,node02 gpu --timeout 300 --wait-idle; then
one's avatar
one committed
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
  echo "GPUs available, starting workload..."
  # YOUR COMMAND HERE (e.g., python train.py)
else
  echo "Error: GPUs not available in time, aborting pipeline."
fi
```

### Exit Codes

Designed to be script-friendly:

* `0`: Availability condition met (GPUs are idle).
* `124`: Timeout reached before the availability condition was met.
* `130`: Interrupted by the user (Ctrl+C).
* `2`: Argument or input error.

one's avatar
one committed
89
90
91
92
93
94
### Fine-grained metric flags

`hytop gpu` uses formatted `hy-smi --json` output and supports a subset of `hy-smi` `--show*` flags:

- `--showtemp`: GPU core temperature (`Temp`)
- `--showpower`: average package power (`AvgPwr`, plus `AvgPwr@window`)
95
- `--showsclk`: sclk frequency (`sclk`)
one's avatar
one committed
96
97
98
99
- `--showmemuse`: VRAM usage (`VRAM%`)
- `--showuse`: GPU utilization (`GPU%`, plus `GPU%@window`)

If no `--show*` flags are specified, hytop defaults to:
100
`--showtemp --showpower --showsclk --showmemuse --showuse`.
one's avatar
one committed
101

one's avatar
one committed
102
103
## Development

one's avatar
one committed
104
Clone the repo and run `make setup` to create the virtual environment, install all dependencies (including dev), and configure pre-commit hooks:
one's avatar
one committed
105

one's avatar
one committed
106
107
108
109
110
```bash
make setup
```

Common development commands:
one's avatar
one committed
111
112

```bash
one's avatar
one committed
113
114
115
116
117
118
make format     # Auto-fix and format code (ruff)
make lint       # Check code style and errors without modifying files
make test       # Run all unit tests (pytest)
make bump part=patch  # Bump version (patch/minor/major or X.Y.Z)
make clean      # Remove build caches and the virtual environment
```
one's avatar
one committed
119

one's avatar
one committed
120
### Version bump
one's avatar
one committed
121

one's avatar
one committed
122
Version is sourced from `src/hytop/__init__.py` (`__version__`).
one's avatar
one committed
123

one's avatar
one committed
124
125
126
127
128
```bash
make bump part=patch          # 0.1.0 -> 0.1.1
make bump part=minor          # 0.1.1 -> 0.2.0
make bump part=major          # 0.2.0 -> 1.0.0
make bump part="set 1.2.3"   # set an explicit version
one's avatar
one committed
129
```