Unverified Commit 6dc85fbc authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

feat: GPU-parallel test runner with VRAM-aware scheduling (#7560)


Signed-off-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent 4ea21079
......@@ -20,9 +20,17 @@ MODEL="${MODEL:-Qwen/Qwen3-VL-8B-Instruct}"
NAMESPACE="${NAMESPACE:-dynamo}"
HTTP_PORT="${HTTP_PORT:-8000}"
BLOCK_SIZE="${BLOCK_SIZE:-16}" # Must match vLLM backend KV block size
GPU_MEMORY_UTILIZATION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-${GPU_MEMORY_UTILIZATION:-0.85}}"
GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.85}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
# KV cache override for parallel-safe GPU memory control
KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
if [[ -n "$KV_BYTES" ]]; then
GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
else
GPU_MEM_ARGS="--gpu-memory-utilization ${GPU_MEMORY_UTILIZATION}"
fi
NATS_SERVER="${NATS_SERVER:-nats://127.0.0.1:4222}"
ETCD_ENDPOINTS="${ETCD_ENDPOINTS:-http://127.0.0.1:2379}"
......@@ -121,7 +129,7 @@ env "${COMMON_ENV[@]}" \
--enable-multimodal \
--block-size "${BLOCK_SIZE}" \
--enforce-eager \
--gpu-memory-utilization "${GPU_MEMORY_UTILIZATION}" \
$GPU_MEM_ARGS \
--max-model-len "${MAX_MODEL_LEN}" \
--served-model-name "${MODEL}__internal" \
${VLLM_EXTRA_ARGS} &
......
# GPU Memory Parameters by Engine
# GPU Memory Control
How vLLM, sglang, and TensorRT-LLM interpret memory-related parameters, and how
to estimate total GPU VRAM usage for each.
How vLLM, SGLang, and TensorRT-LLM allocate GPU memory, and how we override
it for deterministic parallel test execution.
---
## Quick Reference
| Parameter | vLLM | sglang | TensorRT-LLM |
|---|---|---|---|
| Memory fraction | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` (YAML/override) |
| Fraction base | Total VRAM | Total VRAM | Free VRAM (after model load) |
| Default fraction | 0.90 | 0.90 | 0.90 |
| Max sequence length | `--max-model-len` | `--context-length` | `max_seq_len` (YAML/override) |
| KV cache size override | `--kv-cache-memory-bytes` | N/A | `max_gpu_total_bytes` (broken in 1.3.0rc5) |
---
## 1. vLLM
### How `--gpu-memory-utilization` works
This is a fraction of **total** GPU VRAM. The engine budgets everything within
this limit:
```
budget = total_vram * gpu_memory_utilization
KV cache = budget - model_weights - peak_activations - framework_overhead
```
At startup, vLLM profiles actual model weight and activation memory, then
pre-allocates the remaining budget as KV cache blocks. The KV pool size is fixed
for the lifetime of the engine.
### How `--max-model-len` works
Sets the maximum total sequence length (input + output tokens). Longer sequences
require more KV cache per request. If the requested `max-model-len` needs more
KV cache than the budget allows, vLLM errors at startup:
```
ValueError: ... X GiB KV cache is needed, which is larger than the available
KV cache memory (Y GiB). ...
```
Reducing `--max-model-len` is the most effective way to reduce VRAM when the
model fits but the KV cache doesn't.
### How `--kv-cache-memory-bytes` works
When set, this overrides the automatic KV cache sizing from
`gpu-memory-utilization`. The engine allocates exactly this many bytes for KV
cache regardless of the fraction. This means `gpu-memory-utilization` still
controls the *overall* VRAM budget (and thus whether the model fits), but the
KV cache portion is pinned to the explicit byte value.
Consequence for profiling: if a script uses `--kv-cache-memory-bytes`,
changing `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` (which maps to
`--gpu-memory-utilization`) won't change the KV cache size, only the leftover
headroom for activations and overhead.
### Estimating total GPU usage
```
total_vram ≈ model_weights + kv_cache + activations + overhead
model_weights ≈ num_params * bytes_per_param
(e.g. 7B * 2 bytes for BF16 ≈ 14 GiB)
## Why absolute caps, not fractions
kv_cache_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
(the factor of 2 is for K and V tensors)
Memory fractions (`--gpu-memory-utilization`, `--mem-fraction-static`) are
unreliable for parallel / CI workloads:
kv_cache_total = kv_cache_per_token * max_model_len * max_concurrent_seqs
- **Non-deterministic** — same fraction produces different KV cache sizes
depending on what else is on the GPU at init time.
- **Profiling race** — concurrent engines each see "nearly all memory free",
allocate based on that, and OOM.
- **Not portable** — a fraction tuned for 48 GiB is wrong on 24 or 80 GiB.
- **Different semantics** — vLLM/SGLang use fraction of *total* VRAM;
TensorRT-LLM uses fraction of *free* VRAM after model load.
overhead ≈ engine-dependent (auto-computed by estimate_worker_vram):
vllm: 1.2 + 1.0 * sqrt(params_b) GiB (0.6B≈2.0, 8B≈4.0)
sglang: 1.5 + 1.0 * sqrt(params_b) GiB (0.6B≈2.3, 8B≈4.3)
trtllm: 2.0 + 1.2 * sqrt(params_b) GiB (0.6B≈2.9, 8B≈5.4)
```
Instead, we use **absolute KV cache caps**:
Rule of thumb: set `gpu-memory-utilization` so that
`total_vram * fraction >= model_weights + 2 GiB`. The rest becomes KV cache.
| Engine | Deterministic override | Env var |
|--------|----------------------|---------|
| vLLM | `--kv-cache-memory-bytes N` | `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` |
| SGLang | `--max-total-tokens N` | `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` |
| TensorRT-LLM | *(future TODO)* | — |
---
## 2. sglang
### How `--mem-fraction-static` works
Like vLLM, this is a fraction of **total** GPU VRAM:
```
budget = total_vram * mem_fraction_static
KV cache pool = budget - model_weights
```
The budget covers model weights and the KV cache pool. Activations and CUDA
graph buffers are allocated *outside* this budget from the remaining VRAM.
This is slightly different from vLLM (which includes activations in the budget).
sglang recommends keeping 5-8 GiB free for activations and overhead. If you
see OOM errors, decrease `--mem-fraction-static` by 0.01-0.05 increments.
### How `--context-length` and `--max-running-requests` work
Unlike vLLM (where `--max-model-len` directly affects KV cache sizing), sglang's
`--context-length` and `--max-running-requests` do **not** affect KV cache
allocation. The KV cache pool is sized entirely from `--mem-fraction-static`:
```
kv_cache_pool = total_vram * mem_fraction_static - model_weights
```
Profiling confirmed this: changing `--context-length` from 512 to 40960 produced
identical `max_total_num_tokens` values (269,136 on a 48 GiB GPU at fraction 0.95).
These flags only affect **request scheduling**:
- `--context-length` caps the per-request token usage from the KV pool
- `--max-running-requests` limits concurrent request slots (allocated from
memory outside the `--mem-fraction-static` budget)
Setting `--max-running-requests` too high at high fractions can cause OOM because
the request slot pool competes for the small amount of memory left after KV cache
allocation.
### Estimating total GPU usage
```
total_vram ≈ model_weights + kv_cache_pool + activations_and_overhead
kv_cache_pool = total_vram * mem_fraction_static - model_weights
## Quick Reference
activations_and_overhead ≈ 1-2 GiB for small models (0.6B-4B)
~3-5 GiB for larger models (7B+)
(CUDA context, graphs, request pools — allocated outside mem_fraction_static)
```
| | vLLM | SGLang | TensorRT-LLM |
|---|---|---|---|
| Fraction flag | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` |
| Fraction base | Total VRAM | Total VRAM | Free VRAM (post-load) |
| Default | 0.90 | 0.90 | 0.90 |
| Max seq len | `--max-model-len` | `--context-length` | `max_seq_len` |
| KV cache override | `--kv-cache-memory-bytes` | `--max-total-tokens` | *(broken in 1.3.0rc5)* |
---
## 3. TensorRT-LLM
### How `free_gpu_memory_fraction` works
This is a fraction of **free** VRAM (not total). The engine:
1. Loads model weights and builds the TRT engine (fixed cost).
2. Queries remaining free GPU memory.
3. Allocates `free_memory * free_gpu_memory_fraction` for the KV cache pool.
```
kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
```
This means the same fraction yields different absolute KV cache sizes depending
on how much VRAM the model consumed. A 5 GiB model on a 48 GiB GPU leaves
~43 GiB free; fraction=0.24 gives ~10 GiB KV cache. A 30 GiB model leaves
~18 GiB free; fraction=0.24 gives only ~4 GiB.
Set via YAML config, CLI, or env var:
```bash
--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
DYN_TRTLLM_OVERRIDE_ENGINE_ARGS='{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
```
### How `max_seq_len` works
Maximum total sequence length. Defaults to the model's native context.
Sequences exceeding this limit are rejected at runtime.
**VRAM impact: none (PyTorch backend).** Reducing max_seq_len from 40960 to
2048 had zero effect on total VRAM or KV cache size in testing (Qwen3-0.6B,
trtllm 1.3.0rc5). The PyTorch backend does not pre-allocate internal buffers
proportional to max_seq_len; KV cache size is determined solely by
`free_gpu_memory_fraction`. This differs from vLLM/sglang where reducing
context length measurably reduces memory.
Override via:
```bash
--override-engine-args '{"max_seq_len": 4096}'
```
## Per-Engine Notes
### Override gotcha: sub-dict replacement
### vLLM
Overriding any field inside `kv_cache_config` **replaces the entire sub-dict**.
If your YAML has `enable_block_reuse: true` and you override only
`free_gpu_memory_fraction`, you lose `enable_block_reuse`. Always re-include
all fields you need:
`--gpu-memory-utilization` sets a budget as fraction of total VRAM.
KV cache = budget - weights - activations - overhead. Pool is fixed at startup.
```json
{"kv_cache_config": {"free_gpu_memory_fraction": 0.15, "enable_block_reuse": true}}
```
`--kv-cache-memory-bytes` overrides automatic sizing and **skips memory
profiling** ([PR #21489]). The KV cache is pinned to the exact byte value —
no profiling race, no CUDAGraph estimation errors, safe for concurrent
instances ([#10643]). When set, `--gpu-memory-utilization` only affects
headroom for activations, not KV cache size.
### How `max_num_tokens` works
`--max-model-len` caps sequence length. Reducing it is the fastest way to
cut VRAM when the model fits but KV cache doesn't.
Maximum batched input tokens per iteration. Primarily a throughput knob.
[PR #21489]: https://github.com/vllm-project/vllm/pull/21489
[#10643]: https://github.com/vllm-project/vllm/issues/10643
**VRAM impact: none.** Reducing from 8192 → 256 had no measurable effect on
total VRAM (41,643 vs 41,465 MiB — within noise; the slight *increase* is
because smaller activation footprint lets the fraction claim marginally more
KV cache).
### SGLang
### `max_gpu_total_bytes` (broken)
`--mem-fraction-static` sets a budget as fraction of total VRAM.
KV cache pool = budget - weights. Activations and CUDA graph buffers are
*outside* this budget (unlike vLLM).
Intended as an absolute byte cap for KV cache. As of trtllm 1.3.0rc5, this
field is **ignored**. Setting 5 GiB cap with `free_gpu_memory_fraction=0.95`
still allocated ~42 GiB of KV cache. Setting `free_gpu_memory_fraction=0.0`
with only `max_gpu_total_bytes` causes `"Impossible to fit any sequence in
kvCache"`. Do not rely on this field.
`--max-total-tokens` caps the KV token pool directly, regardless of fraction.
When set, the token cap is the binding constraint.
### Override precedence
`--context-length` and `--max-running-requests` affect request scheduling
only — they do **not** change KV cache allocation.
```
--override-engine-args JSON > --extra-engine-args YAML > CLI flags
```
### TensorRT-LLM
The `DYN_TRTLLM_OVERRIDE_ENGINE_ARGS` env var is equivalent to
`--override-engine-args` and avoids shell quoting issues with scripts whose
arg parsers consume unknown flags before passing `"$@"`.
`free_gpu_memory_fraction` is a fraction of **free** VRAM after model load.
Set via YAML or `--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'`.
### Estimating total GPU usage
```
total_vram ≈ model_weights + engine_overhead + kv_cache
model_weights ≈ num_params * bytes_per_param / tensor_parallel_size
engine_overhead ≈ 2.0 + 1.2 * sqrt(params_b) GiB (CUDA context + TRT buffers + activations)
kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
```
Engine overhead is auto-computed by `estimate_worker_vram` when called with the
`trtllm` engine name. Examples: 0.6B → 2.9 GiB, 8B → 5.4 GiB, 30B → 8.6 GiB.
### Empirical validation (Qwen3-0.6B, RTX 6000 Ada 48 GiB, trtllm 1.3.0rc5)
Controlled test: single worker via agg.sh, one override at a time.
| # | Override | Total VRAM | KV Cache | Tokens |
|---|---------|-----------|----------|--------|
| 1 | Baseline (YAML frac=0.85) | 41,465 MiB | 38.04 GiB | 356,160 |
| 2 | `free_gpu_memory_fraction=0.15` | 9,383 MiB | 6.71 GiB | 62,848 |
| 3 | `max_num_tokens=256` | 41,643 MiB | 38.26 GiB | 358,208 |
| 4 | `max_seq_len=4096` | 41,469 MiB | 38.05 GiB | 356,192 |
| 5 | `max_seq_len=2048` | 41,469 MiB | 38.05 GiB | 356,192 |
| 6 | seq=4096 + frac=0.15 | 9,383 MiB | 6.71 GiB | 62,848 |
| 7 | tokens=256 + seq=4096 + frac=0.15 | 9,377 MiB | 6.75 GiB | 63,200 |
**Conclusion:** `free_gpu_memory_fraction` is the **sole effective knob** for
trtllm VRAM control. Neither `max_seq_len` nor `max_num_tokens` reduce memory.
Combined overrides (test 7) produce no additional benefit over fraction alone
(test 2).
Deterministic KV cache control via `build_gpu_mem_args` is a future TODO.
---
## Why vLLM/sglang fractions are NOT interchangeable with TensorRT-LLM
Consider wanting 10 GiB of KV cache on a 48 GiB GPU with a 5 GiB model:
| Engine | Fraction meaning | Calculation | Result |
|---|---|---|---|
| vLLM | 10/48 = 0.21 of total | `48 * 0.21 = 10 GiB` budget (minus model = 5 GiB KV) | Wrong — need higher fraction |
| sglang | Same as vLLM | Same math | Same problem |
| TensorRT-LLM | 10/43 = 0.23 of free | `43 * 0.23 = 10 GiB` KV cache | Correct |
For vLLM/sglang, you actually need `(model + kv) / total = (5 + 10) / 48 = 0.31`
to get 10 GiB of KV cache with a 5 GiB model.
## `build_gpu_mem_args` and Env Vars
The helper functions in `gpu_utils.sh` handle these differences:
- `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM)
- `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM)
- `gpu_worker_fraction <engine> <total_gib> <kv_gib>`: converts estimated GiB
into the engine-appropriate fraction (total for vllm/sglang, free for trtllm).
Launch scripts use `build_gpu_mem_args` which calls these internally:
Launch scripts source `gpu_utils.sh` and call `build_gpu_mem_args` to pick
up env-var overrides during profiling and parallel execution:
```bash
GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --max-model-len "$SEQ_LEN" --max-num-seqs "$CONCURRENCY")
```
---
## KV Cache Memory Per Token
source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
The formula for KV cache memory per token is the same across all engines:
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
python -m dynamo.vllm --model "$MODEL" $GPU_MEM_ARGS &
```
kv_bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
python -m dynamo.sglang --model-path "$MODEL" $GPU_MEM_ARGS &
```
| Model | Layers | KV Heads | Head Dim | Dtype | Per Token |
|---|---|---|---|---|---|
| Qwen3-0.6B | 28 | 8 | 128 | BF16 | 112 KiB |
| Llama-3.1-8B | 32 | 8 | 128 | BF16 | 128 KiB |
| Llama-3.1-70B | 80 | 8 | 128 | BF16 | 320 KiB |
| Qwen2.5-VL-7B | 28 | 4 | 128 | BF16 | 56 KiB |
When the env var is set, `build_gpu_mem_args` returns the corresponding flag.
Otherwise it returns empty and the engine uses its default allocation.
To estimate KV cache for a given context length:
```
kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^3)
```
---
| Env var | Engine | CLI flag produced |
|---------|--------|-------------------|
| `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` | vLLM | `--kv-cache-memory-bytes N --gpu-memory-utilization 0.01` |
| `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` | SGLang | `--max-total-tokens N` |
## `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`
For multi-worker single-GPU scripts, pass `--workers-per-gpu N` to divide
the allocation: `build_gpu_mem_args vllm --workers-per-gpu 2`.
Environment variable used by Dynamo's VRAM profiler to binary-search the minimum
memory fraction a script needs.
**Profiler** (`profile_pytest.py`): binary-searches the KV cap to find the
minimum passing value, applies a 2x safety factor, outputs pytest markers
(`@pytest.mark.requested_vllm_kv_cache_bytes(N)` or
`@pytest.mark.requested_sglang_kv_tokens(N)`).
- Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang.
- For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via
`--override-engine-args`.
- Launch scripts use `build_gpu_mem_args` to compute the default fraction;
the override bypasses the estimator and splits the raw value between workers.
- Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV
cache sizing, making the profiler's fraction override ineffective for KV cache.
Those scripts should warn when `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is set.
**Scheduler** (`pytest_parallel_gpu.py`): reads the markers at runtime and
sets the env var per-test. See `tests/README.md` for details.
......@@ -2,470 +2,62 @@
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Shared GPU utility functions for launch scripts.
# Shared GPU utility functions for launch scripts (source, don't execute).
#
# CLI:
# ./gpu_utils.sh <engine> --model <name> [options...] Print GPU fraction
# ./gpu_utils.sh --self-test Run self-test suite
#
# Source:
# Usage:
# source "$(dirname "$(readlink -f "$0")")/../common/gpu_utils.sh"
# # or with SCRIPT_DIR already set:
# source "$SCRIPT_DIR/../common/gpu_utils.sh"
#
# Functions (all return via stdout — no hidden globals):
# build_gpu_mem_args <engine> <model> ... Prints fraction (or empty)
# get_model_params <model> Prints "pb wb layers kvh hd"
# estimate_worker_vram <model> ... Prints "w_gib kv_gib oh_gib total_gib"
# gpu_worker_fraction <engine> <total> <kv> Prints engine-appropriate fraction
# gpu_peak_to_engine_fraction <engine> <peak> Prints fraction (subtracts engine overhead)
# gpu_gb_to_total_fraction <gib> Prints fraction of TOTAL VRAM (vLLM/sglang)
# gpu_gb_to_free_fraction <gib> Prints fraction of FREE VRAM (TensorRT-LLM)
# build_gpu_mem_args <engine> [options...]
#
# Prints the computed memory fraction to stdout (empty line if none).
# Callers capture with: GPU_MEM_FRACTION=$(build_gpu_mem_args ...)
# Functions (all return via stdout):
# build_gpu_mem_args <engine> [--workers-per-gpu N]
# Returns engine-specific CLI args for GPU memory control based on
# environment variable overrides. Empty if no overrides.
#
# Priority:
# 1. _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE (profiler binary search)
# 2. Engine flag passed to this function (user already chose a value)
# 3. estimate_worker_vram + gpu_worker_fraction (model architecture)
# 4. Empty (let engine use its own default)
#
# Options (each flag accepts engine-specific aliases):
# --model NAME Model name (required).
# aliases: --model-path (sglang, trtllm)
# --max-model-len N Max tokens per sequence (default: 4096).
# aliases: --context-length (sglang)
# --max-seq-len (trtllm)
# --max-num-seqs N Concurrent sequences to budget for (default: 2).
# aliases: --max-running-requests (sglang)
# --max-batch-size (trtllm)
# --gpu-memory-utilization F User override (vllm flag name). Skipped when empty.
# --mem-fraction-static F User override (sglang flag name).
# --workers-per-gpu N Divide the fraction by N (for shared-GPU disagg).
# vLLM: _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES → --kv-cache-memory-bytes N --gpu-memory-utilization 0.01
# SGLang: _PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS → --max-total-tokens N
#
# Usage:
# # Simple single-worker (agg.sh)
# GPU_MEM_FRACTION=$(build_gpu_mem_args vllm \
# --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
# python -m dynamo.vllm --model "$MODEL" \
# ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
#
# # Two workers sharing one GPU (disagg_same_gpu.sh)
# GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --workers-per-gpu 2)
# python -m dynamo.vllm ... --gpu-memory-utilization "${GPU_MEM_FRACTION}" &
# GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
# python -m dynamo.sglang --model-path "$MODEL" $GPU_MEM_ARGS &
#
# # sglang
# GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL" --workers-per-gpu 2)
# python -m dynamo.sglang ... --mem-fraction-static "${GPU_MEM_FRACTION}" &
#
# # trtllm (fraction goes into JSON, not CLI)
# GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --workers-per-gpu 2)
# OVERRIDE_ARGS=(--override-engine-args "{\"kv_cache_config\":{\"free_gpu_memory_fraction\":${GPU_MEM_FRACTION}}}")
# GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
# python -m dynamo.vllm --model "$MODEL" $GPU_MEM_ARGS &
build_gpu_mem_args() {
local engine="${1:?usage: build_gpu_mem_args <engine> --model <name> [options...]}"
local engine="${1:?usage: build_gpu_mem_args <engine> [--workers-per-gpu N]}"
shift
local model=""
local max_model_len="4096"
local max_seqs="2"
local workers_per_gpu=1
local user_frac=""
while [[ $# -gt 0 ]]; do
case "$1" in
--model|--model-path)
model="$2"; shift 2 ;;
--max-model-len|--context-length|--max-seq-len)
max_model_len="$2"; shift 2 ;;
--max-num-seqs|--max-running-requests|--max-batch-size)
max_seqs="$2"; shift 2 ;;
--gpu-memory-utilization|--mem-fraction-static)
user_frac="$2"; shift 2 ;;
--workers-per-gpu) workers_per_gpu="$2"; shift 2 ;;
--workers-per-gpu) workers_per_gpu="$2"; shift 2 ;;
*) echo "build_gpu_mem_args: unknown option '$1'" >&2; return 1 ;;
esac
done
if [[ -z "$model" ]]; then
echo "build_gpu_mem_args: --model is required" >&2
return 1
fi
local frac=""
local from_estimator=false
local est_w="" est_kv="" est_oh="" est_total=""
if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" ]]; then
frac="$_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
elif [[ -n "$user_frac" ]]; then
frac="$user_frac"
elif read -r est_w est_kv est_oh est_total <<< "$(estimate_worker_vram "$model" "$max_model_len" "$max_seqs" "$engine" 2>/dev/null)" && [[ -n "$est_total" ]]; then
frac=$(gpu_worker_fraction "$engine" "$est_total" "$est_kv")
from_estimator=true
fi
# --workers-per-gpu divides profiler/user/estimator results only
if [[ -n "$frac" && "$workers_per_gpu" -gt 1 ]]; then
frac=$(awk -v f="$frac" -v n="$workers_per_gpu" 'BEGIN { printf "%.2f", f / n }')
fi
echo "$frac"
}
# get_model_params <model_name>
#
# Prints "params_b weight_bytes layers kv_heads head_dim" to stdout.
# Returns 1 (prints nothing) if the model is unknown.
#
# Fields:
# params_b Total parameters in billions (all experts for MoE)
# weight_bytes Bytes per weight element (2=BF16/FP16, 1=FP8)
# layers Number of transformer layers
# kv_heads Number of key-value heads (GQA groups)
# head_dim Dimension per attention head
#
# KV cache is assumed BF16 (2 bytes per element) regardless of weight dtype,
# since FP8 KV cache (--kv-cache-dtype fp8) is opt-in and not the default.
#
# To add a model:
# 1. Find config.json at https://huggingface.co/<model>/raw/main/config.json
# For VL/multimodal models, architecture params are under text_config.
# 2. Map fields:
# layers ← num_hidden_layers
# kv_heads ← num_key_value_heads
# head_dim ← head_dim (or hidden_size / num_attention_heads)
# 3. params_b: total parameter count in billions. Derive from:
# - safetensors file size: size_bytes / weight_bytes / 1e9
# (single file: ls -l model.safetensors; sharded: metadata.total_size
# in model.safetensors.index.json)
# - or the model card / paper
# For MoE: params_b is the TOTAL count (all experts loaded into VRAM).
# 4. weight_bytes: 2 for BF16/FP16, 1 for FP8/INT8.
#
# Usage:
# read -r pb wb layers kvh hd <<< "$(get_model_params "Qwen/Qwen3-0.6B")"
# echo "$layers layers, $kvh KV heads"
get_model_params() {
local model="${1:?usage: get_model_params <model_name>}"
local pb wb layers kvh hd
case "$model" in
# https://huggingface.co/Qwen/Qwen3-0.6B/raw/main/config.json
Qwen/Qwen3-0.6B)
pb=0.6; wb=2; layers=28; kvh=8; hd=128 ;;
# https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/raw/main/config.json (text_config)
# params_b from model.safetensors.index.json metadata.total_size / 2 / 1e9
Qwen/Qwen2-VL-2B-Instruct)
pb=2.2; wb=2; layers=28; kvh=2; hd=128 ;;
# https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/raw/main/config.json (text_config)
Qwen/Qwen2.5-VL-7B-Instruct)
pb=8.3; wb=2; layers=28; kvh=4; hd=128 ;;
# https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct/raw/main/config.json (text_config)
# params_b from model.safetensors size / 2 / 1e9
Qwen/Qwen3-VL-2B-Instruct)
pb=2.1; wb=2; layers=28; kvh=8; hd=128 ;;
# https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct/raw/main/config.json (text_config)
Qwen/Qwen3-VL-8B-Instruct)
pb=9.2; wb=2; layers=36; kvh=8; hd=128 ;;
# https://huggingface.co/Qwen/Qwen3-30B-A3B/raw/main/config.json
Qwen/Qwen3-30B-A3B|\
Qwen/Qwen3-30B-A3B-Instruct)
pb=30.5; wb=2; layers=48; kvh=4; hd=128 ;;
# Same architecture as Qwen3-30B-A3B but FP8 quantized (1 byte per weight)
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8)
pb=30.5; wb=1; layers=48; kvh=4; hd=128 ;;
# https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/raw/main/config.json
meta-llama/Meta-Llama-3.1-8B-Instruct)
pb=8.0; wb=2; layers=32; kvh=8; hd=128 ;;
# https://huggingface.co/deepseek-ai/deepseek-llm-7b-base/raw/main/config.json
# MHA (not GQA): num_key_value_heads == num_attention_heads == 32
deepseek-ai/deepseek-llm-7b-base)
pb=6.9; wb=2; layers=30; kvh=32; hd=128 ;;
# https://huggingface.co/Qwen/Qwen3-Embedding-4B/raw/main/config.json
# params_b from model.safetensors.index.json metadata.total_size / 2 / 1e9
# head_dim = hidden_size(2560) / num_attention_heads(32) = 80
Qwen/Qwen3-Embedding-4B)
pb=4.0; wb=2; layers=36; kvh=8; hd=80 ;;
# https://huggingface.co/llava-hf/llava-1.5-7b-hf/raw/main/config.json (text_config)
# MHA: num_key_value_heads == num_attention_heads == 32
llava-hf/llava-1.5-7b-hf)
pb=7.1; wb=2; layers=32; kvh=32; hd=128 ;;
*)
echo "get_model_params: unknown model '$model'" >&2
echo "Add it to get_model_params() in gpu_utils.sh" >&2
return 1 ;;
esac
echo "$pb $wb $layers $kvh $hd"
}
# estimate_worker_vram <model> [max_model_len] [max_concurrent_seqs] [engine_or_overhead]
#
# Prints "weights_gib kv_gib overhead_gib total_gib" to stdout.
# Returns 1 (prints nothing) if the model is unknown to get_model_params.
#
# Formula:
# weights = params_b * 1e9 * weight_bytes
# kv = 2 * layers * kv_heads * head_dim * 2(BF16) * seq_len * seqs
# total = weights + kv + overhead
#
# Arguments:
# model HuggingFace model name (required)
# max_model_len Max tokens per sequence (default: 4096)
# max_concurrent_seqs Concurrent sequences to budget for (default: 2)
# engine_or_overhead Engine name OR explicit GiB value (default: 2.0)
#
# If the 4th argument is an engine name (vllm, sglang, trtllm), overhead is
# auto-computed from model parameters:
# overhead = base + scale * sqrt(params_b)
#
# Per-engine constants (calibrated from measurements on RTX 6000 Ada 48 GiB):
# vllm: base=1.2, scale=1.0 → 0.6B≈2.0, 8B≈4.0, 30B≈6.7
# sglang: base=1.5, scale=1.0 → 0.6B≈2.3, 8B≈4.3, 30B≈7.0
# trtllm: base=2.0, scale=1.2 → 0.6B≈2.9, 8B≈5.4, 30B≈8.6
#
# sglang overhead was re-calibrated via profile_pytest.py bisection on
# RTX 6000 Ada 48 GiB. Observed CUDA overhead (outside --mem-fraction-static):
# Qwen3-0.6B: ~1.8 GiB. Previous coefficients (2.5, 1.5) over-estimated by ~2x.
#
# If the 4th argument is a number, it's used directly (backward compatible).
# If omitted, defaults to 2.0 (backward compatible).
#
# See examples/common/gpu_utils.md for the full derivation.
#
# Usage:
# read -r w kv oh total <<< "$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)"
# echo "$total GiB (w=$w kv=$kv oh=$oh)"
estimate_worker_vram() {
local model="${1:?usage: estimate_worker_vram <model> [seq_len] [seqs] [engine_or_overhead]}"
local seqlen="${2:-4096}"
local seqs="${3:-2}"
local engine_or_overhead="${4:-2.0}"
local mp_out
mp_out=$(get_model_params "$model") || return 1
local pb wb layers kvh hd
read -r pb wb layers kvh hd <<< "$mp_out"
local overhead
case "$engine_or_overhead" in
vllm) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 1.2 + 1.0 * sqrt(p) }') ;;
sglang) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 1.5 + 1.0 * sqrt(p) }') ;;
trtllm) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 2.0 + 1.2 * sqrt(p) }') ;;
*) overhead="$engine_or_overhead" ;;
esac
awk -v pb="$pb" -v wbytes="$wb" \
-v layers="$layers" -v heads="$kvh" -v dim="$hd" \
-v seqlen="$seqlen" -v seqs="$seqs" -v overhead="$overhead" \
'BEGIN {
gib = 1024 * 1024 * 1024
w = pb * 1e9 * wbytes / gib
kv = 2 * layers * heads * dim * 2 * seqlen * seqs / gib
printf "%.1f %.1f %.1f %.1f", w, kv, overhead, w + kv + overhead
}'
}
# gpu_worker_fraction <engine> <total_gib> <kv_gib> [gpu_index]
#
# Convert estimated GiB into the engine-appropriate GPU memory fraction.
#
# Engine semantics (see examples/common/gpu_utils.md):
# vllm/sglang — fraction of TOTAL VRAM (uses total_gib).
# trtllm — fraction of FREE VRAM after model load (uses kv_gib).
#
# Usage:
# gpu_worker_fraction vllm 4.0 0.9 # fraction of total
# gpu_worker_fraction trtllm 4.0 0.9 # fraction of free
# gpu_worker_fraction trtllm 4.0 0.9 1 # query GPU index 1
gpu_worker_fraction() {
local engine="${1:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib> [gpu_index]}"
local total_gib="${2:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib>}"
local kv_gib="${3:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib>}"
local gpu_idx="${4:-0}"
case "$engine" in
vllm|sglang)
gpu_gb_to_total_fraction "$total_gib" "$gpu_idx" ;;
trtllm)
gpu_gb_to_free_fraction "$kv_gib" "$gpu_idx" ;;
*)
echo "gpu_worker_fraction: unknown engine '$engine'" >&2
echo "Supported: vllm, sglang, trtllm" >&2
return 1 ;;
esac
}
# gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]
#
# Convert a measured/profiled GPU peak (total VRAM including CUDA context,
# activations, etc.) into the engine-specific memory fraction flag.
#
# Each engine's fraction controls only a SUBSET of GPU memory (e.g. vLLM's
# --gpu-memory-utilization covers weights + KV cache but not CUDA context).
# This function subtracts the engine-specific overhead so the fraction
# targets the right internal budget, keeping the real peak stable across
# re-profiles.
#
# Overhead constants (GiB outside the engine's budget):
# vllm 2.0 CUDA ctx ~0.6 + activations/sampler ~0.5 + PyTorch alloc ~0.5
# sglang 2.0 (assumed same as vllm; refine when profiled)
# trtllm 0.0 free-fraction is measured after model load, no subtraction needed
#
# Usage:
# gpu_peak_to_engine_fraction vllm 8.6 # on 48 GiB → 0.14
# gpu_peak_to_engine_fraction vllm 20.9 # on 48 GiB → 0.40
# gpu_peak_to_engine_fraction vllm 8.6 1 # query GPU index 1
gpu_peak_to_engine_fraction() {
local engine=${1:?usage: gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]}
local peak_gib=${2:?usage: gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]}
local gpu_idx=${3:-0}
local overhead
case "$engine" in
vllm|sglang) overhead=2.0 ;;
trtllm) overhead=0.0 ;;
*)
echo "gpu_peak_to_engine_fraction: unknown engine '$engine'" >&2
echo "Supported: vllm, sglang, trtllm" >&2
return 1 ;;
esac
local budget
budget=$(awk -v g="$peak_gib" -v oh="$overhead" \
'BEGIN { b = g - oh; if (b < 1) b = 1; printf "%.1f", b }')
case "$engine" in
vllm|sglang) gpu_gb_to_total_fraction "$budget" "$gpu_idx" ;;
trtllm) gpu_gb_to_free_fraction "$budget" "$gpu_idx" ;;
esac
}
# gpu_gb_to_total_fraction <gib> [gpu_index]
#
# For vLLM / sglang: --gpu-memory-utilization is a fraction of TOTAL GPU memory.
# The engine budgets model weights + KV cache + activations within that limit.
#
# Prints the fraction of total GPU VRAM that <gib> GiB represents.
# Useful for converting portable absolute memory requirements to
# engine-specific fraction parameters (--gpu-memory-utilization, etc).
#
# Examples:
# gpu_gb_to_total_fraction 4 # on 48 GiB GPU → 0.09
# gpu_gb_to_total_fraction 16 # on 48 GiB GPU → 0.34
# gpu_gb_to_total_fraction 4 1 # query GPU index 1 instead of 0
#
# The result is ceil-rounded to 2 decimal places with a minimum of 0.05
# and a maximum of 0.95.
gpu_gb_to_total_fraction() {
local gib=${1:?usage: gpu_gb_to_total_fraction <gib> [gpu_index]}
local gpu_idx=${2:-0}
local total_mib
total_mib=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i "$gpu_idx" 2>/dev/null)
if [[ -z "$total_mib" || "$total_mib" -eq 0 ]]; then
echo "gpu_gb_to_total_fraction: failed to query GPU $gpu_idx total memory" >&2
return 1
# --- SGLang: token-based KV cache cap ---
if [[ "$engine" == "sglang" && -n "${_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS:-}" ]]; then
echo "--max-total-tokens ${_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS}"
return 0
fi
local total_gib
total_gib=$(awk -v t="$total_mib" 'BEGIN { printf "%.1f", t / 1024 }')
if awk -v gib="$gib" -v total="$total_mib" 'BEGIN { exit (gib * 1024 > total) ? 0 : 1 }'; then
echo "" >&2
echo "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" >&2
echo "WARNING: Requested ${gib} GiB but GPU $gpu_idx only has ${total_gib} GiB total." >&2
echo "The model likely won't fit. Consider a GPU with more VRAM" >&2
echo "or reduce the model size (quantization, smaller model, etc)." >&2
echo "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" >&2
echo "" >&2
# --- vLLM: byte-based KV cache cap ---
# --gpu-memory-utilization 0.01 prevents vLLM's startup check from rejecting
# the launch when co-resident tests use >10% of VRAM (vLLM checks free memory
# against the fraction *before* applying the byte cap).
if [[ "$engine" == "vllm" && -n "${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}" ]]; then
local kv_bytes="$_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES"
if [[ "$workers_per_gpu" -gt 1 ]]; then
kv_bytes=$(awk -v b="$kv_bytes" -v n="$workers_per_gpu" 'BEGIN { printf "%d", b / n }')
fi
echo "--kv-cache-memory-bytes $kv_bytes --gpu-memory-utilization 0.01"
return 0
fi
# fraction = gib * 1024 / total_mib, ceil to 2 decimals, clamp [0.05, 0.95]
awk -v gib="$gib" -v total="$total_mib" 'BEGIN {
frac = (gib * 1024) / total
# ceil to 2 decimal places
frac = int(frac * 100 + 0.99) / 100
if (frac < 0.05) frac = 0.05
if (frac > 0.95) frac = 0.95
printf "%.2f\n", frac
}'
# No override — engine uses its default allocation
echo ""
}
# gpu_gb_to_free_fraction <gib> [gpu_index]
#
# For TensorRT-LLM: --free-gpu-memory-fraction (CLI) and
# kv_cache_config.free_gpu_memory_fraction (YAML) are fractions of FREE
# memory AFTER model weights are loaded — NOT fractions of total VRAM.
# The engine loads model weights first, queries remaining free memory,
# then allocates fraction * free_after_model for the KV cache.
#
# Why gpu_gb_to_total_fraction won't work for TensorRT-LLM:
# gpu_gb_to_total_fraction(10) on a 48 GiB GPU → 0.21 (fraction of total).
# Passing 0.21 as free_gpu_memory_fraction after a 5 GiB model loads
# would allocate 0.21 * 43 GiB ≈ 9 GiB — close but not exact.
# For larger models the error grows: a 30 GiB model leaves 18 GiB free,
# so 0.21 * 18 ≈ 3.8 GiB — far less than the 10 GiB intended.
#
# This function queries CURRENT free memory from nvidia-smi and computes
# gib / free_mib. The result is a best-effort estimate: TensorRT-LLM will
# see less free memory than we measure here (model weights haven't loaded
# yet), so the actual KV cache allocation will be smaller than <gib>.
# For rough sizing this is fine; for precise control use the YAML config
# with a known model size.
#
# For disagg_same_gpu (two workers sharing one GPU), launch workers
# sequentially: start the first, wait for it to finish loading (poll
# nvidia-smi or logs), then query free memory again and compute the
# fraction for the second worker. This gives predictable per-worker
# KV cache sizes on any GPU.
#
# Override at launch via CLI or env var:
# --override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.15}}'
# DYN_TRTLLM_OVERRIDE_ENGINE_ARGS='{"kv_cache_config":{"free_gpu_memory_fraction": 0.15}}'
#
# GOTCHA: overriding any field inside kv_cache_config REPLACES the entire
# sub-dict from the YAML. You must re-include all fields you care about
# (e.g. enable_block_reuse, dtype) or they'll be lost.
#
# Examples:
# gpu_gb_to_free_fraction 10 # on 48 GiB GPU with 46 GiB free → 0.22
# gpu_gb_to_free_fraction 10 1 # query GPU index 1 instead of 0
#
# The result is ceil-rounded to 2 decimal places, clamped [0.01, 0.95].
# The floor is 0.01 (not 0.05 like gpu_gb_to_total_fraction) because this
# fraction only controls KV cache, so small values are valid.
gpu_gb_to_free_fraction() {
local gib=${1:?usage: gpu_gb_to_free_fraction <gib> [gpu_index]}
local gpu_idx=${2:-0}
local free_mib
free_mib=$(nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits -i "$gpu_idx" 2>/dev/null)
if [[ -z "$free_mib" || "$free_mib" -eq 0 ]]; then
echo "gpu_gb_to_free_fraction: failed to query GPU $gpu_idx free memory" >&2
return 1
fi
local free_gib
free_gib=$(awk -v f="$free_mib" 'BEGIN { printf "%.1f", f / 1024 }')
if awk -v gib="$gib" -v free="$free_mib" 'BEGIN { exit (gib * 1024 > free) ? 0 : 1 }'; then
echo "" >&2
echo "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" >&2
echo "WARNING: Requested ${gib} GiB KV cache but GPU $gpu_idx only has ${free_gib} GiB free." >&2
echo "After model loading, even less will be available." >&2
echo "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" >&2
echo "" >&2
fi
# fraction = gib * 1024 / free_mib, ceil to 2 decimals, clamp [0.01, 0.95]
awk -v gib="$gib" -v free="$free_mib" 'BEGIN {
frac = (gib * 1024) / free
frac = int(frac * 100 + 0.99) / 100
if (frac < 0.01) frac = 0.01
if (frac > 0.95) frac = 0.95
printf "%.2f\n", frac
}'
}
# ---------------------------------------------------------------------------
# Self-test: bash gpu_utils.sh --self-test
......@@ -483,125 +75,51 @@ _gpu_utils_self_test() {
fi
}
echo "=== get_model_params ==="
local result
local out
out=$(get_model_params "Qwen/Qwen3-0.6B")
_assert "known model returns 5 fields" "0.6 2 28 8 128" "$out"
out=$(get_model_params "nope/unknown" 2>/dev/null)
_assert "unknown model returns empty" "" "$out"
get_model_params "nope/unknown" >/dev/null 2>&1
_assert "unknown model exits 1" "1" "$?"
echo "=== vLLM: kv bytes override ==="
result=$(_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=942054000 \
build_gpu_mem_args vllm)
_assert "kv bytes" "--kv-cache-memory-bytes 942054000 --gpu-memory-utilization 0.01" "$result"
echo ""
echo "=== estimate_worker_vram ==="
out=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)
_assert "returns 4 space-separated fields" "4" "$(echo "$out" | wc -w | tr -d ' ')"
local w kv oh total
read -r w kv oh total <<< "$out"
_assert "weights > 0" "yes" "$(awk -v v="$w" 'BEGIN { print (v > 0) ? "yes" : "no" }')"
_assert "total > weights" "yes" "$(awk -v t="$total" -v w="$w" 'BEGIN { print (t > w) ? "yes" : "no" }')"
out=$(estimate_worker_vram "nope/unknown" 2>/dev/null)
_assert "unknown model returns empty" "" "$out"
local out_vllm out_sglang
out_vllm=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)
out_sglang=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 sglang)
_assert "sglang overhead > vllm overhead" "yes" \
"$(awk -v v="$out_vllm" -v s="$out_sglang" 'BEGIN {
split(v, a); split(s, b); print (b[3]+0 > a[3]+0) ? "yes" : "no"
}')"
echo "=== vLLM: kv bytes with --workers-per-gpu 2 ==="
result=$(_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=942054000 \
build_gpu_mem_args vllm --workers-per-gpu 2)
_assert "kv bytes / 2" "--kv-cache-memory-bytes 471027000 --gpu-memory-utilization 0.01" "$result"
echo ""
echo "=== build_gpu_mem_args: estimator path (known model) ==="
local frac
frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2)
_assert "FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
echo "=== vLLM: no override = empty ==="
result=$(build_gpu_mem_args vllm)
_assert "empty (engine default)" "" "$result"
echo ""
echo "=== build_gpu_mem_args: unknown model, no default ==="
frac=$(build_gpu_mem_args vllm --model "nope/unknown")
_assert "FRACTION empty" "" "$frac"
echo "=== vLLM: sglang token env ignored ==="
result=$(_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS=23824 \
build_gpu_mem_args vllm)
_assert "vllm ignores token cap" "" "$result"
echo ""
echo "=== build_gpu_mem_args: profiler wins over all ==="
frac=$(_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.55 \
build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --gpu-memory-utilization 0.70)
_assert "FRACTION = profiler (beats user flag)" "0.55" "$frac"
echo "=== sglang: token cap env ==="
result=$(_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS=1024 \
build_gpu_mem_args sglang)
_assert "token cap" "--max-total-tokens 1024" "$result"
echo ""
echo "=== build_gpu_mem_args: user flag wins over estimator ==="
frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --gpu-memory-utilization 0.70)
_assert "FRACTION = user flag" "0.70" "$frac"
echo "=== sglang: no override = empty ==="
result=$(build_gpu_mem_args sglang)
_assert "empty (engine default)" "" "$result"
echo ""
echo "=== build_gpu_mem_args: empty user flag falls through ==="
frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2 --gpu-memory-utilization "")
_assert "FRACTION = estimator" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
echo "=== sglang: vllm kv bytes env ignored ==="
result=$(_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=942054000 \
build_gpu_mem_args sglang)
_assert "sglang ignores kv bytes" "" "$result"
echo ""
echo "=== build_gpu_mem_args: --workers-per-gpu divides estimator ==="
local undivided
undivided=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2)
frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2 --workers-per-gpu 2)
local expected_half
expected_half=$(awk -v f="$undivided" 'BEGIN { printf "%.2f", f / 2 }')
_assert "FRACTION halved" "$expected_half" "$frac"
echo ""
echo "=== build_gpu_mem_args: --workers-per-gpu divides profiler ==="
frac=$(_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.80 \
build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --workers-per-gpu 2)
_assert "FRACTION = 0.80/2 = 0.40" "0.40" "$frac"
echo ""
echo "=== build_gpu_mem_args: sglang engine (sglang flag names) ==="
frac=$(build_gpu_mem_args sglang --model-path "Qwen/Qwen3-0.6B" --context-length 4096 --max-running-requests 2)
_assert "sglang FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
echo ""
echo "=== build_gpu_mem_args: trtllm engine (trtllm flag names) ==="
frac=$(build_gpu_mem_args trtllm --model-path "Qwen/Qwen3-0.6B" --max-seq-len 4096 --max-batch-size 2)
_assert "trtllm FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
echo ""
echo "=== build_gpu_mem_args: --mem-fraction-static user flag (sglang) ==="
frac=$(build_gpu_mem_args sglang --model-path "Qwen/Qwen3-0.6B" --mem-fraction-static 0.60)
_assert "FRACTION = user flag" "0.60" "$frac"
echo ""
echo "=== build_gpu_mem_args: missing --model ==="
build_gpu_mem_args vllm 2>/dev/null
_assert "missing --model exits 1" "1" "$?"
echo ""
echo "=== gpu_worker_fraction: explicit args ==="
local frac
frac=$(gpu_worker_fraction vllm 4.0 0.9)
_assert "vllm returns non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
frac=$(gpu_worker_fraction trtllm 4.0 0.9)
_assert "trtllm returns non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
gpu_worker_fraction badengine 4.0 0.9 >/dev/null 2>&1
_assert "bad engine exits 1" "1" "$?"
echo "=== missing engine ==="
(build_gpu_mem_args 2>/dev/null)
_assert "missing engine exits non-zero" "1" "$?"
echo ""
echo "=========================================="
......@@ -610,46 +128,8 @@ _gpu_utils_self_test() {
[[ "$fail" -eq 0 ]]
}
# CLI mode: only when executed directly (not sourced by another script)
if [[ "${BASH_SOURCE[0]}" == "$0" ]]; then
if [[ "${1:-}" == "--self-test" ]]; then
_gpu_utils_self_test
exit $?
fi
if [[ $# -gt 0 ]]; then
build_gpu_mem_args "$@"
exit $?
fi
cat <<'HELP'
gpu_utils.sh — GPU memory fraction estimator
Usage:
./gpu_utils.sh <engine> --model <name> [options...]
./gpu_utils.sh --self-test
Engines: vllm, sglang, trtllm
Examples:
./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B
./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B --max-model-len 4096 --max-num-seqs 2
./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B --workers-per-gpu 2
./gpu_utils.sh sglang --model Qwen/Qwen3-0.6B --context-length 8192
./gpu_utils.sh trtllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --max-seq-len 4096
Options:
--model NAME Model name (required)
aliases: --model-path
--max-model-len N Max sequence length (default: 4096)
aliases: --context-length, --max-seq-len
--max-num-seqs N Concurrent sequences (default: 2)
aliases: --max-running-requests, --max-batch-size
--gpu-memory-utilization F Override fraction (vllm flag)
aliases: --mem-fraction-static
--workers-per-gpu N Divide fraction by N (shared-GPU disagg)
--self-test Run built-in test suite
Output: prints the fraction to stdout (empty if model is unknown).
HELP
exit 0
# Self-test: source this file then call _gpu_utils_self_test
if [[ "${BASH_SOURCE[0]}" == "$0" && "${1:-}" == "--self-test" ]]; then
_gpu_utils_self_test
exit $?
fi
......@@ -137,9 +137,9 @@ print_launch_banner() {
echo "Frontend: http://localhost:$_port"
local _seq_len="${MAX_MODEL_LEN:-${CONTEXT_LENGTH:-${MAX_SEQ_LEN:-}}}"
local _frac="${GPU_MEM_FRACTION:-}"
local _mem_args="${GPU_MEM_ARGS:-}"
[[ -n "$_seq_len" ]] && echo "Max seq len: $_seq_len"
[[ -n "$_frac" ]] && echo "GPU frac: $_frac"
[[ -n "$_mem_args" ]] && echo "GPU mem: $_mem_args"
for _line in "$@"; do
echo "$_line"
......
......@@ -93,10 +93,10 @@ python -m dynamo.frontend --http-port 8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill $GPU_MEM_ARGS &
# Wait for all background processes to complete
wait
......@@ -93,11 +93,11 @@ python -m dynamo.frontend --http-port 8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg $GPU_MEM_ARGS &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg $GPU_MEM_ARGS &
# Wait for all background processes to complete
wait
......@@ -19,10 +19,10 @@ python -m dynamo.frontend --http-port=8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill $GPU_MEM_ARGS &
# Wait for all background processes to complete
wait
......@@ -20,11 +20,11 @@ python -m dynamo.frontend --http-port=8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg $GPU_MEM_ARGS &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg $GPU_MEM_ARGS &
# Wait for all background processes to complete
wait
......@@ -234,7 +234,10 @@ markers = [
"gpu_8: marks tests to run on 8GPUs",
"xpu_1: marks tests to run on XPU",
"xpu_2: marks tests to run on 2XPUs",
"max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
# These 3 (profiled_vram_gib and requested_*) are used for parallel pytest executions:
"profiled_vram_gib(N): actual peak VRAM observed by nvidia-smi during profiling. Used for --max-vram-gib filtering and scheduler budget tracking",
"requested_vllm_kv_cache_bytes(N): exact KV cache bytes for vLLM (skips memory profiling). Sets _PROFILE_PYTEST_KV_CACHE_BYTES. Most deterministic method for parallel execution",
"requested_sglang_kv_tokens(N): max KV cache tokens for SGLang parallel execution. Sets _OVERRIDE_SGLANG_MAX_TOTAL_TOKENS to cap --max-total-tokens and prevent over-allocation",
"e2e: marks tests as end-to-end tests",
"integration: marks tests as integration tests",
"unit: marks tests as unit tests",
......
......@@ -114,43 +114,96 @@ Markers are required for all tests. They are used for test selection in CI and l
| Lifecycle [required] | pre_merge, post_merge, nightly, weekly, release | When the test should run |
| Test Type [required] | unit, integration, e2e, benchmark, performance, stress, multimodal | Nature of the test |
| Hardware [required] | gpu_0, gpu_1, gpu_2, gpu_4, gpu_8, h100 | Number/type of GPUs required |
| VRAM Requirement | max_vram_gib(N) | Peak VRAM in GiB (with 10% safety). The pytest invocation can use `--max-vram-gib=N` to select only tests that fit on the available GPU. Does not prevent running on smaller GPUs (that will OOM). Use `profile_pytest.py` to measure. |
| VRAM (profiled) | profiled_vram_gib(N) | Actual peak VRAM observed by nvidia-smi during profiling (includes CUDA overhead). Used for `--max-vram-gib=N` filtering and GPU-parallel scheduler budget tracking. |
| vLLM KV cache bytes | requested_vllm_kv_cache_bytes(N) | (vLLM only) Exact KV cache bytes. Sets `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES``--kv-cache-memory-bytes`. Deterministic, parallel-safe. |
| SGLang KV tokens | requested_sglang_kv_tokens(N) | (SGLang only) Max KV cache tokens. Sets `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS``--max-total-tokens`. Deterministic, parallel-safe. |
| Component/Framework | vllm, trtllm, sglang, kvbm, kvbm_concurrency, planner, router | Backend or component specificity |
| Infrastructure | k8s, deploy, fault_tolerance | Infrastructure/environment needs |
| Execution | parallel | Test can run in parallel with pytest-xdist. Must use dynamic port allocation (`alloc_ports`) and not share resources (e.g. filesystem) |
| Other | slow, skip, xfail, custom_build, model, aiconfigurator | Special handling |
### Example
### Example (vLLM)
```python
@pytest.mark.pre_merge
@pytest.mark.integration
@pytest.mark.gpu_1
@pytest.mark.max_vram_gib(21) # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
@pytest.mark.profiled_vram_gib(20.5) # actual nvidia-smi peak
@pytest.mark.requested_vllm_kv_cache_bytes(942_054_000) # KV cache cap (2x safety over min=471_027_000)
@pytest.mark.vllm
def test_kv_cache_behavior():
...
```
### Filtering by VRAM
### Example (SGLang with token cap)
```python
@pytest.mark.pre_merge
@pytest.mark.e2e
@pytest.mark.gpu_1
@pytest.mark.profiled_vram_gib(3.7) # actual nvidia-smi peak at recommended token count
@pytest.mark.requested_sglang_kv_tokens(96) # KV cache cap (2x safety over min=48)
@pytest.mark.timeout(265)
@pytest.mark.sglang
def test_sglang_aggregated():
...
```
The `max_vram_gib(N)` marker records how much GPU memory a test needs. The pytest invocation can use `--max-vram-gib=N` as a **selector** to run only tests that fit on the available GPU. Tests that exceed the budget are skipped at collection time (before any test starts). Tests without a `max_vram_gib` marker always run (no constraint assumed).
### VRAM Markers and Filtering
This is for the following use cases:
- **MIG partitioned GPUs:** when running tests in parallel on MIG slices (e.g., 2x 40 GiB partitions on an 80 GiB GPU), each slice has limited VRAM.
- **Smaller CI GPUs:** some CI jobs use L4 GPUs with only 24 GiB of VRAM.
Markers differ by engine:
Nothing prevents you from running without this flag — but if a test needs more VRAM than is physically available, it will OOM at runtime (e.g., vLLM raises `ValueError: No available memory for the cache blocks`).
**vLLM** uses byte-based KV cache control:
- **`profiled_vram_gib(N)`** — actual peak from nvidia-smi. Used for `--max-vram-gib` filtering and scheduler budget.
- **`requested_vllm_kv_cache_bytes(N)`** — exact KV cache bytes. Sets `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES``--kv-cache-memory-bytes`. Deterministic and parallel-safe.
```bash
# Preview which gpu_1 vllm tests fit on a 16 GiB MIG partition (no tests are executed)
python3 -m pytest --max-vram-gib=16 --dry-run -m "gpu_1 and vllm" tests/serve/test_vllm.py
**SGLang** uses token-based control:
- **`profiled_vram_gib(N)`** — actual peak from nvidia-smi at the recommended token count. Used for `--max-vram-gib` filtering and scheduler budget.
- **`requested_sglang_kv_tokens(N)`** — max KV cache tokens. Sets `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS``--max-total-tokens`. SGLang's default `--mem-fraction-static` is never overridden; the token cap is the sole allocation control. Deterministic and parallel-safe (see `examples/common/gpu_utils.md`).
`--max-vram-gib=N` deselects tests whose `profiled_vram_gib` exceeds N. Tests without a VRAM marker are also deselected (unknown VRAM = unsafe for parallel). To add a test to the pool, profile it with `tests/utils/profile_pytest.py` (see [GPU VRAM Profiler](#gpu-vram-profiler-profile_pytestpy)).
### GPU-Parallel Execution
GPU tests run concurrently via a custom VRAM-aware scheduler (`tests/utils/pytest_parallel_gpu.py`). This is separate from `pytest-xdist` because:
1. **VRAM budget**: xdist has no GPU memory awareness — two 20 GiB tests on a 48 GiB GPU will OOM.
2. **Profiling race**: engines snapshot free memory during init; concurrent startups corrupt each other. The scheduler staggers launches (VRAM stability check) and retries transient failures.
3. **Engine-specific allocation**: each test gets a constrained allocation so it uses only its budgeted share. xdist has no mechanism for this.
- **vLLM**: `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES = N``--kv-cache-memory-bytes` (from `requested_vllm_kv_cache_bytes` marker). Byte-based cap is deterministic and doesn't depend on current free memory, making it inherently parallel-safe.
- **SGLang**: `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS = N``--max-total-tokens` (from `requested_sglang_kv_tokens` marker). Token-based cap is deterministic and doesn't depend on current free memory, making it inherently parallel-safe.
# Same, but for 24 GiB L4 CI GPUs
```bash
# Dry-run: preview which tests fit and the GPU plan
python3 -m pytest --max-vram-gib=24 --dry-run -m "gpu_1 and vllm" tests/serve/test_vllm.py
# GPU tests that have no max_vram_gib marker yet — need profiling
# TODO: profile these tests and add max_vram_gib markers
python3 -m pytest --dry-run -m "(gpu_1 or gpu_2 or gpu_4 or gpu_8) and not max_vram_gib" tests/serve/test_vllm.py
# Run pre-merge vllm tests in parallel
python3 -m pytest --max-vram-gib=6 -n auto -m "gpu_1 and vllm and not nightly and not post_merge" tests/serve/test_vllm.py
# Run all (pre+post merge) with live output
python3 -m pytest --max-vram-gib=48 -n auto -sv -m "gpu_1 and vllm and not nightly" tests/serve/test_vllm.py tests/frontend/test_vllm.py
# SGLang tests
python3 -m pytest --max-vram-gib=48 -n auto -m "gpu_1 and sglang" tests/serve/test_sglang.py
# Tests that still need profiling
python3 -m pytest --dry-run -m "(gpu_1 or gpu_2) and not profiled_vram_gib" tests/serve/
```
Example output (6 SGLang tests, RTX 6000 Ada 48 GiB):
```
GPU parallel: 6 tests, 7 concurrent slots, GPU0 (48 GiB, 43 GiB multi-proc budget)
[w0] tests/serve/test_sglang.py::...completions_only-2] profiled= 14.9 GiB req_kv_tokens= 1024 timeout=420s
[w1] tests/serve/test_sglang.py::...multimodal_agg_qwen-2] profiled= 20.2 GiB req_kv_tokens= 512 timeout=280s
[w2] tests/serve/test_sglang.py::...aggregated-2] profiled= 6.0 GiB req_kv_tokens= 1024 timeout=240s
...
[w0] tests/serve/...completions_only-2] (GPU0, profiled 14.9 GiB, req_kv_tokens= 1024) RUNNING
[w1] tests/serve/...multimodal_agg_qwen-2] (GPU0, profiled 20.2 GiB, req_kv_tokens= 512) RUNNING
[elapsed 10s] GPU0: 0.6/48 GiB [w0(10s), w1(5s)] [queued: w2, w3, w4, w5]
[w1] tests/serve/...multimodal_agg_qwen-2] PASSED [31s]
[w0] tests/serve/...completions_only-2] PASSED [76s]
...
=============== 6 passed in 111.00s (1:51) (vs 228s seq, 2.1x) ===============
```
### Lifecycle Marker Note
......@@ -294,13 +347,20 @@ pytest -m "pre_merge and parallel and not (vllm or sglang or trtllm) and gpu_0"
pytest -m "pre_merge and not parallel and not (vllm or sglang or trtllm) and gpu_0" -v --tb=short
```
> **Parallel vs sequential:** CPU-only tests (`gpu_0`) marked `parallel` run with `pytest-xdist` (`-n auto` or `-n <workers>`, `--dist=loadscope`). Tests not marked `parallel`, and all GPU tests (`gpu_1`, `gpu_2`, etc.), run sequentially (no `-n` flag). See [`.github/actions/pytest/action.yml`](../.github/actions/pytest/action.yml).
> **Parallel vs sequential:** CPU-only tests (`gpu_0`) marked `parallel` run with `pytest-xdist` (`-n auto` or `-n <workers>`, `--dist=loadscope`). GPU tests (`gpu_1`, `gpu_2`, etc.) run sequentially by default, but can run in parallel with `--max-vram-gib=N -n auto` (uses a custom VRAM-aware scheduler, not xdist). See [`.github/actions/pytest/action.yml`](../.github/actions/pytest/action.yml).
**Full E2E suite** -- launches engines for every test configuration; slowest, requires GPU and a framework container (typically <30min depending on framework and model):
```bash
# Sequential (default)
pytest -m "vllm and e2e and gpu_1" -v --tb=short
pytest -m "sglang and e2e and gpu_1" -v --tb=short
pytest -m "trtllm and e2e and gpu_1" -v --tb=short
# GPU-parallel (VRAM-aware scheduling, ~2x faster on 48 GiB GPU)
# Only tests with profiled_vram_gib markers are selected; -n auto calculates
# concurrent slots from GPU VRAM / smallest test. See "GPU-Parallel Execution" below.
python3 -m pytest --max-vram-gib=48 -n auto -m "gpu_1 and sglang" tests/serve/test_sglang.py -v
python3 -m pytest --max-vram-gib=48 -n auto -m "gpu_1 and vllm" tests/serve/test_vllm.py -v
```
**Post-merge equivalent** -- CI runs `(pre_merge or post_merge)` after merge, which adds slower tests on top of the pre_merge set. **Running the full post-merge suite locally can take several hours per framework** (model downloads, GPU inference, multi-GPU coordination). For day-to-day development, before you submit to CI, use the `pre_merge` commands above for quicker feedback. See [`.github/workflows/post-merge-ci.yml`](../.github/workflows/post-merge-ci.yml) for exact markers:
......@@ -444,66 +504,83 @@ When writing or reviewing GPU tests, use `tests/utils/profile_pytest.py` to meas
### How it works
The profiler sets the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` environment variable (a fraction from 0.0 to 1.0 of total GPU RAM) and runs the test at each probe point. It bisects between "passes" and "OOM/fails" to find the boundary. After the search, it samples `nvidia-smi` to report peak VRAM, phase analysis, and marker recommendations.
The profiler automatically detects the engine type and uses the appropriate binary search:
- **vLLM**: bisects `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` (bytes) → `--kv-cache-memory-bytes`. Finds the minimum KV cache bytes where the test passes, applies a 2x safety factor. Outputs `profiled_vram_gib` and `requested_vllm_kv_cache_bytes` markers.
- **SGLang**: bisects `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` (token count) → `--max-total-tokens`. Finds the minimum KV cache tokens where the test passes, applies a 2x safety factor, then runs a final probe at the safe token count to measure the actual VRAM. Outputs `profiled_vram_gib` and `requested_sglang_kv_tokens` markers.
**Requirement:** The test under profile **must** honor the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` env var. For standalone tests that allocate CUDA memory directly, check `os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")` and cap your allocation accordingly — see `tests/utils/test_mock_gpu_alloc.py` for an example.
**Requirement (vLLM):** The launch script must honor `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES`. This is handled by `build_gpu_mem_args` in `gpu_utils.sh` (returns `--kv-cache-memory-bytes N`).
**Requirement (SGLang):** The launch script must honor `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS`. This is handled by `build_gpu_mem_args` in `gpu_utils.sh` (returns `--max-total-tokens N`).
### Engine-specific mapping
`_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is a generic env var (float 0.0-1.0) that launch scripts translate to the engine-specific CLI flag:
Launch scripts call `build_gpu_mem_args` (from `examples/common/gpu_utils.sh`) which checks env var overrides and returns the appropriate CLI flags:
```bash
GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
python -m dynamo.sglang --model-path "$MODEL" $GPU_MEM_ARGS &
```
Env vars control engine allocation during profiling and parallel test execution:
| Engine | CLI flag | Launch script support |
|---------|----------------------------------|-----------------------|
| vLLM | `--gpu-memory-utilization` | Implemented in `agg.sh`, `disagg.sh`, etc. via `build_gpu_mem_args` |
| SGLang | `--mem-fraction-static` | Implemented in `agg.sh`, `agg_embed.sh`, `disagg.sh`, `agg_router.sh`, `disagg_same_gpu.sh` via `build_gpu_mem_args`. Multimodal scripts (`multimodal_epd.sh`, `multimodal_disagg.sh`) split the override proportionally between workers. |
| TRT-LLM | `--free-gpu-memory-fraction` | Not yet implemented (has its own `DYN_TRTLLM_FREE_GPU_MEMORY_FRACTION`, TODO: unify) |
**`_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES`** (integer) — vLLM only:
**Note on sglang:** Unlike vLLM (where `--max-model-len` affects KV cache sizing), sglang's `--mem-fraction-static` is the sole knob for KV cache allocation. `--context-length` and `--max-running-requests` only affect request scheduling, not memory allocation. See `examples/common/gpu_utils.md` for details.
| Engine | Returned CLI flag | Notes |
|---------|----------------------------------|-------|
| vLLM | `--kv-cache-memory-bytes N` | Exact byte cap on KV cache; deterministic and parallel-safe |
If the profiler detects constant VRAM across all probes (meaning the env var is ignored), it prints a warning and skips marker recommendations.
**`_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS`** (integer) — SGLang only:
| Engine | Returned CLI flag | Notes |
|---------|----------------------------------|-------|
| SGLang | `--max-total-tokens N` | Token-based KV cache cap |
Both use absolute caps (bytes and tokens) — deterministic and independent of current free memory, which is critical for parallel test execution. See `examples/common/gpu_utils.md`.
### Usage
```bash
# Default mode: binary search for minimum VRAM (recommended)
# -xvs is optional: stop on first failure, verbose, show output
# vLLM: binary search for minimum KV cache bytes
python tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated] -xvs
# Profile on a specific GPU (default: 0)
python tests/utils/profile_pytest.py --gpu 1 tests/serve/test_vllm.py::test_serve_deployment[aggregated] -xvs
# SGLang: binary search for minimum KV cache tokens (automatic)
python tests/utils/profile_pytest.py tests/serve/test_sglang.py::test_sglang_deployment[aggregated-2] -xvs
# Single-pass profiling (no binary search, just measure one run using default RAM)
python tests/utils/profile_pytest.py --no-find-min-vram tests/serve/test_vllm.py::test_serve_deployment[aggregated]
```
### Example output
### Example output (vLLM)
```bash
========================================================================
FIND MINIMUM VRAM (binary search)
FIND MINIMUM KV CACHE BYTES (vLLM, deterministic) (binary search)
========================================================================
GPU total : 48.0 GiB
GPU free : 48.0 GiB (in use: 0.0 GiB)
GPU free : 47.4 GiB (in use: 0.6 GiB)
Test : tests/serve/test_vllm.py::test_serve_deployment[aggregated] -x
Range : 5% - 95% (tolerance 5%)
Max iter: 6 (1 validation + 5 bisections)
[probe 1/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.95 (45.6 GiB) [validation run]
[PASS] peak 18.5 GiB, wall 41s, iter took 49s
[probe 1] Validation run: kv_cache=23296 MiB (50% of free)
[PASS] peak 2.9 GiB, wall 42s, iter took 49s
...
[probe 5/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.33 (15.9 GiB)
[FAIL] OOM or error at 33% (15.9 GiB), iter took 30s
[probe 6/15] kv_cache=449 MiB (471,027,000 bytes)
[PASS] peak 2.9 GiB, wall 41s, iter took 49s
[probe 6/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.36 (17.2 GiB) [~0 left, ETA ~0s]
[PASS] peak 18.5 GiB, wall 41s, iter took 49s
[probe 7/15] kv_cache=224 MiB (235,513,856 bytes)
[FAIL] OOM, iter took 30s
========================================================================
MINIMUM VRAM RESULT
========================================================================
Lowest passing utilization : 36%
Minimum VRAM needed : ~17.2 GiB (peak observed: 18.5 GiB, +10% safety: 20.4 GiB)
Minimum KV cache : 449 MiB (471,027,000 bytes)
Safe KV cache : 898 MiB (942,054,000 bytes) (2x safety)
Peak VRAM : 2.9 GiB
# test_serve_deployment[aggregated]: @pytest.mark.max_vram_gib(21)
# Fits on: L4 (24 GiB), V100-32GB (32 GiB), A6000/A40 (48 GiB), A100/H100 (80 GiB)
# Will OOM on: edge/embedded (4 GiB), RTX 3060/4060 (8 GiB), T4 (16 GiB)
Recommended markers:
@pytest.mark.profiled_vram_gib(2.9)
@pytest.mark.requested_vllm_kv_cache_bytes(942_054_000), # KV cache cap (2x safety over min=471_027_000)
========================================================================
========================================================================
......@@ -511,14 +588,41 @@ Recommended markers to add to your pytest. You can copy-paste this:
========================================================================
# Measured using: tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated]
@pytest.mark.e2e # wall time 41.2s, loads a real model
@pytest.mark.gpu_1 # 1 GPU(s) used, peak 18.5 GiB
@pytest.mark.max_vram_gib(21) # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
@pytest.mark.gpu_1 # 1 GPU(s) used, peak 2.9 GiB
@pytest.mark.profiled_vram_gib(2.9) # actual nvidia-smi peak
@pytest.mark.requested_vllm_kv_cache_bytes(942_054_000) # KV cache cap (2x safety over min=471_027_000)
@pytest.mark.timeout(124) # 3x observed 41.2s
WARNING: Wall time 41.2s is too slow for pre_merge (> 20s). Consider post_merge or nightly instead.
WARNING: Will OOM on edge/embedded (4 GiB).
WARNING: Will OOM on RTX 3060/4060 (8 GiB).
WARNING: Will OOM on T4 (16 GiB).
========================================================================
```
### Example output (SGLang — token-based bisection)
```bash
========================================================================
FIND MINIMUM KV TOKENS (SGLang) (binary search)
========================================================================
GPU total : 48.0 GiB
GPU free : 47.4 GiB (in use: 0.6 GiB)
Test : tests/serve/test_sglang.py::test_sglang_deployment[aggregated-2] -xvs
[probe 1] Validation run (no token cap)
[PASS] peak 43.0 GiB, wall 36s, max_total_tokens=366688, iter took 44s
...
[probe 14/15] tokens=48 [~1 left, ETA ~45s]
[PASS] tokens=48, peak 3.7 GiB, wall 26s, iter took 34s
[final probe] Measuring VRAM at safe_tokens=96
[PASS] tokens=96, peak 3.7 GiB, wall 27s
========================================================================
MINIMUM KV TOKENS RESULT
========================================================================
Minimum tokens : 16 (raw bisection result)
Recommended : 96 (2x safety)
Peak VRAM : 3.7 GiB (at 96 tokens)
@pytest.mark.profiled_vram_gib(3.7)
@pytest.mark.requested_sglang_kv_tokens(96), # KV cache cap (2x safety over min=48)
========================================================================
```
......@@ -526,7 +630,7 @@ Recommended markers to add to your pytest. You can copy-paste this:
1. **Copy the `@pytest.mark.*` lines** into your test function or `pytestmark` list.
2. **VRAM marker** — `max_vram_gib(N)` records the peak GPU memory the test needs (with 10% safety margin). This marker does **not** skip tests on its own — if a test runs on a GPU that is too small, it will OOM and fail hard. Use `--max-vram-gib=N` to select only tests that fit on the available GPU (see [Filtering by VRAM](#filtering-by-vram) for examples). The WARNING lines in the profiler output tell you which GPU tiers would be too small (e.g., "Will OOM on T4 (16 GiB)").
2. **VRAM markers** — `profiled_vram_gib(N)` records the actual nvidia-smi peak (for filtering/scheduling), `requested_vllm_kv_cache_bytes(N)` or `requested_sglang_kv_tokens(N)` controls the engine's KV cache allocation for deterministic parallel execution. Use `--max-vram-gib=N` to deselect tests whose profiled VRAM exceeds N (see [Filtering by VRAM](#filtering-by-vram)). The WARNING lines in the profiler output tell you which GPU tiers would be too small (e.g., "Will OOM on T4 (16 GiB)").
3. **Lifecycle markers** — the profiler recommends `pre_merge` only for tests under 20 seconds. For slower tests, it warns you to consider `post_merge` or `nightly` but does not choose for you — use your judgment based on how critical the test is for catching regressions early.
......@@ -538,6 +642,7 @@ Recommended markers to add to your pytest. You can copy-paste this:
| Flag | Description |
|------|-------------|
| `--kv-bytes` | No-op (kept for backward compat). vLLM always bisects on `--kv-cache-memory-bytes` |
| `--no-find-min-vram` | Skip binary search; run a single profiling pass instead |
| `--interval N` | GPU sampling interval in seconds (default: 1.0) |
| `--baseline-seconds N` | Seconds to sample before launching pytest (default: 3.0) |
......
......@@ -25,6 +25,11 @@ from tests.utils.test_output import resolve_test_output_path
_logger = logging.getLogger(__name__)
# Typed stash keys for GPU-parallel config (avoids setting unknown attrs on Config)
_gpu_parallel_gpus_key: pytest.StashKey[list[dict]] = pytest.StashKey()
_gpu_indices_key: pytest.StashKey[list[int] | None] = pytest.StashKey()
_gpu_slots_key: pytest.StashKey[int | None] = pytest.StashKey()
def pytest_addoption(parser: pytest.Parser) -> None:
"""Add shared command-line options for all tests.
......@@ -59,7 +64,18 @@ def pytest_addoption(parser: pytest.Parser) -> None:
"--max-vram-gib",
type=float,
default=None,
help="Skip tests whose @pytest.mark.max_vram_gib(N) exceeds this value (GiB).",
help="Only run tests with @pytest.mark.profiled_vram_gib(N) that fit in N GiB. "
"Without -n: runs tests sequentially. "
"With -n N: runs N tests concurrently as subprocesses with VRAM-aware scheduling. "
"With -n auto: calculates max concurrent slots from GPU VRAM / max_vram_gib.",
)
parser.addoption(
"--gpus",
"--gpu",
type=str,
default="all",
help="Comma-separated GPU indices or 'all' (default: all). "
"Controls which GPUs the parallel test runner distributes tests across.",
)
parser.addoption(
"--dry-run",
......@@ -79,6 +95,130 @@ logging.basicConfig(
)
# ---------------------------------------------------------------------------
# GPU-serial and GPU-parallel: VRAM-aware test scheduling
#
# Activated only when both --max-vram-gib and -n auto are passed:
# pytest --max-vram-gib=48 -n auto -m "gpu_1 and sglang" tests/serve/
# ---------------------------------------------------------------------------
def pytest_configure(config: pytest.Config) -> None:
"""Detect GPUs for --max-vram-gib planning and parallel execution."""
vram_limit = config.getoption("max_vram_gib", default=None)
if vram_limit is None:
return
# Delayed: vram_utils requires pynvml, otherwise conftest fails to load
# on CPU-only CI runners (e.g. ARM deploy tests) that lack nvidia-ml-py.
from tests.utils.pytest_parallel_gpu import _parse_gpu_indices
from tests.utils.vram_utils import auto_worker_count, detect_gpus
gpus = detect_gpus()
if gpus:
config.stash[_gpu_parallel_gpus_key] = gpus
# Parse --gpus into a list of indices (or None for all)
gpus_raw = config.getoption("gpus", default="all")
if gpus_raw and gpus_raw.strip().lower() != "all":
config.stash[_gpu_indices_key] = _parse_gpu_indices(gpus_raw, gpus)
selected_gpus = [
g for g in gpus if g["index"] in config.stash[_gpu_indices_key]
]
else:
config.stash[_gpu_indices_key] = None # all GPUs
selected_gpus = gpus
# If -n is set with --max-vram-gib, save the slot count and disable xdist
# so our subprocess orchestrator handles parallelism instead.
# xdist's pytest_configure(trylast=True) checks _is_distribution_mode()
# which reads dist/tx (not numprocesses), so we must also clear dist.
numproc = config.getoption("numprocesses", default=None)
if numproc is not None and numproc != 0:
if isinstance(numproc, str) or numproc == -1:
config.stash[_gpu_slots_key] = (
auto_worker_count(selected_gpus, vram_limit) if selected_gpus else 1
)
else:
config.stash[_gpu_slots_key] = int(numproc)
config.option.numprocesses = 0
config.option.dist = "no"
@pytest.hookimpl(tryfirst=True)
def pytest_runtestloop(session: pytest.Session) -> bool | None:
"""Intercept the test loop for GPU-parallel execution.
When --max-vram-gib and -n are both present, run tests as independent
subprocesses via the GPU orchestrator instead of the normal pytest loop.
Must run before the default pytest loop (tryfirst) so we can return True
to prevent the default sequential execution.
"""
config = session.config
num_slots = config.stash.get(_gpu_slots_key, None)
vram_limit = config.getoption("max_vram_gib", default=None)
if num_slots is None or vram_limit is None:
return None # serial execution: let normal pytest handle it
# Imports related to parallel execution must be delayed. See vram_utils pynvml note in pytest_configure for the full reasons
from tests.utils.pytest_parallel_gpu import run_parallel
from tests.utils.vram_utils import load_test_meta
# Collect test IDs from the already-filtered session items
test_ids = [item.nodeid for item in session.items]
if not test_ids:
return True
meta = load_test_meta()
is_stream = config.getoption("capture", default="fd") == "no"
gpu_indices = config.stash.get(_gpu_indices_key, None)
# Forward original CLI args to child pytest subprocesses so they
# inherit options like -s, -v, --tb, --durations, --image, etc.
extra_args: list[str] = []
if is_stream:
extra_args.append("-s")
verbose = config.getoption("verbose", default=0)
if verbose >= 2:
extra_args.append("-vv")
elif verbose >= 1:
extra_args.append("-v")
tb_style = config.getoption("tbstyle", default="short")
if tb_style and tb_style != "short":
extra_args.append(f"--tb={tb_style}")
durations = config.getoption("durations", default=None)
if durations is not None:
extra_args.append(f"--durations={durations}")
durations_min = config.getoption("durations_min", default=None)
if durations_min is not None:
extra_args.append(f"--durations-min={durations_min}")
for opt_name, cli_flag in [
("image", "--image"),
("namespace", "--namespace"),
("framework", "--framework"),
("profile", "--profile"),
]:
val = config.getoption(opt_name, default=None)
if val is not None:
extra_args.extend([cli_flag, str(val)])
if config.getoption("skip_service_restart", default=None):
extra_args.append("--skip-service-restart")
rc = run_parallel(
test_ids=test_ids,
meta=meta,
max_vram_gib=vram_limit,
num_slots=num_slots,
gpu_indices=gpu_indices,
extra_pytest_args=extra_args or None,
stream=is_stream,
)
if rc != 0:
session.testsfailed = 1
return True # we handled the test loop
@pytest.fixture()
def set_ucx_tls_no_mm():
"""Set UCX env defaults for all tests."""
......@@ -205,8 +345,10 @@ def _enable_offline_with_mistral_patch():
except (ImportError, AttributeError):
return # transformers version without _patch_mistral_regex — nothing to do
# Write a sitecustomize.py so subprocesses also get the patch
patch_dir = os.path.join(tempfile.gettempdir(), "dynamo_test_hf_patch")
# Write a sitecustomize.py so subprocesses also get the patch.
# Use a per-worker dir under xdist to avoid write races.
worker_id = os.environ.get("PYTEST_XDIST_WORKER", "main")
patch_dir = os.path.join(tempfile.gettempdir(), f"dynamo_test_hf_patch_{worker_id}")
os.makedirs(patch_dir, exist_ok=True)
with open(os.path.join(patch_dir, "sitecustomize.py"), "w") as f:
f.write(
......@@ -239,26 +381,33 @@ def _enable_offline_with_mistral_patch():
def _disable_offline_with_mistral_patch():
"""Undo _enable_offline_with_mistral_patch."""
os.environ.pop("HF_HUB_OFFLINE", None)
patch_dir = os.path.join(tempfile.gettempdir(), "dynamo_test_hf_patch")
worker_id = os.environ.get("PYTEST_XDIST_WORKER", "main")
patch_dir = os.path.join(tempfile.gettempdir(), f"dynamo_test_hf_patch_{worker_id}")
pythonpath = os.environ.get("PYTHONPATH", "")
os.environ["PYTHONPATH"] = pythonpath.replace(f"{patch_dir}:", "").replace(
patch_dir, ""
)
_download_lock_path = os.path.join(tempfile.gettempdir(), "pytest_model_download.lock")
@pytest.fixture(scope="session")
def predownload_models(pytestconfig):
"""Fixture wrapper around download_models for models used in collected tests"""
# Get models from pytest config if available, otherwise fall back to TEST_MODELS
"""Fixture wrapper around download_models for models used in collected tests.
Uses a file lock so that under xdist, only one worker downloads at a time
and the rest reuse the HuggingFace cache.
"""
models = getattr(pytestconfig, "models_to_download", None)
if models:
logging.info(
f"Downloading {len(models)} models needed for collected tests\nModels: {models}"
)
download_models(model_list=list(models))
else:
# Fallback to original behavior if extraction failed
download_models()
with FileLock(_download_lock_path):
if models:
logging.info(
f"Downloading {len(models)} models needed for collected tests\nModels: {models}"
)
download_models(model_list=list(models))
else:
download_models()
_enable_offline_with_mistral_patch()
yield
......@@ -267,21 +416,20 @@ def predownload_models(pytestconfig):
@pytest.fixture(scope="session")
def predownload_tokenizers(pytestconfig):
"""Fixture wrapper around download_models for tokenizers used in collected tests"""
# Get models from pytest config if available, otherwise fall back to TEST_MODELS
"""Fixture wrapper around download_models for tokenizers used in collected tests.
Uses a file lock so that under xdist, only one worker downloads at a time.
"""
models = getattr(pytestconfig, "models_to_download", None)
if models:
logging.info(
f"Downloading tokenizers for {len(models)} models needed for collected tests\nModels: {models}"
)
download_models(model_list=list(models), ignore_weights=True)
else:
# Fallback to original behavior if extraction failed
download_models(ignore_weights=True)
with FileLock(_download_lock_path):
if models:
logging.info(
f"Downloading tokenizers for {len(models)} models needed for collected tests\nModels: {models}"
)
download_models(model_list=list(models), ignore_weights=True)
else:
download_models(ignore_weights=True)
# Skip redundant HuggingFace API calls in worker subprocesses since
# tokenizers are already cached. This avoids flaky timeouts from slow
# HF API responses (the RepoInfo fetch still happens even for cached models).
_enable_offline_with_mistral_patch()
yield
_disable_offline_with_mistral_patch()
......@@ -337,26 +485,41 @@ def pytest_collection_modifyitems(config, items):
if _item_has_marker(item, marker_name):
item.add_marker(skip)
# Skip tests that exceed --max-vram-gib
# Deselect tests based on --max-vram-gib:
# - Tests whose profiled VRAM exceeds the limit are removed
# - Tests WITHOUT a VRAM marker are also removed (unknown VRAM = unsafe)
# Using deselect (not skip) so they never reach the xdist scheduler.
vram_limit = config.getoption("--max-vram-gib", default=None)
if vram_limit is not None:
skip_vram = pytest.mark.skip(
reason=f"requires more than {vram_limit} GiB VRAM (--max-vram-gib={vram_limit})"
)
keep = []
deselected = []
for item in items:
vram_mark = item.get_closest_marker("max_vram_gib")
if vram_mark and vram_mark.args and vram_mark.args[0] > vram_limit:
item.add_marker(skip_vram)
vram_mark = item.get_closest_marker("profiled_vram_gib")
if vram_mark and vram_mark.args and vram_mark.args[0] <= vram_limit:
keep.append(item)
else:
deselected.append(item)
if deselected:
config.hook.pytest_deselected(items=deselected)
items[:] = keep
# Write test metadata for the GPU orchestrator to read.
if vram_limit is not None:
# Delayed: see vram_utils pynvml note in pytest_configure
from tests.utils.vram_utils import print_gpu_plan, write_test_meta
write_test_meta(items)
# --dry-run: print run/skip breakdown and exit without executing tests
# --dry-run: print run/skip breakdown and exit without executing tests.
# At this point, items only contains tests that passed --max-vram-gib
# filtering (deselected items were already removed above).
if config.getoption("--dry-run", default=False):
would_run = []
would_skip = []
unmarked = []
for item in items:
vram_mark = item.get_closest_marker("max_vram_gib")
vram_mark = item.get_closest_marker("profiled_vram_gib")
vram_val = vram_mark.args[0] if vram_mark and vram_mark.args else None
name = item.nodeid.split("::", 1)[1] if "::" in item.nodeid else item.nodeid
name = item.nodeid
skip_reasons = []
for marker in item.iter_markers("skip"):
......@@ -365,39 +528,28 @@ def pytest_collection_modifyitems(config, items):
reason = marker.args[0]
skip_reasons.append(reason or "no reason given")
vram_skipped = (
vram_limit is not None
and vram_val is not None
and vram_val > vram_limit
)
if vram_skipped:
skip_reasons.insert(0, f"{vram_val} GiB > {vram_limit} GiB VRAM limit")
if skip_reasons:
would_skip.append((name, vram_val, skip_reasons))
elif vram_val is not None:
would_run.append((name, vram_val))
else:
unmarked.append(name)
would_run.append((name, vram_val))
print(f"\n{'=' * 60}")
print(
f"--max-vram-gib={vram_limit or 'not set'} | {len(items)} tests selected"
)
print(f"--max-vram-gib={vram_limit or 'not set'} | {len(items)} tests")
print(f"{'=' * 60}")
if would_run:
print(f"\nWould RUN ({len(would_run)}):")
for name, gib in would_run:
print(f" {name} ({gib} GiB)")
gib_str = f" ({gib} GiB)" if gib is not None else ""
print(f" {name}{gib_str}")
if would_skip:
print(f"\nWould SKIP ({len(would_skip)}):")
for name, vram_val, reasons in would_skip:
vram_str = f" ({vram_val} GiB)" if vram_val is not None else ""
print(f" {name}{vram_str} -- {'; '.join(reasons)}")
if unmarked:
print(f"\nNo VRAM marker — always run ({len(unmarked)}):")
for name in unmarked:
print(f" {name}")
gpus = config.stash.get(_gpu_parallel_gpus_key, None)
if gpus and vram_limit is not None:
print_gpu_plan(gpus, vram_limit, would_run)
print()
items.clear()
return
......
......@@ -99,9 +99,16 @@ class VllmWorkerProcess(ManagedProcess):
"32768",
]
gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
if gpu_util:
command.extend(["--gpu-memory-utilization", gpu_util])
kv_bytes = os.environ.get("_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES")
if kv_bytes:
command.extend(
[
"--kv-cache-memory-bytes",
kv_bytes,
"--gpu-memory-utilization",
"0.01",
]
)
env = os.environ.copy()
env["DYN_LOG"] = "debug"
......@@ -229,7 +236,8 @@ def _validate_chat_response(response: requests.Response) -> Dict[str, Any]:
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.profiled_vram_gib(20.4) # actual profiled peak
# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
@pytest.mark.timeout(300) # 3x observed ~70s wall time, rounded up
@pytest.mark.post_merge
def test_reasoning_effort(
......@@ -297,7 +305,8 @@ def test_reasoning_effort(
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.profiled_vram_gib(20.4) # actual profiled peak
# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
@pytest.mark.timeout(113) # 3x observed 37.4s wall time
@pytest.mark.post_merge
def test_tool_calling(
......@@ -341,7 +350,8 @@ def test_tool_calling(
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling_second_round
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.profiled_vram_gib(20.4) # actual profiled peak
# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
@pytest.mark.timeout(115) # 3x observed 38.1s wall time
@pytest.mark.nightly
def test_tool_calling_second_round(
......@@ -407,7 +417,8 @@ def test_tool_calling_second_round(
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.profiled_vram_gib(20.4) # actual profiled peak
# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
@pytest.mark.timeout(131) # 3x observed 43.4s wall time
@pytest.mark.nightly
def test_reasoning(request, start_services: ServicePorts, predownload_models) -> None:
......
......@@ -18,6 +18,7 @@ from tests.conftest import ServicePorts
from tests.utils.client import send_request
from tests.utils.constants import DefaultPort
from tests.utils.engine_process import EngineConfig, EngineProcess
from tests.utils.port_utils import allocate_port, deallocate_port
DEFAULT_TIMEOUT = 10
......@@ -93,6 +94,7 @@ def run_serve_deployment(
# Ensure EngineProcess health checks hit the correct frontend port.
config = dataclasses.replace(config, frontend_port=dynamic_frontend_port)
else:
# Backward compat: infer from config/extra_env if no explicit ports are passed.
dynamic_frontend_port = int(config.frontend_port)
......@@ -108,76 +110,86 @@ def run_serve_deployment(
int(merged_env.get("DYN_SYSTEM_PORT2") or DefaultPort.SYSTEM2.value),
]
with EngineProcess.from_script(
config, request, extra_env=merged_env
) as server_process:
for _payload in config.request_payloads:
logger.info("TESTING: Payload: %s", _payload.__class__.__name__)
# Make a per-iteration copy so tests can safely override ports/fields
# without mutating shared config instances across parametrized cases.
payload = deepcopy(_payload)
# inject model
if hasattr(payload, "with_model"):
payload = payload.with_model(config.model)
# Default behavior: requests go to the frontend port, except metrics which target
# worker system ports (mapped from DefaultPort -> per-test ports).
if getattr(payload, "endpoint", "") == "/metrics":
if payload.port == DefaultPort.SYSTEM1.value:
if len(dynamic_system_ports) < 1:
raise RuntimeError(
"Payload targets SYSTEM_PORT1 but no system ports were provided "
f"(payload={payload.__class__.__name__})"
)
payload.port = dynamic_system_ports[0]
elif payload.port == DefaultPort.SYSTEM2.value:
if len(dynamic_system_ports) < 2:
raise RuntimeError(
"Payload targets SYSTEM_PORT2 but only 1 system port was provided "
f"(payload={payload.__class__.__name__})"
)
payload.port = dynamic_system_ports[1]
else:
payload.port = dynamic_frontend_port
# Optional extra system ports for specialized payloads (e.g. LoRA control-plane APIs).
# BasePayload always defines `system_ports` (usually empty); map defaults
# (SYSTEM_PORT1/2) to per-test system ports when present.
if payload.system_ports:
mapped_system_ports: list[int] = []
for p in payload.system_ports:
if p == DefaultPort.SYSTEM1.value:
# Disagg scripts need a unique bootstrap port so parallel runs don't collide.
disagg_bootstrap_port: int | None = None
if config.script_name and "disagg" in config.script_name:
disagg_bootstrap_port = allocate_port(12000)
merged_env["DYN_DISAGG_BOOTSTRAP_PORT"] = str(disagg_bootstrap_port)
try:
with EngineProcess.from_script(
config, request, extra_env=merged_env
) as server_process:
for _payload in config.request_payloads:
logger.info("TESTING: Payload: %s", _payload.__class__.__name__)
# Make a per-iteration copy so tests can safely override ports/fields
# without mutating shared config instances across parametrized cases.
payload = deepcopy(_payload)
# inject model
if hasattr(payload, "with_model"):
payload = payload.with_model(config.model)
# Default behavior: requests go to the frontend port, except metrics which target
# worker system ports (mapped from DefaultPort -> per-test ports).
if getattr(payload, "endpoint", "") == "/metrics":
if payload.port == DefaultPort.SYSTEM1.value:
if len(dynamic_system_ports) < 1:
raise RuntimeError(
"Payload.system_ports includes SYSTEM_PORT1 but no system ports were provided "
"Payload targets SYSTEM_PORT1 but no system ports were provided "
f"(payload={payload.__class__.__name__})"
)
mapped_system_ports.append(dynamic_system_ports[0])
elif p == DefaultPort.SYSTEM2.value:
payload.port = dynamic_system_ports[0]
elif payload.port == DefaultPort.SYSTEM2.value:
if len(dynamic_system_ports) < 2:
raise RuntimeError(
"Payload.system_ports includes SYSTEM_PORT2 but only 1 system port was provided "
"Payload targets SYSTEM_PORT2 but only 1 system port was provided "
f"(payload={payload.__class__.__name__})"
)
mapped_system_ports.append(dynamic_system_ports[1])
else:
mapped_system_ports.append(p)
payload.system_ports = mapped_system_ports
for _ in range(payload.repeat_count):
response = send_request(
url=payload.url(),
payload=payload.body,
timeout=payload.timeout,
method=payload.method,
stream=payload.http_stream,
)
server_process.check_response(payload, response)
# Call final_validation if the payload has one (e.g., CachedTokensChatPayload)
if hasattr(payload, "final_validation"):
payload.final_validation()
payload.port = dynamic_system_ports[1]
else:
payload.port = dynamic_frontend_port
# Optional extra system ports for specialized payloads (e.g. LoRA control-plane APIs).
# BasePayload always defines `system_ports` (usually empty); map defaults
# (SYSTEM_PORT1/2) to per-test system ports when present.
if payload.system_ports:
mapped_system_ports: list[int] = []
for p in payload.system_ports:
if p == DefaultPort.SYSTEM1.value:
if len(dynamic_system_ports) < 1:
raise RuntimeError(
"Payload.system_ports includes SYSTEM_PORT1 but no system ports were provided "
f"(payload={payload.__class__.__name__})"
)
mapped_system_ports.append(dynamic_system_ports[0])
elif p == DefaultPort.SYSTEM2.value:
if len(dynamic_system_ports) < 2:
raise RuntimeError(
"Payload.system_ports includes SYSTEM_PORT2 but only 1 system port was provided "
f"(payload={payload.__class__.__name__})"
)
mapped_system_ports.append(dynamic_system_ports[1])
else:
mapped_system_ports.append(p)
payload.system_ports = mapped_system_ports
for _ in range(payload.repeat_count):
response = send_request(
url=payload.url(),
payload=payload.body,
timeout=payload.timeout,
method=payload.method,
stream=payload.http_stream,
)
server_process.check_response(payload, response)
# Call final_validation if the payload has one (e.g., CachedTokensChatPayload)
if hasattr(payload, "final_validation"):
payload.final_validation()
finally:
if disagg_bootstrap_port is not None:
deallocate_port(disagg_bootstrap_port)
def params_with_model_mark(configs: Mapping[str, EngineConfig]):
......
......@@ -12,7 +12,11 @@ trap 'echo "Cleaning up..."; kill 0' EXIT
MODEL="${MODEL:-Qwen/Qwen3-0.6B}"
GPU_MEM_FRACTION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}"
KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
GPU_MEM_ARGS=""
if [[ -n "$KV_BYTES" ]]; then
GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
fi
echo "Starting Dynamo frontend..."
python3 -m dynamo.frontend &
......@@ -25,7 +29,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
--node-rank 0 \
--master-addr 127.0.0.1 \
--enforce-eager \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
$GPU_MEM_ARGS &
echo "Starting dynamo.vllm headless worker (TP=2, nnodes=2, node-rank=1, GPU 1)..."
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
......@@ -35,7 +39,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--node-rank 1 \
--master-addr 127.0.0.1 \
--enforce-eager \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} \
$GPU_MEM_ARGS \
--headless &
wait
......@@ -45,9 +45,9 @@ sglang_dir = os.environ.get("SGLANG_DIR") or os.path.join(
# SGLang test configurations
# NOTE: pytest.mark.gpu_1 tests take ~167s (2m 47s) total to run sequentially (with models pre-cached)
# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
# TODO: Now that these tests use dynamic ports and each config has a profiled_vram_gib marker,
# optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
# A future collector/launcher can sum profiled_vram_gib values to decide how many tests fit
# concurrently without exceeding available VRAM.
sglang_configs = {
"aggregated": SGLangConfig(
......@@ -58,8 +58,13 @@ sglang_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(6.1), # observed peak 5.6 GiB (+10% safety)
pytest.mark.timeout(240), # profiled 34.4s on A6000
pytest.mark.profiled_vram_gib(
3.7
), # actual peak at recommended token count
pytest.mark.requested_sglang_kv_tokens(
96
), # KV cache cap (2x safety over min=48)
pytest.mark.timeout(195), # profiled 33s on RTX 6000 Ada
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
......@@ -160,7 +165,8 @@ sglang_configs = {
script_name="template_verifier.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.timeout(240), # profiled 11.7s on A6000 (no GPU model load)
pytest.mark.profiled_vram_gib(0.0), # no GPU model load
pytest.mark.timeout(120), # profiled 12s on RTX 6000 Ada
pytest.mark.pre_merge,
pytest.mark.nightly,
],
......@@ -175,8 +181,8 @@ sglang_configs = {
),
# NOTE: Pack all workers on 1 GPU for lower CI resource requirements.
# NOTE: multimodal_epd.sh uses explicit --mem-fraction-static via DYN_ENCODE_GPU_MEM
# / DYN_WORKER_GPU_MEM env vars, so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect.
# Regardless of fraction overrides, the workers combined consistently use ~23.6 GiB.
# / DYN_WORKER_GPU_MEM env vars. The profiler override distributes proportionally
# but workers combined consistently use ~23.6 GiB regardless of fraction overrides.
"multimodal_e_pd_qwen": SGLangConfig(
# E/P/D architecture: Encode, Prefill, Decode workers all on GPU 0
name="multimodal_e_pd_qwen",
......@@ -184,16 +190,15 @@ sglang_configs = {
script_name="multimodal_epd.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(13.3), # observed peak 12.1 GiB (+10% safety)
pytest.mark.timeout(360), # profiled 31.0s on A6000
# No profiled_vram_gib: uses hard-coded --mem-fraction-static via
# DYN_ENCODE_GPU_MEM / DYN_WORKER_GPU_MEM, so VRAM scales with GPU size.
pytest.mark.timeout(210), # profiled 35s on RTX 6000 Ada
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-VL-2B-Instruct",
script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
timeout=360,
env={
"DYN_ENCODE_WORKER_GPU": "0",
"DYN_WORKER_GPU": "0",
"DYN_ENCODE_GPU_MEM": "0.1",
"DYN_WORKER_GPU_MEM": "0.4",
},
......@@ -226,8 +231,11 @@ sglang_configs = {
script_name="multimodal_disagg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(17.7), # observed peak 16.1 GiB (+10% safety)
pytest.mark.timeout(360), # profiled 36.0s on A6000
pytest.mark.profiled_vram_gib(16.1), # actual profiled peak
pytest.mark.requested_sglang_kv_tokens(
1024
), # KV cache cap (2x safety over min=512)
pytest.mark.timeout(222), # profiled 37s on RTX 6000 Ada
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-VL-2B-Instruct",
......@@ -261,8 +269,13 @@ sglang_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.0), # observed peak 19.1 GiB (+10% safety)
pytest.mark.timeout(300), # profiled 41.3s on A6000
pytest.mark.profiled_vram_gib(
19.1
), # actual peak at recommended token count
pytest.mark.requested_sglang_kv_tokens(
768
), # KV cache cap (2x safety over min=384)
pytest.mark.timeout(182), # profiled 30s on RTX 6000 Ada
pytest.mark.pre_merge,
pytest.mark.nightly,
],
......@@ -300,8 +313,13 @@ sglang_configs = {
script_name="agg_embed.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(12.1), # observed peak 11.0 GiB (+10% safety)
pytest.mark.timeout(270), # profiled 25.5s on A6000
pytest.mark.profiled_vram_gib(
9.8
), # actual peak at recommended token count
pytest.mark.requested_sglang_kv_tokens(
128
), # KV cache cap (2x safety over min=64)
pytest.mark.timeout(147), # profiled 24s on RTX 6000 Ada
pytest.mark.pre_merge,
pytest.mark.nightly,
],
......@@ -338,8 +356,13 @@ sglang_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(16.2), # observed peak 14.8 GiB (+10% safety)
pytest.mark.timeout(420), # profiled 73s on A6000
pytest.mark.profiled_vram_gib(
14.7
), # actual peak at recommended token count
pytest.mark.requested_sglang_kv_tokens(
64
), # KV cache cap (2x safety over min=32)
pytest.mark.timeout(341), # profiled 57s on RTX 6000 Ada
pytest.mark.post_merge,
],
model="deepseek-ai/deepseek-llm-7b-base",
......@@ -362,7 +385,7 @@ sglang_configs = {
pytest.mark.post_merge,
pytest.mark.timeout(240),
pytest.mark.skip(reason="DYN-2261"),
# TODO: profile to get max_vram (currently skipped)
# TODO: profile once DYN-2261 is fixed (uses agg.sh, profiler works)
],
model="Qwen/Qwen3-0.6B",
env={"DYN_ENABLE_ANTHROPIC_API": "1"},
......
......@@ -54,9 +54,9 @@ vllm_dir = os.environ.get("VLLM_DIR") or os.path.join(
# vLLM test configurations
# NOTE: pytest.mark.gpu_1 tests take ~5.5 minutes total to run sequentially (with models pre-cached)
# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
# TODO: Now that these tests use dynamic ports and each config has VRAM markers,
# optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
# A future collector/launcher can sum profiled_vram_gib values to decide how many tests fit
# concurrently without exceeding available VRAM.
vllm_configs = {
"aggregated": VLLMConfig(
......@@ -65,8 +65,13 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.timeout(300), # ~7x observed 42.2s; old value before profiling
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(
360
), # ~8.5x observed 42.2s; bumped for GPU-parallel headroom
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
......@@ -93,7 +98,10 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(120), # ~5x observed 24.3s; CI machines are slower
pytest.mark.post_merge,
],
......@@ -122,7 +130,10 @@ vllm_configs = {
marks=[
pytest.mark.lmcache,
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.4 GiB (+10% safety)
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(360), # ~7x observed 49.0s; old value before profiling
pytest.mark.pre_merge,
pytest.mark.skipif(
......@@ -145,7 +156,10 @@ vllm_configs = {
marks=[
pytest.mark.lmcache,
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.4 GiB (+10% safety)
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(360), # ~7x observed 49.3s; old value before profiling
pytest.mark.pre_merge,
pytest.mark.skipif(
......@@ -170,8 +184,13 @@ vllm_configs = {
script_name="agg_request_planes.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.3 GiB (+10% safety)
pytest.mark.timeout(300), # ~7x observed 43.0s; old value before profiling
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(
360
), # ~8x observed 43.0s; bumped for GPU-parallel headroom
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
......@@ -187,8 +206,13 @@ vllm_configs = {
script_name="agg_request_planes.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.3 GiB (+10% safety)
pytest.mark.timeout(300), # ~7x observed 42.3s; old value before profiling
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(
360
), # ~8.5x observed 42.3s; bumped for GPU-parallel headroom
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
......@@ -299,13 +323,17 @@ vllm_configs = {
],
),
# NOTE: Pack all workers on 1 GPU for lower CI resource requirements
# NOTE: disagg_multimodal_e_pd.sh uses explicit --gpu-memory-utilization via
# DYN_ENCODE_GPU_MEM / DYN_PD_GPU_MEM env vars in single-GPU mode.
# PD worker honors build_gpu_mem_args for parallel execution.
"multimodal_e_pd_qwen": VLLMConfig(
name="multimodal_e_pd_qwen",
directory=vllm_dir,
script_name="disagg_multimodal_e_pd.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(24.6), # observed peak 22.3 GiB (+10% safety)
# No profiled_vram_gib / requested_vllm_kv_cache_bytes: single-GPU mode
# uses hardcoded fractions (encode=0.1, PD=0.7) that scale with GPU size.
pytest.mark.timeout(340), # ~5x observed 68.4s; 2B model loads slower on CI
pytest.mark.pre_merge,
],
......@@ -339,7 +367,10 @@ vllm_configs = {
# post_merge because needs real NIXL not stub
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(10.2), # observed peak 9.3 GiB (+10% safety)
pytest.mark.profiled_vram_gib(9.6), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_710_490_000
), # KV cache cap (2x safety over min=855_244_800)
pytest.mark.timeout(220), # ~5x observed 43.7s; 2B model loads slower on CI
pytest.mark.post_merge,
],
......@@ -373,21 +404,25 @@ vllm_configs = {
# NOTE: disagg_multimodal_epd.sh uses --kv-cache-memory-bytes=512MB for P/D
# workers. Per vLLM CacheConfig, kv_cache_memory_bytes (when not-None) ignores
# gpu_memory_utilization (ref: https://docs.vllm.ai/en/stable/api/vllm/config/cache/),
# so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect. Regardless of GPU_MEM
# so KV cache overrides have no effect. Regardless of GPU_MEM
# fractions (0.1/0.4/0.4), the 3 workers combined consistently use ~17.6 GiB
# total on this GPU.
# NOTE: disagg_multimodal_epd.sh uses explicit --gpu-memory-utilization via
# DYN_ENCODE_GPU_MEM / DYN_PREFILL_GPU_MEM / DYN_DECODE_GPU_MEM env vars.
# P/D workers honor build_gpu_mem_args for parallel execution.
"multimodal_disagg_qwen": VLLMConfig(
name="multimodal_disagg_qwen",
directory=vllm_dir,
script_name="disagg_multimodal_epd.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(19.4), # observed peak 17.6 GiB (+10% safety)
# No profiled_vram_gib / requested_vllm_kv_cache_bytes: single-GPU mode
# uses hardcoded fractions via DYN_*_GPU_MEM that scale with GPU size.
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-VL-2B-Instruct",
script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
timeout=360,
timeout=300,
env={
"DYN_ENCODE_WORKER_GPU": "0",
"DYN_PREFILL_WORKER_GPU": "0",
......@@ -421,7 +456,10 @@ vllm_configs = {
script_name="agg_multimodal.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.6), # observed peak 19.6 GiB (+10% safety)
pytest.mark.profiled_vram_gib(19.9), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
922_354_000
), # KV cache cap (2x safety over min=461_176_832)
pytest.mark.timeout(
360
), # ~7x observed 50.0s; 7B model loads ~48s on CI (A10G/L4)
......@@ -455,7 +493,10 @@ vllm_configs = {
script_name="agg_multimodal.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(18.9), # observed peak 17.1 GiB (+10% safety)
pytest.mark.profiled_vram_gib(14.9), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
922_354_000
), # KV cache cap (2x safety over min=461_176_832)
pytest.mark.timeout(
300
), # ~7x observed 42.7s; 7B model loads ~48s on CI (A10G/L4)
......@@ -703,7 +744,10 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.9), # observed peak 19.9 GiB (+10% safety)
pytest.mark.profiled_vram_gib(18.3), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
4_074_898_000
), # KV cache cap (2x safety over min=2_037_448_704)
pytest.mark.timeout(
420
), # 7B model loads ~48s on CI (A10G/L4) vs ~15s locally
......@@ -742,7 +786,10 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(110), # ~5x observed 22.3s; CI machines are slower
pytest.mark.pre_merge,
],
......
......@@ -14,17 +14,18 @@ in-process instrumentation. Using NVML directly (the same C library that
``nvidia-smi`` wraps) avoids the overhead of forking a subprocess each sample
and allows high-frequency sampling.
In **binary-search mode** (the default), the profiler sets the env var
``_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`` to a value between 0.05 and 0.95 and
re-runs the test at each midpoint. If the test passes, the fraction is lowered;
if it OOMs, the fraction is raised — standard bisection to find the minimum
VRAM the test needs. The peak ``memory.used`` from the last passing run
(plus a 10 % safety margin) becomes the ``@pytest.mark.max_vram_gib`` recommendation.
**IMPORTANT**: The test under profile **MUST** honor ``_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE``
— either directly (see ``test_mock_gpu_alloc.py``) or via launch scripts that
pass it as ``--gpu-memory-utilization`` to vLLM (e.g. ``agg.sh``). If the test
ignores this variable, every probe will pass at the same peak and the profiler
In **binary-search mode** (the default), the profiler bisects the KV cache
allocation — ``_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES`` for vLLM (bytes) or
``_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS`` for SGLang (tokens).
If the test passes, the allocation is lowered; if it OOMs, it is raised —
standard bisection to find the minimum the test needs. A safety factor
is applied and the peak ``memory.used`` from the last passing run becomes
the ``@pytest.mark.profiled_vram_gib`` recommendation.
**IMPORTANT**: The test under profile **MUST** read the appropriate KV cache
override — either directly (see ``test_mock_gpu_alloc.py``) or via launch
scripts that call ``build_gpu_mem_args`` (e.g. ``agg.sh``). If the test
ignores the override, every probe will pass at the same peak and the profiler
will warn that the binary search is unreliable.
Usage::
......@@ -51,6 +52,7 @@ import json
import logging
import math
import os
import re
import shutil
import subprocess
import sys
......@@ -68,6 +70,11 @@ logger = logging.getLogger(__name__)
# tier has headroom for variance across runs.
_VRAM_SAFETY_FACTOR = 1.1
# Safety margin for KV cache recommendations (both SGLang tokens and vLLM bytes).
# The minimum passing value is multiplied by this factor to provide headroom for
# prompt length variation, scheduling jitter, and multi-turn conversations.
_KV_SAFETY_FACTOR = 2.0
# Phase detection: a memory jump exceeding this threshold (MiB) between
# consecutive samples marks a phase boundary.
_PHASE_JUMP_MIB = 200
......@@ -77,6 +84,11 @@ _PHASE_JUMP_MIB = 200
_PLATEAU_TOLERANCE_MIB = 50
_PLATEAU_MIN_SAMPLES = 3
# Early-stop threshold for binary search: if the last 3 probes have peak
# VRAM within this range, the bisection is in the noise floor (model weights
# dominate) and further probes won't yield meaningful data.
_EARLY_STOP_RANGE_MIB = 768 # 0.75 GiB
def _extract_model_from_markers(pytest_args: list[str]) -> str | None:
"""Extract the model name from @pytest.mark.model(...) via pytest-json-report.
......@@ -446,6 +458,9 @@ def _recommend_markers(
wall_secs: float,
model_name: str | None = None,
num_runs: int = 1,
requested_sglang_kv_tokens: int | None = None,
requested_vllm_kv_cache_bytes: int | None = None,
min_kv_value: int | None = None,
) -> tuple[list[MarkerRecommendation], list[str]]:
"""Generate marker recommendations from profiling data.
......@@ -523,17 +538,37 @@ def _recommend_markers(
)
)
# -- Hardware: VRAM requirement --
# -- Hardware: VRAM requirements (two markers) --
if used_vram > _PLATEAU_TOLERANCE_MIB:
max_peak_gib = round(max_peak_mib / 1024, 1)
padded_peak_mib = int(max_peak_mib * _VRAM_SAFETY_FACTOR)
padded_peak_gib = round(padded_peak_mib / 1024, 1)
# profiled_vram_gib: actual nvidia-smi peak (for scheduling/filtering)
recs.append(
MarkerRecommendation(
f"max_vram_gib({padded_peak_gib})",
f"peak {_format_mib(max_peak_mib)} GPU RAM used "
f"(+10% safety: {_format_mib(padded_peak_mib)})",
f"profiled_vram_gib({max_peak_gib})",
f"actual nvidia-smi peak {_format_mib(max_peak_mib)}",
)
)
if requested_sglang_kv_tokens is not None:
min_label = f" over min={min_kv_value}" if min_kv_value is not None else ""
recs.append(
MarkerRecommendation(
f"requested_sglang_kv_tokens({requested_sglang_kv_tokens})",
f"KV cache cap ({_KV_SAFETY_FACTOR:.0f}x safety{min_label})",
)
)
if requested_vllm_kv_cache_bytes is not None:
min_label = (
f" over min={min_kv_value:_}" if min_kv_value is not None else ""
)
recs.append(
MarkerRecommendation(
f"requested_vllm_kv_cache_bytes({requested_vllm_kv_cache_bytes:_})",
f"KV cache cap ({_KV_SAFETY_FACTOR:.0f}x safety{min_label})",
)
)
# Warn about GPU cards that would OOM
for card_gib, card_name in _GPU_REFERENCE_CARDS:
......@@ -541,7 +576,7 @@ def _recommend_markers(
warnings.append(f"Will OOM on {card_name} ({card_gib} GiB).")
# -- Timeout --
timeout_val = int(math.ceil(wall_secs * 3.0))
timeout_val = int(math.ceil(wall_secs * 6.0))
timeout_val = max(timeout_val, 10)
recs.append(
MarkerRecommendation(
......@@ -598,6 +633,46 @@ def _print_recommendations(
print()
_SGLANG_NODEID_MARKERS = ["test_sglang", "sglang"]
def _is_sglang_test(pytest_args: list[str]) -> bool:
"""Check if any pytest arg looks like a SGLang test node ID."""
return any(
marker in arg for arg in pytest_args for marker in _SGLANG_NODEID_MARKERS
)
_OOM_PATTERNS = [
"OutOfMemoryError",
"CUDA out of memory",
"CUDA error: out of memory",
"not enough memory",
"Cannot allocate",
"oom-kill",
]
def _looks_like_oom(stdout: str) -> bool:
"""Check if captured output contains OOM-like errors."""
stdout_lower = stdout.lower()
return any(pat.lower() in stdout_lower for pat in _OOM_PATTERNS)
_SGLANG_MAX_TOKENS_RE = re.compile(r"max_total_tokens=(\d+)")
def _extract_requested_sglang_kv_tokens(stdout: str) -> int | None:
"""Extract max_total_tokens from SGLang engine output.
SGLang logs: "Got total KV blocks from scheduler: N (max_total_tokens=M, page_size=P)"
"""
match = _SGLANG_MAX_TOKENS_RE.search(stdout)
if match:
return int(match.group(1))
return None
_DEFAULT_PROBE_TIMEOUT = 300 # 5 minutes max per profile run
......@@ -610,13 +685,13 @@ def _run_once(
quiet: bool = False,
run_label: str | None = None,
timeout: float = _DEFAULT_PROBE_TIMEOUT,
) -> tuple[int, float, list[GpuReport], list[GpuSample]]:
) -> tuple[int, float, list[GpuReport], list[GpuSample], str]:
"""Run pytest once with GPU sampling.
When *run_label* is set, each line of pytest stdout/stderr is prefixed
with ``[run_label]`` so multi-run output is easy to follow.
Returns (exit_code, wall_secs, reports, raw_samples).
Returns (exit_code, wall_secs, reports, raw_samples, captured_stdout).
"""
sampler = _Sampler(interval=interval)
sampler.start()
......@@ -639,6 +714,7 @@ def _run_once(
capture = run_label is not None
t_start = time.monotonic()
timed_out = False
captured_stdout = ""
try:
result = subprocess.run(
pytest_cmd,
......@@ -648,6 +724,8 @@ def _run_once(
timeout=timeout,
)
rc = result.returncode
if capture:
captured_stdout = result.stdout or ""
except subprocess.TimeoutExpired:
timed_out = True
rc = 1
......@@ -658,9 +736,9 @@ def _run_once(
)
if not timed_out and capture:
prefix = f"[{run_label}] "
for line in result.stdout.splitlines():
for line in captured_stdout.splitlines():
print(f"{prefix}{line}")
for line in result.stderr.splitlines():
for line in (result.stderr or "").splitlines():
print(f"{prefix}{line}", file=sys.stderr)
sys.stdout.flush()
wall_secs = time.monotonic() - t_start
......@@ -672,7 +750,7 @@ def _run_once(
sampler.stop()
reports = _build_reports(sampler.samples, baseline_end, test_end)
return rc, wall_secs, reports, sampler.samples
return rc, wall_secs, reports, sampler.samples, captured_stdout
def _find_min_vram(
......@@ -682,23 +760,46 @@ def _find_min_vram(
teardown_seconds: float = 2.0,
recommend: bool = True,
csv_path: str | None = None,
kv_bytes_mode: bool = False,
gpu_index: int = 0,
) -> int:
"""Binary search _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE to find the minimum VRAM a test needs.
"""Binary search to find the minimum VRAM a test needs.
Sets _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE env var (honored by agg.sh and similar scripts),
runs the test at each profile point, and bisects until the boundary is found.
Three modes, two patterns:
KV bisection (deterministic, no profiling race):
vLLM: bisects _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES (bytes)
SGLang: bisects _PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS (tokens)
Both use the same _KV_SAFETY_FACTOR (2x) and the same bisect loop.
The only differences are env var name, units, display, and bounds.
"""
is_sglang = _is_sglang_test(pytest_args)
gpu_info = _query_gpu_stats()
if not gpu_info:
raise RuntimeError("NVML returned no GPU data")
used_mib = gpu_info[0][1]
total_mib = gpu_info[0][2]
if gpu_index >= len(gpu_info):
raise RuntimeError(
f"GPU {gpu_index} not found (available: 0..{len(gpu_info) - 1})"
)
used_mib = gpu_info[gpu_index][1]
total_mib = gpu_info[gpu_index][2]
free_mib = total_mib - used_mib
total_gib = total_mib / 1024
# Base env: pin subprocess to the selected GPU
_gpu_env = {"CUDA_VISIBLE_DEVICES": str(gpu_index)}
model_name = _extract_model_from_markers(pytest_args)
print("\n--- FIND MINIMUM VRAM (binary search) ---")
if not is_sglang:
kv_bytes_mode = True
if kv_bytes_mode:
mode_label = "KV CACHE BYTES (vLLM, deterministic)"
else:
mode_label = "KV TOKENS (SGLang)"
print(f"\n--- FIND MINIMUM {mode_label} (binary search) ---")
print(f" GPU total : {total_gib:.1f} GiB")
print(
f" GPU free : {free_mib / 1024:.1f} GiB "
......@@ -708,7 +809,6 @@ def _find_min_vram(
if model_name:
print(f" Model : {model_name}")
# Warn if something is already consuming significant GPU memory
hogged_pct = used_mib / total_mib * 100
if hogged_pct > 10:
print(f"\n {'!' * 72}")
......@@ -716,91 +816,169 @@ def _find_min_vram(
f" WARNING: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) of GPU memory "
f"is already in use!"
)
print(" Another process is hogging the GPU. Results will be inaccurate")
print(
" because _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is a fraction of TOTAL memory,"
)
print(" not FREE memory. Kill other GPU processes first.")
print(" Another process is hogging the GPU. Free memory is reduced,")
print(" which limits KV cache headroom. Kill other GPU processes first.")
print(f" {'!' * 72}")
print()
lo = 0.05
hi = 0.95
tolerance = 0.05
max_iterations = math.ceil(math.log2((hi - lo) / tolerance))
last_pass_util: float | None = None
last_pass_peak_mib: int = 0
elapsed_times: list[float] = []
all_peak_mibs: list[int] = []
pass_wall_times: list[float] = []
print(f" Range : {lo:.0%} - {hi:.0%} (tolerance {tolerance:.0%})")
print(
f" Max iter: {max_iterations + 1} (1 validation + {max_iterations} bisections)"
)
print()
# -- Validation run --
validation_env: dict[str, str] = dict(_gpu_env)
if kv_bytes_mode:
# Start at 50% of free GPU. If it passes, that's the upper bound and we
# search downward. If it fails (model weights too large), halve again
# until we find a passing point, then search downward from there.
max_kv_bytes = int(max(free_mib // 2, 1024) * 1024 * 1024)
validation_env["_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES"] = str(max_kv_bytes)
validation_desc = f"kv_cache={max_kv_bytes // (1024**2)} MiB (50% of free)"
else:
validation_desc = "no token cap, default fraction"
# First, verify the test passes at hi (0.95)
print(
f" [profile 1/{max_iterations + 1}] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE={hi:.2f} "
f"(allowed max GPU {hi * total_gib:.1f} GiB) [validation run]"
)
print(f" [probe 1] Validation run ({validation_desc})")
sys.stdout.flush()
t_iter_start = time.monotonic()
label = f"profile 1/{max_iterations + 1}"
rc, wall, reports, raw_samples = _run_once(
rc, wall, reports, raw_samples, stdout = _run_once(
pytest_args,
interval=interval,
baseline_seconds=baseline_seconds,
teardown_seconds=teardown_seconds,
extra_env={"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE": f"{hi:.2f}"},
extra_env=validation_env or None,
quiet=True,
run_label=label,
run_label="probe 1",
)
iter_elapsed = time.monotonic() - t_iter_start
elapsed_times.append(iter_elapsed)
# kv-bytes mode: if validation fails, check whether it's OOM (over-allocated)
# or a genuine test failure (unrelated to KV cache). Only retry with less KV
# if the output looks like OOM; otherwise the test is broken and retrying won't help.
if rc != 0 and kv_bytes_mode:
if _looks_like_oom(stdout):
for attempt in range(4):
max_kv_bytes //= 2
if max_kv_bytes < 64 * 1024 * 1024:
break
validation_env["_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES"] = str(
max_kv_bytes
)
print(
f" [OOM] Reducing KV cache to {max_kv_bytes // (1024**2)} MiB "
f"(retry {attempt + 1}/4)"
)
sys.stdout.flush()
t_iter_start = time.monotonic()
rc, wall, reports, raw_samples, stdout = _run_once(
pytest_args,
interval=interval,
baseline_seconds=baseline_seconds,
teardown_seconds=teardown_seconds,
extra_env=validation_env,
quiet=True,
run_label=f"probe 1 (retry {attempt + 1})",
)
iter_elapsed = time.monotonic() - t_iter_start
if rc == 0:
break
else:
print(
" [FAIL] Test failed but NOT from OOM — the test appears genuinely broken."
)
print(
" Hint: check the test output above for the root cause "
"(EngineDeadError, timeout, assertion, etc.)."
)
if rc != 0:
print(
f" [FAIL] allowed GPU = {hi * total_gib:.1f} GiB ({hi:.0%}), "
f"test fails even at max utilization. Cannot determine minimum."
reason = (
"OOM at all KV sizes"
if _looks_like_oom(stdout)
else "test broken (not OOM)"
)
print(f" [FAIL] Cannot determine minimum KV cache: {reason}.")
return rc
peak_mib = max((r.peak_mib for r in reports), default=0)
all_peak_mibs.append(peak_mib)
last_pass_util = hi
last_pass_peak_mib = peak_mib
if kv_bytes_mode:
# Search range: 64 MiB to 40 GiB in bytes.
# Lower bound at 64 MiB to skip probes that always fail (no model
# can serve even 1 request with < 64 MiB KV cache).
lo: float | int = 64 * 1024 * 1024 # 64 MiB minimum
hi: float | int = max_kv_bytes
tolerance: float | int = 16 * 1024 * 1024 # 16 MiB tolerance
print(
f" [PASS] peak {_format_mib(peak_mib)}, wall {wall:.0f}s, "
f"iter took {iter_elapsed:.0f}s"
)
else:
max_tokens = _extract_requested_sglang_kv_tokens(stdout)
if max_tokens is None:
print(
" [ERROR] Could not extract max_total_tokens from SGLang output.\n"
" The launch script must log 'max_total_tokens=N' (SGLang does this by default)."
)
return 4
page_size = 16
lo = page_size
hi = max_tokens
tolerance = page_size * 2
print(
f" [PASS] peak {_format_mib(peak_mib)}, wall {wall:.0f}s, "
f"max_total_tokens={max_tokens}, iter took {iter_elapsed:.0f}s"
)
baseline_time = iter_elapsed
probe_timeout = max(baseline_time * 2, 60)
print(f" Profile timeout: {probe_timeout:.0f}s (2x first probe)")
max_iterations = (
max(1, math.ceil(math.log2((hi - lo) / tolerance))) if hi > lo else 0
)
last_pass_value: float | int = hi
last_pass_peak_mib: int = peak_mib
last_pass_reports = reports
last_pass_samples = raw_samples
pass_wall_times.append(wall)
elapsed_times: list[float] = [iter_elapsed]
pass_wall_times: list[float] = [wall]
all_peak_mibs: list[int] = [peak_mib]
if kv_bytes_mode:
print(
f"\n Range : {int(lo) // (1024**2)} - {int(hi) // (1024**2)} MiB (tolerance {int(tolerance) // (1024**2)} MiB)"
)
else:
print(f"\n Range : {lo} - {hi} tokens (tolerance {tolerance} tokens)")
print(
f" [PASS] allowed GPU = {hi * total_gib:.1f} GiB ({hi:.0%}), "
f"peak GPU used = {_format_mib(peak_mib)}, wall {wall:.0f}s, "
f"iter took {iter_elapsed:.0f}s"
f" Max iter: {max_iterations + 1} (1 validation + {max_iterations} bisections)"
)
print()
# Use 2x the first profile's time as the timeout for subsequent profiles.
# If a profile takes longer than this, it's likely stuck in teardown.
baseline_time = iter_elapsed
probe_timeout = max(baseline_time * 2, 60)
print(f" Profile timeout: {probe_timeout:.0f}s (2x first profile)")
# -- Binary search loop --
iteration = 0
while (hi - lo) > tolerance:
iteration += 1
probe_num = iteration + 1
mid = (lo + hi) / 2
remaining = max_iterations + 1 - probe_num
avg_iter = sum(elapsed_times) / len(elapsed_times)
eta_s = remaining * avg_iter
label = f"profile {probe_num}/{max_iterations + 1}"
print(
f"\n [{label}] "
f"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE={mid:.2f} "
f"(allowed max GPU {mid * total_gib:.1f} GiB) "
f"[~{remaining} iters left, profiling ETA ~{eta_s:.0f}s]"
)
if kv_bytes_mode:
mid_int = (int(lo) + int(hi)) // 2
mid_int = max(mid_int, 1024 * 1024) # minimum 1 MiB
probe_env = {
**_gpu_env,
"_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES": str(mid_int),
}
probe_desc = f"kv_cache={mid_int // (1024**2)} MiB ({mid_int:,} bytes)"
else:
mid_int = ((int(lo) + int(hi)) // 2 // page_size) * page_size
mid_int = max(mid_int, page_size)
probe_env = {
**_gpu_env,
"_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS": str(mid_int),
}
probe_desc = f"tokens={mid_int}"
label = f"probe {probe_num}/{max_iterations + 1}"
print(f" [{label}] {probe_desc} [~{remaining} left, ETA ~{eta_s:.0f}s]")
sys.stdout.flush()
stop_progress = threading.Event()
......@@ -829,12 +1007,12 @@ def _find_min_vram(
)
progress_thread.start()
rc, wall, reports, raw_samples = _run_once(
rc, wall, reports, raw_samples, stdout = _run_once(
pytest_args,
interval=interval,
baseline_seconds=baseline_seconds,
teardown_seconds=teardown_seconds,
extra_env={"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE": f"{mid:.2f}"},
extra_env=probe_env,
quiet=True,
run_label=label,
timeout=probe_timeout,
......@@ -853,77 +1031,173 @@ def _find_min_vram(
peak_mib = max((r.peak_mib for r in reports), default=0)
all_peak_mibs.append(peak_mib)
mid_value = mid_int
if rc == 0:
last_pass_util = mid
last_pass_value = mid_value
last_pass_peak_mib = peak_mib
last_pass_reports = reports
last_pass_samples = raw_samples
pass_wall_times.append(wall)
hi = mid
hi = mid_value
print(
f" [PASS] allowed GPU = {mid * total_gib:.1f} GiB ({mid:.0%}), "
f"peak GPU used = {_format_mib(peak_mib)}, wall {wall:.0f}s, "
f"iter took {iter_elapsed:.0f}s"
f" [PASS] {probe_desc}, peak {_format_mib(peak_mib)}, "
f"wall {wall:.0f}s, iter took {iter_elapsed:.0f}s"
)
else:
lo = mid
lo = mid_value
print(f" [FAIL] {probe_desc}, iter took {iter_elapsed:.0f}s")
# Early termination: if last 3 probes have peak VRAM within
# _EARLY_STOP_RANGE_MIB, further bisection is in the noise floor.
if len(all_peak_mibs) >= 4:
recent = all_peak_mibs[-3:]
peak_range = max(recent) - min(recent)
if peak_range < _EARLY_STOP_RANGE_MIB:
print(
f" [EARLY STOP] Peak VRAM stable at ~{_format_mib(recent[-1])} "
f"for last 3 probes (range {peak_range} MiB < "
f"{_EARLY_STOP_RANGE_MIB} MiB threshold) "
f"-- stopping bisection early"
)
break
# -- Results --
test_name = next(
(a for a in pytest_args if "::" in a or a.endswith(".py")),
" ".join(pytest_args),
)
test_short = test_name.rsplit("::", 1)[-1] if "::" in test_name else test_name
peak_gib = round(last_pass_peak_mib / 1024, 1)
print(f"\n{'=' * 72}")
if kv_bytes_mode:
min_kv_bytes = int(last_pass_value)
safe_kv_bytes = int(min_kv_bytes * _KV_SAFETY_FACTOR)
# Round up to nearest 1000 for clean marker values
safe_kv_bytes = ((safe_kv_bytes + 999) // 1000) * 1000
safe_kv_mib = safe_kv_bytes // (1024 * 1024)
min_kv_mib = min_kv_bytes // (1024 * 1024)
# Final validation probe at safe_kv_bytes to get accurate profiled_vram_gib.
# The bisection's last pass was at min_kv_bytes; the recommended marker uses
# safe_kv_bytes which allocates more KV cache and thus more VRAM.
print(f" [final probe] Measuring VRAM at safe_kv_bytes={safe_kv_mib} MiB")
sys.stdout.flush()
rc_final, wall_final, reports_final, samples_final, stdout_final = _run_once(
pytest_args,
interval=interval,
baseline_seconds=baseline_seconds,
teardown_seconds=teardown_seconds,
extra_env={
**_gpu_env,
"_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES": str(safe_kv_bytes),
},
quiet=True,
run_label="final",
timeout=probe_timeout,
)
if rc_final == 0:
last_pass_peak_mib = max((r.peak_mib for r in reports_final), default=0)
last_pass_reports = reports_final
last_pass_samples = samples_final
pass_wall_times.append(wall_final)
peak_gib = round(last_pass_peak_mib / 1024, 1)
print(
f" [FAIL] allowed GPU = {mid * total_gib:.1f} GiB ({mid:.0%}), "
f"OOM or error, iter took {iter_elapsed:.0f}s"
f" [PASS] kv_cache={safe_kv_mib} MiB, "
f"peak {_format_mib(last_pass_peak_mib)}, wall {wall_final:.0f}s"
)
# Detect if _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is being ignored: all peaks are nearly
# identical despite wildly different utilization caps.
if len(all_peak_mibs) >= 3:
peak_range = max(all_peak_mibs) - min(all_peak_mibs)
if peak_range < _PLATEAU_TOLERANCE_MIB:
print(f"\n {'!' * 72}")
else:
print(
f" WARNING: Peak VRAM was ~{_format_mib(all_peak_mibs[0])} across ALL "
f"{len(all_peak_mibs)} probes (range: {peak_range} MiB)."
f" [FAIL] kv_cache={safe_kv_mib} MiB failed unexpectedly, "
f"using VRAM from min_kv_bytes={min_kv_mib} MiB instead"
)
print(f"\n{'=' * 72}")
print("MINIMUM KV CACHE RESULT")
print(f"{'=' * 72}")
print(f" Minimum KV cache : {min_kv_mib} MiB ({min_kv_bytes:,} bytes)")
print(
f" Safe KV cache : {safe_kv_mib} MiB ({safe_kv_bytes:,} bytes) ({_KV_SAFETY_FACTOR:.0f}x safety)"
)
print(
f" Peak VRAM : {_format_mib(last_pass_peak_mib)} (at {safe_kv_mib} MiB)"
)
print()
print(" Recommended markers:")
print(f" @pytest.mark.profiled_vram_gib({peak_gib})")
print(
f" @pytest.mark.requested_vllm_kv_cache_bytes({safe_kv_bytes:_}), # KV cache cap ({_KV_SAFETY_FACTOR:.0f}x safety over min={min_kv_bytes:_})"
)
print(f"{'=' * 72}")
else:
min_tokens = int(last_pass_value)
safe_tokens = int(min_tokens * _KV_SAFETY_FACTOR)
page_size = 16
safe_tokens = ((safe_tokens + page_size - 1) // page_size) * page_size
# Final validation probe at safe_tokens to get accurate profiled_vram_gib.
# The bisection's last pass was at min_tokens; the recommended marker uses
# safe_tokens which allocates more KV cache and thus more VRAM.
print(f" [final probe] Measuring VRAM at safe_tokens={safe_tokens}")
sys.stdout.flush()
rc_final, wall_final, reports_final, samples_final, stdout_final = _run_once(
pytest_args,
interval=interval,
baseline_seconds=baseline_seconds,
teardown_seconds=teardown_seconds,
extra_env={
**_gpu_env,
"_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS": str(safe_tokens),
},
quiet=True,
run_label="final",
timeout=probe_timeout,
)
if rc_final == 0:
last_pass_peak_mib = max((r.peak_mib for r in reports_final), default=0)
last_pass_reports = reports_final
last_pass_samples = samples_final
pass_wall_times.append(wall_final)
peak_gib = round(last_pass_peak_mib / 1024, 1)
print(
" This strongly suggests the test IGNORES the _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
f" [PASS] tokens={safe_tokens}, peak {_format_mib(last_pass_peak_mib)}, "
f"wall {wall_final:.0f}s"
)
print(" env var. Binary search results are UNRELIABLE — no marker")
print(" recommendation will be provided.")
print(" ")
else:
print(
" FIX: The test (or its launch script) must read _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
f" [FAIL] tokens={safe_tokens} failed unexpectedly, "
f"using VRAM from min_tokens={min_tokens} instead"
)
print(" and pass --gpu-memory-utilization to vLLM / the engine.")
print(" See tests/README.md 'GPU VRAM Profiler' for details.")
print(f" {'!' * 72}")
return 4
# Results
assert last_pass_util is not None
min_vram_gib = last_pass_util * total_gib
padded_peak_mib = int(last_pass_peak_mib * _VRAM_SAFETY_FACTOR)
padded_peak_gib = round(padded_peak_mib / 1024, 1)
# Extract a short test name from pytest args for the summary
test_name = next(
(a for a in pytest_args if "::" in a or a.endswith(".py")),
" ".join(pytest_args),
)
test_short = test_name.rsplit("::", 1)[-1] if "::" in test_name else test_name
print("\n--- RESULT ---")
print(f" Lowest passing utilization : {last_pass_util:.0%}")
print(
f" Minimum VRAM needed : ~{min_vram_gib:.1f} GiB "
f"(peak observed: {_format_mib(last_pass_peak_mib)}, "
f"+10% safety: {_format_mib(padded_peak_mib)})"
)
print(f" {test_short}: @pytest.mark.max_vram_gib({padded_peak_gib})")
print(f"\n{'=' * 72}")
print("MINIMUM KV TOKENS RESULT")
print(f"{'=' * 72}")
print(f" Minimum tokens : {min_tokens} (raw bisection result)")
print(f" Recommended : {safe_tokens} ({_KV_SAFETY_FACTOR:.0f}x safety)")
print(
f" Peak VRAM : {_format_mib(last_pass_peak_mib)} (at {safe_tokens} tokens)"
)
print(f" {test_short}: @pytest.mark.profiled_vram_gib({peak_gib})")
print(
f" {test_short}: @pytest.mark.requested_sglang_kv_tokens({safe_tokens}), # KV cache cap ({_KV_SAFETY_FACTOR:.0f}x safety over min={min_tokens})"
)
print(f"{'=' * 72}")
# Full marker recommendations using average wall time across all passing runs
# Marker recommendations
requested_sglang_kv_tokens = safe_tokens if is_sglang else None
requested_vllm_kv_cache_bytes = safe_kv_bytes if kv_bytes_mode else None
min_kv_value = int(last_pass_value)
if recommend:
avg_pass_wall = sum(pass_wall_times) / len(pass_wall_times)
recs, warnings = _recommend_markers(
last_pass_reports, avg_pass_wall, model_name, num_runs=len(pass_wall_times)
last_pass_reports,
avg_pass_wall,
model_name,
num_runs=len(pass_wall_times),
requested_sglang_kv_tokens=requested_sglang_kv_tokens,
requested_vllm_kv_cache_bytes=requested_vllm_kv_cache_bytes,
min_kv_value=min_kv_value,
)
_print_recommendations(recs, warnings, pytest_args=pytest_args)
......@@ -980,6 +1254,22 @@ def main(argv: list[str] | None = None) -> int:
help="Disable the default binary-search mode that finds minimum VRAM. "
"When set, runs a single profiling pass instead.",
)
parser.add_argument(
"--kv-bytes",
action="store_true",
default=False,
help="(No-op, kept for backward compat.) vLLM always uses KV byte "
"bisection via _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES. "
"Outputs @pytest.mark.requested_vllm_kv_cache_bytes(N).",
)
parser.add_argument(
"--gpu",
"--gpus",
type=int,
default=0,
help="GPU index to profile on (default: 0). "
"Sets CUDA_VISIBLE_DEVICES for the subprocess.",
)
raw = argv if argv is not None else sys.argv[1:]
......@@ -1002,19 +1292,26 @@ def main(argv: list[str] | None = None) -> int:
if looks_like_test_path and not os.path.exists(test_path):
parser.error(f"Test path does not exist: {test_path}")
gpu_idx = args.gpu
gpu_info = _query_gpu_stats()
if not gpu_info:
raise RuntimeError("NVML returned no GPU data")
if gpu_idx >= len(gpu_info):
raise RuntimeError(
f"GPU {gpu_idx} not found (available: 0..{len(gpu_info) - 1})"
)
used_mib = gpu_info[0][1]
total_mib = gpu_info[0][2]
used_mib = gpu_info[gpu_idx][1]
total_mib = gpu_info[gpu_idx][2]
hogged_pct = used_mib / total_mib * 100
if hogged_pct > 10:
print(
f"\nWARNING: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) of GPU memory "
f"is already in use! Results may be inaccurate.\n"
f"\nWARNING: GPU {gpu_idx}: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) "
f"of GPU memory is already in use! Results may be inaccurate.\n"
)
gpu_env = {"CUDA_VISIBLE_DEVICES": str(gpu_idx)}
if not args.no_find_min_vram:
return _find_min_vram(
pytest_args,
......@@ -1023,21 +1320,34 @@ def main(argv: list[str] | None = None) -> int:
teardown_seconds=args.teardown_seconds,
recommend=not args.no_recommend,
csv_path=args.csv,
kv_bytes_mode=args.kv_bytes,
gpu_index=gpu_idx,
)
model_name = _extract_model_from_markers(pytest_args)
is_sglang = _is_sglang_test(pytest_args)
rc, wall_secs, reports, samples = _run_once(
rc, wall_secs, reports, samples, stdout = _run_once(
pytest_args,
interval=args.interval,
baseline_seconds=args.baseline_seconds,
teardown_seconds=args.teardown_seconds,
extra_env=gpu_env,
run_label="profile" if is_sglang else None,
)
_print_report(reports, rc, wall_secs, model_name=model_name)
if not args.no_recommend and reports:
recs, warnings = _recommend_markers(reports, wall_secs, model_name=model_name)
requested_sglang_kv_tokens = None
if is_sglang:
requested_sglang_kv_tokens = _extract_requested_sglang_kv_tokens(stdout)
recs, warnings = _recommend_markers(
reports,
wall_secs,
model_name=model_name,
requested_sglang_kv_tokens=requested_sglang_kv_tokens,
)
_print_recommendations(recs, warnings, pytest_args=pytest_args)
if args.csv:
......
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""GPU-parallel test runner (used by conftest.py, not invoked directly).
Runs pytest tests as independent subprocesses with VRAM-aware scheduling.
Each test gets CUDA_VISIBLE_DEVICES and KV cache overrides
(_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES / _PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS)
so the engine allocates only its declared VRAM budget.
Usage (always via pytest):
pytest --max-vram-gib=6 -n auto -m "gpu_1 and vllm" tests/serve/
pytest --max-vram-gib=6 -n 4 -sv -m "gpu_1 and vllm" tests/serve/
Flags:
--max-vram-gib=N Only run tests with profiled_vram_gib <= N
-n N / -n auto Run N tests concurrently (auto = GPU budget / smallest test)
-s Stream subprocess output live with [wN] prefixes
-v / -vv Passed through to subprocesses for verbose test names
A 10-second cooldown between launches avoids the vLLM profiling race
(bug #10643). Tests that fail due to profiling race are retried up to 3 times.
"""
from __future__ import annotations
import argparse
import os
import subprocess
import sys
import tempfile
import threading
import time
from dataclasses import dataclass, field
from pathlib import Path
import pynvml
_repo_root = str(Path(__file__).resolve().parents[2])
if _repo_root not in sys.path:
sys.path.insert(0, _repo_root)
from tests.utils.vram_utils import ( # noqa: E402
VRAM_MULTI_PROC_MARGIN,
auto_worker_count,
detect_gpus,
load_test_meta,
)
@dataclass
class _TestEntry:
"""A test scheduled for GPU-parallel execution."""
id: str
name: str
profiled_gib: float
timeout: float
requested_vllm_kv_cache_bytes: int | None = None
requested_sglang_kv_tokens: int | None = None
skip_reason: str | None = None
w_id: int = 0
assigned_gpu: int | None = None
retries: int = 0
@dataclass
class _CompletedTest:
"""Result record for a finished test subprocess."""
test: _TestEntry
duration: float
passed: bool
skipped: bool = False
skip_reason: str | None = None
fail_reason: str | None = None
@dataclass
class _TentativeGpu:
"""Scratch copy of GPU budget/free state used during scheduling."""
budget: float
free: float
count: int
@dataclass
class _GpuState:
"""Per-GPU bookkeeping for VRAM budget tracking."""
index: int
total_gib: float
budget_multi: float
budget_used: float = 0.0
running_count: int = 0
@dataclass
class _RunningTest:
"""State for a test subprocess currently executing on a GPU."""
proc: subprocess.Popen[str]
test: _TestEntry
start_time: float
captured: list[str] = field(default_factory=list)
reader_thread: threading.Thread | None = None
def _print(msg: str = "") -> None:
"""Print to stderr so pytest doesn't capture it."""
print(msg, file=sys.stderr, flush=True)
def _fmt_req(test: _TestEntry) -> str:
"""Format the resource request value for display."""
if test.requested_sglang_kv_tokens is not None:
return f"req_kv_tokens={int(test.requested_sglang_kv_tokens)}"
if test.requested_vllm_kv_cache_bytes is not None:
gib = int(test.requested_vllm_kv_cache_bytes) / (1024**3)
return f"req_kv={gib:.2f} GiB"
return "req_kv=None"
_JUNIT_DIR = os.path.join(tempfile.gettempdir(), "gpu_parallel_junit")
_JUNIT_COMBINED = os.path.join(_JUNIT_DIR, "combined.xml")
def _parse_junit_skipped(junit_path: str) -> str | None:
"""Check JUnit XML for a skipped test. Returns skip reason or None."""
import xml.etree.ElementTree as ET
try:
tree = ET.parse(junit_path)
except (ET.ParseError, FileNotFoundError):
return None
root = tree.getroot()
suite = root if root.tag == "testsuite" else root.find("testsuite")
if suite is None:
return None
for tc in suite.findall("testcase"):
skip_el = tc.find("skipped")
if skip_el is not None:
return skip_el.get("message", "skipped")
return None
def _aggregate_junit_xml(junit_dir: str) -> str | None:
"""Merge per-test JUnit XML files into one combined testsuite."""
import xml.etree.ElementTree as ET
xmls = sorted(Path(junit_dir).glob("*.xml"))
xmls = [x for x in xmls if x.name != "combined.xml"]
if not xmls:
return None
total_tests = total_errors = total_failures = 0
total_time = 0.0
testcases = []
for xml_path in xmls:
try:
tree = ET.parse(xml_path)
except ET.ParseError:
continue
root = tree.getroot()
suite = root if root.tag == "testsuite" else root.find("testsuite")
if suite is None:
continue
total_tests += int(suite.get("tests", 0))
total_errors += int(suite.get("errors", 0))
total_failures += int(suite.get("failures", 0))
total_time += float(suite.get("time", 0))
testcases.extend(suite.findall("testcase"))
combined = ET.Element(
"testsuite",
{
"name": "gpu-parallel",
"tests": str(total_tests),
"errors": str(total_errors),
"failures": str(total_failures),
"time": f"{total_time:.3f}",
},
)
for tc in testcases:
combined.append(tc)
out = _JUNIT_COMBINED
ET.ElementTree(combined).write(out, encoding="unicode", xml_declaration=True)
return out
def _collect_tests(pytest_args: list[str], max_vram_gib: float) -> list[str]:
"""Run pytest --collect-only to get test IDs, filtered by --max-vram-gib."""
_strip_flags = {"-v", "-vv", "-vvv", "--verbose", "-s", "--capture=no"}
collect_args = [a for a in pytest_args if a not in _strip_flags]
cmd = [
sys.executable,
"-m",
"pytest",
f"--max-vram-gib={max_vram_gib}",
"--collect-only",
"-q",
*collect_args,
]
result = subprocess.run(cmd, capture_output=True, text=True)
test_ids = []
for line in result.stdout.strip().split("\n"):
line = line.strip()
if "::" in line and not line.startswith(" "):
test_ids.append(line)
return test_ids
def _get_gpu_used_gib(gpu_index: int = 0) -> float:
"""Query actual GPU memory used via pynvml."""
try:
pynvml.nvmlInit()
handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
pynvml.nvmlShutdown()
return mem.used / (1024**3)
except pynvml.NVMLError:
return 0.0
_RETRYABLE_INIT_MARKERS = [
"Error in memory profiling", # vLLM profiling race assertion
"Free memory on device", # not enough free VRAM at startup
"Engine core initialization failed", # engine init crash
"exited with code 0 while waiting for health check", # engine started but died during init
"exited with code -15 while waiting for health check", # SIGTERM during init
"exited with code -9 while waiting for health check", # SIGKILL (OOM killer) during init
]
_MAX_RETRIES = 3
def _capture_output(pipe, captured: list[str], prefix: str | None = None) -> None:
"""Read all lines from a pipe into `captured`. Runs in a thread.
If prefix is set, also prints each line live (-s mode).
"""
for line in iter(pipe.readline, ""):
line = line.rstrip("\n")
if line:
captured.append(line)
if prefix is not None:
_print(f"{prefix} {line}")
pipe.close()
def _parse_gpu_indices(raw: str, available: list[dict]) -> list[int]:
"""Parse --gpus value into a list of GPU indices.
Accepts 'all' or comma-separated indices (e.g. '0,1').
"""
avail_indices = [g["index"] for g in available]
if raw.strip().lower() == "all":
return avail_indices
indices = []
for part in raw.split(","):
part = part.strip()
if not part:
continue
idx = int(part)
if idx not in avail_indices:
raise ValueError(f"GPU {idx} not found (available: {avail_indices})")
indices.append(idx)
return indices or avail_indices
def run_parallel(
test_ids: list[str],
meta: dict[str, dict],
max_vram_gib: float,
num_slots: int,
gpu_indices: list[int] | None = None,
extra_pytest_args: list[str] | None = None,
stream: bool = False,
) -> int:
"""Run tests in parallel with VRAM-aware scheduling across multiple GPUs.
Flags (mimic pytest semantics):
-s Stream subprocess output live with [wN] prefixes.
-v/-vv Passed through to subprocesses for verbose test names / diffs.
No effect on the orchestrator's output.
Without -s, output is buffered and printed after each test completes.
Returns exit code: 0 if all pass, 1 if any fail.
"""
gpus = detect_gpus()
if not gpus:
_print("ERROR: No GPUs detected")
return 1
if gpu_indices is None:
gpu_indices = [g["index"] for g in gpus]
gpu_by_idx = {g["index"]: g for g in gpus}
gpu_states: dict[int, _GpuState] = {}
for gi in gpu_indices:
if gi not in gpu_by_idx:
_print(
f"ERROR: GPU{gi} not found "
f"(available: {[g['index'] for g in gpus]})"
)
return 1
total = gpu_by_idx[gi]["total_mib"] / 1024.0
gpu_states[gi] = _GpuState(
index=gi,
total_gib=total,
budget_multi=total * (1.0 - VRAM_MULTI_PROC_MARGIN),
)
tests: list[_TestEntry] = []
for tid in test_ids:
m = meta.get(tid, {})
tests.append(
_TestEntry(
id=tid,
name=tid,
profiled_gib=m.get("profiled_vram_gib", max_vram_gib),
requested_vllm_kv_cache_bytes=m.get("requested_vllm_kv_cache_bytes"),
timeout=m.get("timeout", 600),
requested_sglang_kv_tokens=m.get("requested_sglang_kv_tokens"),
skip_reason=m.get("skip_reason"),
)
)
# Separate skip-marked tests — they won't actually run, so don't
# validate KV markers or consume GPU budget.
skipped_tests = [t for t in tests if t.skip_reason is not None]
tests = [t for t in tests if t.skip_reason is None]
# Sort by timeout descending (longest first to minimize tail latency)
tests.sort(key=lambda t: t.timeout, reverse=True)
# Reject tests without a KV marker — without explicit memory control
# they'd each grab the engine's default (e.g. vLLM 90%) and OOM when
# run concurrently. Tests with profiled_gib=0 are exempt (mock/CPU-only).
no_kv = [
t
for t in tests
if t.requested_vllm_kv_cache_bytes is None
and t.requested_sglang_kv_tokens is None
and t.profiled_gib > 0
]
if no_kv:
_print(
f"\nERROR: {len(no_kv)} test(s) lack a requested_vllm_kv_cache_bytes "
f"or requested_sglang_kv_tokens marker and cannot run in parallel:"
)
for t in no_kv:
_print(f" {t.name}")
_print(
"\nAdd the appropriate marker via profile_pytest.py --kv-bytes, "
"then rerun."
)
return 1
# Identify tests in metadata that exceed the VRAM budget
test_id_set = set(test_ids)
over_budget = []
for nodeid, m in meta.items():
if nodeid not in test_id_set:
profiled = m.get("profiled_vram_gib")
if profiled is not None and profiled > max_vram_gib:
over_budget.append((nodeid, profiled))
# Assign permanent worker IDs (w0, w1, ...) to all tests including skipped
all_tests = tests + skipped_tests
for idx, test in enumerate(all_tests):
test.w_id = idx
os.makedirs(_JUNIT_DIR, exist_ok=True)
# --- Plan header ---
n_run = len(tests)
n_skip = len(skipped_tests)
count_str = f"{n_run} tests"
if n_skip:
count_str += f", {n_skip} skipped"
if len(gpu_states) == 1:
gi = next(iter(gpu_states))
gs = gpu_states[gi]
_print(
f"\nGPU parallel: {count_str}, {num_slots} concurrent slots, "
f"GPU{gi} ({gs.total_gib:.0f} GiB, "
f"{gs.budget_multi:.0f} GiB multi-proc budget)"
)
else:
gpu_list = ",".join(str(gi) for gi in sorted(gpu_states))
sizes = {int(gs.total_gib) for gs in gpu_states.values()}
budgets = {int(gs.budget_multi) for gs in gpu_states.values()}
if len(sizes) == 1 and len(budgets) == 1:
size_str = (
f"{next(iter(sizes))} GiB each, "
f"{next(iter(budgets))} GiB multi-proc budget"
)
else:
size_str = ", ".join(
f"GPU{gi}: {gs.total_gib:.0f}/{gs.budget_multi:.0f} GiB"
for gi, gs in sorted(gpu_states.items())
)
_print(
f"\nGPU parallel: {count_str}, {num_slots} concurrent slots, "
f"GPUs {gpu_list} ({size_str})"
)
_print()
for test in tests:
_print(
f"[w{test.w_id}] {test.name} "
f"profiled={test.profiled_gib:.1f} GiB, "
f"{_fmt_req(test)}, "
f"timeout={int(test.timeout)}s"
)
if over_budget:
_print()
_print(
f"Over budget ({len(over_budget)} -- profiled > max_vram_gib {max_vram_gib:.0f} GiB):"
)
for name, profiled in sorted(over_budget, key=lambda x: x[1], reverse=True):
_print(f" {name} (profiled={profiled:.1f} GiB)")
_print()
# --- Report skip-marked tests immediately (like xdist SKIPPED) ---
completed: list[_CompletedTest] = []
for test in skipped_tests:
_print(f"[w{test.w_id}] {test.name} SKIPPED" f" - {test.skip_reason}")
completed.append(
_CompletedTest(
test=test,
duration=0,
passed=False,
skipped=True,
skip_reason=test.skip_reason,
)
)
# --- Scheduling state ---
t0 = time.monotonic()
pending = list(tests)
running: dict[int, _RunningTest] = {}
next_status = t0 + 10
# vLLM needs a stagger because --gpu-memory-utilization triggers a memory
# profiling step that snapshots free memory — concurrent launches corrupt
# each other's snapshots (bug #10643). SGLang uses --max-total-tokens
# which is deterministic, so no stagger is needed.
_VLLM_LAUNCH_STAGGER_S = 5.0
last_vllm_launch: dict[int, float] = {} # gpu_index -> monotonic timestamp
def _build_status(now: float) -> str:
"""Build multi-GPU status string for periodic output."""
elapsed = int(now - t0)
gpu_parts = []
for gi in sorted(gpu_states):
gs = gpu_states[gi]
actual = _get_gpu_used_gib(gi)
workers = sorted(
w for w, run_info in running.items() if run_info.test.assigned_gpu == gi
)
wstr = ", ".join(
f"w{w}({int(now - running[w].start_time)}s)" for w in workers
)
part = f"GPU{gi}: {actual:.1f}/{gs.total_gib:.0f} GiB"
if wstr:
part += f" [{wstr}]"
gpu_parts.append(part)
return f"[elapsed {elapsed}s] {', '.join(gpu_parts)}"
def _launch_test(test: _TestEntry, env_base: dict) -> _RunningTest:
"""Build env, spawn subprocess, start output streamer thread."""
env = env_base.copy()
env["CUDA_VISIBLE_DEVICES"] = str(test.assigned_gpu)
if test.requested_sglang_kv_tokens is not None:
env["_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS"] = str(
int(test.requested_sglang_kv_tokens)
)
elif test.requested_vllm_kv_cache_bytes is not None:
env["_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES"] = str(
int(test.requested_vllm_kv_cache_bytes)
)
safe_name = test.name.replace("/", "_").replace("::", "__")
junit_path = os.path.join(_JUNIT_DIR, f"{safe_name}.xml")
has_tb = extra_pytest_args and any(
a.startswith("--tb") for a in extra_pytest_args
)
cmd = [
sys.executable,
"-m",
"pytest",
test.id,
"-x",
*([] if has_tb else ["--tb=short"]),
f"--timeout={int(test.timeout)}",
f"--junitxml={junit_path}",
]
if extra_pytest_args:
cmd.extend(extra_pytest_args)
proc = subprocess.Popen(
cmd,
env=env,
stdout=subprocess.PIPE,
stderr=subprocess.STDOUT,
text=True,
)
run_info = _RunningTest(proc=proc, test=test, start_time=time.monotonic())
w_id = test.w_id
stream_prefix = f"[w{w_id}]" if stream else None
t = threading.Thread(
target=_capture_output,
args=(proc.stdout, run_info.captured, stream_prefix),
daemon=True,
)
t.start()
run_info.reader_thread = t
return run_info
env_base = os.environ.copy()
while pending or running:
now = time.monotonic()
# Check for completed subprocesses
for w_id in list(running.keys()):
run_info = running[w_id]
rc = run_info.proc.poll()
if rc is not None:
if run_info.reader_thread is not None:
run_info.reader_thread.join(timeout=5)
duration = now - run_info.start_time
passed = rc == 0
test = run_info.test
gi = test.assigned_gpu
# Detect retryable init errors (profiling race, OOM at startup)
if not passed and test.retries < _MAX_RETRIES:
matched_marker = None
for line in run_info.captured:
for marker in _RETRYABLE_INIT_MARKERS:
if marker in line:
matched_marker = marker
break
if matched_marker:
break
if matched_marker:
test.retries += 1
_print(
f"[w{w_id}] retrying ({test.retries}/{_MAX_RETRIES})"
f" — {matched_marker}"
)
if gi is not None:
gpu_states[gi].budget_used -= test.profiled_gib
gpu_states[gi].running_count -= 1
del running[w_id]
test.assigned_gpu = None
pending.insert(0, test)
continue
# Detect runtime skips via JUnit XML (subprocess exit 0
# covers both "all passed" and "all skipped").
skipped = False
skip_reason: str | None = None
if passed:
safe_name = test.name.replace("/", "_").replace("::", "__")
junit_path = os.path.join(_JUNIT_DIR, f"{safe_name}.xml")
skip_reason = _parse_junit_skipped(junit_path)
if skip_reason is not None:
passed = False
skipped = True
# Dump buffered output on failure only (matches pytest behavior).
# With -s, output was already streamed live.
fail_reason = ""
if not passed and not skipped:
if not stream:
prefix = f"[w{w_id}]"
for line in run_info.captured:
_print(f"{prefix} {line}")
for line in reversed(run_info.captured):
stripped = line.strip()
if stripped and not stripped.startswith("="):
fail_reason = stripped
break
if skipped:
status = "SKIPPED"
elif passed:
status = "PASSED"
else:
status = "FAILED"
if skipped:
_print(f"[w{w_id}] {test.name} SKIPPED" f" - {skip_reason}")
else:
_print(f"[w{w_id}] {test.name} {status} [{duration:.0f}s]")
if gi is not None:
gpu_states[gi].budget_used -= test.profiled_gib
gpu_states[gi].running_count -= 1
completed.append(
_CompletedTest(
test=test,
duration=duration,
passed=passed,
skipped=skipped,
skip_reason=skip_reason,
fail_reason=fail_reason,
)
)
del running[w_id]
# Print status immediately after completion
parts = [_build_status(now)]
if pending:
queued_str = ", ".join(f"w{t.w_id}" for t in pending)
parts.append(f"[queued: {queued_str}]")
_print(" ".join(parts))
next_status = now + 10
# --- Launch pending tests ---
# For each pending test, find the GPU with most available budget.
# Gate on BOTH budget tracking AND actual GPU free memory.
# vLLM stagger is per-GPU only — tests on different GPUs launch
# simultaneously.
if pending and len(running) < num_slots:
actual_free = {
gi: gs.total_gib - _get_gpu_used_gib(gi)
for gi, gs in gpu_states.items()
}
tentative = {
gi: _TentativeGpu(
budget=gs.budget_used,
free=actual_free[gi],
count=gs.running_count,
)
for gi, gs in gpu_states.items()
}
to_launch: list[tuple[int, int]] = [] # (pending_idx, gpu_idx)
n_total = len(running)
for i, test in enumerate(pending):
if n_total + len(to_launch) >= num_slots:
break
best_gi: int | None = None
best_avail = -1.0
for gi, gs in gpu_states.items():
ts = tentative[gi]
will_be_multi = ts.count >= 1
cap = gs.budget_multi if will_be_multi else gs.total_gib
avail = cap - ts.budget
if avail < test.profiled_gib:
continue
if ts.free < test.profiled_gib:
continue
if avail > best_avail:
best_gi = gi
best_avail = avail
if best_gi is not None:
to_launch.append((i, best_gi))
tentative[best_gi].budget += test.profiled_gib
tentative[best_gi].free -= test.profiled_gib
tentative[best_gi].count += 1
# Pop from pending in reverse to preserve indices, then reverse
# back so longest-timeout tests launch first.
batch: list[_TestEntry] = []
for pending_idx, assigned_gpu in reversed(to_launch):
entry = pending.pop(pending_idx)
entry.assigned_gpu = assigned_gpu
batch.append(entry)
batch.reverse()
for entry in batch:
w_id = entry.w_id
gi = entry.assigned_gpu
assert gi is not None
is_vllm = (
entry.requested_sglang_kv_tokens is None and entry.profiled_gib > 0
)
# Per-GPU vLLM stagger — only between vLLM tests on the
# same GPU. Tests on different GPUs launch simultaneously.
if is_vllm:
last_t = last_vllm_launch.get(gi, 0)
wait = _VLLM_LAUNCH_STAGGER_S - (time.monotonic() - last_t)
if wait > 0:
time.sleep(wait)
gpu_states[gi].budget_used += entry.profiled_gib
gpu_states[gi].running_count += 1
run_info = _launch_test(entry, env_base)
running[w_id] = run_info
if is_vllm:
last_vllm_launch[gi] = time.monotonic()
retry_str = f" (retry {entry.retries})" if entry.retries else ""
_print(
f"[w{w_id}] {entry.name} "
f"(GPU{gi}, profiled={entry.profiled_gib:.1f} GiB, "
f"{_fmt_req(entry)}) RUNNING{retry_str}"
)
now = time.monotonic()
if now >= next_status and (running or pending):
parts = [_build_status(now)]
if pending:
queued_str = ", ".join(f"w{t.w_id}" for t in pending)
parts.append(f"[queued: {queued_str}]")
_print(" ".join(parts))
next_status = now + 10
# Periodic status (print even when waiting for VRAM to free up)
if now >= next_status and (running or pending):
parts = [_build_status(now)]
if pending:
queued_str = ", ".join(f"w{t.w_id}" for t in pending)
if not running:
next_needed = pending[0].profiled_gib
parts.append(f"[waiting for {next_needed:.1f} GiB free]")
parts.append(f"[queued: {queued_str}]")
_print(" ".join(parts))
next_status = now + 10
if running or pending:
time.sleep(1.0)
# Summary
wall_time = time.monotonic() - t0
sequential_time = sum(c.duration for c in completed if not c.skipped)
n_passed = sum(1 for c in completed if c.passed)
n_skipped = sum(1 for c in completed if c.skipped)
n_failed = sum(1 for c in completed if not c.passed and not c.skipped)
completed.sort(key=lambda c: c.test.w_id)
_print()
_print(f"{'=' * 27} short test summary info {'=' * 27}")
for c in completed:
test = c.test
w_id = test.w_id
if c.skipped:
reason = c.skip_reason or "skipped"
_print(f"SKIPPED [w{w_id}] {test.name} - {reason}")
elif c.passed:
duration = int(c.duration)
timeout = int(test.timeout)
retries = test.retries
retry_str = f" ({retries} retries)" if retries else ""
_print(
f"PASSED [w{w_id}] {test.name} " f"[{duration}s/{timeout}s]{retry_str}"
)
else:
duration = int(c.duration)
timeout = int(test.timeout)
retries = test.retries
retry_str = f" ({retries} retries)" if retries else ""
fail_str = f" - {c.fail_reason}" if c.fail_reason else ""
_print(
f"FAILED [w{w_id}] {test.name} "
f"[{duration}s/{timeout}s]{retry_str}{fail_str}"
)
n_summary_parts = []
if n_failed:
n_summary_parts.append(f"{n_failed} failed")
n_summary_parts.append(f"{n_passed} passed")
if n_skipped:
n_summary_parts.append(f"{n_skipped} skipped")
wall_int = int(wall_time)
h, remainder = divmod(wall_int, 3600)
m, s = divmod(remainder, 60)
time_str = f"{wall_time:.2f}s"
if h:
time_str += f" ({h}:{m:02d}:{s:02d})"
elif m:
time_str += f" ({m:01d}:{s:02d})"
summary = ", ".join(n_summary_parts) + f" in {time_str}"
if n_passed > 1 and sequential_time > 0:
speedup = sequential_time / wall_time
summary += f" (vs {sequential_time:.0f}s seq, {speedup:.1f}x)"
pad = max(0, (78 - len(summary) - 2) // 2)
_print(f"{'=' * pad} {summary} {'=' * pad}")
combined = _aggregate_junit_xml(_JUNIT_DIR)
if combined:
_print(f"JUnit XML: {combined}")
return 0 if n_failed == 0 else 1
# ---------------------------------------------------------------------------
# Standalone CLI
# ---------------------------------------------------------------------------
def main() -> int:
parser = argparse.ArgumentParser(
description="Run GPU tests in parallel with VRAM-aware scheduling.",
usage="%(prog)s --max-vram-gib=N [-n SLOTS] [--gpu=0,1] [pytest-args...]",
)
parser.add_argument(
"--max-vram-gib",
type=float,
required=True,
help="Only run tests with profiled_vram_gib <= N.",
)
parser.add_argument(
"-n",
type=str,
default="auto",
help="Number of concurrent slots. 'auto' = gpu_usable / max_vram_gib.",
)
parser.add_argument(
"--gpu",
"--gpus",
type=str,
default="all",
help="Comma-separated GPU indices or 'all' (default: all).",
)
raw = sys.argv[1:]
if "--" in raw:
split = raw.index("--")
args = parser.parse_args(raw[:split])
pytest_args = raw[split + 1 :]
else:
args, pytest_args = parser.parse_known_args(raw)
if not pytest_args:
parser.error("No pytest arguments provided")
is_stream = any(a in ("-s", "--capture=no") or "-s" in a for a in pytest_args)
gpus = detect_gpus()
if not gpus:
_print("ERROR: No GPUs detected")
return 1
gpu_indices = _parse_gpu_indices(args.gpus, gpus)
_print(f"Collecting tests with --max-vram-gib={args.max_vram_gib}...")
test_ids = _collect_tests(pytest_args, args.max_vram_gib)
if not test_ids:
_print("No tests collected.")
return 0
meta = load_test_meta()
if args.n == "auto":
profiled_gibs = [
meta.get(tid, {}).get("profiled_vram_gib", args.max_vram_gib)
for tid in test_ids
]
selected_gpus = [g for g in gpus if g["index"] in gpu_indices]
num_slots = auto_worker_count(selected_gpus, args.max_vram_gib, profiled_gibs)
else:
num_slots = int(args.n)
return run_parallel(
test_ids=test_ids,
meta=meta,
max_vram_gib=args.max_vram_gib,
num_slots=num_slots,
gpu_indices=gpu_indices,
stream=is_stream,
)
if __name__ == "__main__":
sys.exit(main())
......@@ -32,27 +32,27 @@ ALLOC_MIB = 4096 # 4 GiB
@pytest.mark.gpu_1
@pytest.mark.timeout(30)
def test_mock_4gb_gpu_alloc():
"""Allocate 4 GiB of GPU VRAM, hold 2s, release. Honors _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE."""
"""Allocate 4 GiB of GPU VRAM, hold 2s, release. Honors _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES."""
if not torch.cuda.is_available():
pytest.skip("CUDA not available")
device = 0
total_mib = torch.cuda.get_device_properties(device).total_memory / (1024 * 1024)
gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
if gpu_util is not None:
cap_mib = total_mib * float(gpu_util)
kv_bytes_str = os.environ.get("_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES")
if kv_bytes_str is not None:
cap_mib = int(kv_bytes_str) / (1024 * 1024)
logger.info(
"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=%.2f -> cap %.0f MiB (%.1f GiB) of %.0f MiB total",
float(gpu_util),
"_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=%s -> cap %.0f MiB (%.1f GiB) of %.0f MiB total",
kv_bytes_str,
cap_mib,
cap_mib / 1024,
total_mib,
)
if ALLOC_MIB > cap_mib:
raise RuntimeError(
f"Requested {ALLOC_MIB} MiB exceeds _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE "
f"cap of {cap_mib:.0f} MiB ({gpu_util})"
f"Requested {ALLOC_MIB} MiB exceeds KV cache cap "
f"of {cap_mib:.0f} MiB ({kv_bytes_str} bytes)"
)
num_elements = (ALLOC_MIB * 1024 * 1024) // 4
......
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""GPU VRAM utilities for parallel test execution.
Functions:
detect_gpus() Enumerate GPUs via pynvml
auto_worker_count(gpus, limit) Calculate slot count for -n auto
write_test_meta(items) Serialize profiled/requested vram + timeout
load_test_meta() Read the serialized test metadata
print_gpu_plan(gpus, limit, would_run) Dry-run GPU plan summary
Usage:
# Sequential (filter only)
pytest --max-vram-gib=10 -m "gpu_1 and vllm" tests/serve/
# Parallel (VRAM-aware scheduling)
pytest --max-vram-gib=10 -n auto -m "gpu_1 and vllm" tests/serve/
"""
from __future__ import annotations
import json
import logging
import os
import tempfile
import pynvml
_logger = logging.getLogger(__name__)
# When 2+ tests run concurrently, reserve 15% of GPU VRAM for CUDA context
# overhead across processes. A single test gets the full GPU (0% margin).
VRAM_MULTI_PROC_MARGIN = 0.15
_TEST_META_FILENAME = "pytest_gpu_parallel_test_meta.json"
def detect_gpus() -> list[dict]:
"""Return list of dicts with 'index', 'name', 'total_mib' per GPU.
Uses pynvml (already a dependency via profile_pytest.py).
Returns empty list if no GPUs or pynvml is unavailable.
"""
try:
pynvml.nvmlInit()
except pynvml.NVMLError:
return []
try:
count = pynvml.nvmlDeviceGetCount()
gpus = []
for i in range(count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
name = pynvml.nvmlDeviceGetName(handle)
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
gpus.append(
{
"index": i,
"name": name,
"total_mib": mem.total // (1024 * 1024),
}
)
return gpus
finally:
pynvml.nvmlShutdown()
def auto_worker_count(
gpus: list[dict],
vram_limit: float,
test_profiled_gibs: list[float] | None = None,
) -> int:
"""Calculate slot count for -n auto.
Uses the smallest profiled test size (if provided) to maximize parallelism.
Falls back to vram_limit when no test sizes are available.
"""
if not gpus or vram_limit <= 0:
return len(gpus) or 1
min_gpu_gib = min(g["total_mib"] for g in gpus) / 1024.0
budget_gib = min_gpu_gib * (1.0 - VRAM_MULTI_PROC_MARGIN)
divisor = vram_limit
if test_profiled_gibs:
nonzero = [g for g in test_profiled_gibs if g > 0]
if nonzero:
divisor = min(nonzero)
workers_per_gpu = max(1, int(budget_gib / divisor)) if divisor > 0 else 1
return len(gpus) * workers_per_gpu
def write_test_meta(items, dest_dir: str | None = None) -> None:
"""Serialize profiled_vram_gib, timeout, and KV cache markers to JSON.
Called from pytest_collection_modifyitems so the GPU orchestrator can
read test metadata without re-collecting.
"""
test_meta: dict[str, dict] = {}
for item in items:
meta: dict = {}
profiled_mark = item.get_closest_marker("profiled_vram_gib")
if profiled_mark and profiled_mark.args:
meta["profiled_vram_gib"] = profiled_mark.args[0]
kv_bytes_mark = item.get_closest_marker("requested_vllm_kv_cache_bytes")
if kv_bytes_mark and kv_bytes_mark.args:
meta["requested_vllm_kv_cache_bytes"] = kv_bytes_mark.args[0]
timeout_mark = item.get_closest_marker("timeout")
if timeout_mark and timeout_mark.args:
meta["timeout"] = timeout_mark.args[0]
kv_tokens_mark = item.get_closest_marker("requested_sglang_kv_tokens")
if kv_tokens_mark and kv_tokens_mark.args:
meta["requested_sglang_kv_tokens"] = kv_tokens_mark.args[0]
skip_mark = item.get_closest_marker("skip")
if skip_mark:
reason = skip_mark.kwargs.get("reason", "")
if not reason and skip_mark.args:
reason = skip_mark.args[0]
meta["skip_reason"] = reason or "skipped"
if meta:
test_meta[item.nodeid] = meta
if test_meta:
path = os.path.join(dest_dir or tempfile.gettempdir(), _TEST_META_FILENAME)
with open(path, "w") as f:
json.dump(test_meta, f)
def load_test_meta() -> dict[str, dict]:
"""Load the nodeid -> {profiled_vram_gib, timeout, ...} map."""
path = os.path.join(tempfile.gettempdir(), _TEST_META_FILENAME)
try:
with open(path) as f:
return json.load(f)
except (FileNotFoundError, json.JSONDecodeError):
return {}
def print_gpu_plan(
gpus: list[dict], vram_limit: float, would_run: list[tuple[str, float]]
) -> None:
"""Print the GPU-parallel plan section for --dry-run output."""
min_gpu_gib = min(g["total_mib"] for g in gpus) / 1024.0
budget_gib = min_gpu_gib * (1.0 - VRAM_MULTI_PROC_MARGIN)
profiled_gibs = [gib for _, gib in would_run if gib is not None and gib > 0]
min_test_gib = min(profiled_gibs) if profiled_gibs else vram_limit
auto_slots = max(1, int(budget_gib / min_test_gib)) if min_test_gib > 0 else 1
print(f"\n{'=' * 60}")
print("GPU-Parallel Plan")
print(f"{'=' * 60}")
for gpu in gpus:
gib = gpu["total_mib"] / 1024
print(f" GPU {gpu['index']}: {gpu['name']} ({gib:.1f} GiB)")
print(f"\n Usable VRAM: {budget_gib:.0f} GiB")
print("\n Run options:")
print(" (no -n) : sequential, 1 test at a time")
print(
f" -n auto : up to {auto_slots} slots per GPU "
f"({budget_gib:.0f} / {min_test_gib:.0f} GiB smallest test)"
)
print(f" -n N : N concurrent slots across {len(gpus)} GPU(s)")
print("\n Usage:")
print(
f" pytest --max-vram-gib={vram_limit:.0f} -n {auto_slots} "
f'-m "gpu_1 and vllm" tests/serve/'
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment