Unverified Commit 6dc85fbc authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

feat: GPU-parallel test runner with VRAM-aware scheduling (#7560)


Signed-off-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent 4ea21079
......@@ -20,9 +20,17 @@ MODEL="${MODEL:-Qwen/Qwen3-VL-8B-Instruct}"
NAMESPACE="${NAMESPACE:-dynamo}"
HTTP_PORT="${HTTP_PORT:-8000}"
BLOCK_SIZE="${BLOCK_SIZE:-16}" # Must match vLLM backend KV block size
GPU_MEMORY_UTILIZATION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-${GPU_MEMORY_UTILIZATION:-0.85}}"
GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.85}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
# KV cache override for parallel-safe GPU memory control
KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
if [[ -n "$KV_BYTES" ]]; then
GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
else
GPU_MEM_ARGS="--gpu-memory-utilization ${GPU_MEMORY_UTILIZATION}"
fi
NATS_SERVER="${NATS_SERVER:-nats://127.0.0.1:4222}"
ETCD_ENDPOINTS="${ETCD_ENDPOINTS:-http://127.0.0.1:2379}"
......@@ -121,7 +129,7 @@ env "${COMMON_ENV[@]}" \
--enable-multimodal \
--block-size "${BLOCK_SIZE}" \
--enforce-eager \
--gpu-memory-utilization "${GPU_MEMORY_UTILIZATION}" \
$GPU_MEM_ARGS \
--max-model-len "${MAX_MODEL_LEN}" \
--served-model-name "${MODEL}__internal" \
${VLLM_EXTRA_ARGS} &
......
# GPU Memory Parameters by Engine
# GPU Memory Control
How vLLM, sglang, and TensorRT-LLM interpret memory-related parameters, and how
to estimate total GPU VRAM usage for each.
How vLLM, SGLang, and TensorRT-LLM allocate GPU memory, and how we override
it for deterministic parallel test execution.
---
## Quick Reference
| Parameter | vLLM | sglang | TensorRT-LLM |
|---|---|---|---|
| Memory fraction | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` (YAML/override) |
| Fraction base | Total VRAM | Total VRAM | Free VRAM (after model load) |
| Default fraction | 0.90 | 0.90 | 0.90 |
| Max sequence length | `--max-model-len` | `--context-length` | `max_seq_len` (YAML/override) |
| KV cache size override | `--kv-cache-memory-bytes` | N/A | `max_gpu_total_bytes` (broken in 1.3.0rc5) |
---
## 1. vLLM
### How `--gpu-memory-utilization` works
This is a fraction of **total** GPU VRAM. The engine budgets everything within
this limit:
```
budget = total_vram * gpu_memory_utilization
KV cache = budget - model_weights - peak_activations - framework_overhead
```
At startup, vLLM profiles actual model weight and activation memory, then
pre-allocates the remaining budget as KV cache blocks. The KV pool size is fixed
for the lifetime of the engine.
### How `--max-model-len` works
Sets the maximum total sequence length (input + output tokens). Longer sequences
require more KV cache per request. If the requested `max-model-len` needs more
KV cache than the budget allows, vLLM errors at startup:
```
ValueError: ... X GiB KV cache is needed, which is larger than the available
KV cache memory (Y GiB). ...
```
Reducing `--max-model-len` is the most effective way to reduce VRAM when the
model fits but the KV cache doesn't.
### How `--kv-cache-memory-bytes` works
When set, this overrides the automatic KV cache sizing from
`gpu-memory-utilization`. The engine allocates exactly this many bytes for KV
cache regardless of the fraction. This means `gpu-memory-utilization` still
controls the *overall* VRAM budget (and thus whether the model fits), but the
KV cache portion is pinned to the explicit byte value.
Consequence for profiling: if a script uses `--kv-cache-memory-bytes`,
changing `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` (which maps to
`--gpu-memory-utilization`) won't change the KV cache size, only the leftover
headroom for activations and overhead.
### Estimating total GPU usage
```
total_vram ≈ model_weights + kv_cache + activations + overhead
model_weights ≈ num_params * bytes_per_param
(e.g. 7B * 2 bytes for BF16 ≈ 14 GiB)
## Why absolute caps, not fractions
kv_cache_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
(the factor of 2 is for K and V tensors)
Memory fractions (`--gpu-memory-utilization`, `--mem-fraction-static`) are
unreliable for parallel / CI workloads:
kv_cache_total = kv_cache_per_token * max_model_len * max_concurrent_seqs
- **Non-deterministic** — same fraction produces different KV cache sizes
depending on what else is on the GPU at init time.
- **Profiling race** — concurrent engines each see "nearly all memory free",
allocate based on that, and OOM.
- **Not portable** — a fraction tuned for 48 GiB is wrong on 24 or 80 GiB.
- **Different semantics** — vLLM/SGLang use fraction of *total* VRAM;
TensorRT-LLM uses fraction of *free* VRAM after model load.
overhead ≈ engine-dependent (auto-computed by estimate_worker_vram):
vllm: 1.2 + 1.0 * sqrt(params_b) GiB (0.6B≈2.0, 8B≈4.0)
sglang: 1.5 + 1.0 * sqrt(params_b) GiB (0.6B≈2.3, 8B≈4.3)
trtllm: 2.0 + 1.2 * sqrt(params_b) GiB (0.6B≈2.9, 8B≈5.4)
```
Instead, we use **absolute KV cache caps**:
Rule of thumb: set `gpu-memory-utilization` so that
`total_vram * fraction >= model_weights + 2 GiB`. The rest becomes KV cache.
| Engine | Deterministic override | Env var |
|--------|----------------------|---------|
| vLLM | `--kv-cache-memory-bytes N` | `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` |
| SGLang | `--max-total-tokens N` | `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` |
| TensorRT-LLM | *(future TODO)* | — |
---
## 2. sglang
### How `--mem-fraction-static` works
Like vLLM, this is a fraction of **total** GPU VRAM:
```
budget = total_vram * mem_fraction_static
KV cache pool = budget - model_weights
```
The budget covers model weights and the KV cache pool. Activations and CUDA
graph buffers are allocated *outside* this budget from the remaining VRAM.
This is slightly different from vLLM (which includes activations in the budget).
sglang recommends keeping 5-8 GiB free for activations and overhead. If you
see OOM errors, decrease `--mem-fraction-static` by 0.01-0.05 increments.
### How `--context-length` and `--max-running-requests` work
Unlike vLLM (where `--max-model-len` directly affects KV cache sizing), sglang's
`--context-length` and `--max-running-requests` do **not** affect KV cache
allocation. The KV cache pool is sized entirely from `--mem-fraction-static`:
```
kv_cache_pool = total_vram * mem_fraction_static - model_weights
```
Profiling confirmed this: changing `--context-length` from 512 to 40960 produced
identical `max_total_num_tokens` values (269,136 on a 48 GiB GPU at fraction 0.95).
These flags only affect **request scheduling**:
- `--context-length` caps the per-request token usage from the KV pool
- `--max-running-requests` limits concurrent request slots (allocated from
memory outside the `--mem-fraction-static` budget)
Setting `--max-running-requests` too high at high fractions can cause OOM because
the request slot pool competes for the small amount of memory left after KV cache
allocation.
### Estimating total GPU usage
```
total_vram ≈ model_weights + kv_cache_pool + activations_and_overhead
kv_cache_pool = total_vram * mem_fraction_static - model_weights
## Quick Reference
activations_and_overhead ≈ 1-2 GiB for small models (0.6B-4B)
~3-5 GiB for larger models (7B+)
(CUDA context, graphs, request pools — allocated outside mem_fraction_static)
```
| | vLLM | SGLang | TensorRT-LLM |
|---|---|---|---|
| Fraction flag | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` |
| Fraction base | Total VRAM | Total VRAM | Free VRAM (post-load) |
| Default | 0.90 | 0.90 | 0.90 |
| Max seq len | `--max-model-len` | `--context-length` | `max_seq_len` |
| KV cache override | `--kv-cache-memory-bytes` | `--max-total-tokens` | *(broken in 1.3.0rc5)* |
---
## 3. TensorRT-LLM
### How `free_gpu_memory_fraction` works
This is a fraction of **free** VRAM (not total). The engine:
1. Loads model weights and builds the TRT engine (fixed cost).
2. Queries remaining free GPU memory.
3. Allocates `free_memory * free_gpu_memory_fraction` for the KV cache pool.
```
kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
```
This means the same fraction yields different absolute KV cache sizes depending
on how much VRAM the model consumed. A 5 GiB model on a 48 GiB GPU leaves
~43 GiB free; fraction=0.24 gives ~10 GiB KV cache. A 30 GiB model leaves
~18 GiB free; fraction=0.24 gives only ~4 GiB.
Set via YAML config, CLI, or env var:
```bash
--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
DYN_TRTLLM_OVERRIDE_ENGINE_ARGS='{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
```
### How `max_seq_len` works
Maximum total sequence length. Defaults to the model's native context.
Sequences exceeding this limit are rejected at runtime.
**VRAM impact: none (PyTorch backend).** Reducing max_seq_len from 40960 to
2048 had zero effect on total VRAM or KV cache size in testing (Qwen3-0.6B,
trtllm 1.3.0rc5). The PyTorch backend does not pre-allocate internal buffers
proportional to max_seq_len; KV cache size is determined solely by
`free_gpu_memory_fraction`. This differs from vLLM/sglang where reducing
context length measurably reduces memory.
Override via:
```bash
--override-engine-args '{"max_seq_len": 4096}'
```
## Per-Engine Notes
### Override gotcha: sub-dict replacement
### vLLM
Overriding any field inside `kv_cache_config` **replaces the entire sub-dict**.
If your YAML has `enable_block_reuse: true` and you override only
`free_gpu_memory_fraction`, you lose `enable_block_reuse`. Always re-include
all fields you need:
`--gpu-memory-utilization` sets a budget as fraction of total VRAM.
KV cache = budget - weights - activations - overhead. Pool is fixed at startup.
```json
{"kv_cache_config": {"free_gpu_memory_fraction": 0.15, "enable_block_reuse": true}}
```
`--kv-cache-memory-bytes` overrides automatic sizing and **skips memory
profiling** ([PR #21489]). The KV cache is pinned to the exact byte value —
no profiling race, no CUDAGraph estimation errors, safe for concurrent
instances ([#10643]). When set, `--gpu-memory-utilization` only affects
headroom for activations, not KV cache size.
### How `max_num_tokens` works
`--max-model-len` caps sequence length. Reducing it is the fastest way to
cut VRAM when the model fits but KV cache doesn't.
Maximum batched input tokens per iteration. Primarily a throughput knob.
[PR #21489]: https://github.com/vllm-project/vllm/pull/21489
[#10643]: https://github.com/vllm-project/vllm/issues/10643
**VRAM impact: none.** Reducing from 8192 → 256 had no measurable effect on
total VRAM (41,643 vs 41,465 MiB — within noise; the slight *increase* is
because smaller activation footprint lets the fraction claim marginally more
KV cache).
### SGLang
### `max_gpu_total_bytes` (broken)
`--mem-fraction-static` sets a budget as fraction of total VRAM.
KV cache pool = budget - weights. Activations and CUDA graph buffers are
*outside* this budget (unlike vLLM).
Intended as an absolute byte cap for KV cache. As of trtllm 1.3.0rc5, this
field is **ignored**. Setting 5 GiB cap with `free_gpu_memory_fraction=0.95`
still allocated ~42 GiB of KV cache. Setting `free_gpu_memory_fraction=0.0`
with only `max_gpu_total_bytes` causes `"Impossible to fit any sequence in
kvCache"`. Do not rely on this field.
`--max-total-tokens` caps the KV token pool directly, regardless of fraction.
When set, the token cap is the binding constraint.
### Override precedence
`--context-length` and `--max-running-requests` affect request scheduling
only — they do **not** change KV cache allocation.
```
--override-engine-args JSON > --extra-engine-args YAML > CLI flags
```
### TensorRT-LLM
The `DYN_TRTLLM_OVERRIDE_ENGINE_ARGS` env var is equivalent to
`--override-engine-args` and avoids shell quoting issues with scripts whose
arg parsers consume unknown flags before passing `"$@"`.
`free_gpu_memory_fraction` is a fraction of **free** VRAM after model load.
Set via YAML or `--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'`.
### Estimating total GPU usage
```
total_vram ≈ model_weights + engine_overhead + kv_cache
model_weights ≈ num_params * bytes_per_param / tensor_parallel_size
engine_overhead ≈ 2.0 + 1.2 * sqrt(params_b) GiB (CUDA context + TRT buffers + activations)
kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
```
Engine overhead is auto-computed by `estimate_worker_vram` when called with the
`trtllm` engine name. Examples: 0.6B → 2.9 GiB, 8B → 5.4 GiB, 30B → 8.6 GiB.
### Empirical validation (Qwen3-0.6B, RTX 6000 Ada 48 GiB, trtllm 1.3.0rc5)
Controlled test: single worker via agg.sh, one override at a time.
| # | Override | Total VRAM | KV Cache | Tokens |
|---|---------|-----------|----------|--------|
| 1 | Baseline (YAML frac=0.85) | 41,465 MiB | 38.04 GiB | 356,160 |
| 2 | `free_gpu_memory_fraction=0.15` | 9,383 MiB | 6.71 GiB | 62,848 |
| 3 | `max_num_tokens=256` | 41,643 MiB | 38.26 GiB | 358,208 |
| 4 | `max_seq_len=4096` | 41,469 MiB | 38.05 GiB | 356,192 |
| 5 | `max_seq_len=2048` | 41,469 MiB | 38.05 GiB | 356,192 |
| 6 | seq=4096 + frac=0.15 | 9,383 MiB | 6.71 GiB | 62,848 |
| 7 | tokens=256 + seq=4096 + frac=0.15 | 9,377 MiB | 6.75 GiB | 63,200 |
**Conclusion:** `free_gpu_memory_fraction` is the **sole effective knob** for
trtllm VRAM control. Neither `max_seq_len` nor `max_num_tokens` reduce memory.
Combined overrides (test 7) produce no additional benefit over fraction alone
(test 2).
Deterministic KV cache control via `build_gpu_mem_args` is a future TODO.
---
## Why vLLM/sglang fractions are NOT interchangeable with TensorRT-LLM
Consider wanting 10 GiB of KV cache on a 48 GiB GPU with a 5 GiB model:
| Engine | Fraction meaning | Calculation | Result |
|---|---|---|---|
| vLLM | 10/48 = 0.21 of total | `48 * 0.21 = 10 GiB` budget (minus model = 5 GiB KV) | Wrong — need higher fraction |
| sglang | Same as vLLM | Same math | Same problem |
| TensorRT-LLM | 10/43 = 0.23 of free | `43 * 0.23 = 10 GiB` KV cache | Correct |
For vLLM/sglang, you actually need `(model + kv) / total = (5 + 10) / 48 = 0.31`
to get 10 GiB of KV cache with a 5 GiB model.
## `build_gpu_mem_args` and Env Vars
The helper functions in `gpu_utils.sh` handle these differences:
- `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM)
- `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM)
- `gpu_worker_fraction <engine> <total_gib> <kv_gib>`: converts estimated GiB
into the engine-appropriate fraction (total for vllm/sglang, free for trtllm).
Launch scripts use `build_gpu_mem_args` which calls these internally:
Launch scripts source `gpu_utils.sh` and call `build_gpu_mem_args` to pick
up env-var overrides during profiling and parallel execution:
```bash
GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --max-model-len "$SEQ_LEN" --max-num-seqs "$CONCURRENCY")
```
---
## KV Cache Memory Per Token
source "$SCRIPT_DIR/../../../common/gpu_utils.sh"
The formula for KV cache memory per token is the same across all engines:
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
python -m dynamo.vllm --model "$MODEL" $GPU_MEM_ARGS &
```
kv_bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
python -m dynamo.sglang --model-path "$MODEL" $GPU_MEM_ARGS &
```
| Model | Layers | KV Heads | Head Dim | Dtype | Per Token |
|---|---|---|---|---|---|
| Qwen3-0.6B | 28 | 8 | 128 | BF16 | 112 KiB |
| Llama-3.1-8B | 32 | 8 | 128 | BF16 | 128 KiB |
| Llama-3.1-70B | 80 | 8 | 128 | BF16 | 320 KiB |
| Qwen2.5-VL-7B | 28 | 4 | 128 | BF16 | 56 KiB |
When the env var is set, `build_gpu_mem_args` returns the corresponding flag.
Otherwise it returns empty and the engine uses its default allocation.
To estimate KV cache for a given context length:
```
kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^3)
```
---
| Env var | Engine | CLI flag produced |
|---------|--------|-------------------|
| `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` | vLLM | `--kv-cache-memory-bytes N --gpu-memory-utilization 0.01` |
| `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` | SGLang | `--max-total-tokens N` |
## `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`
For multi-worker single-GPU scripts, pass `--workers-per-gpu N` to divide
the allocation: `build_gpu_mem_args vllm --workers-per-gpu 2`.
Environment variable used by Dynamo's VRAM profiler to binary-search the minimum
memory fraction a script needs.
**Profiler** (`profile_pytest.py`): binary-searches the KV cap to find the
minimum passing value, applies a 2x safety factor, outputs pytest markers
(`@pytest.mark.requested_vllm_kv_cache_bytes(N)` or
`@pytest.mark.requested_sglang_kv_tokens(N)`).
- Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang.
- For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via
`--override-engine-args`.
- Launch scripts use `build_gpu_mem_args` to compute the default fraction;
the override bypasses the estimator and splits the raw value between workers.
- Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV
cache sizing, making the profiler's fraction override ineffective for KV cache.
Those scripts should warn when `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is set.
**Scheduler** (`pytest_parallel_gpu.py`): reads the markers at runtime and
sets the env var per-test. See `tests/README.md` for details.
This diff is collapsed.
......@@ -137,9 +137,9 @@ print_launch_banner() {
echo "Frontend: http://localhost:$_port"
local _seq_len="${MAX_MODEL_LEN:-${CONTEXT_LENGTH:-${MAX_SEQ_LEN:-}}}"
local _frac="${GPU_MEM_FRACTION:-}"
local _mem_args="${GPU_MEM_ARGS:-}"
[[ -n "$_seq_len" ]] && echo "Max seq len: $_seq_len"
[[ -n "$_frac" ]] && echo "GPU frac: $_frac"
[[ -n "$_mem_args" ]] && echo "GPU mem: $_mem_args"
for _line in "$@"; do
echo "$_line"
......
......@@ -93,10 +93,10 @@ python -m dynamo.frontend --http-port 8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill $GPU_MEM_ARGS &
# Wait for all background processes to complete
wait
......@@ -93,11 +93,11 @@ python -m dynamo.frontend --http-port 8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg $GPU_MEM_ARGS &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg $GPU_MEM_ARGS &
# Wait for all background processes to complete
wait
......@@ -19,10 +19,10 @@ python -m dynamo.frontend --http-port=8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill $GPU_MEM_ARGS &
# Wait for all background processes to complete
wait
......@@ -20,11 +20,11 @@ python -m dynamo.frontend --http-port=8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg $GPU_MEM_ARGS &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg $GPU_MEM_ARGS &
# Wait for all background processes to complete
wait
......@@ -234,7 +234,10 @@ markers = [
"gpu_8: marks tests to run on 8GPUs",
"xpu_1: marks tests to run on XPU",
"xpu_2: marks tests to run on 2XPUs",
"max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
# These 3 (profiled_vram_gib and requested_*) are used for parallel pytest executions:
"profiled_vram_gib(N): actual peak VRAM observed by nvidia-smi during profiling. Used for --max-vram-gib filtering and scheduler budget tracking",
"requested_vllm_kv_cache_bytes(N): exact KV cache bytes for vLLM (skips memory profiling). Sets _PROFILE_PYTEST_KV_CACHE_BYTES. Most deterministic method for parallel execution",
"requested_sglang_kv_tokens(N): max KV cache tokens for SGLang parallel execution. Sets _OVERRIDE_SGLANG_MAX_TOTAL_TOKENS to cap --max-total-tokens and prevent over-allocation",
"e2e: marks tests as end-to-end tests",
"integration: marks tests as integration tests",
"unit: marks tests as unit tests",
......
This diff is collapsed.
......@@ -25,6 +25,11 @@ from tests.utils.test_output import resolve_test_output_path
_logger = logging.getLogger(__name__)
# Typed stash keys for GPU-parallel config (avoids setting unknown attrs on Config)
_gpu_parallel_gpus_key: pytest.StashKey[list[dict]] = pytest.StashKey()
_gpu_indices_key: pytest.StashKey[list[int] | None] = pytest.StashKey()
_gpu_slots_key: pytest.StashKey[int | None] = pytest.StashKey()
def pytest_addoption(parser: pytest.Parser) -> None:
"""Add shared command-line options for all tests.
......@@ -59,7 +64,18 @@ def pytest_addoption(parser: pytest.Parser) -> None:
"--max-vram-gib",
type=float,
default=None,
help="Skip tests whose @pytest.mark.max_vram_gib(N) exceeds this value (GiB).",
help="Only run tests with @pytest.mark.profiled_vram_gib(N) that fit in N GiB. "
"Without -n: runs tests sequentially. "
"With -n N: runs N tests concurrently as subprocesses with VRAM-aware scheduling. "
"With -n auto: calculates max concurrent slots from GPU VRAM / max_vram_gib.",
)
parser.addoption(
"--gpus",
"--gpu",
type=str,
default="all",
help="Comma-separated GPU indices or 'all' (default: all). "
"Controls which GPUs the parallel test runner distributes tests across.",
)
parser.addoption(
"--dry-run",
......@@ -79,6 +95,130 @@ logging.basicConfig(
)
# ---------------------------------------------------------------------------
# GPU-serial and GPU-parallel: VRAM-aware test scheduling
#
# Activated only when both --max-vram-gib and -n auto are passed:
# pytest --max-vram-gib=48 -n auto -m "gpu_1 and sglang" tests/serve/
# ---------------------------------------------------------------------------
def pytest_configure(config: pytest.Config) -> None:
"""Detect GPUs for --max-vram-gib planning and parallel execution."""
vram_limit = config.getoption("max_vram_gib", default=None)
if vram_limit is None:
return
# Delayed: vram_utils requires pynvml, otherwise conftest fails to load
# on CPU-only CI runners (e.g. ARM deploy tests) that lack nvidia-ml-py.
from tests.utils.pytest_parallel_gpu import _parse_gpu_indices
from tests.utils.vram_utils import auto_worker_count, detect_gpus
gpus = detect_gpus()
if gpus:
config.stash[_gpu_parallel_gpus_key] = gpus
# Parse --gpus into a list of indices (or None for all)
gpus_raw = config.getoption("gpus", default="all")
if gpus_raw and gpus_raw.strip().lower() != "all":
config.stash[_gpu_indices_key] = _parse_gpu_indices(gpus_raw, gpus)
selected_gpus = [
g for g in gpus if g["index"] in config.stash[_gpu_indices_key]
]
else:
config.stash[_gpu_indices_key] = None # all GPUs
selected_gpus = gpus
# If -n is set with --max-vram-gib, save the slot count and disable xdist
# so our subprocess orchestrator handles parallelism instead.
# xdist's pytest_configure(trylast=True) checks _is_distribution_mode()
# which reads dist/tx (not numprocesses), so we must also clear dist.
numproc = config.getoption("numprocesses", default=None)
if numproc is not None and numproc != 0:
if isinstance(numproc, str) or numproc == -1:
config.stash[_gpu_slots_key] = (
auto_worker_count(selected_gpus, vram_limit) if selected_gpus else 1
)
else:
config.stash[_gpu_slots_key] = int(numproc)
config.option.numprocesses = 0
config.option.dist = "no"
@pytest.hookimpl(tryfirst=True)
def pytest_runtestloop(session: pytest.Session) -> bool | None:
"""Intercept the test loop for GPU-parallel execution.
When --max-vram-gib and -n are both present, run tests as independent
subprocesses via the GPU orchestrator instead of the normal pytest loop.
Must run before the default pytest loop (tryfirst) so we can return True
to prevent the default sequential execution.
"""
config = session.config
num_slots = config.stash.get(_gpu_slots_key, None)
vram_limit = config.getoption("max_vram_gib", default=None)
if num_slots is None or vram_limit is None:
return None # serial execution: let normal pytest handle it
# Imports related to parallel execution must be delayed. See vram_utils pynvml note in pytest_configure for the full reasons
from tests.utils.pytest_parallel_gpu import run_parallel
from tests.utils.vram_utils import load_test_meta
# Collect test IDs from the already-filtered session items
test_ids = [item.nodeid for item in session.items]
if not test_ids:
return True
meta = load_test_meta()
is_stream = config.getoption("capture", default="fd") == "no"
gpu_indices = config.stash.get(_gpu_indices_key, None)
# Forward original CLI args to child pytest subprocesses so they
# inherit options like -s, -v, --tb, --durations, --image, etc.
extra_args: list[str] = []
if is_stream:
extra_args.append("-s")
verbose = config.getoption("verbose", default=0)
if verbose >= 2:
extra_args.append("-vv")
elif verbose >= 1:
extra_args.append("-v")
tb_style = config.getoption("tbstyle", default="short")
if tb_style and tb_style != "short":
extra_args.append(f"--tb={tb_style}")
durations = config.getoption("durations", default=None)
if durations is not None:
extra_args.append(f"--durations={durations}")
durations_min = config.getoption("durations_min", default=None)
if durations_min is not None:
extra_args.append(f"--durations-min={durations_min}")
for opt_name, cli_flag in [
("image", "--image"),
("namespace", "--namespace"),
("framework", "--framework"),
("profile", "--profile"),
]:
val = config.getoption(opt_name, default=None)
if val is not None:
extra_args.extend([cli_flag, str(val)])
if config.getoption("skip_service_restart", default=None):
extra_args.append("--skip-service-restart")
rc = run_parallel(
test_ids=test_ids,
meta=meta,
max_vram_gib=vram_limit,
num_slots=num_slots,
gpu_indices=gpu_indices,
extra_pytest_args=extra_args or None,
stream=is_stream,
)
if rc != 0:
session.testsfailed = 1
return True # we handled the test loop
@pytest.fixture()
def set_ucx_tls_no_mm():
"""Set UCX env defaults for all tests."""
......@@ -205,8 +345,10 @@ def _enable_offline_with_mistral_patch():
except (ImportError, AttributeError):
return # transformers version without _patch_mistral_regex — nothing to do
# Write a sitecustomize.py so subprocesses also get the patch
patch_dir = os.path.join(tempfile.gettempdir(), "dynamo_test_hf_patch")
# Write a sitecustomize.py so subprocesses also get the patch.
# Use a per-worker dir under xdist to avoid write races.
worker_id = os.environ.get("PYTEST_XDIST_WORKER", "main")
patch_dir = os.path.join(tempfile.gettempdir(), f"dynamo_test_hf_patch_{worker_id}")
os.makedirs(patch_dir, exist_ok=True)
with open(os.path.join(patch_dir, "sitecustomize.py"), "w") as f:
f.write(
......@@ -239,26 +381,33 @@ def _enable_offline_with_mistral_patch():
def _disable_offline_with_mistral_patch():
"""Undo _enable_offline_with_mistral_patch."""
os.environ.pop("HF_HUB_OFFLINE", None)
patch_dir = os.path.join(tempfile.gettempdir(), "dynamo_test_hf_patch")
worker_id = os.environ.get("PYTEST_XDIST_WORKER", "main")
patch_dir = os.path.join(tempfile.gettempdir(), f"dynamo_test_hf_patch_{worker_id}")
pythonpath = os.environ.get("PYTHONPATH", "")
os.environ["PYTHONPATH"] = pythonpath.replace(f"{patch_dir}:", "").replace(
patch_dir, ""
)
_download_lock_path = os.path.join(tempfile.gettempdir(), "pytest_model_download.lock")
@pytest.fixture(scope="session")
def predownload_models(pytestconfig):
"""Fixture wrapper around download_models for models used in collected tests"""
# Get models from pytest config if available, otherwise fall back to TEST_MODELS
"""Fixture wrapper around download_models for models used in collected tests.
Uses a file lock so that under xdist, only one worker downloads at a time
and the rest reuse the HuggingFace cache.
"""
models = getattr(pytestconfig, "models_to_download", None)
if models:
logging.info(
f"Downloading {len(models)} models needed for collected tests\nModels: {models}"
)
download_models(model_list=list(models))
else:
# Fallback to original behavior if extraction failed
download_models()
with FileLock(_download_lock_path):
if models:
logging.info(
f"Downloading {len(models)} models needed for collected tests\nModels: {models}"
)
download_models(model_list=list(models))
else:
download_models()
_enable_offline_with_mistral_patch()
yield
......@@ -267,21 +416,20 @@ def predownload_models(pytestconfig):
@pytest.fixture(scope="session")
def predownload_tokenizers(pytestconfig):
"""Fixture wrapper around download_models for tokenizers used in collected tests"""
# Get models from pytest config if available, otherwise fall back to TEST_MODELS
"""Fixture wrapper around download_models for tokenizers used in collected tests.
Uses a file lock so that under xdist, only one worker downloads at a time.
"""
models = getattr(pytestconfig, "models_to_download", None)
if models:
logging.info(
f"Downloading tokenizers for {len(models)} models needed for collected tests\nModels: {models}"
)
download_models(model_list=list(models), ignore_weights=True)
else:
# Fallback to original behavior if extraction failed
download_models(ignore_weights=True)
with FileLock(_download_lock_path):
if models:
logging.info(
f"Downloading tokenizers for {len(models)} models needed for collected tests\nModels: {models}"
)
download_models(model_list=list(models), ignore_weights=True)
else:
download_models(ignore_weights=True)
# Skip redundant HuggingFace API calls in worker subprocesses since
# tokenizers are already cached. This avoids flaky timeouts from slow
# HF API responses (the RepoInfo fetch still happens even for cached models).
_enable_offline_with_mistral_patch()
yield
_disable_offline_with_mistral_patch()
......@@ -337,26 +485,41 @@ def pytest_collection_modifyitems(config, items):
if _item_has_marker(item, marker_name):
item.add_marker(skip)
# Skip tests that exceed --max-vram-gib
# Deselect tests based on --max-vram-gib:
# - Tests whose profiled VRAM exceeds the limit are removed
# - Tests WITHOUT a VRAM marker are also removed (unknown VRAM = unsafe)
# Using deselect (not skip) so they never reach the xdist scheduler.
vram_limit = config.getoption("--max-vram-gib", default=None)
if vram_limit is not None:
skip_vram = pytest.mark.skip(
reason=f"requires more than {vram_limit} GiB VRAM (--max-vram-gib={vram_limit})"
)
keep = []
deselected = []
for item in items:
vram_mark = item.get_closest_marker("max_vram_gib")
if vram_mark and vram_mark.args and vram_mark.args[0] > vram_limit:
item.add_marker(skip_vram)
vram_mark = item.get_closest_marker("profiled_vram_gib")
if vram_mark and vram_mark.args and vram_mark.args[0] <= vram_limit:
keep.append(item)
else:
deselected.append(item)
if deselected:
config.hook.pytest_deselected(items=deselected)
items[:] = keep
# Write test metadata for the GPU orchestrator to read.
if vram_limit is not None:
# Delayed: see vram_utils pynvml note in pytest_configure
from tests.utils.vram_utils import print_gpu_plan, write_test_meta
write_test_meta(items)
# --dry-run: print run/skip breakdown and exit without executing tests
# --dry-run: print run/skip breakdown and exit without executing tests.
# At this point, items only contains tests that passed --max-vram-gib
# filtering (deselected items were already removed above).
if config.getoption("--dry-run", default=False):
would_run = []
would_skip = []
unmarked = []
for item in items:
vram_mark = item.get_closest_marker("max_vram_gib")
vram_mark = item.get_closest_marker("profiled_vram_gib")
vram_val = vram_mark.args[0] if vram_mark and vram_mark.args else None
name = item.nodeid.split("::", 1)[1] if "::" in item.nodeid else item.nodeid
name = item.nodeid
skip_reasons = []
for marker in item.iter_markers("skip"):
......@@ -365,39 +528,28 @@ def pytest_collection_modifyitems(config, items):
reason = marker.args[0]
skip_reasons.append(reason or "no reason given")
vram_skipped = (
vram_limit is not None
and vram_val is not None
and vram_val > vram_limit
)
if vram_skipped:
skip_reasons.insert(0, f"{vram_val} GiB > {vram_limit} GiB VRAM limit")
if skip_reasons:
would_skip.append((name, vram_val, skip_reasons))
elif vram_val is not None:
would_run.append((name, vram_val))
else:
unmarked.append(name)
would_run.append((name, vram_val))
print(f"\n{'=' * 60}")
print(
f"--max-vram-gib={vram_limit or 'not set'} | {len(items)} tests selected"
)
print(f"--max-vram-gib={vram_limit or 'not set'} | {len(items)} tests")
print(f"{'=' * 60}")
if would_run:
print(f"\nWould RUN ({len(would_run)}):")
for name, gib in would_run:
print(f" {name} ({gib} GiB)")
gib_str = f" ({gib} GiB)" if gib is not None else ""
print(f" {name}{gib_str}")
if would_skip:
print(f"\nWould SKIP ({len(would_skip)}):")
for name, vram_val, reasons in would_skip:
vram_str = f" ({vram_val} GiB)" if vram_val is not None else ""
print(f" {name}{vram_str} -- {'; '.join(reasons)}")
if unmarked:
print(f"\nNo VRAM marker — always run ({len(unmarked)}):")
for name in unmarked:
print(f" {name}")
gpus = config.stash.get(_gpu_parallel_gpus_key, None)
if gpus and vram_limit is not None:
print_gpu_plan(gpus, vram_limit, would_run)
print()
items.clear()
return
......
......@@ -99,9 +99,16 @@ class VllmWorkerProcess(ManagedProcess):
"32768",
]
gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
if gpu_util:
command.extend(["--gpu-memory-utilization", gpu_util])
kv_bytes = os.environ.get("_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES")
if kv_bytes:
command.extend(
[
"--kv-cache-memory-bytes",
kv_bytes,
"--gpu-memory-utilization",
"0.01",
]
)
env = os.environ.copy()
env["DYN_LOG"] = "debug"
......@@ -229,7 +236,8 @@ def _validate_chat_response(response: requests.Response) -> Dict[str, Any]:
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.profiled_vram_gib(20.4) # actual profiled peak
# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
@pytest.mark.timeout(300) # 3x observed ~70s wall time, rounded up
@pytest.mark.post_merge
def test_reasoning_effort(
......@@ -297,7 +305,8 @@ def test_reasoning_effort(
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.profiled_vram_gib(20.4) # actual profiled peak
# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
@pytest.mark.timeout(113) # 3x observed 37.4s wall time
@pytest.mark.post_merge
def test_tool_calling(
......@@ -341,7 +350,8 @@ def test_tool_calling(
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling_second_round
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.profiled_vram_gib(20.4) # actual profiled peak
# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
@pytest.mark.timeout(115) # 3x observed 38.1s wall time
@pytest.mark.nightly
def test_tool_calling_second_round(
......@@ -407,7 +417,8 @@ def test_tool_calling_second_round(
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.profiled_vram_gib(20.4) # actual profiled peak
# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
@pytest.mark.timeout(131) # 3x observed 43.4s wall time
@pytest.mark.nightly
def test_reasoning(request, start_services: ServicePorts, predownload_models) -> None:
......
......@@ -18,6 +18,7 @@ from tests.conftest import ServicePorts
from tests.utils.client import send_request
from tests.utils.constants import DefaultPort
from tests.utils.engine_process import EngineConfig, EngineProcess
from tests.utils.port_utils import allocate_port, deallocate_port
DEFAULT_TIMEOUT = 10
......@@ -93,6 +94,7 @@ def run_serve_deployment(
# Ensure EngineProcess health checks hit the correct frontend port.
config = dataclasses.replace(config, frontend_port=dynamic_frontend_port)
else:
# Backward compat: infer from config/extra_env if no explicit ports are passed.
dynamic_frontend_port = int(config.frontend_port)
......@@ -108,76 +110,86 @@ def run_serve_deployment(
int(merged_env.get("DYN_SYSTEM_PORT2") or DefaultPort.SYSTEM2.value),
]
with EngineProcess.from_script(
config, request, extra_env=merged_env
) as server_process:
for _payload in config.request_payloads:
logger.info("TESTING: Payload: %s", _payload.__class__.__name__)
# Make a per-iteration copy so tests can safely override ports/fields
# without mutating shared config instances across parametrized cases.
payload = deepcopy(_payload)
# inject model
if hasattr(payload, "with_model"):
payload = payload.with_model(config.model)
# Default behavior: requests go to the frontend port, except metrics which target
# worker system ports (mapped from DefaultPort -> per-test ports).
if getattr(payload, "endpoint", "") == "/metrics":
if payload.port == DefaultPort.SYSTEM1.value:
if len(dynamic_system_ports) < 1:
raise RuntimeError(
"Payload targets SYSTEM_PORT1 but no system ports were provided "
f"(payload={payload.__class__.__name__})"
)
payload.port = dynamic_system_ports[0]
elif payload.port == DefaultPort.SYSTEM2.value:
if len(dynamic_system_ports) < 2:
raise RuntimeError(
"Payload targets SYSTEM_PORT2 but only 1 system port was provided "
f"(payload={payload.__class__.__name__})"
)
payload.port = dynamic_system_ports[1]
else:
payload.port = dynamic_frontend_port
# Optional extra system ports for specialized payloads (e.g. LoRA control-plane APIs).
# BasePayload always defines `system_ports` (usually empty); map defaults
# (SYSTEM_PORT1/2) to per-test system ports when present.
if payload.system_ports:
mapped_system_ports: list[int] = []
for p in payload.system_ports:
if p == DefaultPort.SYSTEM1.value:
# Disagg scripts need a unique bootstrap port so parallel runs don't collide.
disagg_bootstrap_port: int | None = None
if config.script_name and "disagg" in config.script_name:
disagg_bootstrap_port = allocate_port(12000)
merged_env["DYN_DISAGG_BOOTSTRAP_PORT"] = str(disagg_bootstrap_port)
try:
with EngineProcess.from_script(
config, request, extra_env=merged_env
) as server_process:
for _payload in config.request_payloads:
logger.info("TESTING: Payload: %s", _payload.__class__.__name__)
# Make a per-iteration copy so tests can safely override ports/fields
# without mutating shared config instances across parametrized cases.
payload = deepcopy(_payload)
# inject model
if hasattr(payload, "with_model"):
payload = payload.with_model(config.model)
# Default behavior: requests go to the frontend port, except metrics which target
# worker system ports (mapped from DefaultPort -> per-test ports).
if getattr(payload, "endpoint", "") == "/metrics":
if payload.port == DefaultPort.SYSTEM1.value:
if len(dynamic_system_ports) < 1:
raise RuntimeError(
"Payload.system_ports includes SYSTEM_PORT1 but no system ports were provided "
"Payload targets SYSTEM_PORT1 but no system ports were provided "
f"(payload={payload.__class__.__name__})"
)
mapped_system_ports.append(dynamic_system_ports[0])
elif p == DefaultPort.SYSTEM2.value:
payload.port = dynamic_system_ports[0]
elif payload.port == DefaultPort.SYSTEM2.value:
if len(dynamic_system_ports) < 2:
raise RuntimeError(
"Payload.system_ports includes SYSTEM_PORT2 but only 1 system port was provided "
"Payload targets SYSTEM_PORT2 but only 1 system port was provided "
f"(payload={payload.__class__.__name__})"
)
mapped_system_ports.append(dynamic_system_ports[1])
else:
mapped_system_ports.append(p)
payload.system_ports = mapped_system_ports
for _ in range(payload.repeat_count):
response = send_request(
url=payload.url(),
payload=payload.body,
timeout=payload.timeout,
method=payload.method,
stream=payload.http_stream,
)
server_process.check_response(payload, response)
# Call final_validation if the payload has one (e.g., CachedTokensChatPayload)
if hasattr(payload, "final_validation"):
payload.final_validation()
payload.port = dynamic_system_ports[1]
else:
payload.port = dynamic_frontend_port
# Optional extra system ports for specialized payloads (e.g. LoRA control-plane APIs).
# BasePayload always defines `system_ports` (usually empty); map defaults
# (SYSTEM_PORT1/2) to per-test system ports when present.
if payload.system_ports:
mapped_system_ports: list[int] = []
for p in payload.system_ports:
if p == DefaultPort.SYSTEM1.value:
if len(dynamic_system_ports) < 1:
raise RuntimeError(
"Payload.system_ports includes SYSTEM_PORT1 but no system ports were provided "
f"(payload={payload.__class__.__name__})"
)
mapped_system_ports.append(dynamic_system_ports[0])
elif p == DefaultPort.SYSTEM2.value:
if len(dynamic_system_ports) < 2:
raise RuntimeError(
"Payload.system_ports includes SYSTEM_PORT2 but only 1 system port was provided "
f"(payload={payload.__class__.__name__})"
)
mapped_system_ports.append(dynamic_system_ports[1])
else:
mapped_system_ports.append(p)
payload.system_ports = mapped_system_ports
for _ in range(payload.repeat_count):
response = send_request(
url=payload.url(),
payload=payload.body,
timeout=payload.timeout,
method=payload.method,
stream=payload.http_stream,
)
server_process.check_response(payload, response)
# Call final_validation if the payload has one (e.g., CachedTokensChatPayload)
if hasattr(payload, "final_validation"):
payload.final_validation()
finally:
if disagg_bootstrap_port is not None:
deallocate_port(disagg_bootstrap_port)
def params_with_model_mark(configs: Mapping[str, EngineConfig]):
......
......@@ -12,7 +12,11 @@ trap 'echo "Cleaning up..."; kill 0' EXIT
MODEL="${MODEL:-Qwen/Qwen3-0.6B}"
GPU_MEM_FRACTION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}"
KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
GPU_MEM_ARGS=""
if [[ -n "$KV_BYTES" ]]; then
GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
fi
echo "Starting Dynamo frontend..."
python3 -m dynamo.frontend &
......@@ -25,7 +29,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
--node-rank 0 \
--master-addr 127.0.0.1 \
--enforce-eager \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
$GPU_MEM_ARGS &
echo "Starting dynamo.vllm headless worker (TP=2, nnodes=2, node-rank=1, GPU 1)..."
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
......@@ -35,7 +39,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--node-rank 1 \
--master-addr 127.0.0.1 \
--enforce-eager \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} \
$GPU_MEM_ARGS \
--headless &
wait
......@@ -45,9 +45,9 @@ sglang_dir = os.environ.get("SGLANG_DIR") or os.path.join(
# SGLang test configurations
# NOTE: pytest.mark.gpu_1 tests take ~167s (2m 47s) total to run sequentially (with models pre-cached)
# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
# TODO: Now that these tests use dynamic ports and each config has a profiled_vram_gib marker,
# optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
# A future collector/launcher can sum profiled_vram_gib values to decide how many tests fit
# concurrently without exceeding available VRAM.
sglang_configs = {
"aggregated": SGLangConfig(
......@@ -58,8 +58,13 @@ sglang_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(6.1), # observed peak 5.6 GiB (+10% safety)
pytest.mark.timeout(240), # profiled 34.4s on A6000
pytest.mark.profiled_vram_gib(
3.7
), # actual peak at recommended token count
pytest.mark.requested_sglang_kv_tokens(
96
), # KV cache cap (2x safety over min=48)
pytest.mark.timeout(195), # profiled 33s on RTX 6000 Ada
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
......@@ -160,7 +165,8 @@ sglang_configs = {
script_name="template_verifier.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.timeout(240), # profiled 11.7s on A6000 (no GPU model load)
pytest.mark.profiled_vram_gib(0.0), # no GPU model load
pytest.mark.timeout(120), # profiled 12s on RTX 6000 Ada
pytest.mark.pre_merge,
pytest.mark.nightly,
],
......@@ -175,8 +181,8 @@ sglang_configs = {
),
# NOTE: Pack all workers on 1 GPU for lower CI resource requirements.
# NOTE: multimodal_epd.sh uses explicit --mem-fraction-static via DYN_ENCODE_GPU_MEM
# / DYN_WORKER_GPU_MEM env vars, so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect.
# Regardless of fraction overrides, the workers combined consistently use ~23.6 GiB.
# / DYN_WORKER_GPU_MEM env vars. The profiler override distributes proportionally
# but workers combined consistently use ~23.6 GiB regardless of fraction overrides.
"multimodal_e_pd_qwen": SGLangConfig(
# E/P/D architecture: Encode, Prefill, Decode workers all on GPU 0
name="multimodal_e_pd_qwen",
......@@ -184,16 +190,15 @@ sglang_configs = {
script_name="multimodal_epd.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(13.3), # observed peak 12.1 GiB (+10% safety)
pytest.mark.timeout(360), # profiled 31.0s on A6000
# No profiled_vram_gib: uses hard-coded --mem-fraction-static via
# DYN_ENCODE_GPU_MEM / DYN_WORKER_GPU_MEM, so VRAM scales with GPU size.
pytest.mark.timeout(210), # profiled 35s on RTX 6000 Ada
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-VL-2B-Instruct",
script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
timeout=360,
env={
"DYN_ENCODE_WORKER_GPU": "0",
"DYN_WORKER_GPU": "0",
"DYN_ENCODE_GPU_MEM": "0.1",
"DYN_WORKER_GPU_MEM": "0.4",
},
......@@ -226,8 +231,11 @@ sglang_configs = {
script_name="multimodal_disagg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(17.7), # observed peak 16.1 GiB (+10% safety)
pytest.mark.timeout(360), # profiled 36.0s on A6000
pytest.mark.profiled_vram_gib(16.1), # actual profiled peak
pytest.mark.requested_sglang_kv_tokens(
1024
), # KV cache cap (2x safety over min=512)
pytest.mark.timeout(222), # profiled 37s on RTX 6000 Ada
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-VL-2B-Instruct",
......@@ -261,8 +269,13 @@ sglang_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.0), # observed peak 19.1 GiB (+10% safety)
pytest.mark.timeout(300), # profiled 41.3s on A6000
pytest.mark.profiled_vram_gib(
19.1
), # actual peak at recommended token count
pytest.mark.requested_sglang_kv_tokens(
768
), # KV cache cap (2x safety over min=384)
pytest.mark.timeout(182), # profiled 30s on RTX 6000 Ada
pytest.mark.pre_merge,
pytest.mark.nightly,
],
......@@ -300,8 +313,13 @@ sglang_configs = {
script_name="agg_embed.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(12.1), # observed peak 11.0 GiB (+10% safety)
pytest.mark.timeout(270), # profiled 25.5s on A6000
pytest.mark.profiled_vram_gib(
9.8
), # actual peak at recommended token count
pytest.mark.requested_sglang_kv_tokens(
128
), # KV cache cap (2x safety over min=64)
pytest.mark.timeout(147), # profiled 24s on RTX 6000 Ada
pytest.mark.pre_merge,
pytest.mark.nightly,
],
......@@ -338,8 +356,13 @@ sglang_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(16.2), # observed peak 14.8 GiB (+10% safety)
pytest.mark.timeout(420), # profiled 73s on A6000
pytest.mark.profiled_vram_gib(
14.7
), # actual peak at recommended token count
pytest.mark.requested_sglang_kv_tokens(
64
), # KV cache cap (2x safety over min=32)
pytest.mark.timeout(341), # profiled 57s on RTX 6000 Ada
pytest.mark.post_merge,
],
model="deepseek-ai/deepseek-llm-7b-base",
......@@ -362,7 +385,7 @@ sglang_configs = {
pytest.mark.post_merge,
pytest.mark.timeout(240),
pytest.mark.skip(reason="DYN-2261"),
# TODO: profile to get max_vram (currently skipped)
# TODO: profile once DYN-2261 is fixed (uses agg.sh, profiler works)
],
model="Qwen/Qwen3-0.6B",
env={"DYN_ENABLE_ANTHROPIC_API": "1"},
......
......@@ -54,9 +54,9 @@ vllm_dir = os.environ.get("VLLM_DIR") or os.path.join(
# vLLM test configurations
# NOTE: pytest.mark.gpu_1 tests take ~5.5 minutes total to run sequentially (with models pre-cached)
# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
# TODO: Now that these tests use dynamic ports and each config has VRAM markers,
# optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
# A future collector/launcher can sum profiled_vram_gib values to decide how many tests fit
# concurrently without exceeding available VRAM.
vllm_configs = {
"aggregated": VLLMConfig(
......@@ -65,8 +65,13 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.timeout(300), # ~7x observed 42.2s; old value before profiling
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(
360
), # ~8.5x observed 42.2s; bumped for GPU-parallel headroom
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
......@@ -93,7 +98,10 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(120), # ~5x observed 24.3s; CI machines are slower
pytest.mark.post_merge,
],
......@@ -122,7 +130,10 @@ vllm_configs = {
marks=[
pytest.mark.lmcache,
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.4 GiB (+10% safety)
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(360), # ~7x observed 49.0s; old value before profiling
pytest.mark.pre_merge,
pytest.mark.skipif(
......@@ -145,7 +156,10 @@ vllm_configs = {
marks=[
pytest.mark.lmcache,
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.4 GiB (+10% safety)
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(360), # ~7x observed 49.3s; old value before profiling
pytest.mark.pre_merge,
pytest.mark.skipif(
......@@ -170,8 +184,13 @@ vllm_configs = {
script_name="agg_request_planes.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.3 GiB (+10% safety)
pytest.mark.timeout(300), # ~7x observed 43.0s; old value before profiling
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(
360
), # ~8x observed 43.0s; bumped for GPU-parallel headroom
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
......@@ -187,8 +206,13 @@ vllm_configs = {
script_name="agg_request_planes.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.3 GiB (+10% safety)
pytest.mark.timeout(300), # ~7x observed 42.3s; old value before profiling
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(
360
), # ~8.5x observed 42.3s; bumped for GPU-parallel headroom
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
......@@ -299,13 +323,17 @@ vllm_configs = {
],
),
# NOTE: Pack all workers on 1 GPU for lower CI resource requirements
# NOTE: disagg_multimodal_e_pd.sh uses explicit --gpu-memory-utilization via
# DYN_ENCODE_GPU_MEM / DYN_PD_GPU_MEM env vars in single-GPU mode.
# PD worker honors build_gpu_mem_args for parallel execution.
"multimodal_e_pd_qwen": VLLMConfig(
name="multimodal_e_pd_qwen",
directory=vllm_dir,
script_name="disagg_multimodal_e_pd.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(24.6), # observed peak 22.3 GiB (+10% safety)
# No profiled_vram_gib / requested_vllm_kv_cache_bytes: single-GPU mode
# uses hardcoded fractions (encode=0.1, PD=0.7) that scale with GPU size.
pytest.mark.timeout(340), # ~5x observed 68.4s; 2B model loads slower on CI
pytest.mark.pre_merge,
],
......@@ -339,7 +367,10 @@ vllm_configs = {
# post_merge because needs real NIXL not stub
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(10.2), # observed peak 9.3 GiB (+10% safety)
pytest.mark.profiled_vram_gib(9.6), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_710_490_000
), # KV cache cap (2x safety over min=855_244_800)
pytest.mark.timeout(220), # ~5x observed 43.7s; 2B model loads slower on CI
pytest.mark.post_merge,
],
......@@ -373,21 +404,25 @@ vllm_configs = {
# NOTE: disagg_multimodal_epd.sh uses --kv-cache-memory-bytes=512MB for P/D
# workers. Per vLLM CacheConfig, kv_cache_memory_bytes (when not-None) ignores
# gpu_memory_utilization (ref: https://docs.vllm.ai/en/stable/api/vllm/config/cache/),
# so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect. Regardless of GPU_MEM
# so KV cache overrides have no effect. Regardless of GPU_MEM
# fractions (0.1/0.4/0.4), the 3 workers combined consistently use ~17.6 GiB
# total on this GPU.
# NOTE: disagg_multimodal_epd.sh uses explicit --gpu-memory-utilization via
# DYN_ENCODE_GPU_MEM / DYN_PREFILL_GPU_MEM / DYN_DECODE_GPU_MEM env vars.
# P/D workers honor build_gpu_mem_args for parallel execution.
"multimodal_disagg_qwen": VLLMConfig(
name="multimodal_disagg_qwen",
directory=vllm_dir,
script_name="disagg_multimodal_epd.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(19.4), # observed peak 17.6 GiB (+10% safety)
# No profiled_vram_gib / requested_vllm_kv_cache_bytes: single-GPU mode
# uses hardcoded fractions via DYN_*_GPU_MEM that scale with GPU size.
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-VL-2B-Instruct",
script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
timeout=360,
timeout=300,
env={
"DYN_ENCODE_WORKER_GPU": "0",
"DYN_PREFILL_WORKER_GPU": "0",
......@@ -421,7 +456,10 @@ vllm_configs = {
script_name="agg_multimodal.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.6), # observed peak 19.6 GiB (+10% safety)
pytest.mark.profiled_vram_gib(19.9), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
922_354_000
), # KV cache cap (2x safety over min=461_176_832)
pytest.mark.timeout(
360
), # ~7x observed 50.0s; 7B model loads ~48s on CI (A10G/L4)
......@@ -455,7 +493,10 @@ vllm_configs = {
script_name="agg_multimodal.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(18.9), # observed peak 17.1 GiB (+10% safety)
pytest.mark.profiled_vram_gib(14.9), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
922_354_000
), # KV cache cap (2x safety over min=461_176_832)
pytest.mark.timeout(
300
), # ~7x observed 42.7s; 7B model loads ~48s on CI (A10G/L4)
......@@ -703,7 +744,10 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.9), # observed peak 19.9 GiB (+10% safety)
pytest.mark.profiled_vram_gib(18.3), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
4_074_898_000
), # KV cache cap (2x safety over min=2_037_448_704)
pytest.mark.timeout(
420
), # 7B model loads ~48s on CI (A10G/L4) vs ~15s locally
......@@ -742,7 +786,10 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.profiled_vram_gib(3.8), # actual profiled peak with kv-bytes
pytest.mark.requested_vllm_kv_cache_bytes(
1_119_388_000
), # KV cache cap (2x safety over min=559_693_824)
pytest.mark.timeout(110), # ~5x observed 22.3s; CI machines are slower
pytest.mark.pre_merge,
],
......
This diff is collapsed.
This diff is collapsed.
......@@ -32,27 +32,27 @@ ALLOC_MIB = 4096 # 4 GiB
@pytest.mark.gpu_1
@pytest.mark.timeout(30)
def test_mock_4gb_gpu_alloc():
"""Allocate 4 GiB of GPU VRAM, hold 2s, release. Honors _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE."""
"""Allocate 4 GiB of GPU VRAM, hold 2s, release. Honors _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES."""
if not torch.cuda.is_available():
pytest.skip("CUDA not available")
device = 0
total_mib = torch.cuda.get_device_properties(device).total_memory / (1024 * 1024)
gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
if gpu_util is not None:
cap_mib = total_mib * float(gpu_util)
kv_bytes_str = os.environ.get("_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES")
if kv_bytes_str is not None:
cap_mib = int(kv_bytes_str) / (1024 * 1024)
logger.info(
"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=%.2f -> cap %.0f MiB (%.1f GiB) of %.0f MiB total",
float(gpu_util),
"_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=%s -> cap %.0f MiB (%.1f GiB) of %.0f MiB total",
kv_bytes_str,
cap_mib,
cap_mib / 1024,
total_mib,
)
if ALLOC_MIB > cap_mib:
raise RuntimeError(
f"Requested {ALLOC_MIB} MiB exceeds _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE "
f"cap of {cap_mib:.0f} MiB ({gpu_util})"
f"Requested {ALLOC_MIB} MiB exceeds KV cache cap "
f"of {cap_mib:.0f} MiB ({kv_bytes_str} bytes)"
)
num_elements = (ALLOC_MIB * 1024 * 1024) // 4
......
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""GPU VRAM utilities for parallel test execution.
Functions:
detect_gpus() Enumerate GPUs via pynvml
auto_worker_count(gpus, limit) Calculate slot count for -n auto
write_test_meta(items) Serialize profiled/requested vram + timeout
load_test_meta() Read the serialized test metadata
print_gpu_plan(gpus, limit, would_run) Dry-run GPU plan summary
Usage:
# Sequential (filter only)
pytest --max-vram-gib=10 -m "gpu_1 and vllm" tests/serve/
# Parallel (VRAM-aware scheduling)
pytest --max-vram-gib=10 -n auto -m "gpu_1 and vllm" tests/serve/
"""
from __future__ import annotations
import json
import logging
import os
import tempfile
import pynvml
_logger = logging.getLogger(__name__)
# When 2+ tests run concurrently, reserve 15% of GPU VRAM for CUDA context
# overhead across processes. A single test gets the full GPU (0% margin).
VRAM_MULTI_PROC_MARGIN = 0.15
_TEST_META_FILENAME = "pytest_gpu_parallel_test_meta.json"
def detect_gpus() -> list[dict]:
"""Return list of dicts with 'index', 'name', 'total_mib' per GPU.
Uses pynvml (already a dependency via profile_pytest.py).
Returns empty list if no GPUs or pynvml is unavailable.
"""
try:
pynvml.nvmlInit()
except pynvml.NVMLError:
return []
try:
count = pynvml.nvmlDeviceGetCount()
gpus = []
for i in range(count):
handle = pynvml.nvmlDeviceGetHandleByIndex(i)
name = pynvml.nvmlDeviceGetName(handle)
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
gpus.append(
{
"index": i,
"name": name,
"total_mib": mem.total // (1024 * 1024),
}
)
return gpus
finally:
pynvml.nvmlShutdown()
def auto_worker_count(
gpus: list[dict],
vram_limit: float,
test_profiled_gibs: list[float] | None = None,
) -> int:
"""Calculate slot count for -n auto.
Uses the smallest profiled test size (if provided) to maximize parallelism.
Falls back to vram_limit when no test sizes are available.
"""
if not gpus or vram_limit <= 0:
return len(gpus) or 1
min_gpu_gib = min(g["total_mib"] for g in gpus) / 1024.0
budget_gib = min_gpu_gib * (1.0 - VRAM_MULTI_PROC_MARGIN)
divisor = vram_limit
if test_profiled_gibs:
nonzero = [g for g in test_profiled_gibs if g > 0]
if nonzero:
divisor = min(nonzero)
workers_per_gpu = max(1, int(budget_gib / divisor)) if divisor > 0 else 1
return len(gpus) * workers_per_gpu
def write_test_meta(items, dest_dir: str | None = None) -> None:
"""Serialize profiled_vram_gib, timeout, and KV cache markers to JSON.
Called from pytest_collection_modifyitems so the GPU orchestrator can
read test metadata without re-collecting.
"""
test_meta: dict[str, dict] = {}
for item in items:
meta: dict = {}
profiled_mark = item.get_closest_marker("profiled_vram_gib")
if profiled_mark and profiled_mark.args:
meta["profiled_vram_gib"] = profiled_mark.args[0]
kv_bytes_mark = item.get_closest_marker("requested_vllm_kv_cache_bytes")
if kv_bytes_mark and kv_bytes_mark.args:
meta["requested_vllm_kv_cache_bytes"] = kv_bytes_mark.args[0]
timeout_mark = item.get_closest_marker("timeout")
if timeout_mark and timeout_mark.args:
meta["timeout"] = timeout_mark.args[0]
kv_tokens_mark = item.get_closest_marker("requested_sglang_kv_tokens")
if kv_tokens_mark and kv_tokens_mark.args:
meta["requested_sglang_kv_tokens"] = kv_tokens_mark.args[0]
skip_mark = item.get_closest_marker("skip")
if skip_mark:
reason = skip_mark.kwargs.get("reason", "")
if not reason and skip_mark.args:
reason = skip_mark.args[0]
meta["skip_reason"] = reason or "skipped"
if meta:
test_meta[item.nodeid] = meta
if test_meta:
path = os.path.join(dest_dir or tempfile.gettempdir(), _TEST_META_FILENAME)
with open(path, "w") as f:
json.dump(test_meta, f)
def load_test_meta() -> dict[str, dict]:
"""Load the nodeid -> {profiled_vram_gib, timeout, ...} map."""
path = os.path.join(tempfile.gettempdir(), _TEST_META_FILENAME)
try:
with open(path) as f:
return json.load(f)
except (FileNotFoundError, json.JSONDecodeError):
return {}
def print_gpu_plan(
gpus: list[dict], vram_limit: float, would_run: list[tuple[str, float]]
) -> None:
"""Print the GPU-parallel plan section for --dry-run output."""
min_gpu_gib = min(g["total_mib"] for g in gpus) / 1024.0
budget_gib = min_gpu_gib * (1.0 - VRAM_MULTI_PROC_MARGIN)
profiled_gibs = [gib for _, gib in would_run if gib is not None and gib > 0]
min_test_gib = min(profiled_gibs) if profiled_gibs else vram_limit
auto_slots = max(1, int(budget_gib / min_test_gib)) if min_test_gib > 0 else 1
print(f"\n{'=' * 60}")
print("GPU-Parallel Plan")
print(f"{'=' * 60}")
for gpu in gpus:
gib = gpu["total_mib"] / 1024
print(f" GPU {gpu['index']}: {gpu['name']} ({gib:.1f} GiB)")
print(f"\n Usable VRAM: {budget_gib:.0f} GiB")
print("\n Run options:")
print(" (no -n) : sequential, 1 test at a time")
print(
f" -n auto : up to {auto_slots} slots per GPU "
f"({budget_gib:.0f} / {min_test_gib:.0f} GiB smallest test)"
)
print(f" -n N : N concurrent slots across {len(gpus)} GPU(s)")
print("\n Usage:")
print(
f" pytest --max-vram-gib={vram_limit:.0f} -n {auto_slots} "
f'-m "gpu_1 and vllm" tests/serve/'
)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment