Unverified Commit 0b20745e authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

feat: GPU VRAM profiler via memory fraction injection + profiled test markers...


feat: GPU VRAM profiler via memory fraction injection + profiled test markers (part 2 - vLLM only) (#6719)
Signed-off-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent d047851e
...@@ -28,10 +28,12 @@ if [[ "$CAPACITY_GB" != "0" ]]; then ...@@ -28,10 +28,12 @@ if [[ "$CAPACITY_GB" != "0" ]]; then
}") }")
fi fi
GPU_MEM_UTIL="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-.9}"
CUDA_VISIBLE_DEVICES=2 \ CUDA_VISIBLE_DEVICES=2 \
vllm serve "$MODEL" \ vllm serve "$MODEL" \
--enable-log-requests \ --enable-log-requests \
--max-model-len 16384 \ --max-model-len 16384 \
--gpu-memory-utilization .9 \ --gpu-memory-utilization "$GPU_MEM_UTIL" \
"${EC_ARGS[@]}" \ "${EC_ARGS[@]}" \
"${EXTRA_ARGS[@]}" "${EXTRA_ARGS[@]}"
...@@ -20,7 +20,7 @@ MODEL="${MODEL:-Qwen/Qwen3-VL-8B-Instruct}" ...@@ -20,7 +20,7 @@ MODEL="${MODEL:-Qwen/Qwen3-VL-8B-Instruct}"
NAMESPACE="${NAMESPACE:-dynamo}" NAMESPACE="${NAMESPACE:-dynamo}"
HTTP_PORT="${HTTP_PORT:-8000}" HTTP_PORT="${HTTP_PORT:-8000}"
BLOCK_SIZE="${BLOCK_SIZE:-16}" # Must match vLLM backend KV block size BLOCK_SIZE="${BLOCK_SIZE:-16}" # Must match vLLM backend KV block size
GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.85}" GPU_MEMORY_UTILIZATION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-${GPU_MEMORY_UTILIZATION:-0.85}}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}" MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
NATS_SERVER="${NATS_SERVER:-nats://127.0.0.1:4222}" NATS_SERVER="${NATS_SERVER:-nats://127.0.0.1:4222}"
......
...@@ -57,7 +57,7 @@ controls the *overall* VRAM budget (and thus whether the model fits), but the ...@@ -57,7 +57,7 @@ controls the *overall* VRAM budget (and thus whether the model fits), but the
KV cache portion is pinned to the explicit byte value. KV cache portion is pinned to the explicit byte value.
Consequence for profiling: if a script uses `--kv-cache-memory-bytes`, Consequence for profiling: if a script uses `--kv-cache-memory-bytes`,
changing `DYN_GPU_MEMORY_FRACTION_OVERRIDE` (which maps to changing `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` (which maps to
`--gpu-memory-utilization`) won't change the KV cache size, only the leftover `--gpu-memory-utilization`) won't change the KV cache size, only the leftover
headroom for activations and overhead. headroom for activations and overhead.
...@@ -256,14 +256,13 @@ to get 10 GiB of KV cache with a 5 GiB model. ...@@ -256,14 +256,13 @@ to get 10 GiB of KV cache with a 5 GiB model.
The helper functions in `gpu_utils.sh` handle these differences: The helper functions in `gpu_utils.sh` handle these differences:
- `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM) - `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM)
- `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM) - `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM)
- `gpu_worker_fraction <engine>`: unified wrapper — reads `_EW_*` vars from - `gpu_worker_fraction <engine> <total_gib> <kv_gib>`: converts estimated GiB
`estimate_worker_vram` and calls the right function for the engine. into the engine-appropriate fraction (total for vllm/sglang, free for trtllm).
Launch scripts use `gpu_worker_fraction` so they all follow the same pattern: Launch scripts use `build_gpu_mem_args` which calls these internally:
```bash ```bash
estimate_worker_vram "$MODEL" "$SEQ_LEN" "$CONCURRENCY" trtllm GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --max-model-len "$SEQ_LEN" --max-num-seqs "$CONCURRENCY")
GPU_MEM_FRACTION=$(gpu_worker_fraction trtllm)
``` ```
--- ---
...@@ -291,7 +290,7 @@ kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^ ...@@ -291,7 +290,7 @@ kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^
--- ---
## `DYN_GPU_MEMORY_FRACTION_OVERRIDE` ## `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`
Environment variable used by Dynamo's VRAM profiler to binary-search the minimum Environment variable used by Dynamo's VRAM profiler to binary-search the minimum
memory fraction a script needs. memory fraction a script needs.
...@@ -299,8 +298,8 @@ memory fraction a script needs. ...@@ -299,8 +298,8 @@ memory fraction a script needs.
- Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang. - Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang.
- For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via - For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via
`--override-engine-args`. `--override-engine-args`.
- Launch scripts use `gpu_worker_fraction <engine>` to compute the default - Launch scripts use `build_gpu_mem_args` to compute the default fraction;
fraction; the override bypasses this and splits the raw value between workers. the override bypasses the estimator and splits the raw value between workers.
- Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV - Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV
cache sizing, making the profiler's fraction override ineffective for KV cache. cache sizing, making the profiler's fraction override ineffective for KV cache.
Those scripts should warn when `DYN_GPU_MEMORY_FRACTION_OVERRIDE` is set. Those scripts should warn when `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is set.
#!/bin/bash #!/bin/bash
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
# #
# Shared GPU utility functions for launch scripts. # Shared GPU utility functions for launch scripts.
# #
# Usage: # CLI:
# ./gpu_utils.sh <engine> --model <name> [options...] Print GPU fraction
# ./gpu_utils.sh --self-test Run self-test suite
#
# Source:
# source "$(dirname "$(readlink -f "$0")")/../common/gpu_utils.sh" # source "$(dirname "$(readlink -f "$0")")/../common/gpu_utils.sh"
# # or with SCRIPT_DIR already set: # # or with SCRIPT_DIR already set:
# source "$SCRIPT_DIR/../common/gpu_utils.sh" # source "$SCRIPT_DIR/../common/gpu_utils.sh"
# #
# Functions: # Functions (all return via stdout — no hidden globals):
# get_model_params <model> Set _MP_* vars for a known model's architecture # build_gpu_mem_args <engine> <model> ... Prints fraction (or empty)
# estimate_worker_vram <model> ... Set _EW_* vars with per-worker VRAM estimate # get_model_params <model> Prints "pb wb layers kvh hd"
# gpu_worker_fraction <engine> Convert _EW_* estimate → engine-appropriate fraction # estimate_worker_vram <model> ... Prints "w_gib kv_gib oh_gib total_gib"
# gpu_gb_to_total_fraction <gib> Convert absolute GiB → fraction of TOTAL VRAM (vLLM/sglang) # gpu_worker_fraction <engine> <total> <kv> Prints engine-appropriate fraction
# gpu_gb_to_free_fraction <gib> Convert absolute GiB → fraction of FREE VRAM (TensorRT-LLM) # gpu_peak_to_engine_fraction <engine> <peak> Prints fraction (subtracts engine overhead)
# gpu_gb_to_total_fraction <gib> Prints fraction of TOTAL VRAM (vLLM/sglang)
# gpu_gb_to_free_fraction <gib> Prints fraction of FREE VRAM (TensorRT-LLM)
# build_gpu_mem_args <engine> [options...]
#
# Prints the computed memory fraction to stdout (empty line if none).
# Callers capture with: GPU_MEM_FRACTION=$(build_gpu_mem_args ...)
#
# Priority:
# 1. _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE (profiler binary search)
# 2. Engine flag passed to this function (user already chose a value)
# 3. estimate_worker_vram + gpu_worker_fraction (model architecture)
# 4. Empty (let engine use its own default)
#
# Options (each flag accepts engine-specific aliases):
# --model NAME Model name (required).
# aliases: --model-path (sglang, trtllm)
# --max-model-len N Max tokens per sequence (default: 4096).
# aliases: --context-length (sglang)
# --max-seq-len (trtllm)
# --max-num-seqs N Concurrent sequences to budget for (default: 2).
# aliases: --max-running-requests (sglang)
# --max-batch-size (trtllm)
# --gpu-memory-utilization F User override (vllm flag name). Skipped when empty.
# --mem-fraction-static F User override (sglang flag name).
# --workers-per-gpu N Divide the fraction by N (for shared-GPU disagg).
#
# Usage:
# # Simple single-worker (agg.sh)
# GPU_MEM_FRACTION=$(build_gpu_mem_args vllm \
# --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
# python -m dynamo.vllm --model "$MODEL" \
# ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
#
# # Two workers sharing one GPU (disagg_same_gpu.sh)
# GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --workers-per-gpu 2)
# python -m dynamo.vllm ... --gpu-memory-utilization "${GPU_MEM_FRACTION}" &
#
# # sglang
# GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL" --workers-per-gpu 2)
# python -m dynamo.sglang ... --mem-fraction-static "${GPU_MEM_FRACTION}" &
#
# # trtllm (fraction goes into JSON, not CLI)
# GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --workers-per-gpu 2)
# OVERRIDE_ARGS=(--override-engine-args "{\"kv_cache_config\":{\"free_gpu_memory_fraction\":${GPU_MEM_FRACTION}}}")
build_gpu_mem_args() {
local engine="${1:?usage: build_gpu_mem_args <engine> --model <name> [options...]}"
shift
local model=""
local max_model_len="4096"
local max_seqs="2"
local workers_per_gpu=1
local user_frac=""
while [[ $# -gt 0 ]]; do
case "$1" in
--model|--model-path)
model="$2"; shift 2 ;;
--max-model-len|--context-length|--max-seq-len)
max_model_len="$2"; shift 2 ;;
--max-num-seqs|--max-running-requests|--max-batch-size)
max_seqs="$2"; shift 2 ;;
--gpu-memory-utilization|--mem-fraction-static)
user_frac="$2"; shift 2 ;;
--workers-per-gpu) workers_per_gpu="$2"; shift 2 ;;
*) echo "build_gpu_mem_args: unknown option '$1'" >&2; return 1 ;;
esac
done
if [[ -z "$model" ]]; then
echo "build_gpu_mem_args: --model is required" >&2
return 1
fi
local frac=""
local from_estimator=false
local est_w="" est_kv="" est_oh="" est_total=""
if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" ]]; then
frac="$_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
elif [[ -n "$user_frac" ]]; then
frac="$user_frac"
elif read -r est_w est_kv est_oh est_total <<< "$(estimate_worker_vram "$model" "$max_model_len" "$max_seqs" "$engine" 2>/dev/null)" && [[ -n "$est_total" ]]; then
frac=$(gpu_worker_fraction "$engine" "$est_total" "$est_kv")
from_estimator=true
fi
# --workers-per-gpu divides profiler/user/estimator results only
if [[ -n "$frac" && "$workers_per_gpu" -gt 1 ]]; then
frac=$(awk -v f="$frac" -v n="$workers_per_gpu" 'BEGIN { printf "%.2f", f / n }')
fi
echo "$frac"
}
# get_model_params <model_name> # get_model_params <model_name>
# #
# Sets _MP_* variables for a known model's architecture: # Prints "params_b weight_bytes layers kv_heads head_dim" to stdout.
# _MP_PARAMS_B Total parameters in billions (all experts for MoE) # Returns 1 (prints nothing) if the model is unknown.
# _MP_WEIGHT_BYTES Bytes per weight element (2=BF16/FP16, 1=FP8) #
# _MP_LAYERS Number of transformer layers # Fields:
# _MP_KV_HEADS Number of key-value heads (GQA groups) # params_b Total parameters in billions (all experts for MoE)
# _MP_HEAD_DIM Dimension per attention head # weight_bytes Bytes per weight element (2=BF16/FP16, 1=FP8)
# layers Number of transformer layers
# kv_heads Number of key-value heads (GQA groups)
# head_dim Dimension per attention head
# #
# KV cache is assumed BF16 (2 bytes per element) regardless of weight dtype, # KV cache is assumed BF16 (2 bytes per element) regardless of weight dtype,
# since FP8 KV cache (--kv-cache-dtype fp8) is opt-in and not the default. # since FP8 KV cache (--kv-cache-dtype fp8) is opt-in and not the default.
# #
# To add a model: look up config.json on HuggingFace for num_hidden_layers, # To add a model:
# num_key_value_heads, and head_dim. For VL/multimodal models, use the # 1. Find config.json at https://huggingface.co/<model>/raw/main/config.json
# text_config section. For MoE, _MP_PARAMS_B is the TOTAL param count # For VL/multimodal models, architecture params are under text_config.
# (all experts are loaded into VRAM). # 2. Map fields:
# layers ← num_hidden_layers
# kv_heads ← num_key_value_heads
# head_dim ← head_dim (or hidden_size / num_attention_heads)
# 3. params_b: total parameter count in billions. Derive from:
# - safetensors file size: size_bytes / weight_bytes / 1e9
# (single file: ls -l model.safetensors; sharded: metadata.total_size
# in model.safetensors.index.json)
# - or the model card / paper
# For MoE: params_b is the TOTAL count (all experts loaded into VRAM).
# 4. weight_bytes: 2 for BF16/FP16, 1 for FP8/INT8.
# #
# Usage: # Usage:
# get_model_params "Qwen/Qwen3-0.6B" # read -r pb wb layers kvh hd <<< "$(get_model_params "Qwen/Qwen3-0.6B")"
# echo "$_MP_LAYERS layers, $_MP_KV_HEADS KV heads" # echo "$layers layers, $kvh KV heads"
get_model_params() { get_model_params() {
local model="${1:?usage: get_model_params <model_name>}" local model="${1:?usage: get_model_params <model_name>}"
local pb wb layers kvh hd
case "$model" in case "$model" in
# https://huggingface.co/Qwen/Qwen3-0.6B/raw/main/config.json
Qwen/Qwen3-0.6B) Qwen/Qwen3-0.6B)
_MP_PARAMS_B=0.6; _MP_WEIGHT_BYTES=2 pb=0.6; wb=2; layers=28; kvh=8; hd=128 ;;
_MP_LAYERS=28; _MP_KV_HEADS=8; _MP_HEAD_DIM=128 ;; # https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/raw/main/config.json (text_config)
# params_b from model.safetensors.index.json metadata.total_size / 2 / 1e9
Qwen/Qwen2-VL-2B-Instruct)
pb=2.2; wb=2; layers=28; kvh=2; hd=128 ;;
# https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/raw/main/config.json (text_config)
Qwen/Qwen2.5-VL-7B-Instruct) Qwen/Qwen2.5-VL-7B-Instruct)
_MP_PARAMS_B=8.3; _MP_WEIGHT_BYTES=2 pb=8.3; wb=2; layers=28; kvh=4; hd=128 ;;
_MP_LAYERS=28; _MP_KV_HEADS=4; _MP_HEAD_DIM=128 ;; # https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct/raw/main/config.json (text_config)
# params_b from model.safetensors size / 2 / 1e9
Qwen/Qwen3-VL-2B-Instruct)
pb=2.1; wb=2; layers=28; kvh=8; hd=128 ;;
# https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct/raw/main/config.json (text_config)
Qwen/Qwen3-VL-8B-Instruct) Qwen/Qwen3-VL-8B-Instruct)
_MP_PARAMS_B=9.2; _MP_WEIGHT_BYTES=2 pb=9.2; wb=2; layers=36; kvh=8; hd=128 ;;
_MP_LAYERS=36; _MP_KV_HEADS=8; _MP_HEAD_DIM=128 ;; # https://huggingface.co/Qwen/Qwen3-30B-A3B/raw/main/config.json
Qwen/Qwen3-30B-A3B|\ Qwen/Qwen3-30B-A3B|\
Qwen/Qwen3-30B-A3B-Instruct) Qwen/Qwen3-30B-A3B-Instruct)
_MP_PARAMS_B=30.5; _MP_WEIGHT_BYTES=2 pb=30.5; wb=2; layers=48; kvh=4; hd=128 ;;
_MP_LAYERS=48; _MP_KV_HEADS=4; _MP_HEAD_DIM=128 ;; # Same architecture as Qwen3-30B-A3B but FP8 quantized (1 byte per weight)
Qwen/Qwen3-VL-30B-A3B-Instruct-FP8) Qwen/Qwen3-VL-30B-A3B-Instruct-FP8)
_MP_PARAMS_B=30.5; _MP_WEIGHT_BYTES=1 pb=30.5; wb=1; layers=48; kvh=4; hd=128 ;;
_MP_LAYERS=48; _MP_KV_HEADS=4; _MP_HEAD_DIM=128 ;; # https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/raw/main/config.json
meta-llama/Meta-Llama-3.1-8B-Instruct) meta-llama/Meta-Llama-3.1-8B-Instruct)
_MP_PARAMS_B=8.0; _MP_WEIGHT_BYTES=2 pb=8.0; wb=2; layers=32; kvh=8; hd=128 ;;
_MP_LAYERS=32; _MP_KV_HEADS=8; _MP_HEAD_DIM=128 ;; # https://huggingface.co/deepseek-ai/deepseek-llm-7b-base/raw/main/config.json
# MHA (not GQA): num_key_value_heads == num_attention_heads == 32
deepseek-ai/deepseek-llm-7b-base)
pb=6.9; wb=2; layers=30; kvh=32; hd=128 ;;
# https://huggingface.co/llava-hf/llava-1.5-7b-hf/raw/main/config.json (text_config)
# MHA: num_key_value_heads == num_attention_heads == 32
llava-hf/llava-1.5-7b-hf) llava-hf/llava-1.5-7b-hf)
_MP_PARAMS_B=7.1; _MP_WEIGHT_BYTES=2 pb=7.1; wb=2; layers=32; kvh=32; hd=128 ;;
_MP_LAYERS=32; _MP_KV_HEADS=32; _MP_HEAD_DIM=128 ;;
*) *)
echo "get_model_params: unknown model '$model'" >&2 echo "get_model_params: unknown model '$model'" >&2
echo "Add it to get_model_params() in gpu_utils.sh" >&2 echo "Add it to get_model_params() in gpu_utils.sh" >&2
return 1 ;; return 1 ;;
esac esac
echo "$pb $wb $layers $kvh $hd"
} }
# estimate_worker_vram <model> [max_model_len] [max_concurrent_seqs] [engine_or_overhead] # estimate_worker_vram <model> [max_model_len] [max_concurrent_seqs] [engine_or_overhead]
# #
# Calls get_model_params, then sets: # Prints "weights_gib kv_gib overhead_gib total_gib" to stdout.
# _EW_WEIGHTS_GIB Estimated model weight memory # Returns 1 (prints nothing) if the model is unknown to get_model_params.
# _EW_KV_GIB Estimated KV cache memory
# _EW_OVERHEAD_GIB Overhead used (auto-computed or explicit)
# _EW_TOTAL_GIB Estimated total per-worker VRAM (weights + kv + overhead)
# #
# Formula: # Formula:
# weights = params_b * 1e9 * weight_bytes # weights = params_b * 1e9 * weight_bytes
...@@ -102,68 +225,60 @@ get_model_params() { ...@@ -102,68 +225,60 @@ get_model_params() {
# See examples/common/gpu_utils.md for the full derivation. # See examples/common/gpu_utils.md for the full derivation.
# #
# Usage: # Usage:
# estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm # auto overhead # read -r w kv oh total <<< "$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)"
# estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 trtllm # auto overhead # echo "$total GiB (w=$w kv=$kv oh=$oh)"
# estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 3.5 # explicit 3.5 GiB
# estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 # default 2.0 GiB
# echo "$_EW_TOTAL_GIB GiB (w=$_EW_WEIGHTS_GIB kv=$_EW_KV_GIB oh=$_EW_OVERHEAD_GIB)"
estimate_worker_vram() { estimate_worker_vram() {
local model="${1:?usage: estimate_worker_vram <model> [seq_len] [seqs] [engine_or_overhead]}" local model="${1:?usage: estimate_worker_vram <model> [seq_len] [seqs] [engine_or_overhead]}"
local seqlen="${2:-4096}" local seqlen="${2:-4096}"
local seqs="${3:-2}" local seqs="${3:-2}"
local engine_or_overhead="${4:-2.0}" local engine_or_overhead="${4:-2.0}"
get_model_params "$model" || return 1 local mp_out
mp_out=$(get_model_params "$model") || return 1
local pb wb layers kvh hd
read -r pb wb layers kvh hd <<< "$mp_out"
local overhead local overhead
case "$engine_or_overhead" in case "$engine_or_overhead" in
vllm) overhead=$(awk -v p="$_MP_PARAMS_B" 'BEGIN { printf "%.1f", 1.2 + 1.0 * sqrt(p) }') ;; vllm) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 1.2 + 1.0 * sqrt(p) }') ;;
sglang) overhead=$(awk -v p="$_MP_PARAMS_B" 'BEGIN { printf "%.1f", 2.5 + 1.5 * sqrt(p) }') ;; sglang) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 2.5 + 1.5 * sqrt(p) }') ;;
trtllm) overhead=$(awk -v p="$_MP_PARAMS_B" 'BEGIN { printf "%.1f", 2.0 + 1.2 * sqrt(p) }') ;; trtllm) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 2.0 + 1.2 * sqrt(p) }') ;;
*) overhead="$engine_or_overhead" ;; *) overhead="$engine_or_overhead" ;;
esac esac
_EW_OVERHEAD_GIB="$overhead" awk -v pb="$pb" -v wbytes="$wb" \
read -r _EW_WEIGHTS_GIB _EW_KV_GIB _EW_TOTAL_GIB <<< "$(awk \ -v layers="$layers" -v heads="$kvh" -v dim="$hd" \
-v pb="$_MP_PARAMS_B" -v wbytes="$_MP_WEIGHT_BYTES" \
-v layers="$_MP_LAYERS" -v heads="$_MP_KV_HEADS" -v dim="$_MP_HEAD_DIM" \
-v seqlen="$seqlen" -v seqs="$seqs" -v overhead="$overhead" \ -v seqlen="$seqlen" -v seqs="$seqs" -v overhead="$overhead" \
'BEGIN { 'BEGIN {
gib = 1024 * 1024 * 1024 gib = 1024 * 1024 * 1024
w = pb * 1e9 * wbytes / gib w = pb * 1e9 * wbytes / gib
kv = 2 * layers * heads * dim * 2 * seqlen * seqs / gib kv = 2 * layers * heads * dim * 2 * seqlen * seqs / gib
printf "%.1f %.1f %.1f", w, kv, w + kv + overhead printf "%.1f %.1f %.1f %.1f", w, kv, overhead, w + kv + overhead
}')" }'
} }
# gpu_worker_fraction <engine> [gpu_index] # gpu_worker_fraction <engine> <total_gib> <kv_gib> [gpu_index]
# #
# Unified fraction calculator for all engines. Reads the _EW_* variables # Convert estimated GiB into the engine-appropriate GPU memory fraction.
# set by estimate_worker_vram and returns the engine-appropriate fraction.
# #
# Engine semantics (see examples/common/gpu_utils.md): # Engine semantics (see examples/common/gpu_utils.md):
# vllm/sglang — fraction of TOTAL VRAM. The engine budgets weights + KV + # vllm/sglang — fraction of TOTAL VRAM (uses total_gib).
# activations inside this limit. We pass _EW_TOTAL_GIB. # trtllm — fraction of FREE VRAM after model load (uses kv_gib).
# trtllm — fraction of FREE VRAM (after model load). The engine uses
# this only for KV cache. We pass _EW_KV_GIB.
#
# This lets every launch script use the same pattern:
# estimate_worker_vram "$MODEL" "$SEQ_LEN" "$CONCURRENCY" "$OVERHEAD_GIB"
# GPU_MEM_FRACTION=$(gpu_worker_fraction "<engine>")
# #
# Usage: # Usage:
# gpu_worker_fraction vllm # uses _EW_TOTAL_GIB, fraction of total # gpu_worker_fraction vllm 4.0 0.9 # fraction of total
# gpu_worker_fraction sglang # same as vllm # gpu_worker_fraction trtllm 4.0 0.9 # fraction of free
# gpu_worker_fraction trtllm # uses _EW_KV_GIB, fraction of free # gpu_worker_fraction trtllm 4.0 0.9 1 # query GPU index 1
# gpu_worker_fraction trtllm 1 # query GPU index 1
gpu_worker_fraction() { gpu_worker_fraction() {
local engine="${1:?usage: gpu_worker_fraction <engine> [gpu_index]}" local engine="${1:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib> [gpu_index]}"
local gpu_idx="${2:-0}" local total_gib="${2:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib>}"
local kv_gib="${3:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib>}"
local gpu_idx="${4:-0}"
case "$engine" in case "$engine" in
vllm|sglang) vllm|sglang)
gpu_gb_to_total_fraction "$_EW_TOTAL_GIB" "$gpu_idx" ;; gpu_gb_to_total_fraction "$total_gib" "$gpu_idx" ;;
trtllm) trtllm)
gpu_gb_to_free_fraction "$_EW_KV_GIB" "$gpu_idx" ;; gpu_gb_to_free_fraction "$kv_gib" "$gpu_idx" ;;
*) *)
echo "gpu_worker_fraction: unknown engine '$engine'" >&2 echo "gpu_worker_fraction: unknown engine '$engine'" >&2
echo "Supported: vllm, sglang, trtllm" >&2 echo "Supported: vllm, sglang, trtllm" >&2
...@@ -171,6 +286,51 @@ gpu_worker_fraction() { ...@@ -171,6 +286,51 @@ gpu_worker_fraction() {
esac esac
} }
# gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]
#
# Convert a measured/profiled GPU peak (total VRAM including CUDA context,
# activations, etc.) into the engine-specific memory fraction flag.
#
# Each engine's fraction controls only a SUBSET of GPU memory (e.g. vLLM's
# --gpu-memory-utilization covers weights + KV cache but not CUDA context).
# This function subtracts the engine-specific overhead so the fraction
# targets the right internal budget, keeping the real peak stable across
# re-profiles.
#
# Overhead constants (GiB outside the engine's budget):
# vllm 2.0 CUDA ctx ~0.6 + activations/sampler ~0.5 + PyTorch alloc ~0.5
# sglang 2.0 (assumed same as vllm; refine when profiled)
# trtllm 0.0 free-fraction is measured after model load, no subtraction needed
#
# Usage:
# gpu_peak_to_engine_fraction vllm 8.6 # on 48 GiB → 0.14
# gpu_peak_to_engine_fraction vllm 20.9 # on 48 GiB → 0.40
# gpu_peak_to_engine_fraction vllm 8.6 1 # query GPU index 1
gpu_peak_to_engine_fraction() {
local engine=${1:?usage: gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]}
local peak_gib=${2:?usage: gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]}
local gpu_idx=${3:-0}
local overhead
case "$engine" in
vllm|sglang) overhead=2.0 ;;
trtllm) overhead=0.0 ;;
*)
echo "gpu_peak_to_engine_fraction: unknown engine '$engine'" >&2
echo "Supported: vllm, sglang, trtllm" >&2
return 1 ;;
esac
local budget
budget=$(awk -v g="$peak_gib" -v oh="$overhead" \
'BEGIN { b = g - oh; if (b < 1) b = 1; printf "%.1f", b }')
case "$engine" in
vllm|sglang) gpu_gb_to_total_fraction "$budget" "$gpu_idx" ;;
trtllm) gpu_gb_to_free_fraction "$budget" "$gpu_idx" ;;
esac
}
# gpu_gb_to_total_fraction <gib> [gpu_index] # gpu_gb_to_total_fraction <gib> [gpu_index]
# #
# For vLLM / sglang: --gpu-memory-utilization is a fraction of TOTAL GPU memory. # For vLLM / sglang: --gpu-memory-utilization is a fraction of TOTAL GPU memory.
...@@ -298,3 +458,189 @@ gpu_gb_to_free_fraction() { ...@@ -298,3 +458,189 @@ gpu_gb_to_free_fraction() {
}' }'
} }
# ---------------------------------------------------------------------------
# Self-test: bash gpu_utils.sh --self-test
# ---------------------------------------------------------------------------
_gpu_utils_self_test() {
local pass=0 fail=0
_assert() {
local label="$1" expected="$2" actual="$3"
if [[ "$expected" == "$actual" ]]; then
((pass++))
echo " PASS $label"
else
((fail++))
echo " FAIL $label (expected='$expected' actual='$actual')"
fi
}
echo "=== get_model_params ==="
local out
out=$(get_model_params "Qwen/Qwen3-0.6B")
_assert "known model returns 5 fields" "0.6 2 28 8 128" "$out"
out=$(get_model_params "nope/unknown" 2>/dev/null)
_assert "unknown model returns empty" "" "$out"
get_model_params "nope/unknown" >/dev/null 2>&1
_assert "unknown model exits 1" "1" "$?"
echo ""
echo "=== estimate_worker_vram ==="
out=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)
_assert "returns 4 space-separated fields" "4" "$(echo "$out" | wc -w | tr -d ' ')"
local w kv oh total
read -r w kv oh total <<< "$out"
_assert "weights > 0" "yes" "$(awk -v v="$w" 'BEGIN { print (v > 0) ? "yes" : "no" }')"
_assert "total > weights" "yes" "$(awk -v t="$total" -v w="$w" 'BEGIN { print (t > w) ? "yes" : "no" }')"
out=$(estimate_worker_vram "nope/unknown" 2>/dev/null)
_assert "unknown model returns empty" "" "$out"
local out_vllm out_sglang
out_vllm=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)
out_sglang=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 sglang)
_assert "sglang overhead > vllm overhead" "yes" \
"$(awk -v v="$out_vllm" -v s="$out_sglang" 'BEGIN {
split(v, a); split(s, b); print (b[3]+0 > a[3]+0) ? "yes" : "no"
}')"
echo ""
echo "=== build_gpu_mem_args: estimator path (known model) ==="
local frac
frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2)
_assert "FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
echo ""
echo "=== build_gpu_mem_args: unknown model, no default ==="
frac=$(build_gpu_mem_args vllm --model "nope/unknown")
_assert "FRACTION empty" "" "$frac"
echo ""
echo "=== build_gpu_mem_args: profiler wins over all ==="
frac=$(_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.55 \
build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --gpu-memory-utilization 0.70)
_assert "FRACTION = profiler (beats user flag)" "0.55" "$frac"
echo ""
echo "=== build_gpu_mem_args: user flag wins over estimator ==="
frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --gpu-memory-utilization 0.70)
_assert "FRACTION = user flag" "0.70" "$frac"
echo ""
echo "=== build_gpu_mem_args: empty user flag falls through ==="
frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2 --gpu-memory-utilization "")
_assert "FRACTION = estimator" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
echo ""
echo "=== build_gpu_mem_args: --workers-per-gpu divides estimator ==="
local undivided
undivided=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2)
frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2 --workers-per-gpu 2)
local expected_half
expected_half=$(awk -v f="$undivided" 'BEGIN { printf "%.2f", f / 2 }')
_assert "FRACTION halved" "$expected_half" "$frac"
echo ""
echo "=== build_gpu_mem_args: --workers-per-gpu divides profiler ==="
frac=$(_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.80 \
build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --workers-per-gpu 2)
_assert "FRACTION = 0.80/2 = 0.40" "0.40" "$frac"
echo ""
echo "=== build_gpu_mem_args: sglang engine (sglang flag names) ==="
frac=$(build_gpu_mem_args sglang --model-path "Qwen/Qwen3-0.6B" --context-length 4096 --max-running-requests 2)
_assert "sglang FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
echo ""
echo "=== build_gpu_mem_args: trtllm engine (trtllm flag names) ==="
frac=$(build_gpu_mem_args trtllm --model-path "Qwen/Qwen3-0.6B" --max-seq-len 4096 --max-batch-size 2)
_assert "trtllm FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
echo ""
echo "=== build_gpu_mem_args: --mem-fraction-static user flag (sglang) ==="
frac=$(build_gpu_mem_args sglang --model-path "Qwen/Qwen3-0.6B" --mem-fraction-static 0.60)
_assert "FRACTION = user flag" "0.60" "$frac"
echo ""
echo "=== build_gpu_mem_args: missing --model ==="
build_gpu_mem_args vllm 2>/dev/null
_assert "missing --model exits 1" "1" "$?"
echo ""
echo "=== gpu_worker_fraction: explicit args ==="
local frac
frac=$(gpu_worker_fraction vllm 4.0 0.9)
_assert "vllm returns non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
frac=$(gpu_worker_fraction trtllm 4.0 0.9)
_assert "trtllm returns non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
gpu_worker_fraction badengine 4.0 0.9 >/dev/null 2>&1
_assert "bad engine exits 1" "1" "$?"
echo ""
echo "=========================================="
echo "Results: $pass passed, $fail failed"
echo "=========================================="
[[ "$fail" -eq 0 ]]
}
# CLI mode: only when executed directly (not sourced by another script)
if [[ "${BASH_SOURCE[0]}" == "$0" ]]; then
if [[ "${1:-}" == "--self-test" ]]; then
_gpu_utils_self_test
exit $?
fi
if [[ $# -gt 0 ]]; then
build_gpu_mem_args "$@"
exit $?
fi
cat <<'HELP'
gpu_utils.sh — GPU memory fraction estimator
Usage:
./gpu_utils.sh <engine> --model <name> [options...]
./gpu_utils.sh --self-test
Engines: vllm, sglang, trtllm
Examples:
./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B
./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B --max-model-len 4096 --max-num-seqs 2
./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B --workers-per-gpu 2
./gpu_utils.sh sglang --model Qwen/Qwen3-0.6B --context-length 8192
./gpu_utils.sh trtllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --max-seq-len 4096
Options:
--model NAME Model name (required)
aliases: --model-path
--max-model-len N Max sequence length (default: 4096)
aliases: --context-length, --max-seq-len
--max-num-seqs N Concurrent sequences (default: 2)
aliases: --max-running-requests, --max-batch-size
--gpu-memory-utilization F Override fraction (vllm flag)
aliases: --mem-fraction-static
--workers-per-gpu N Divide fraction by N (shared-GPU disagg)
--self-test Run built-in test suite
Output: prints the fraction to stdout (empty if model is unknown).
HELP
exit 0
fi
...@@ -135,6 +135,12 @@ print_launch_banner() { ...@@ -135,6 +135,12 @@ print_launch_banner() {
echo "==========================================" echo "=========================================="
echo "Model: $_model" echo "Model: $_model"
echo "Frontend: http://localhost:$_port" echo "Frontend: http://localhost:$_port"
local _seq_len="${MAX_MODEL_LEN:-${CONTEXT_LENGTH:-${MAX_SEQ_LEN:-}}}"
local _frac="${GPU_MEM_FRACTION:-}"
[[ -n "$_seq_len" ]] && echo "Max seq len: $_seq_len"
[[ -n "$_frac" ]] && echo "GPU frac: $_frac"
for _line in "$@"; do for _line in "$@"; do
echo "$_line" echo "$_line"
done done
......
...@@ -4,6 +4,9 @@ ...@@ -4,6 +4,9 @@
set -e set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../common/gpu_utils.sh"
# Default values # Default values
MODEL_NAME="Qwen/Qwen2-Audio-7B-Instruct" MODEL_NAME="Qwen/Qwen2-Audio-7B-Instruct"
PROMPT_TEMPLATE="" PROMPT_TEMPLATE=""
...@@ -90,8 +93,10 @@ python -m dynamo.frontend --http-port 8000 & ...@@ -90,8 +93,10 @@ python -m dynamo.frontend --http-port 8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" & python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers # run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME & CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill & VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
# Wait for all background processes to complete # Wait for all background processes to complete
wait wait
...@@ -4,6 +4,9 @@ ...@@ -4,6 +4,9 @@
set -e set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../common/gpu_utils.sh"
# Default values # Default values
MODEL_NAME="Qwen/Qwen2-Audio-7B-Instruct" MODEL_NAME="Qwen/Qwen2-Audio-7B-Instruct"
PROMPT_TEMPLATE="" PROMPT_TEMPLATE=""
...@@ -90,9 +93,11 @@ python -m dynamo.frontend --http-port 8000 & ...@@ -90,9 +93,11 @@ python -m dynamo.frontend --http-port 8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" & python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers # run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME & CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg & DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg & DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
# Wait for all background processes to complete # Wait for all background processes to complete
wait wait
...@@ -4,6 +4,9 @@ ...@@ -4,6 +4,9 @@
set -e set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../common/gpu_utils.sh"
# Default values # Default values
MODEL_NAME="llava-hf/LLaVA-NeXT-Video-7B-hf" MODEL_NAME="llava-hf/LLaVA-NeXT-Video-7B-hf"
PROMPT_TEMPLATE="USER: <video>\n<prompt> ASSISTANT:" PROMPT_TEMPLATE="USER: <video>\n<prompt> ASSISTANT:"
...@@ -16,8 +19,10 @@ python -m dynamo.frontend --http-port=8000 & ...@@ -16,8 +19,10 @@ python -m dynamo.frontend --http-port=8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" & python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers # run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE & CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill & VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
# Wait for all background processes to complete # Wait for all background processes to complete
wait wait
...@@ -4,6 +4,9 @@ ...@@ -4,6 +4,9 @@
set -e set -e
trap 'echo Cleaning up...; kill 0' EXIT trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../common/gpu_utils.sh"
# Default values # Default values
MODEL_NAME="llava-hf/LLaVA-NeXT-Video-7B-hf" MODEL_NAME="llava-hf/LLaVA-NeXT-Video-7B-hf"
PROMPT_TEMPLATE="USER: <video>\n<prompt> ASSISTANT:" PROMPT_TEMPLATE="USER: <video>\n<prompt> ASSISTANT:"
...@@ -17,9 +20,11 @@ python -m dynamo.frontend --http-port=8000 & ...@@ -17,9 +20,11 @@ python -m dynamo.frontend --http-port=8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" & python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers # run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE & CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg & DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg & DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
# Wait for all background processes to complete # Wait for all background processes to complete
wait wait
...@@ -233,6 +233,7 @@ markers = [ ...@@ -233,6 +233,7 @@ markers = [
"gpu_2: marks tests to run on 2GPUs", "gpu_2: marks tests to run on 2GPUs",
"gpu_4: marks tests to run on 4GPUs", "gpu_4: marks tests to run on 4GPUs",
"gpu_8: marks tests to run on 8GPUs", "gpu_8: marks tests to run on 8GPUs",
"max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
"e2e: marks tests as end-to-end tests", "e2e: marks tests as end-to-end tests",
"integration: marks tests as integration tests", "integration: marks tests as integration tests",
"unit: marks tests as unit tests", "unit: marks tests as unit tests",
......
...@@ -116,6 +116,7 @@ Markers are required for all tests. They are used for test selection in CI and l ...@@ -116,6 +116,7 @@ Markers are required for all tests. They are used for test selection in CI and l
| Lifecycle [required] | pre_merge, post_merge, nightly, weekly, release | When the test should run | | Lifecycle [required] | pre_merge, post_merge, nightly, weekly, release | When the test should run |
| Test Type [required] | unit, integration, e2e, benchmark, performance, stress, multimodal | Nature of the test | | Test Type [required] | unit, integration, e2e, benchmark, performance, stress, multimodal | Nature of the test |
| Hardware [required] | gpu_0, gpu_1, gpu_2, gpu_4, gpu_8, h100 | Number/type of GPUs required | | Hardware [required] | gpu_0, gpu_1, gpu_2, gpu_4, gpu_8, h100 | Number/type of GPUs required |
| VRAM Requirement | max_vram_gib(N) | Peak VRAM in GiB (with 10% safety). The pytest invocation can use `--max-vram-gib=N` to select only tests that fit on the available GPU. Does not prevent running on smaller GPUs (that will OOM). Use `profile_pytest.py` to measure. |
| Component/Framework | vllm, trtllm, sglang, kvbm, kvbm_concurrency, planner, router | Backend or component specificity | | Component/Framework | vllm, trtllm, sglang, kvbm, kvbm_concurrency, planner, router | Backend or component specificity |
| Infrastructure | k8s, deploy, fault_tolerance | Infrastructure/environment needs | | Infrastructure | k8s, deploy, fault_tolerance | Infrastructure/environment needs |
| Execution | parallel | Test can run in parallel with pytest-xdist. Must use dynamic port allocation (`alloc_ports`) and not share resources (e.g. filesystem) | | Execution | parallel | Test can run in parallel with pytest-xdist. Must use dynamic port allocation (`alloc_ports`) and not share resources (e.g. filesystem) |
...@@ -126,11 +127,30 @@ Markers are required for all tests. They are used for test selection in CI and l ...@@ -126,11 +127,30 @@ Markers are required for all tests. They are used for test selection in CI and l
@pytest.mark.pre_merge @pytest.mark.pre_merge
@pytest.mark.integration @pytest.mark.integration
@pytest.mark.gpu_1 @pytest.mark.gpu_1
@pytest.mark.max_vram_gib(21) # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
@pytest.mark.vllm @pytest.mark.vllm
def test_kv_cache_behavior(): def test_kv_cache_behavior():
... ...
``` ```
### Filtering by VRAM
The `max_vram_gib(N)` marker records how much GPU memory a test needs. The pytest invocation can use `--max-vram-gib=N` as a **selector** to run only tests that fit on the available GPU. Tests that exceed the budget are skipped at collection time (before any test starts). Tests without a `max_vram_gib` marker always run (no constraint assumed).
Nothing prevents you from running without this flag — but if a test needs more VRAM than is physically available, it will OOM at runtime (e.g., vLLM raises `ValueError: No available memory for the cache blocks`).
```bash
# Run only tests that fit on a 48 GiB GPU — tests needing >48 GiB are skipped
python3 -m pytest --max-vram-gib=48 tests/
# GPU tests that have no max_vram_gib marker yet — need profiling
# TODO: profile these tests and add max_vram_gib markers
python3 -m pytest -m "(gpu_1 or gpu_2 or gpu_4 or gpu_8) and not max_vram_gib" tests/
# No filter — run everything regardless of VRAM (tests that exceed available memory will OOM)
python3 -m pytest tests/
```
### Lifecycle Marker Note ### Lifecycle Marker Note
Use the marker for the earliest pipeline stage where the test must run (e.g., `@pytest.mark.pre_merge`). This ensures the test is included in that stage and all subsequent ones (e.g., nightly, release), as CI pipelines select tests marked for earlier stages. Use the marker for the earliest pipeline stage where the test must run (e.g., `@pytest.mark.pre_merge`). This ensures the test is included in that stage and all subsequent ones (e.g., nightly, release), as CI pipelines select tests marked for earlier stages.
...@@ -416,6 +436,113 @@ GPU and model-loading overhead means Dynamo E2E tests are inherently slower than ...@@ -416,6 +436,113 @@ GPU and model-loading overhead means Dynamo E2E tests are inherently slower than
--- ---
## GPU VRAM Profiler (`profile_pytest.py`)
When writing or reviewing GPU tests, use `tests/utils/profile_pytest.py` to measure how much VRAM a test actually needs. The script runs the test repeatedly with different GPU memory caps and uses binary search to find the minimum VRAM required. It then prints recommended pytest markers you can copy into your test.
### How it works
The profiler sets the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` environment variable (a fraction from 0.0 to 1.0 of total GPU RAM) and runs the test at each probe point. It bisects between "passes" and "OOM/fails" to find the boundary. After the search, it samples `nvidia-smi` to report peak VRAM, phase analysis, and marker recommendations.
**Requirement:** The test under profile **must** honor the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` env var. For standalone tests that allocate CUDA memory directly, check `os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")` and cap your allocation accordingly — see `tests/utils/test_mock_gpu_alloc.py` for an example.
### Engine-specific mapping
`_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is a generic env var (float 0.0-1.0) that launch scripts translate to the engine-specific CLI flag:
| Engine | CLI flag | Launch script support |
|---------|----------------------------------|-----------------------|
| vLLM | `--gpu-memory-utilization` | Implemented in `agg.sh`, `disagg.sh`, etc. |
| SGLang | `--mem-fraction-static` | Not yet implemented (TODO) |
| TRT-LLM | `--free-gpu-memory-fraction` | Not yet implemented (has its own `DYN_TRTLLM_FREE_GPU_MEMORY_FRACTION`, TODO: unify) |
Scripts that already hard-code their own memory fraction (e.g. `agg_multimodal.sh` with 0.85) have a TODO to honor `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` in the future. If the profiler detects constant VRAM across all probes (meaning the env var is ignored), it prints a warning and skips marker recommendations.
### Usage
```bash
# Default mode: binary search for minimum VRAM (recommended)
# -xvs is optional: stop on first failure, verbose, show output
python tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated] -xvs
# Single-pass profiling (no binary search, just measure one run using default RAM)
python tests/utils/profile_pytest.py --no-find-min-vram tests/serve/test_vllm.py::test_serve_deployment[aggregated]
```
### Example output
```bash
========================================================================
FIND MINIMUM VRAM (binary search)
========================================================================
GPU total : 48.0 GiB
GPU free : 48.0 GiB (in use: 0.0 GiB)
Test : tests/serve/test_vllm.py::test_serve_deployment[aggregated] -x
Range : 5% - 95% (tolerance 5%)
Max iter: 6 (1 validation + 5 bisections)
[probe 1/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.95 (45.6 GiB) [validation run]
[PASS] peak 18.5 GiB, wall 41s, iter took 49s
...
[probe 5/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.33 (15.9 GiB)
[FAIL] OOM or error at 33% (15.9 GiB), iter took 30s
[probe 6/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.36 (17.2 GiB) [~0 left, ETA ~0s]
[PASS] peak 18.5 GiB, wall 41s, iter took 49s
========================================================================
MINIMUM VRAM RESULT
========================================================================
Lowest passing utilization : 36%
Minimum VRAM needed : ~17.2 GiB (peak observed: 18.5 GiB, +10% safety: 20.4 GiB)
# test_serve_deployment[aggregated]: @pytest.mark.max_vram_gib(21)
# Fits on: L4 (24 GiB), V100-32GB (32 GiB), A6000/A40 (48 GiB), A100/H100 (80 GiB)
# Will OOM on: edge/embedded (4 GiB), RTX 3060/4060 (8 GiB), T4 (16 GiB)
========================================================================
========================================================================
Recommended markers to add to your pytest. You can copy-paste this:
========================================================================
# Measured using: tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated]
@pytest.mark.e2e # wall time 41.2s, loads a real model
@pytest.mark.gpu_1 # 1 GPU(s) used, peak 18.5 GiB
@pytest.mark.max_vram_gib(21) # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
@pytest.mark.timeout(124) # 3x observed 41.2s
WARNING: Wall time 41.2s is too slow for pre_merge (> 20s). Consider post_merge or nightly instead.
WARNING: Will OOM on edge/embedded (4 GiB).
WARNING: Will OOM on RTX 3060/4060 (8 GiB).
WARNING: Will OOM on T4 (16 GiB).
========================================================================
```
### How to use the recommendations
1. **Copy the `@pytest.mark.*` lines** into your test function or `pytestmark` list.
2. **VRAM marker** — `max_vram_gib(N)` records the peak GPU memory the test needs (with 10% safety margin). This marker does **not** skip tests on its own — if a test runs on a GPU that is too small, it will OOM and fail hard. Use `--max-vram-gib=N` to select only tests that fit on the available GPU (see [Filtering by VRAM](#filtering-by-vram) for examples). The WARNING lines in the profiler output tell you which GPU tiers would be too small (e.g., "Will OOM on T4 (16 GiB)").
3. **Lifecycle markers** — the profiler recommends `pre_merge` only for tests under 20 seconds. For slower tests, it warns you to consider `post_merge` or `nightly` but does not choose for you — use your judgment based on how critical the test is for catching regressions early.
4. **Timeout** — the recommended value is 3x the observed wall time. Adjust upward if your test has high variance (e.g., first-run model download, flaky network).
5. **Test type** (`unit`, `integration`, `e2e`) — inferred from wall time and whether a real model was loaded. Override if you know better (e.g., a fast test that uses a mock engine is `integration`, not `e2e`).
### Options
| Flag | Description |
|------|-------------|
| `--no-find-min-vram` | Skip binary search; run a single profiling pass instead |
| `--interval N` | GPU sampling interval in seconds (default: 1.0) |
| `--baseline-seconds N` | Seconds to sample before launching pytest (default: 3.0) |
| `--teardown-seconds N` | Seconds to sample after pytest exits (default: 5.0) |
| `--csv FILE` | Write raw nvidia-smi samples to a CSV file |
| `--no-recommend` | Suppress marker recommendations |
---
## References ## References
- [pytest documentation](https://docs.pytest.org/en/stable/) - [pytest documentation](https://docs.pytest.org/en/stable/)
- [Bazel Test Encyclopedia — test sizes and timeouts](https://docs.bazel.build/versions/2.0.0/test-encyclopedia.html) - [Bazel Test Encyclopedia — test sizes and timeouts](https://docs.bazel.build/versions/2.0.0/test-encyclopedia.html)
......
...@@ -42,6 +42,7 @@ def pytest_configure(config): ...@@ -42,6 +42,7 @@ def pytest_configure(config):
"gpu_2: marks tests to run on 2GPUs", "gpu_2: marks tests to run on 2GPUs",
"gpu_4: marks tests to run on 4GPUs", "gpu_4: marks tests to run on 4GPUs",
"gpu_8: marks tests to run on 8GPUs", "gpu_8: marks tests to run on 8GPUs",
"max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
"e2e: marks tests as end-to-end tests", "e2e: marks tests as end-to-end tests",
"integration: marks tests as integration tests", "integration: marks tests as integration tests",
"unit: marks tests as unit tests", "unit: marks tests as unit tests",
...@@ -101,6 +102,12 @@ def pytest_addoption(parser: pytest.Parser) -> None: ...@@ -101,6 +102,12 @@ def pytest_addoption(parser: pytest.Parser) -> None:
help="Skip restarting NATS and etcd services before deployment. " help="Skip restarting NATS and etcd services before deployment. "
"Default: deploy tests skip (for speed), fault-tolerance tests restart (for clean state).", "Default: deploy tests skip (for speed), fault-tolerance tests restart (for clean state).",
) )
parser.addoption(
"--max-vram-gib",
type=float,
default=None,
help="Skip tests whose @pytest.mark.max_vram_gib(N) exceeds this value (GiB).",
)
LOG_FORMAT = "[TEST] %(asctime)s %(levelname)s %(name)s: %(message)s" LOG_FORMAT = "[TEST] %(asctime)s %(levelname)s %(name)s: %(message)s"
...@@ -293,6 +300,17 @@ def pytest_collection_modifyitems(config, items): ...@@ -293,6 +300,17 @@ def pytest_collection_modifyitems(config, items):
if _item_has_marker(item, marker_name): if _item_has_marker(item, marker_name):
item.add_marker(skip) item.add_marker(skip)
# Skip tests that exceed --max-vram-gib
vram_limit = config.getoption("--max-vram-gib", default=None)
if vram_limit is not None:
skip_vram = pytest.mark.skip(
reason=f"requires more than {vram_limit} GiB VRAM (--max-vram-gib={vram_limit})"
)
for item in items:
vram_mark = item.get_closest_marker("max_vram_gib")
if vram_mark and vram_mark.args and vram_mark.args[0] > vram_limit:
item.add_marker(skip_vram)
# Collect models via explicit pytest mark from final filtered items only # Collect models via explicit pytest mark from final filtered items only
models_to_download = set() models_to_download = set()
for item in items: for item in items:
...@@ -836,11 +854,17 @@ def dynamo_dynamic_ports(num_system_ports) -> Generator[ServicePorts, None, None ...@@ -836,11 +854,17 @@ def dynamo_dynamic_ports(num_system_ports) -> Generator[ServicePorts, None, None
- frontend_port: OpenAI-compatible HTTP/gRPC ingress (dynamo.frontend) - frontend_port: OpenAI-compatible HTTP/gRPC ingress (dynamo.frontend)
- system_ports: List of worker metrics/system ports (configurable count via num_system_ports) - system_ports: List of worker metrics/system ports (configurable count via num_system_ports)
- kv_event_port: ZMQ port for vLLM KV event publishing (avoids collisions under xdist)
""" """
frontend_port = allocate_port(DefaultPort.FRONTEND.value) frontend_port = allocate_port(DefaultPort.FRONTEND.value)
system_port_list = allocate_ports(num_system_ports, DefaultPort.SYSTEM1.value) system_port_list = allocate_ports(num_system_ports, DefaultPort.SYSTEM1.value)
all_ports = [frontend_port, *system_port_list] kv_event_port = allocate_port(DefaultPort.SYSTEM1.value)
all_ports = [frontend_port, *system_port_list, kv_event_port]
try: try:
yield ServicePorts(frontend_port=frontend_port, system_ports=system_port_list) yield ServicePorts(
frontend_port=frontend_port,
system_ports=system_port_list,
kv_event_port=kv_event_port,
)
finally: finally:
deallocate_ports(all_ports) deallocate_ports(all_ports)
...@@ -89,6 +89,8 @@ class VllmWorkerProcess(ManagedProcess): ...@@ -89,6 +89,8 @@ class VllmWorkerProcess(ManagedProcess):
"dynamo.vllm", "dynamo.vllm",
"--model", "--model",
TEST_MODEL, TEST_MODEL,
"--max-model-len",
"32768", # 32768 uses ~1.5 GiB (original default 131072 used ~6 GiB KV cache)
"--dyn-tool-call-parser", "--dyn-tool-call-parser",
"harmony", "harmony",
"--dyn-reasoning-parser", "--dyn-reasoning-parser",
...@@ -97,6 +99,10 @@ class VllmWorkerProcess(ManagedProcess): ...@@ -97,6 +99,10 @@ class VllmWorkerProcess(ManagedProcess):
"32768", "32768",
] ]
gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
if gpu_util:
command.extend(["--gpu-memory-utilization", gpu_util])
env = os.environ.copy() env = os.environ.copy()
env["DYN_LOG"] = "debug" env["DYN_LOG"] = "debug"
env["DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS"] = '["generate"]' env["DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS"] = '["generate"]'
...@@ -222,7 +228,9 @@ def _validate_chat_response(response: requests.Response) -> Dict[str, Any]: ...@@ -222,7 +228,9 @@ def _validate_chat_response(response: requests.Response) -> Dict[str, Any]:
return response_json return response_json
@pytest.mark.timeout(300) # ~3x measured total (~70s/test), rounded up # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.timeout(300) # 3x observed ~70s wall time, rounded up
@pytest.mark.post_merge @pytest.mark.post_merge
def test_reasoning_effort( def test_reasoning_effort(
request, start_services: ServicePorts, predownload_models request, start_services: ServicePorts, predownload_models
...@@ -288,7 +296,9 @@ def test_reasoning_effort( ...@@ -288,7 +296,9 @@ def test_reasoning_effort(
) )
@pytest.mark.timeout(180) # ~3x measured total (~50s/test), rounded up # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.timeout(113) # 3x observed 37.4s wall time
@pytest.mark.post_merge @pytest.mark.post_merge
def test_tool_calling( def test_tool_calling(
request, start_services: ServicePorts, predownload_models request, start_services: ServicePorts, predownload_models
...@@ -330,7 +340,9 @@ def test_tool_calling( ...@@ -330,7 +340,9 @@ def test_tool_calling(
), "Expected get_current_weather tool to be called" ), "Expected get_current_weather tool to be called"
@pytest.mark.timeout(180) # ~3x measured total (~50s/test), rounded up # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling_second_round
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.timeout(115) # 3x observed 38.1s wall time
@pytest.mark.nightly @pytest.mark.nightly
def test_tool_calling_second_round( def test_tool_calling_second_round(
request, start_services: ServicePorts, predownload_models request, start_services: ServicePorts, predownload_models
...@@ -394,7 +406,9 @@ def test_tool_calling_second_round( ...@@ -394,7 +406,9 @@ def test_tool_calling_second_round(
), "Expected response to include temperature information from tool call result (20°C)" ), "Expected response to include temperature information from tool call result (20°C)"
@pytest.mark.timeout(180) # ~3x measured total (~57s/test), rounded up # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.timeout(131) # 3x observed 43.4s wall time
@pytest.mark.nightly @pytest.mark.nightly
def test_reasoning(request, start_services: ServicePorts, predownload_models) -> None: def test_reasoning(request, start_services: ServicePorts, predownload_models) -> None:
"""Test reasoning functionality with a mathematical problem.""" """Test reasoning functionality with a mathematical problem."""
......
...@@ -6,6 +6,7 @@ ...@@ -6,6 +6,7 @@
import dataclasses import dataclasses
import logging import logging
import os import os
import time
from collections.abc import Mapping from collections.abc import Mapping
from copy import deepcopy from copy import deepcopy
from typing import Any, Dict, Optional from typing import Any, Dict, Optional
...@@ -51,6 +52,16 @@ def run_serve_deployment( ...@@ -51,6 +52,16 @@ def run_serve_deployment(
if extra_env: if extra_env:
merged_env.update(extra_env) merged_env.update(extra_env)
# Stagger engine startup under xdist to avoid vLLM profiling race
# (vLLM bug #10643: concurrent profilers miscount each other's memory).
worker_id = os.environ.get("PYTEST_XDIST_WORKER", "")
if worker_id.startswith("gw"):
worker_num = int(worker_id.removeprefix("gw"))
if worker_num > 0:
stagger_s = worker_num * 15
logger.info("Staggering startup by %ds (xdist %s)", stagger_s, worker_id)
time.sleep(stagger_s)
if ports is not None: if ports is not None:
dynamic_frontend_port = int(ports.frontend_port) dynamic_frontend_port = int(ports.frontend_port)
dynamic_system_ports = [int(p) for p in ports.system_ports] dynamic_system_ports = [int(p) for p in ports.system_ports]
...@@ -76,6 +87,10 @@ def run_serve_deployment( ...@@ -76,6 +87,10 @@ def run_serve_deployment(
for idx, port in enumerate(dynamic_system_ports, start=1): for idx, port in enumerate(dynamic_system_ports, start=1):
merged_env[f"DYN_SYSTEM_PORT{idx}"] = str(port) merged_env[f"DYN_SYSTEM_PORT{idx}"] = str(port)
# Unique ZMQ port for vLLM KV event publishing (avoids xdist collisions).
if ports.kv_event_port:
merged_env["DYN_VLLM_KV_EVENT_PORT"] = str(ports.kv_event_port)
# Ensure EngineProcess health checks hit the correct frontend port. # Ensure EngineProcess health checks hit the correct frontend port.
config = dataclasses.replace(config, frontend_port=dynamic_frontend_port) config = dataclasses.replace(config, frontend_port=dynamic_frontend_port)
else: else:
......
...@@ -9,9 +9,10 @@ from pytest_httpserver import HTTPServer ...@@ -9,9 +9,10 @@ from pytest_httpserver import HTTPServer
from dynamo.common.utils.paths import WORKSPACE_DIR from dynamo.common.utils.paths import WORKSPACE_DIR
from tests.serve.lora_utils import MinioLoraConfig, MinioService from tests.serve.lora_utils import MinioLoraConfig, MinioService
from tests.utils.port_utils import allocate_port, deallocate_port
# Shared constants for multimodal testing # Shared constants for multimodal testing
IMAGE_SERVER_PORT = 8765 IMAGE_SERVER_PORT = allocate_port(8765)
MULTIMODAL_IMG_PATH = os.path.join( MULTIMODAL_IMG_PATH = os.path.join(
WORKSPACE_DIR, "lib/llm/tests/data/media/llm-optimize-deploy-graphic.png" WORKSPACE_DIR, "lib/llm/tests/data/media/llm-optimize-deploy-graphic.png"
) )
...@@ -42,7 +43,8 @@ def get_multimodal_test_image_bytes() -> bytes: ...@@ -42,7 +43,8 @@ def get_multimodal_test_image_bytes() -> bytes:
@pytest.fixture(scope="session") @pytest.fixture(scope="session")
def httpserver_listen_address(): def httpserver_listen_address():
return ("127.0.0.1", IMAGE_SERVER_PORT) yield ("127.0.0.1", IMAGE_SERVER_PORT)
deallocate_port(IMAGE_SERVER_PORT)
@pytest.fixture(scope="function") @pytest.fixture(scope="function")
...@@ -60,7 +62,7 @@ def image_server(httpserver: HTTPServer): ...@@ -60,7 +62,7 @@ def image_server(httpserver: HTTPServer):
Usage: Usage:
def test_multimodal(image_server): def test_multimodal(image_server):
url = "http://localhost:8765/llm-graphic.png" # Use MULTIMODAL_IMG_URL from this module
# ... use url in your test payload # ... use url in your test payload
""" """
image_data = get_multimodal_test_image_bytes() image_data = get_multimodal_test_image_bytes()
......
...@@ -12,6 +12,8 @@ trap 'echo "Cleaning up..."; kill 0' EXIT ...@@ -12,6 +12,8 @@ trap 'echo "Cleaning up..."; kill 0' EXIT
MODEL="${MODEL:-Qwen/Qwen3-0.6B}" MODEL="${MODEL:-Qwen/Qwen3-0.6B}"
GPU_MEM_FRACTION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}"
echo "Starting Dynamo frontend..." echo "Starting Dynamo frontend..."
python3 -m dynamo.frontend & python3 -m dynamo.frontend &
...@@ -22,7 +24,8 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \ ...@@ -22,7 +24,8 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
--nnodes 2 \ --nnodes 2 \
--node-rank 0 \ --node-rank 0 \
--master-addr 127.0.0.1 \ --master-addr 127.0.0.1 \
--enforce-eager & --enforce-eager \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
echo "Starting dynamo.vllm headless worker (TP=2, nnodes=2, node-rank=1, GPU 1)..." echo "Starting dynamo.vllm headless worker (TP=2, nnodes=2, node-rank=1, GPU 1)..."
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
...@@ -32,6 +35,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \ ...@@ -32,6 +35,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--node-rank 1 \ --node-rank 1 \
--master-addr 127.0.0.1 \ --master-addr 127.0.0.1 \
--enforce-eager \ --enforce-eager \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} \
--headless & --headless &
wait wait
...@@ -54,10 +54,10 @@ vllm_dir = os.environ.get("VLLM_DIR") or os.path.join( ...@@ -54,10 +54,10 @@ vllm_dir = os.environ.get("VLLM_DIR") or os.path.join(
# vLLM test configurations # vLLM test configurations
# NOTE: pytest.mark.gpu_1 tests take ~5.5 minutes total to run sequentially (with models pre-cached) # NOTE: pytest.mark.gpu_1 tests take ~5.5 minutes total to run sequentially (with models pre-cached)
# TODO: Now that these tests use dynamic ports, optimize the runtime by bin-packing and running # TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
# multiple engine deployments in parallel (while keeping GPU contention under control). This may # optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
# require annotating each config with approximate GPU RAM usage so a future collector/launcher can # A future collector/launcher can sum max_vram_gib values to decide how many tests fit
# bin-pack safely. # concurrently without exceeding available VRAM.
vllm_configs = { vllm_configs = {
"aggregated": VLLMConfig( "aggregated": VLLMConfig(
name="aggregated", name="aggregated",
...@@ -65,8 +65,9 @@ vllm_configs = { ...@@ -65,8 +65,9 @@ vllm_configs = {
script_name="agg.sh", script_name="agg.sh",
marks=[ marks=[
pytest.mark.gpu_1, pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.timeout(127), # 3x observed 42.2s wall time
pytest.mark.pre_merge, pytest.mark.pre_merge,
pytest.mark.timeout(300), # 3x measured time (43s) + download time (150s)
], ],
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
request_payloads=[ request_payloads=[
...@@ -90,7 +91,12 @@ vllm_configs = { ...@@ -90,7 +91,12 @@ vllm_configs = {
name="aggregated_logprobs", name="aggregated_logprobs",
directory=vllm_dir, directory=vllm_dir,
script_name="agg.sh", script_name="agg.sh",
marks=[pytest.mark.gpu_1, pytest.mark.post_merge], marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.timeout(73), # 3x observed 24.3s wall time
pytest.mark.post_merge,
],
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
request_payloads=[ request_payloads=[
chat_payload_with_logprobs( chat_payload_with_logprobs(
...@@ -116,8 +122,9 @@ vllm_configs = { ...@@ -116,8 +122,9 @@ vllm_configs = {
marks=[ marks=[
pytest.mark.lmcache, pytest.mark.lmcache,
pytest.mark.gpu_1, pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.4 GiB (+10% safety)
pytest.mark.timeout(147), # 3x observed 49.0s wall time
pytest.mark.pre_merge, pytest.mark.pre_merge,
pytest.mark.timeout(360), # 3x estimated time (70s) + download time (150s)
pytest.mark.skipif( pytest.mark.skipif(
_is_cuda13(), _is_cuda13(),
reason="lmcache does not support CUDA 13 as of v0.3.11", reason="lmcache does not support CUDA 13 as of v0.3.11",
...@@ -138,8 +145,9 @@ vllm_configs = { ...@@ -138,8 +145,9 @@ vllm_configs = {
marks=[ marks=[
pytest.mark.lmcache, pytest.mark.lmcache,
pytest.mark.gpu_1, pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.4 GiB (+10% safety)
pytest.mark.timeout(148), # 3x observed 49.3s wall time
pytest.mark.pre_merge, pytest.mark.pre_merge,
pytest.mark.timeout(360), # 3x estimated time (70s) + download time (150s)
pytest.mark.skipif( pytest.mark.skipif(
_is_cuda13(), _is_cuda13(),
reason="lmcache does not support CUDA 13 as of v0.3.11", reason="lmcache does not support CUDA 13 as of v0.3.11",
...@@ -162,8 +170,9 @@ vllm_configs = { ...@@ -162,8 +170,9 @@ vllm_configs = {
script_name="agg_request_planes.sh", script_name="agg_request_planes.sh",
marks=[ marks=[
pytest.mark.gpu_1, pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.3 GiB (+10% safety)
pytest.mark.timeout(129), # 3x observed 43.0s wall time
pytest.mark.pre_merge, pytest.mark.pre_merge,
pytest.mark.timeout(300), # 3x measured time (43s) + download time (150s)
], ],
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
script_args=["--tcp"], script_args=["--tcp"],
...@@ -178,8 +187,9 @@ vllm_configs = { ...@@ -178,8 +187,9 @@ vllm_configs = {
script_name="agg_request_planes.sh", script_name="agg_request_planes.sh",
marks=[ marks=[
pytest.mark.gpu_1, pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.3 GiB (+10% safety)
pytest.mark.timeout(127), # 3x observed 42.3s wall time
pytest.mark.pre_merge, pytest.mark.pre_merge,
pytest.mark.timeout(300), # 3x measured time (43s) + download time (150s)
], ],
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
script_args=["--http"], script_args=["--http"],
...@@ -196,7 +206,7 @@ vllm_configs = { ...@@ -196,7 +206,7 @@ vllm_configs = {
pytest.mark.gpu_2, pytest.mark.gpu_2,
pytest.mark.pre_merge, pytest.mark.pre_merge,
pytest.mark.skip(reason="DYN-2263"), pytest.mark.skip(reason="DYN-2263"),
], ], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
request_payloads=[ request_payloads=[
chat_payload_default( chat_payload_default(
...@@ -219,7 +229,7 @@ vllm_configs = { ...@@ -219,7 +229,7 @@ vllm_configs = {
pytest.mark.gpu_2, pytest.mark.gpu_2,
pytest.mark.pre_merge, pytest.mark.pre_merge,
pytest.mark.skip(reason="DYN-2264"), pytest.mark.skip(reason="DYN-2264"),
], ], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
request_payloads=[ request_payloads=[
# Test approximate KV routing (--no-kv-events mode) # Test approximate KV routing (--no-kv-events mode)
...@@ -250,7 +260,10 @@ vllm_configs = { ...@@ -250,7 +260,10 @@ vllm_configs = {
name="disaggregated", name="disaggregated",
directory=vllm_dir, directory=vllm_dir,
script_name="disagg.sh", script_name="disagg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.pre_merge], marks=[
pytest.mark.gpu_2,
pytest.mark.pre_merge,
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
request_payloads=[ request_payloads=[
chat_payload_default(), chat_payload_default(),
...@@ -266,6 +279,7 @@ vllm_configs = { ...@@ -266,6 +279,7 @@ vllm_configs = {
pytest.mark.vllm, pytest.mark.vllm,
pytest.mark.h100, pytest.mark.h100,
pytest.mark.nightly, pytest.mark.nightly,
# TODO: profile to get max_vram and timeout
], ],
model="deepseek-ai/DeepSeek-V2-Lite", model="deepseek-ai/DeepSeek-V2-Lite",
script_args=[ script_args=[
...@@ -289,7 +303,12 @@ vllm_configs = { ...@@ -289,7 +303,12 @@ vllm_configs = {
name="multimodal_disagg_qwen3vl_2b_e_pd", name="multimodal_disagg_qwen3vl_2b_e_pd",
directory=vllm_dir, directory=vllm_dir,
script_name="disagg_multimodal_e_pd.sh", script_name="disagg_multimodal_e_pd.sh",
marks=[pytest.mark.gpu_1, pytest.mark.pre_merge], marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(24.6), # observed peak 22.3 GiB (+10% safety)
pytest.mark.timeout(206), # 3x observed 68.4s wall time
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-VL-2B-Instruct", model="Qwen/Qwen3-VL-2B-Instruct",
script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"], script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
request_payloads=[ request_payloads=[
...@@ -318,7 +337,12 @@ vllm_configs = { ...@@ -318,7 +337,12 @@ vllm_configs = {
directory=vllm_dir, directory=vllm_dir,
script_name="agg_multimodal.sh", script_name="agg_multimodal.sh",
# post_merge because needs real NIXL not stub # post_merge because needs real NIXL not stub
marks=[pytest.mark.gpu_1, pytest.mark.post_merge], marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(10.2), # observed peak 9.3 GiB (+10% safety)
pytest.mark.timeout(131), # 3x observed 43.7s wall time
pytest.mark.post_merge,
],
model="Qwen/Qwen2-VL-2B-Instruct", model="Qwen/Qwen2-VL-2B-Instruct",
# Pass --frontend-decoding to enable Rust frontend image decoding + NIXL RDMA transfer # Pass --frontend-decoding to enable Rust frontend image decoding + NIXL RDMA transfer
script_args=[ script_args=[
...@@ -345,13 +369,20 @@ vllm_configs = { ...@@ -345,13 +369,20 @@ vllm_configs = {
) )
], ],
), ),
# NOTE: Pack all workers on 1 GPU for lower CI resource requirements # NOTE: Pack all workers on 1 GPU for lower CI resource requirements.
# NOTE: disagg_multimodal_epd.sh uses --kv-cache-memory-bytes=512MB for P/D
# workers. Per vLLM CacheConfig, kv_cache_memory_bytes (when not-None) ignores
# gpu_memory_utilization (ref: https://docs.vllm.ai/en/stable/api/vllm/config/cache/),
# so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect. Regardless of GPU_MEM
# fractions (0.1/0.4/0.4), the 3 workers combined consistently use ~17.6 GiB
# total on this GPU.
"multimodal_disagg_qwen3vl_2b_epd": VLLMConfig( "multimodal_disagg_qwen3vl_2b_epd": VLLMConfig(
name="multimodal_disagg_qwen3vl_2b_epd", name="multimodal_disagg_qwen3vl_2b_epd",
directory=vllm_dir, directory=vllm_dir,
script_name="disagg_multimodal_epd.sh", script_name="disagg_multimodal_epd.sh",
marks=[ marks=[
pytest.mark.gpu_1, pytest.mark.gpu_1,
pytest.mark.max_vram_gib(19.4), # observed peak 17.6 GiB (+10% safety)
pytest.mark.post_merge, pytest.mark.post_merge,
pytest.mark.skip(reason="DYN-2265"), pytest.mark.skip(reason="DYN-2265"),
], ],
...@@ -389,7 +420,12 @@ vllm_configs = { ...@@ -389,7 +420,12 @@ vllm_configs = {
name="multimodal_agg_qwen", name="multimodal_agg_qwen",
directory=vllm_dir, directory=vllm_dir,
script_name="agg_multimodal.sh", script_name="agg_multimodal.sh",
marks=[pytest.mark.gpu_1, pytest.mark.post_merge], marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.6), # observed peak 19.6 GiB (+10% safety)
pytest.mark.timeout(150), # 3x observed 50.0s wall time
pytest.mark.post_merge,
],
model="Qwen/Qwen2.5-VL-7B-Instruct", model="Qwen/Qwen2.5-VL-7B-Instruct",
script_args=["--model", "Qwen/Qwen2.5-VL-7B-Instruct"], script_args=["--model", "Qwen/Qwen2.5-VL-7B-Instruct"],
delayed_start=0, delayed_start=0,
...@@ -418,6 +454,8 @@ vllm_configs = { ...@@ -418,6 +454,8 @@ vllm_configs = {
script_name="agg_multimodal.sh", script_name="agg_multimodal.sh",
marks=[ marks=[
pytest.mark.gpu_1, pytest.mark.gpu_1,
pytest.mark.max_vram_gib(18.9), # observed peak 17.1 GiB (+10% safety)
pytest.mark.timeout(128), # 3x observed 42.7s wall time
pytest.mark.nightly, pytest.mark.nightly,
# https://github.com/ai-dynamo/dynamo/issues/4501 # https://github.com/ai-dynamo/dynamo/issues/4501
pytest.mark.xfail(strict=False), pytest.mark.xfail(strict=False),
...@@ -456,7 +494,10 @@ vllm_configs = { ...@@ -456,7 +494,10 @@ vllm_configs = {
name="multimodal_video_agg", name="multimodal_video_agg",
directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"), directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
script_name="video_agg.sh", script_name="video_agg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.nightly], marks=[
pytest.mark.gpu_2,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="llava-hf/LLaVA-NeXT-Video-7B-hf", model="llava-hf/LLaVA-NeXT-Video-7B-hf",
delayed_start=60, # Video models require longer loading time delayed_start=60, # Video models require longer loading time
script_args=["--model", "llava-hf/LLaVA-NeXT-Video-7B-hf"], script_args=["--model", "llava-hf/LLaVA-NeXT-Video-7B-hf"],
...@@ -483,7 +524,10 @@ vllm_configs = { ...@@ -483,7 +524,10 @@ vllm_configs = {
name="multimodal_video_disagg", name="multimodal_video_disagg",
directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"), directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
script_name="video_disagg.sh", script_name="video_disagg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.nightly], marks=[
pytest.mark.gpu_2,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="llava-hf/LLaVA-NeXT-Video-7B-hf", model="llava-hf/LLaVA-NeXT-Video-7B-hf",
delayed_start=60, # Video models require longer loading time delayed_start=60, # Video models require longer loading time
script_args=["--model", "llava-hf/LLaVA-NeXT-Video-7B-hf"], script_args=["--model", "llava-hf/LLaVA-NeXT-Video-7B-hf"],
...@@ -512,7 +556,10 @@ vllm_configs = { ...@@ -512,7 +556,10 @@ vllm_configs = {
name="multimodal_audio_agg", name="multimodal_audio_agg",
directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"), directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
script_name="audio_agg.sh", script_name="audio_agg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.nightly], marks=[
pytest.mark.gpu_2,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen2-Audio-7B-Instruct", model="Qwen/Qwen2-Audio-7B-Instruct",
delayed_start=60, # Audio models require longer loading time delayed_start=60, # Audio models require longer loading time
script_args=["--model", "Qwen/Qwen2-Audio-7B-Instruct"], script_args=["--model", "Qwen/Qwen2-Audio-7B-Instruct"],
...@@ -539,7 +586,10 @@ vllm_configs = { ...@@ -539,7 +586,10 @@ vllm_configs = {
name="multimodal_audio_disagg", name="multimodal_audio_disagg",
directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"), directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
script_name="audio_disagg.sh", script_name="audio_disagg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.nightly], marks=[
pytest.mark.gpu_2,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen2-Audio-7B-Instruct", model="Qwen/Qwen2-Audio-7B-Instruct",
delayed_start=60, # Audio models require longer loading time delayed_start=60, # Audio models require longer loading time
script_args=["--model", "Qwen/Qwen2-Audio-7B-Instruct"], script_args=["--model", "Qwen/Qwen2-Audio-7B-Instruct"],
...@@ -566,7 +616,11 @@ vllm_configs = { ...@@ -566,7 +616,11 @@ vllm_configs = {
name="aggregated_toolcalling", name="aggregated_toolcalling",
directory=vllm_dir, directory=vllm_dir,
script_name="agg_multimodal.sh", script_name="agg_multimodal.sh",
marks=[pytest.mark.gpu_2, pytest.mark.multimodal, pytest.mark.nightly], marks=[
pytest.mark.gpu_2,
pytest.mark.multimodal,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8", model="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8",
script_args=[ script_args=[
"--model", "--model",
...@@ -646,10 +700,9 @@ vllm_configs = { ...@@ -646,10 +700,9 @@ vllm_configs = {
script_name="agg.sh", script_name="agg.sh",
marks=[ marks=[
pytest.mark.gpu_1, pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.9), # observed peak 19.9 GiB (+10% safety)
pytest.mark.timeout(233), # 3x observed 77.7s wall time
pytest.mark.post_merge, pytest.mark.post_merge,
pytest.mark.timeout(
420
), # 3x estimated time (60s) + download time (240s) for 7B model
], ],
model="deepseek-ai/deepseek-llm-7b-base", model="deepseek-ai/deepseek-llm-7b-base",
script_args=[ script_args=[
...@@ -669,6 +722,7 @@ vllm_configs = { ...@@ -669,6 +722,7 @@ vllm_configs = {
marks=[ marks=[
pytest.mark.gpu_2, pytest.mark.gpu_2,
pytest.mark.pre_merge, pytest.mark.pre_merge,
# TODO: profile to get max_vram
pytest.mark.timeout(300), pytest.mark.timeout(300),
], ],
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
...@@ -681,7 +735,12 @@ vllm_configs = { ...@@ -681,7 +735,12 @@ vllm_configs = {
name="guided_decoding", name="guided_decoding",
directory=vllm_dir, directory=vllm_dir,
script_name="agg.sh", script_name="agg.sh",
marks=[pytest.mark.gpu_1, pytest.mark.pre_merge], marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.timeout(67), # 3x observed 22.3s wall time
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B", model="Qwen/Qwen3-0.6B",
request_payloads=[ request_payloads=[
chat_payload( chat_payload(
......
...@@ -187,6 +187,9 @@ class EngineProcess(ManagedProcess): ...@@ -187,6 +187,9 @@ class EngineProcess(ManagedProcess):
), ),
], ],
delayed_start=config.delayed_start, delayed_start=config.delayed_start,
# Must stay False: command[0] is "bash", so True would kill every
# bash process system-wide. Stale cleanup relies on stragglers list
# and process-group termination in __exit__ instead.
terminate_all_matching_process_names=False, terminate_all_matching_process_names=False,
stragglers=config.stragglers, stragglers=config.stragglers,
log_dir=request.node.name, log_dir=request.node.name,
......
...@@ -38,6 +38,7 @@ class ServicePorts: ...@@ -38,6 +38,7 @@ class ServicePorts:
frontend_port: int frontend_port: int
system_ports: list[int] system_ports: list[int]
kv_event_port: int = 0
def _load_port_registry() -> dict: def _load_port_registry() -> dict:
......
#!/usr/bin/env python3
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
"""Profile GPU VRAM usage during a pytest run.
How it works
~~~~~~~~~~~~
A background thread queries NVML (via ``pynvml``) every 100 ms (configurable
with ``--interval``) to record GPU memory usage while the test runs as a
subprocess. This captures *all* GPU memory (model weights, KV cache, CUDA
contexts, NCCL buffers — not just PyTorch allocations) without requiring any
in-process instrumentation. Using NVML directly (the same C library that
``nvidia-smi`` wraps) avoids the overhead of forking a subprocess each sample
and allows high-frequency sampling.
In **binary-search mode** (the default), the profiler sets the env var
``_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`` to a value between 0.05 and 0.95 and
re-runs the test at each midpoint. If the test passes, the fraction is lowered;
if it OOMs, the fraction is raised — standard bisection to find the minimum
VRAM the test needs. The peak ``memory.used`` from the last passing run
(plus a 10 % safety margin) becomes the ``@pytest.mark.max_vram_gib`` recommendation.
**IMPORTANT**: The test under profile **MUST** honor ``_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE``
— either directly (see ``test_mock_gpu_alloc.py``) or via launch scripts that
pass it as ``--gpu-memory-utilization`` to vLLM (e.g. ``agg.sh``). If the test
ignores this variable, every probe will pass at the same peak and the profiler
will warn that the binary search is unreliable.
Usage::
python tests/utils/profile_pytest.py [options] pytest-args...
Examples (``-xvs`` is optional: stop on first failure, verbose, no capture)::
python tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
python tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort -xvs
Single-pass profiling (no binary search, just measure one run using default RAM)::
python tests/utils/profile_pytest.py --no-find-min-vram tests/frontend/test_vllm.py::test_tool_calling
The report is written to stdout after the test finishes.
The raw CSV samples are saved to ``--csv`` if specified.
Use ``--no-recommend`` to suppress the marker recommendation section.
"""
import argparse
import atexit
import json
import logging
import math
import os
import shutil
import subprocess
import sys
import tempfile
import threading
import time
from dataclasses import dataclass, field
import pynvml
logger = logging.getLogger(__name__)
# Safety margin for VRAM tier recommendations. Peak VRAM is multiplied by
# this factor before comparing against tier thresholds, so the recommended
# tier has headroom for variance across runs.
_VRAM_SAFETY_FACTOR = 1.1
# Phase detection: a memory jump exceeding this threshold (MiB) between
# consecutive samples marks a phase boundary.
_PHASE_JUMP_MIB = 200
# How long memory must be stable (within this tolerance) to consider it
# a plateau, in consecutive samples.
_PLATEAU_TOLERANCE_MIB = 50
_PLATEAU_MIN_SAMPLES = 3
def _extract_model_from_markers(pytest_args: list[str]) -> str | None:
"""Extract the model name from @pytest.mark.model(...) via pytest-json-report.
Runs ``pytest --collect-only`` with the json-report plugin to inspect markers
without executing the test. Returns None if the plugin is missing or the
test has no ``model`` marker.
"""
fd, json_path = tempfile.mkstemp(prefix="_profile_collect_", suffix=".json")
os.close(fd)
try:
result = subprocess.run(
[
sys.executable,
"-m",
"pytest",
"--collect-only",
"-q",
"--rootdir=.",
"--override-ini=testpaths=tests",
f"--json-report-file={json_path}",
]
+ list(pytest_args),
capture_output=True,
text=True,
timeout=30,
)
if result.returncode not in (0, 5):
return None
with open(json_path) as f:
data = json.load(f)
for collector in data.get("collectors", []):
for marker in collector.get("markers", []):
if marker.get("name") == "model" and marker.get("args"):
return marker["args"][0]
for test in data.get("tests", []):
for marker in test.get("markers", []):
if marker.get("name") == "model" and marker.get("args"):
return marker["args"][0]
except (subprocess.SubprocessError, OSError, json.JSONDecodeError, KeyError) as exc:
logger.warning("model marker extraction failed: %s", exc)
return None
finally:
try:
os.remove(json_path)
except OSError:
pass
return None
@dataclass
class GpuSample:
timestamp: float # time.monotonic() offset from start
gpu_idx: int
mem_used_mib: int
mem_total_mib: int
gpu_util_pct: int
@dataclass
class PhaseInfo:
name: str
start_sec: float
end_sec: float
mem_start_mib: int
mem_peak_mib: int
mem_end_mib: int
description: str = ""
@dataclass
class GpuReport:
gpu_idx: int
mem_total_mib: int
baseline_mib: int
peak_mib: int
peak_timestamp: float
final_mib: int
leaked_mib: int # final - baseline
phases: list[PhaseInfo] = field(default_factory=list)
_nvml_initialized = False
_nvml_handles: list = []
def _nvml_init() -> None:
"""Lazily initialize NVML and cache device handles."""
global _nvml_initialized, _nvml_handles
if _nvml_initialized:
return
pynvml.nvmlInit()
_nvml_initialized = True
count = pynvml.nvmlDeviceGetCount()
_nvml_handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(count)]
atexit.register(_nvml_shutdown)
def _nvml_shutdown() -> None:
global _nvml_initialized, _nvml_handles
if _nvml_initialized:
_nvml_handles = []
pynvml.nvmlShutdown()
_nvml_initialized = False
def _query_gpu_stats() -> list[tuple[int, int, int, int]]:
"""Return [(gpu_idx, mem_used_mib, mem_total_mib, util_pct), ...] via NVML."""
_nvml_init()
results = []
for idx, handle in enumerate(_nvml_handles):
mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
util = pynvml.nvmlDeviceGetUtilizationRates(handle)
used_mib = int(mem.used) // (1024 * 1024)
total_mib = int(mem.total) // (1024 * 1024)
results.append((idx, used_mib, total_mib, int(util.gpu)))
return results
class _Sampler:
"""Background thread that queries NVML at a fixed interval."""
def __init__(self, interval: float = 0.1):
self.interval = interval
self.samples: list[GpuSample] = []
self._stop = threading.Event()
self._t0 = time.monotonic()
self._thread = threading.Thread(target=self._run, daemon=True)
def start(self):
self._t0 = time.monotonic()
self._thread.start()
def stop(self):
self._stop.set()
self._thread.join(timeout=self.interval * 3)
def _run(self):
while not self._stop.is_set():
ts = time.monotonic() - self._t0
try:
for gpu_idx, mem_used, mem_total, util_pct in _query_gpu_stats():
self.samples.append(
GpuSample(ts, gpu_idx, mem_used, mem_total, util_pct)
)
except pynvml.NVMLError:
pass # transient NVML error; skip this sample
self._stop.wait(self.interval)
def _detect_phases(
samples: list[GpuSample], baseline_end: float, test_end: float
) -> list[PhaseInfo]:
"""Heuristic phase detection from a single GPU's memory timeline.
Looks for large jumps (model load, KV cache alloc) and identifies
the inference peak and teardown regions.
"""
if not samples:
return []
phases: list[PhaseInfo] = []
baseline_samples = [s for s in samples if s.timestamp <= baseline_end]
test_samples = [s for s in samples if baseline_end < s.timestamp <= test_end]
teardown_samples = [s for s in samples if s.timestamp > test_end]
if baseline_samples:
bl = baseline_samples[-1].mem_used_mib
phases.append(
PhaseInfo(
name="Baseline",
start_sec=samples[0].timestamp,
end_sec=baseline_end,
mem_start_mib=baseline_samples[0].mem_used_mib,
mem_peak_mib=max(s.mem_used_mib for s in baseline_samples),
mem_end_mib=bl,
description="Idle GPU before test starts",
)
)
if not test_samples:
return phases
# Walk test samples and detect jumps
prev_mem = baseline_samples[-1].mem_used_mib if baseline_samples else 0
phase_start = test_samples[0].timestamp
phase_start_mem = prev_mem
phase_peak = prev_mem
jump_count = 0
phase_names = ["Model load", "KV cache alloc", "Inference"]
for s in test_samples:
delta = s.mem_used_mib - prev_mem
phase_peak = max(phase_peak, s.mem_used_mib)
if delta > _PHASE_JUMP_MIB and jump_count < len(phase_names) - 1:
# Close current phase, start new one
if phase_start < s.timestamp:
name = phase_names[min(jump_count, len(phase_names) - 1)]
phases.append(
PhaseInfo(
name=name,
start_sec=phase_start,
end_sec=s.timestamp,
mem_start_mib=phase_start_mem,
mem_peak_mib=phase_peak,
mem_end_mib=prev_mem,
)
)
jump_count += 1
phase_start = s.timestamp
phase_start_mem = s.mem_used_mib
phase_peak = s.mem_used_mib
prev_mem = s.mem_used_mib
# Close final test phase
name = phase_names[min(jump_count, len(phase_names) - 1)]
phases.append(
PhaseInfo(
name=name,
start_sec=phase_start,
end_sec=test_end,
mem_start_mib=phase_start_mem,
mem_peak_mib=phase_peak,
mem_end_mib=test_samples[-1].mem_used_mib,
)
)
if teardown_samples:
phases.append(
PhaseInfo(
name="Teardown",
start_sec=test_end,
end_sec=teardown_samples[-1].timestamp,
mem_start_mib=teardown_samples[0].mem_used_mib,
mem_peak_mib=max(s.mem_used_mib for s in teardown_samples),
mem_end_mib=teardown_samples[-1].mem_used_mib,
description="After pytest exits; should return to baseline",
)
)
return phases
def _build_reports(
samples: list[GpuSample], baseline_end: float, test_end: float
) -> list[GpuReport]:
"""Build per-GPU reports from collected samples."""
gpu_indices = sorted({s.gpu_idx for s in samples})
reports = []
for idx in gpu_indices:
gpu_samples = [s for s in samples if s.gpu_idx == idx]
if not gpu_samples:
continue
baseline_samples = [s for s in gpu_samples if s.timestamp <= baseline_end]
baseline_mib = baseline_samples[-1].mem_used_mib if baseline_samples else 0
peak_sample = max(gpu_samples, key=lambda s: s.mem_used_mib)
final_mib = gpu_samples[-1].mem_used_mib
reports.append(
GpuReport(
gpu_idx=idx,
mem_total_mib=gpu_samples[0].mem_total_mib,
baseline_mib=baseline_mib,
peak_mib=peak_sample.mem_used_mib,
peak_timestamp=peak_sample.timestamp,
final_mib=final_mib,
leaked_mib=final_mib - baseline_mib,
phases=_detect_phases(gpu_samples, baseline_end, test_end),
)
)
return reports
def _format_mib(mib: int) -> str:
if mib >= 1024:
return f"{mib / 1024:.1f} GiB"
return f"{mib} MiB"
def _print_report(
reports: list[GpuReport],
pytest_rc: int,
wall_secs: float,
model_name: str | None = None,
):
"""Print a human-readable profiling report."""
print("\n--- GPU MEMORY PROFILE ---")
print(f" pytest exit code : {pytest_rc}")
print(f" wall time : {wall_secs:.1f}s")
print(f" GPUs sampled : {len(reports)}")
if model_name:
print(f" model : {model_name}")
for r in reports:
print(f"\n{'─' * 72}")
print(f" GPU {r.gpu_idx} ({_format_mib(r.mem_total_mib)} total)")
print(f"{'─' * 72}")
print(f" Baseline : {_format_mib(r.baseline_mib)}")
print(
f" Peak : {_format_mib(r.peak_mib)} "
f"({r.peak_mib * 100 // r.mem_total_mib}% of total) "
f"@ t={r.peak_timestamp:.1f}s"
)
print(f" Final : {_format_mib(r.final_mib)}")
delta = r.leaked_mib
tag = "OK" if abs(delta) < _PLATEAU_TOLERANCE_MIB else "LEAKED"
sign = "+" if delta > 0 else ""
print(f" Delta (final-bl) : {sign}{_format_mib(delta)} [{tag}]")
if r.phases:
print()
print(
f" {'Phase':<16} {'Time':>12} {'Start':>10} {'Peak':>10} {'End':>10}"
)
print(f" {'─' * 16} {'─' * 12} {'─' * 10} {'─' * 10} {'─' * 10}")
for p in r.phases:
dur = p.end_sec - p.start_sec
time_range = (
f"{p.start_sec:.0f}s-{p.end_sec:.0f}s"
if dur > 0
else f"{p.start_sec:.0f}s"
)
print(
f" {p.name:<16} {time_range:>12} "
f"{_format_mib(p.mem_start_mib):>10} "
f"{_format_mib(p.mem_peak_mib):>10} "
f"{_format_mib(p.mem_end_mib):>10}"
)
print()
def _write_csv(samples: list[GpuSample], path: str):
with open(path, "w") as f:
f.write("timestamp_s,gpu,mem_used_mib,mem_total_mib,gpu_util_pct\n")
for s in samples:
f.write(
f"{s.timestamp:.2f},{s.gpu_idx},{s.mem_used_mib},"
f"{s.mem_total_mib},{s.gpu_util_pct}\n"
)
_GPU_REFERENCE_CARDS: list[tuple[int, str]] = [
(4, "edge/embedded"),
(8, "RTX 3060/4060"),
(16, "T4"),
(24, "L4"),
(32, "V100-32GB"),
(48, "A6000/A40"),
(80, "A100/H100"),
]
@dataclass
class MarkerRecommendation:
marker: str
reason: str
def _recommend_markers(
reports: list[GpuReport],
wall_secs: float,
model_name: str | None = None,
num_runs: int = 1,
) -> tuple[list[MarkerRecommendation], list[str]]:
"""Generate marker recommendations from profiling data.
Returns (recommendations, warnings).
"""
recs: list[MarkerRecommendation] = []
warnings: list[str] = []
if model_name:
recs.append(
MarkerRecommendation(
f'model("{model_name}")',
"detected from test source",
)
)
max_peak_mib = max((r.peak_mib for r in reports), default=0)
max_baseline_mib = max((r.baseline_mib for r in reports), default=0)
used_vram = max_peak_mib - max_baseline_mib
gpus_with_vram = sum(
1 for r in reports if (r.peak_mib - r.baseline_mib) > _PLATEAU_TOLERANCE_MIB
)
has_model_load = any(
p.name == "Model load"
for r in reports
for p in r.phases
if p.mem_peak_mib - p.mem_start_mib > _PHASE_JUMP_MIB
)
any_leaked = any(abs(r.leaked_mib) >= _PLATEAU_TOLERANCE_MIB for r in reports)
# -- Test Type --
if wall_secs < 1.0 and used_vram < _PLATEAU_TOLERANCE_MIB:
recs.append(
MarkerRecommendation("unit", f"wall time {wall_secs:.1f}s, no GPU usage")
)
elif wall_secs < 30.0 and not has_model_load:
recs.append(
MarkerRecommendation(
"integration", f"wall time {wall_secs:.1f}s, no model load detected"
)
)
else:
reason = f"wall time avg {wall_secs:.1f}s based on {num_runs} run{'s' if num_runs != 1 else ''}"
if has_model_load:
reason += ", loads a real model"
recs.append(MarkerRecommendation("e2e", reason))
# -- Lifecycle --
if wall_secs < 20.0:
recs.append(
MarkerRecommendation(
"pre_merge", f"wall time {wall_secs:.1f}s (< 20s, fast enough per PR)"
)
)
elif wall_secs < 300.0:
warnings.append(
f"Wall time {wall_secs:.1f}s is too slow for pre_merge (> 20s). "
f"Consider post_merge or nightly instead."
)
else:
warnings.append(
f"Wall time {wall_secs:.1f}s is very slow (> 300s). "
f"Consider nightly instead."
)
# -- Hardware: GPU count --
if gpus_with_vram == 0:
recs.append(MarkerRecommendation("gpu_0", "no GPU VRAM used"))
else:
marker = f"gpu_{gpus_with_vram}"
recs.append(
MarkerRecommendation(
marker,
f"{gpus_with_vram} GPU(s) used, peak {_format_mib(max_peak_mib)}",
)
)
# -- Hardware: VRAM requirement --
if used_vram > _PLATEAU_TOLERANCE_MIB:
padded_peak_mib = int(max_peak_mib * _VRAM_SAFETY_FACTOR)
padded_peak_gib = round(padded_peak_mib / 1024, 1)
recs.append(
MarkerRecommendation(
f"max_vram_gib({padded_peak_gib})",
f"peak {_format_mib(max_peak_mib)} GPU RAM used "
f"(+10% safety: {_format_mib(padded_peak_mib)})",
)
)
# Warn about GPU cards that would OOM
for card_gib, card_name in _GPU_REFERENCE_CARDS:
if padded_peak_gib > card_gib:
warnings.append(f"Will OOM on {card_name} ({card_gib} GiB).")
# -- Timeout --
timeout_val = int(math.ceil(wall_secs * 3.0))
timeout_val = max(timeout_val, 10)
recs.append(
MarkerRecommendation(
f"timeout({timeout_val})",
f"wall time {wall_secs:.1f}s, based on {num_runs} run{'s' if num_runs != 1 else ''}",
)
)
# -- Memory leak warning --
if any_leaked:
leaked_reports = [
r for r in reports if abs(r.leaked_mib) >= _PLATEAU_TOLERANCE_MIB
]
for r in leaked_reports:
warnings.append(
f"GPU {r.gpu_idx}: VRAM not fully released "
f"(baseline {_format_mib(r.baseline_mib)} -> "
f"final {_format_mib(r.final_mib)}, "
f"delta {_format_mib(r.leaked_mib)}). "
f"Possible leak or teardown issue."
)
return recs, warnings
def _print_recommendations(
recs: list[MarkerRecommendation],
warnings: list[str],
pytest_args: list[str] | None = None,
):
print("--- Recommended markers (copy-paste into your test) ---")
if pytest_args:
print(
f"# Measured using: tests/utils/profile_pytest.py {' '.join(pytest_args)}"
)
else:
print("# Measured using: tests/utils/profile_pytest.py")
for r in recs:
print(f"@pytest.mark.{r.marker} # {r.reason}")
# Show example so user knows where to place the markers
test_name = None
if pytest_args:
test_name = next(
(a.rsplit("::", 1)[-1] for a in pytest_args if "::" in a), None
)
print(f"def {test_name or 'test_something'}(...):")
print(" ...")
if warnings:
print()
for w in warnings:
print(f" WARNING: {w}")
print()
_DEFAULT_PROBE_TIMEOUT = 300 # 5 minutes max per profile run
def _run_once(
pytest_args: list[str],
interval: float = 0.1,
baseline_seconds: float = 3.0,
teardown_seconds: float = 5.0,
extra_env: dict[str, str] | None = None,
quiet: bool = False,
run_label: str | None = None,
timeout: float = _DEFAULT_PROBE_TIMEOUT,
) -> tuple[int, float, list[GpuReport], list[GpuSample]]:
"""Run pytest once with GPU sampling.
When *run_label* is set, each line of pytest stdout/stderr is prefixed
with ``[run_label]`` so multi-run output is easy to follow.
Returns (exit_code, wall_secs, reports, raw_samples).
"""
sampler = _Sampler(interval=interval)
sampler.start()
if not quiet:
print(f"Sampling baseline for {baseline_seconds}s ...")
time.sleep(baseline_seconds)
baseline_end = time.monotonic() - sampler._t0
pytest_cmd = [sys.executable, "-m", "pytest"] + list(pytest_args)
if not quiet:
print(f"Running: {' '.join(pytest_cmd)}")
sys.stdout.flush()
env = os.environ.copy()
env.setdefault("HF_HUB_OFFLINE", "1")
if extra_env:
env.update(extra_env)
capture = run_label is not None
t_start = time.monotonic()
timed_out = False
try:
result = subprocess.run(
pytest_cmd,
env=env,
capture_output=capture,
text=capture or None,
timeout=timeout,
)
rc = result.returncode
except subprocess.TimeoutExpired:
timed_out = True
rc = 1
if not quiet or run_label:
print(
f" [TIMEOUT] pytest exceeded {timeout:.0f}s limit "
f"(teardown likely hung)"
)
if not timed_out and capture:
prefix = f"[{run_label}] "
for line in result.stdout.splitlines():
print(f"{prefix}{line}")
for line in result.stderr.splitlines():
print(f"{prefix}{line}", file=sys.stderr)
sys.stdout.flush()
wall_secs = time.monotonic() - t_start
test_end = time.monotonic() - sampler._t0
if not quiet:
print(f"Sampling teardown for {teardown_seconds}s ...")
time.sleep(teardown_seconds)
sampler.stop()
reports = _build_reports(sampler.samples, baseline_end, test_end)
return rc, wall_secs, reports, sampler.samples
def _find_min_vram(
pytest_args: list[str],
interval: float = 0.1,
baseline_seconds: float = 2.0,
teardown_seconds: float = 2.0,
recommend: bool = True,
csv_path: str | None = None,
) -> int:
"""Binary search _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE to find the minimum VRAM a test needs.
Sets _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE env var (honored by agg.sh and similar scripts),
runs the test at each profile point, and bisects until the boundary is found.
"""
gpu_info = _query_gpu_stats()
if not gpu_info:
raise RuntimeError("NVML returned no GPU data")
used_mib = gpu_info[0][1]
total_mib = gpu_info[0][2]
free_mib = total_mib - used_mib
total_gib = total_mib / 1024
model_name = _extract_model_from_markers(pytest_args)
print("\n--- FIND MINIMUM VRAM (binary search) ---")
print(f" GPU total : {total_gib:.1f} GiB")
print(
f" GPU free : {free_mib / 1024:.1f} GiB "
f"(in use: {used_mib / 1024:.1f} GiB)"
)
print(f" Test : {' '.join(pytest_args)}")
if model_name:
print(f" Model : {model_name}")
# Warn if something is already consuming significant GPU memory
hogged_pct = used_mib / total_mib * 100
if hogged_pct > 10:
print(f"\n {'!' * 72}")
print(
f" WARNING: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) of GPU memory "
f"is already in use!"
)
print(" Another process is hogging the GPU. Results will be inaccurate")
print(
" because _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is a fraction of TOTAL memory,"
)
print(" not FREE memory. Kill other GPU processes first.")
print(f" {'!' * 72}")
print()
lo = 0.05
hi = 0.95
tolerance = 0.05
max_iterations = math.ceil(math.log2((hi - lo) / tolerance))
last_pass_util: float | None = None
last_pass_peak_mib: int = 0
elapsed_times: list[float] = []
all_peak_mibs: list[int] = []
pass_wall_times: list[float] = []
print(f" Range : {lo:.0%} - {hi:.0%} (tolerance {tolerance:.0%})")
print(
f" Max iter: {max_iterations + 1} (1 validation + {max_iterations} bisections)"
)
print()
# First, verify the test passes at hi (0.95)
print(
f" [profile 1/{max_iterations + 1}] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE={hi:.2f} "
f"(allowed max GPU {hi * total_gib:.1f} GiB) [validation run]"
)
sys.stdout.flush()
t_iter_start = time.monotonic()
label = f"profile 1/{max_iterations + 1}"
rc, wall, reports, raw_samples = _run_once(
pytest_args,
interval=interval,
baseline_seconds=baseline_seconds,
teardown_seconds=teardown_seconds,
extra_env={"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE": f"{hi:.2f}"},
quiet=True,
run_label=label,
)
iter_elapsed = time.monotonic() - t_iter_start
elapsed_times.append(iter_elapsed)
if rc != 0:
print(
f" [FAIL] allowed GPU = {hi * total_gib:.1f} GiB ({hi:.0%}), "
f"test fails even at max utilization. Cannot determine minimum."
)
return rc
peak_mib = max((r.peak_mib for r in reports), default=0)
all_peak_mibs.append(peak_mib)
last_pass_util = hi
last_pass_peak_mib = peak_mib
last_pass_reports = reports
last_pass_samples = raw_samples
pass_wall_times.append(wall)
print(
f" [PASS] allowed GPU = {hi * total_gib:.1f} GiB ({hi:.0%}), "
f"peak GPU used = {_format_mib(peak_mib)}, wall {wall:.0f}s, "
f"iter took {iter_elapsed:.0f}s"
)
# Use 2x the first profile's time as the timeout for subsequent profiles.
# If a profile takes longer than this, it's likely stuck in teardown.
baseline_time = iter_elapsed
probe_timeout = max(baseline_time * 2, 60)
print(f" Profile timeout: {probe_timeout:.0f}s (2x first profile)")
iteration = 0
while (hi - lo) > tolerance:
iteration += 1
probe_num = iteration + 1
mid = (lo + hi) / 2
remaining = max_iterations + 1 - probe_num
avg_iter = sum(elapsed_times) / len(elapsed_times)
eta_s = remaining * avg_iter
label = f"profile {probe_num}/{max_iterations + 1}"
print(
f"\n [{label}] "
f"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE={mid:.2f} "
f"(allowed max GPU {mid * total_gib:.1f} GiB) "
f"[~{remaining} iters left, profiling ETA ~{eta_s:.0f}s]"
)
sys.stdout.flush()
stop_progress = threading.Event()
t_iter_start = time.monotonic()
is_tty = sys.stderr.isatty()
def _print_progress(t0: float, expected: float, stop: threading.Event) -> None:
if not is_tty:
return
term_width = shutil.get_terminal_size((80, 24)).columns
bar_total = max(term_width - 40, 10)
while not stop.wait(2):
elapsed = time.monotonic() - t0
frac = min(elapsed / expected, 1.0) if expected > 0 else 0
filled = int(frac * bar_total)
bar = "\u2588" * filled + "\u2591" * (bar_total - filled)
pct = frac * 100
line = f" [{bar}] {elapsed:5.0f}s / ~{expected:.0f}s ({pct:3.0f}%)"
sys.stderr.write(f"\r{line}")
sys.stderr.flush()
progress_thread = threading.Thread(
target=_print_progress,
args=(t_iter_start, baseline_time, stop_progress),
daemon=True,
)
progress_thread.start()
rc, wall, reports, raw_samples = _run_once(
pytest_args,
interval=interval,
baseline_seconds=baseline_seconds,
teardown_seconds=teardown_seconds,
extra_env={"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE": f"{mid:.2f}"},
quiet=True,
run_label=label,
timeout=probe_timeout,
)
stop_progress.set()
progress_thread.join(timeout=2)
if is_tty:
sys.stderr.write(
"\r" + " " * shutil.get_terminal_size((80, 24)).columns + "\r"
)
sys.stderr.flush()
iter_elapsed = time.monotonic() - t_iter_start
elapsed_times.append(iter_elapsed)
peak_mib = max((r.peak_mib for r in reports), default=0)
all_peak_mibs.append(peak_mib)
if rc == 0:
last_pass_util = mid
last_pass_peak_mib = peak_mib
last_pass_reports = reports
last_pass_samples = raw_samples
pass_wall_times.append(wall)
hi = mid
print(
f" [PASS] allowed GPU = {mid * total_gib:.1f} GiB ({mid:.0%}), "
f"peak GPU used = {_format_mib(peak_mib)}, wall {wall:.0f}s, "
f"iter took {iter_elapsed:.0f}s"
)
else:
lo = mid
print(
f" [FAIL] allowed GPU = {mid * total_gib:.1f} GiB ({mid:.0%}), "
f"OOM or error, iter took {iter_elapsed:.0f}s"
)
# Detect if _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is being ignored: all peaks are nearly
# identical despite wildly different utilization caps.
if len(all_peak_mibs) >= 3:
peak_range = max(all_peak_mibs) - min(all_peak_mibs)
if peak_range < _PLATEAU_TOLERANCE_MIB:
print(f"\n {'!' * 72}")
print(
f" WARNING: Peak VRAM was ~{_format_mib(all_peak_mibs[0])} across ALL "
f"{len(all_peak_mibs)} probes (range: {peak_range} MiB)."
)
print(
" This strongly suggests the test IGNORES the _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
)
print(" env var. Binary search results are UNRELIABLE — no marker")
print(" recommendation will be provided.")
print(" ")
print(
" FIX: The test (or its launch script) must read _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
)
print(" and pass --gpu-memory-utilization to vLLM / the engine.")
print(" See tests/README.md 'GPU VRAM Profiler' for details.")
print(f" {'!' * 72}")
return 4
# Results
assert last_pass_util is not None
min_vram_gib = last_pass_util * total_gib
padded_peak_mib = int(last_pass_peak_mib * _VRAM_SAFETY_FACTOR)
padded_peak_gib = round(padded_peak_mib / 1024, 1)
# Extract a short test name from pytest args for the summary
test_name = next(
(a for a in pytest_args if "::" in a or a.endswith(".py")),
" ".join(pytest_args),
)
test_short = test_name.rsplit("::", 1)[-1] if "::" in test_name else test_name
print("\n--- RESULT ---")
print(f" Lowest passing utilization : {last_pass_util:.0%}")
print(
f" Minimum VRAM needed : ~{min_vram_gib:.1f} GiB "
f"(peak observed: {_format_mib(last_pass_peak_mib)}, "
f"+10% safety: {_format_mib(padded_peak_mib)})"
)
print(f" {test_short}: @pytest.mark.max_vram_gib({padded_peak_gib})")
# Full marker recommendations using average wall time across all passing runs
if recommend:
avg_pass_wall = sum(pass_wall_times) / len(pass_wall_times)
recs, warnings = _recommend_markers(
last_pass_reports, avg_pass_wall, model_name, num_runs=len(pass_wall_times)
)
_print_recommendations(recs, warnings, pytest_args=pytest_args)
if csv_path and last_pass_samples:
_write_csv(last_pass_samples, csv_path)
print(f"Raw samples (last passing run) written to {csv_path}")
return 0
def main(argv: list[str] | None = None) -> int:
logging.basicConfig(
level=logging.INFO,
format="%(levelname)s: %(message)s",
)
parser = argparse.ArgumentParser(
description="Profile GPU memory during a pytest run.",
usage="%(prog)s [options] [-- ] pytest-args...",
)
parser.add_argument(
"--interval",
type=float,
default=0.1,
help="Sampling interval in seconds (default: 0.1)",
)
parser.add_argument(
"--baseline-seconds",
type=float,
default=3.0,
help="Seconds to sample baseline before launching pytest (default: 3.0)",
)
parser.add_argument(
"--teardown-seconds",
type=float,
default=5.0,
help="Seconds to sample after pytest exits to measure teardown (default: 5.0)",
)
parser.add_argument(
"--csv",
type=str,
default=None,
help="Write raw samples to this CSV file",
)
parser.add_argument(
"--no-recommend",
action="store_true",
default=False,
help="Suppress marker recommendations",
)
parser.add_argument(
"--no-find-min-vram",
action="store_true",
default=False,
help="Disable the default binary-search mode that finds minimum VRAM. "
"When set, runs a single profiling pass instead.",
)
raw = argv if argv is not None else sys.argv[1:]
if "--" in raw:
split_idx = raw.index("--")
args = parser.parse_args(raw[:split_idx])
pytest_args = raw[split_idx + 1 :]
else:
args, pytest_args = parser.parse_known_args(raw)
if not pytest_args:
parser.error("No pytest arguments provided")
# Validate that test file paths actually exist
for arg in pytest_args:
if arg.startswith("-"):
continue
test_path = arg.split("::")[0]
looks_like_test_path = test_path.endswith(".py") or (os.path.sep in test_path)
if looks_like_test_path and not os.path.exists(test_path):
parser.error(f"Test path does not exist: {test_path}")
gpu_info = _query_gpu_stats()
if not gpu_info:
raise RuntimeError("NVML returned no GPU data")
used_mib = gpu_info[0][1]
total_mib = gpu_info[0][2]
hogged_pct = used_mib / total_mib * 100
if hogged_pct > 10:
print(
f"\nWARNING: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) of GPU memory "
f"is already in use! Results may be inaccurate.\n"
)
if not args.no_find_min_vram:
return _find_min_vram(
pytest_args,
interval=args.interval,
baseline_seconds=args.baseline_seconds,
teardown_seconds=args.teardown_seconds,
recommend=not args.no_recommend,
csv_path=args.csv,
)
model_name = _extract_model_from_markers(pytest_args)
rc, wall_secs, reports, samples = _run_once(
pytest_args,
interval=args.interval,
baseline_seconds=args.baseline_seconds,
teardown_seconds=args.teardown_seconds,
)
_print_report(reports, rc, wall_secs, model_name=model_name)
if not args.no_recommend and reports:
recs, warnings = _recommend_markers(reports, wall_secs, model_name=model_name)
_print_recommendations(recs, warnings, pytest_args=pytest_args)
if args.csv:
_write_csv(samples, args.csv)
print(f"Raw samples written to {args.csv}")
return rc
if __name__ == "__main__":
if (
os.environ.get("CI")
or os.environ.get("GITHUB_ACTIONS")
or os.environ.get("GITLAB_CI")
):
print("ERROR: profile_pytest.py must not run in CI.", file=sys.stderr)
raise SystemExit(1)
raise SystemExit(main())
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment