feat: GPU VRAM profiler via memory fraction injection + profiled test markers...

feat: GPU VRAM profiler via memory fraction injection + profiled test markers (part 2 - vLLM only) (#6719) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

feat: GPU VRAM profiler via memory fraction injection + profiled test markers...
feat: GPU VRAM profiler via memory fraction injection + profiled test markers (part 2 - vLLM only) (#6719) Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
0b20745e · Keiven C · GitHub · d047851e · 0b20745e · 0b20745e
Unverified Commit 0b20745e authored Mar 18, 2026 by Keiven C Committed by GitHub Mar 18, 2026
20 changed files
--- a/examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
+++ b/examples/backends/vllm/launch/vllm_serve_embedding_cache.sh
@@ -28,10 +28,12 @@ if [[ "$CAPACITY_GB" != "0" ]]; then
    }")
 fi
+GPU_MEM_UTIL="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-.9}"
 CUDA_VISIBLE_DEVICES=2 \
 vllm serve "$MODEL" \
    --enable-log-requests \
    --max-model-len 16384 \
-    --gpu-memory-utilization .9 \
+    --gpu-memory-utilization "$GPU_MEM_UTIL" \
    "${EC_ARGS[@]}" \
    "${EXTRA_ARGS[@]}"
--- a/examples/backends/vllm/mm_router_worker/launch.sh
+++ b/examples/backends/vllm/mm_router_worker/launch.sh
@@ -20,7 +20,7 @@ MODEL="${MODEL:-Qwen/Qwen3-VL-8B-Instruct}"
 NAMESPACE="${NAMESPACE:-dynamo}"
 HTTP_PORT="${HTTP_PORT:-8000}"
 BLOCK_SIZE="${BLOCK_SIZE:-16}"            # Must match vLLM backend KV block size
-GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.85}"
+GPU_MEMORY_UTILIZATION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-${GPU_MEMORY_UTILIZATION:-0.85}}"
 MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
 NATS_SERVER="${NATS_SERVER:-nats://127.0.0.1:4222}"

--- a/examples/common/gpu_utils.md
+++ b/examples/common/gpu_utils.md
@@ -57,7 +57,7 @@ controls the *overall* VRAM budget (and thus whether the model fits), but the
 KV cache portion is pinned to the explicit byte value.
 Consequence for profiling: if a script uses `--kv-cache-memory-bytes`,
-changing `DYN_GPU_MEMORY_FRACTION_OVERRIDE` (which maps to
+changing `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` (which maps to
 `--gpu-memory-utilization`) won't change the KV cache size, only the leftover
 headroom for activations and overhead.
@@ -256,14 +256,13 @@ to get 10 GiB of KV cache with a 5 GiB model.
 The helper functions in `gpu_utils.sh` handle these differences:
 - `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM)
 - `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM)
- `gpu_worker_fraction <engine>`: unified wrapper — reads `_EW_*` vars from
+- `gpu_worker_fraction <engine> <total_gib> <kv_gib>`: converts estimated GiB
-  `estimate_worker_vram` and calls the right function for the engine.
+  into the engine-appropriate fraction (total for vllm/sglang, free for trtllm).
-Launch scripts use `gpu_worker_fraction` so they all follow the same pattern:
+Launch scripts use `build_gpu_mem_args` which calls these internally:
 ```bash
-estimate_worker_vram "$MODEL" "$SEQ_LEN" "$CONCURRENCY" trtllm
+GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --max-model-len "$SEQ_LEN" --max-num-seqs "$CONCURRENCY")
-GPU_MEM_FRACTION=$(gpu_worker_fraction trtllm)
 ```
 ---
@@ -291,7 +290,7 @@ kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^
 ---
-## `DYN_GPU_MEMORY_FRACTION_OVERRIDE`
+## `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`
 Environment variable used by Dynamo's VRAM profiler to binary-search the minimum
 memory fraction a script needs.
@@ -299,8 +298,8 @@ memory fraction a script needs.
 - Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang.
 - For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via
  `--override-engine-args`.
- Launch scripts use `gpu_worker_fraction <engine>` to compute the default
+- Launch scripts use `build_gpu_mem_args` to compute the default fraction;
-  fraction; the override bypasses this and splits the raw value between workers.
+  the override bypasses the estimator and splits the raw value between workers.
 - Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV
  cache sizing, making the profiler's fraction override ineffective for KV cache.
-  Those scripts should warn when `DYN_GPU_MEMORY_FRACTION_OVERRIDE` is set.
+  Those scripts should warn when `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is set.
--- a/examples/common/gpu_utils.sh
+++ b/examples/common/gpu_utils.sh
 #!/bin/bash
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
 # Shared GPU utility functions for launch scripts.
 #
-# Usage:
+# CLI:
+#   ./gpu_utils.sh <engine> --model <name> [options...]   Print GPU fraction
+#   ./gpu_utils.sh --self-test                            Run self-test suite
+#
+# Source:
 #   source "$(dirname "$(readlink -f "$0")")/../common/gpu_utils.sh"
 #   # or with SCRIPT_DIR already set:
 #   source "$SCRIPT_DIR/../common/gpu_utils.sh"
 #
-# Functions:
+# Functions (all return via stdout — no hidden globals):
-#   get_model_params <model>           Set _MP_* vars for a known model's architecture
+#   build_gpu_mem_args <engine> <model> ...     Prints fraction (or empty)
-#   estimate_worker_vram <model> ...   Set _EW_* vars with per-worker VRAM estimate
+#   get_model_params <model>                    Prints "pb wb layers kvh hd"
-#   gpu_worker_fraction <engine>       Convert _EW_* estimate → engine-appropriate fraction
+#   estimate_worker_vram <model> ...            Prints "w_gib kv_gib oh_gib total_gib"
-#   gpu_gb_to_total_fraction <gib>     Convert absolute GiB → fraction of TOTAL VRAM (vLLM/sglang)
+#   gpu_worker_fraction <engine> <total> <kv>   Prints engine-appropriate fraction
-#   gpu_gb_to_free_fraction <gib>      Convert absolute GiB → fraction of FREE VRAM (TensorRT-LLM)
+#   gpu_peak_to_engine_fraction <engine> <peak> Prints fraction (subtracts engine overhead)
+#   gpu_gb_to_total_fraction <gib>              Prints fraction of TOTAL VRAM (vLLM/sglang)
+#   gpu_gb_to_free_fraction <gib>               Prints fraction of FREE VRAM (TensorRT-LLM)
+# build_gpu_mem_args <engine> [options...]
+#
+# Prints the computed memory fraction to stdout (empty line if none).
+# Callers capture with:  GPU_MEM_FRACTION=$(build_gpu_mem_args ...)
+#
+# Priority:
+#   1. _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE  (profiler binary search)
+#   2. Engine flag passed to this function  (user already chose a value)
+#   3. estimate_worker_vram + gpu_worker_fraction  (model architecture)
+#   4. Empty  (let engine use its own default)
+#
+# Options (each flag accepts engine-specific aliases):
+#   --model NAME                 Model name (required).
+#     aliases: --model-path        (sglang, trtllm)
+#   --max-model-len N            Max tokens per sequence (default: 4096).
+#     aliases: --context-length    (sglang)
+#              --max-seq-len       (trtllm)
+#   --max-num-seqs N             Concurrent sequences to budget for (default: 2).
+#     aliases: --max-running-requests (sglang)
+#              --max-batch-size       (trtllm)
+#   --gpu-memory-utilization F   User override (vllm flag name).  Skipped when empty.
+#   --mem-fraction-static F      User override (sglang flag name).
+#   --workers-per-gpu N          Divide the fraction by N (for shared-GPU disagg).
+#
+# Usage:
+#   # Simple single-worker (agg.sh)
+#   GPU_MEM_FRACTION=$(build_gpu_mem_args vllm \
+#       --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
+#   python -m dynamo.vllm --model "$MODEL" \
+#       ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+#
+#   # Two workers sharing one GPU (disagg_same_gpu.sh)
+#   GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --workers-per-gpu 2)
+#   python -m dynamo.vllm ... --gpu-memory-utilization "${GPU_MEM_FRACTION}" &
+#
+#   # sglang
+#   GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL" --workers-per-gpu 2)
+#   python -m dynamo.sglang ... --mem-fraction-static "${GPU_MEM_FRACTION}" &
+#
+#   # trtllm (fraction goes into JSON, not CLI)
+#   GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --workers-per-gpu 2)
+#   OVERRIDE_ARGS=(--override-engine-args "{\"kv_cache_config\":{\"free_gpu_memory_fraction\":${GPU_MEM_FRACTION}}}")
+build_gpu_mem_args() {
+    local engine="${1:?usage: build_gpu_mem_args <engine> --model <name> [options...]}"
+    shift
+    local model=""
+    local max_model_len="4096"
+    local max_seqs="2"
+    local workers_per_gpu=1
+    local user_frac=""
+    while [[ $# -gt 0 ]]; do
+        case "$1" in
+            --model|--model-path)
+                                model="$2";           shift 2 ;;
+            --max-model-len|--context-length|--max-seq-len)
+                                max_model_len="$2";   shift 2 ;;
+            --max-num-seqs|--max-running-requests|--max-batch-size)
+                                max_seqs="$2";        shift 2 ;;
+            --gpu-memory-utilization|--mem-fraction-static)
+                                user_frac="$2";       shift 2 ;;
+            --workers-per-gpu)  workers_per_gpu="$2"; shift 2 ;;
+            *) echo "build_gpu_mem_args: unknown option '$1'" >&2; return 1 ;;
+        esac
+    done
+    if [[ -z "$model" ]]; then
+        echo "build_gpu_mem_args: --model is required" >&2
+        return 1
+    fi
+    local frac=""
+    local from_estimator=false
+    local est_w="" est_kv="" est_oh="" est_total=""
+    if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" ]]; then
+        frac="$_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
+    elif [[ -n "$user_frac" ]]; then
+        frac="$user_frac"
+    elif read -r est_w est_kv est_oh est_total <<< "$(estimate_worker_vram "$model" "$max_model_len" "$max_seqs" "$engine" 2>/dev/null)" && [[ -n "$est_total" ]]; then
+        frac=$(gpu_worker_fraction "$engine" "$est_total" "$est_kv")
+        from_estimator=true
+    fi
+    # --workers-per-gpu divides profiler/user/estimator results only
+    if [[ -n "$frac" && "$workers_per_gpu" -gt 1 ]]; then
+        frac=$(awk -v f="$frac" -v n="$workers_per_gpu" 'BEGIN { printf "%.2f", f / n }')
+    fi
+    echo "$frac"
+}
 # get_model_params <model_name>
 #
-# Sets _MP_* variables for a known model's architecture:
+# Prints "params_b weight_bytes layers kv_heads head_dim" to stdout.
-#   _MP_PARAMS_B       Total parameters in billions (all experts for MoE)
+# Returns 1 (prints nothing) if the model is unknown.
-#   _MP_WEIGHT_BYTES   Bytes per weight element (2=BF16/FP16, 1=FP8)
+#
-#   _MP_LAYERS         Number of transformer layers
+# Fields:
-#   _MP_KV_HEADS       Number of key-value heads (GQA groups)
+#   params_b       Total parameters in billions (all experts for MoE)
-#   _MP_HEAD_DIM       Dimension per attention head
+#   weight_bytes   Bytes per weight element (2=BF16/FP16, 1=FP8)
+#   layers         Number of transformer layers
+#   kv_heads       Number of key-value heads (GQA groups)
+#   head_dim       Dimension per attention head
 #
 # KV cache is assumed BF16 (2 bytes per element) regardless of weight dtype,
 # since FP8 KV cache (--kv-cache-dtype fp8) is opt-in and not the default.
 #
-# To add a model: look up config.json on HuggingFace for num_hidden_layers,
+# To add a model:
-# num_key_value_heads, and head_dim. For VL/multimodal models, use the
+#   1. Find config.json at  https://huggingface.co/<model>/raw/main/config.json
-# text_config section. For MoE, _MP_PARAMS_B is the TOTAL param count
+#      For VL/multimodal models, architecture params are under text_config.
-# (all experts are loaded into VRAM).
+#   2. Map fields:
+#        layers    ← num_hidden_layers
+#        kv_heads  ← num_key_value_heads
+#        head_dim  ← head_dim  (or hidden_size / num_attention_heads)
+#   3. params_b: total parameter count in billions.  Derive from:
+#        - safetensors file size:  size_bytes / weight_bytes / 1e9
+#          (single file: ls -l model.safetensors; sharded: metadata.total_size
+#          in model.safetensors.index.json)
+#        - or the model card / paper
+#      For MoE: params_b is the TOTAL count (all experts loaded into VRAM).
+#   4. weight_bytes: 2 for BF16/FP16, 1 for FP8/INT8.
 #
 # Usage:
-#   get_model_params "Qwen/Qwen3-0.6B"
+#   read -r pb wb layers kvh hd <<< "$(get_model_params "Qwen/Qwen3-0.6B")"
-#   echo "$_MP_LAYERS layers, $_MP_KV_HEADS KV heads"
+#   echo "$layers layers, $kvh KV heads"
 get_model_params() {
    local model="${1:?usage: get_model_params <model_name>}"
+    local pb wb layers kvh hd
    case "$model" in
+        # https://huggingface.co/Qwen/Qwen3-0.6B/raw/main/config.json
        Qwen/Qwen3-0.6B)
-            _MP_PARAMS_B=0.6;  _MP_WEIGHT_BYTES=2
+            pb=0.6;  wb=2; layers=28; kvh=8;  hd=128 ;;
-            _MP_LAYERS=28;  _MP_KV_HEADS=8;   _MP_HEAD_DIM=128 ;;
+        # https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/raw/main/config.json  (text_config)
+        # params_b from model.safetensors.index.json metadata.total_size / 2 / 1e9
+        Qwen/Qwen2-VL-2B-Instruct)
+            pb=2.2;  wb=2; layers=28; kvh=2;  hd=128 ;;
+        # https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/raw/main/config.json  (text_config)
        Qwen/Qwen2.5-VL-7B-Instruct)
-            _MP_PARAMS_B=8.3;  _MP_WEIGHT_BYTES=2
+            pb=8.3;  wb=2; layers=28; kvh=4;  hd=128 ;;
-            _MP_LAYERS=28;  _MP_KV_HEADS=4;   _MP_HEAD_DIM=128 ;;
+        # https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct/raw/main/config.json  (text_config)
+        # params_b from model.safetensors size / 2 / 1e9
+        Qwen/Qwen3-VL-2B-Instruct)
+            pb=2.1;  wb=2; layers=28; kvh=8;  hd=128 ;;
+        # https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct/raw/main/config.json  (text_config)
        Qwen/Qwen3-VL-8B-Instruct)
-            _MP_PARAMS_B=9.2;  _MP_WEIGHT_BYTES=2
+            pb=9.2;  wb=2; layers=36; kvh=8;  hd=128 ;;
-            _MP_LAYERS=36;  _MP_KV_HEADS=8;   _MP_HEAD_DIM=128 ;;
+        # https://huggingface.co/Qwen/Qwen3-30B-A3B/raw/main/config.json
        Qwen/Qwen3-30B-A3B|\
        Qwen/Qwen3-30B-A3B-Instruct)
-            _MP_PARAMS_B=30.5; _MP_WEIGHT_BYTES=2
+            pb=30.5; wb=2; layers=48; kvh=4;  hd=128 ;;
-            _MP_LAYERS=48;  _MP_KV_HEADS=4;   _MP_HEAD_DIM=128 ;;
+        # Same architecture as Qwen3-30B-A3B but FP8 quantized (1 byte per weight)
        Qwen/Qwen3-VL-30B-A3B-Instruct-FP8)
-            _MP_PARAMS_B=30.5; _MP_WEIGHT_BYTES=1
+            pb=30.5; wb=1; layers=48; kvh=4;  hd=128 ;;
-            _MP_LAYERS=48;  _MP_KV_HEADS=4;   _MP_HEAD_DIM=128 ;;
+        # https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/raw/main/config.json
        meta-llama/Meta-Llama-3.1-8B-Instruct)
-            _MP_PARAMS_B=8.0;  _MP_WEIGHT_BYTES=2
+            pb=8.0;  wb=2; layers=32; kvh=8;  hd=128 ;;
-            _MP_LAYERS=32;  _MP_KV_HEADS=8;   _MP_HEAD_DIM=128 ;;
+        # https://huggingface.co/deepseek-ai/deepseek-llm-7b-base/raw/main/config.json
+        # MHA (not GQA): num_key_value_heads == num_attention_heads == 32
+        deepseek-ai/deepseek-llm-7b-base)
+            pb=6.9;  wb=2; layers=30; kvh=32; hd=128 ;;
+        # https://huggingface.co/llava-hf/llava-1.5-7b-hf/raw/main/config.json  (text_config)
+        # MHA: num_key_value_heads == num_attention_heads == 32
        llava-hf/llava-1.5-7b-hf)
-            _MP_PARAMS_B=7.1;  _MP_WEIGHT_BYTES=2
+            pb=7.1;  wb=2; layers=32; kvh=32; hd=128 ;;
-            _MP_LAYERS=32;  _MP_KV_HEADS=32;  _MP_HEAD_DIM=128 ;;
        *)
            echo "get_model_params: unknown model '$model'" >&2
            echo "Add it to get_model_params() in gpu_utils.sh" >&2
            return 1 ;;
    esac
+    echo "$pb $wb $layers $kvh $hd"
 }
 # estimate_worker_vram <model> [max_model_len] [max_concurrent_seqs] [engine_or_overhead]
 #
-# Calls get_model_params, then sets:
+# Prints "weights_gib kv_gib overhead_gib total_gib" to stdout.
-#   _EW_WEIGHTS_GIB    Estimated model weight memory
+# Returns 1 (prints nothing) if the model is unknown to get_model_params.
-#   _EW_KV_GIB         Estimated KV cache memory
-#   _EW_OVERHEAD_GIB   Overhead used (auto-computed or explicit)
-#   _EW_TOTAL_GIB      Estimated total per-worker VRAM (weights + kv + overhead)
 #
 # Formula:
 #   weights = params_b * 1e9 * weight_bytes
@@ -102,68 +225,60 @@ get_model_params() {
 # See examples/common/gpu_utils.md for the full derivation.
 #
 # Usage:
-#   estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm      # auto overhead
+#   read -r w kv oh total <<< "$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)"
-#   estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 trtllm     # auto overhead
+#   echo "$total GiB (w=$w kv=$kv oh=$oh)"
-#   estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 3.5        # explicit 3.5 GiB
-#   estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2            # default 2.0 GiB
-#   echo "$_EW_TOTAL_GIB GiB (w=$_EW_WEIGHTS_GIB kv=$_EW_KV_GIB oh=$_EW_OVERHEAD_GIB)"
 estimate_worker_vram() {
    local model="${1:?usage: estimate_worker_vram <model> [seq_len] [seqs] [engine_or_overhead]}"
    local seqlen="${2:-4096}"
    local seqs="${3:-2}"
    local engine_or_overhead="${4:-2.0}"
-    get_model_params "$model" || return 1
+    local mp_out
+    mp_out=$(get_model_params "$model") || return 1
+    local pb wb layers kvh hd
+    read -r pb wb layers kvh hd <<< "$mp_out"
    local overhead
    case "$engine_or_overhead" in
-        vllm)   overhead=$(awk -v p="$_MP_PARAMS_B" 'BEGIN { printf "%.1f", 1.2 + 1.0 * sqrt(p) }') ;;
+        vllm)   overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 1.2 + 1.0 * sqrt(p) }') ;;
-        sglang) overhead=$(awk -v p="$_MP_PARAMS_B" 'BEGIN { printf "%.1f", 2.5 + 1.5 * sqrt(p) }') ;;
+        sglang) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 2.5 + 1.5 * sqrt(p) }') ;;
-        trtllm) overhead=$(awk -v p="$_MP_PARAMS_B" 'BEGIN { printf "%.1f", 2.0 + 1.2 * sqrt(p) }') ;;
+        trtllm) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 2.0 + 1.2 * sqrt(p) }') ;;
        *)      overhead="$engine_or_overhead" ;;
    esac
-    _EW_OVERHEAD_GIB="$overhead"
+    awk -v pb="$pb" -v wbytes="$wb" \
-    read -r _EW_WEIGHTS_GIB _EW_KV_GIB _EW_TOTAL_GIB <<< "$(awk \
+        -v layers="$layers" -v heads="$kvh" -v dim="$hd" \
-        -v pb="$_MP_PARAMS_B" -v wbytes="$_MP_WEIGHT_BYTES" \
-        -v layers="$_MP_LAYERS" -v heads="$_MP_KV_HEADS" -v dim="$_MP_HEAD_DIM" \
        -v seqlen="$seqlen" -v seqs="$seqs" -v overhead="$overhead" \
        'BEGIN {
            gib = 1024 * 1024 * 1024
            w   = pb * 1e9 * wbytes / gib
            kv  = 2 * layers * heads * dim * 2 * seqlen * seqs / gib
-            printf "%.1f %.1f %.1f", w, kv, w + kv + overhead
+            printf "%.1f %.1f %.1f %.1f", w, kv, overhead, w + kv + overhead
-        }')"
+        }'
 }
-# gpu_worker_fraction <engine> [gpu_index]
+# gpu_worker_fraction <engine> <total_gib> <kv_gib> [gpu_index]
 #
-# Unified fraction calculator for all engines.  Reads the _EW_* variables
+# Convert estimated GiB into the engine-appropriate GPU memory fraction.
-# set by estimate_worker_vram and returns the engine-appropriate fraction.
 #
 # Engine semantics (see examples/common/gpu_utils.md):
-#   vllm/sglang  — fraction of TOTAL VRAM.  The engine budgets weights + KV +
+#   vllm/sglang  — fraction of TOTAL VRAM (uses total_gib).
-#                  activations inside this limit.  We pass _EW_TOTAL_GIB.
+#   trtllm       — fraction of FREE VRAM after model load (uses kv_gib).
-#   trtllm       — fraction of FREE VRAM (after model load).  The engine uses
-#                  this only for KV cache.  We pass _EW_KV_GIB.
-#
-# This lets every launch script use the same pattern:
-#   estimate_worker_vram "$MODEL" "$SEQ_LEN" "$CONCURRENCY" "$OVERHEAD_GIB"
-#   GPU_MEM_FRACTION=$(gpu_worker_fraction "<engine>")
 #
 # Usage:
-#   gpu_worker_fraction vllm        # uses _EW_TOTAL_GIB, fraction of total
+#   gpu_worker_fraction vllm   4.0 0.9      # fraction of total
-#   gpu_worker_fraction sglang      # same as vllm
+#   gpu_worker_fraction trtllm 4.0 0.9      # fraction of free
-#   gpu_worker_fraction trtllm      # uses _EW_KV_GIB, fraction of free
+#   gpu_worker_fraction trtllm 4.0 0.9 1    # query GPU index 1
-#   gpu_worker_fraction trtllm 1    # query GPU index 1
 gpu_worker_fraction() {
-    local engine="${1:?usage: gpu_worker_fraction <engine> [gpu_index]}"
+    local engine="${1:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib> [gpu_index]}"
-    local gpu_idx="${2:-0}"
+    local total_gib="${2:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib>}"
+    local kv_gib="${3:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib>}"
+    local gpu_idx="${4:-0}"
    case "$engine" in
        vllm|sglang)
-            gpu_gb_to_total_fraction "$_EW_TOTAL_GIB" "$gpu_idx" ;;
+            gpu_gb_to_total_fraction "$total_gib" "$gpu_idx" ;;
        trtllm)
-            gpu_gb_to_free_fraction "$_EW_KV_GIB" "$gpu_idx" ;;
+            gpu_gb_to_free_fraction "$kv_gib" "$gpu_idx" ;;
        *)
            echo "gpu_worker_fraction: unknown engine '$engine'" >&2
            echo "Supported: vllm, sglang, trtllm" >&2
@@ -171,6 +286,51 @@ gpu_worker_fraction() {
    esac
 }
+# gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]
+#
+# Convert a measured/profiled GPU peak (total VRAM including CUDA context,
+# activations, etc.) into the engine-specific memory fraction flag.
+#
+# Each engine's fraction controls only a SUBSET of GPU memory (e.g. vLLM's
+# --gpu-memory-utilization covers weights + KV cache but not CUDA context).
+# This function subtracts the engine-specific overhead so the fraction
+# targets the right internal budget, keeping the real peak stable across
+# re-profiles.
+#
+# Overhead constants (GiB outside the engine's budget):
+#   vllm   2.0   CUDA ctx ~0.6 + activations/sampler ~0.5 + PyTorch alloc ~0.5
+#   sglang 2.0   (assumed same as vllm; refine when profiled)
+#   trtllm 0.0   free-fraction is measured after model load, no subtraction needed
+#
+# Usage:
+#   gpu_peak_to_engine_fraction vllm 8.6       # on 48 GiB → 0.14
+#   gpu_peak_to_engine_fraction vllm 20.9      # on 48 GiB → 0.40
+#   gpu_peak_to_engine_fraction vllm 8.6 1     # query GPU index 1
+gpu_peak_to_engine_fraction() {
+    local engine=${1:?usage: gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]}
+    local peak_gib=${2:?usage: gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]}
+    local gpu_idx=${3:-0}
+    local overhead
+    case "$engine" in
+        vllm|sglang) overhead=2.0 ;;
+        trtllm)      overhead=0.0 ;;
+        *)
+            echo "gpu_peak_to_engine_fraction: unknown engine '$engine'" >&2
+            echo "Supported: vllm, sglang, trtllm" >&2
+            return 1 ;;
+    esac
+    local budget
+    budget=$(awk -v g="$peak_gib" -v oh="$overhead" \
+        'BEGIN { b = g - oh; if (b < 1) b = 1; printf "%.1f", b }')
+    case "$engine" in
+        vllm|sglang) gpu_gb_to_total_fraction "$budget" "$gpu_idx" ;;
+        trtllm)      gpu_gb_to_free_fraction  "$budget" "$gpu_idx" ;;
+    esac
+}
 # gpu_gb_to_total_fraction <gib> [gpu_index]
 #
 # For vLLM / sglang: --gpu-memory-utilization is a fraction of TOTAL GPU memory.
@@ -298,3 +458,189 @@ gpu_gb_to_free_fraction() {
    }'
 }
+# ---------------------------------------------------------------------------
+# Self-test: bash gpu_utils.sh --self-test
+# ---------------------------------------------------------------------------
+_gpu_utils_self_test() {
+    local pass=0 fail=0
+    _assert() {
+        local label="$1" expected="$2" actual="$3"
+        if [[ "$expected" == "$actual" ]]; then
+            ((pass++))
+            echo "  PASS  $label"
+        else
+            ((fail++))
+            echo "  FAIL  $label  (expected='$expected'  actual='$actual')"
+        fi
+    }
+    echo "=== get_model_params ==="
+    local out
+    out=$(get_model_params "Qwen/Qwen3-0.6B")
+    _assert "known model returns 5 fields" "0.6 2 28 8 128" "$out"
+    out=$(get_model_params "nope/unknown" 2>/dev/null)
+    _assert "unknown model returns empty" "" "$out"
+    get_model_params "nope/unknown" >/dev/null 2>&1
+    _assert "unknown model exits 1" "1" "$?"
+    echo ""
+    echo "=== estimate_worker_vram ==="
+    out=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)
+    _assert "returns 4 space-separated fields" "4" "$(echo "$out" | wc -w | tr -d ' ')"
+    local w kv oh total
+    read -r w kv oh total <<< "$out"
+    _assert "weights > 0" "yes" "$(awk -v v="$w" 'BEGIN { print (v > 0) ? "yes" : "no" }')"
+    _assert "total > weights" "yes" "$(awk -v t="$total" -v w="$w" 'BEGIN { print (t > w) ? "yes" : "no" }')"
+    out=$(estimate_worker_vram "nope/unknown" 2>/dev/null)
+    _assert "unknown model returns empty" "" "$out"
+    local out_vllm out_sglang
+    out_vllm=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)
+    out_sglang=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 sglang)
+    _assert "sglang overhead > vllm overhead" "yes" \
+        "$(awk -v v="$out_vllm" -v s="$out_sglang" 'BEGIN {
+            split(v, a); split(s, b); print (b[3]+0 > a[3]+0) ? "yes" : "no"
+        }')"
+    echo ""
+    echo "=== build_gpu_mem_args: estimator path (known model) ==="
+    local frac
+    frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2)
+    _assert "FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
+    echo ""
+    echo "=== build_gpu_mem_args: unknown model, no default ==="
+    frac=$(build_gpu_mem_args vllm --model "nope/unknown")
+    _assert "FRACTION empty" "" "$frac"
+    echo ""
+    echo "=== build_gpu_mem_args: profiler wins over all ==="
+    frac=$(_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.55 \
+        build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --gpu-memory-utilization 0.70)
+    _assert "FRACTION = profiler (beats user flag)" "0.55" "$frac"
+    echo ""
+    echo "=== build_gpu_mem_args: user flag wins over estimator ==="
+    frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --gpu-memory-utilization 0.70)
+    _assert "FRACTION = user flag" "0.70" "$frac"
+    echo ""
+    echo "=== build_gpu_mem_args: empty user flag falls through ==="
+    frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2 --gpu-memory-utilization "")
+    _assert "FRACTION = estimator" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
+    echo ""
+    echo "=== build_gpu_mem_args: --workers-per-gpu divides estimator ==="
+    local undivided
+    undivided=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2)
+    frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2 --workers-per-gpu 2)
+    local expected_half
+    expected_half=$(awk -v f="$undivided" 'BEGIN { printf "%.2f", f / 2 }')
+    _assert "FRACTION halved" "$expected_half" "$frac"
+    echo ""
+    echo "=== build_gpu_mem_args: --workers-per-gpu divides profiler ==="
+    frac=$(_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.80 \
+        build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --workers-per-gpu 2)
+    _assert "FRACTION = 0.80/2 = 0.40" "0.40" "$frac"
+    echo ""
+    echo "=== build_gpu_mem_args: sglang engine (sglang flag names) ==="
+    frac=$(build_gpu_mem_args sglang --model-path "Qwen/Qwen3-0.6B" --context-length 4096 --max-running-requests 2)
+    _assert "sglang FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
+    echo ""
+    echo "=== build_gpu_mem_args: trtllm engine (trtllm flag names) ==="
+    frac=$(build_gpu_mem_args trtllm --model-path "Qwen/Qwen3-0.6B" --max-seq-len 4096 --max-batch-size 2)
+    _assert "trtllm FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
+    echo ""
+    echo "=== build_gpu_mem_args: --mem-fraction-static user flag (sglang) ==="
+    frac=$(build_gpu_mem_args sglang --model-path "Qwen/Qwen3-0.6B" --mem-fraction-static 0.60)
+    _assert "FRACTION = user flag" "0.60" "$frac"
+    echo ""
+    echo "=== build_gpu_mem_args: missing --model ==="
+    build_gpu_mem_args vllm 2>/dev/null
+    _assert "missing --model exits 1" "1" "$?"
+    echo ""
+    echo "=== gpu_worker_fraction: explicit args ==="
+    local frac
+    frac=$(gpu_worker_fraction vllm 4.0 0.9)
+    _assert "vllm returns non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
+    frac=$(gpu_worker_fraction trtllm 4.0 0.9)
+    _assert "trtllm returns non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
+    gpu_worker_fraction badengine 4.0 0.9 >/dev/null 2>&1
+    _assert "bad engine exits 1" "1" "$?"
+    echo ""
+    echo "=========================================="
+    echo "Results: $pass passed, $fail failed"
+    echo "=========================================="
+    [[ "$fail" -eq 0 ]]
+}
+# CLI mode: only when executed directly (not sourced by another script)
+if [[ "${BASH_SOURCE[0]}" == "$0" ]]; then
+    if [[ "${1:-}" == "--self-test" ]]; then
+        _gpu_utils_self_test
+        exit $?
+    fi
+    if [[ $# -gt 0 ]]; then
+        build_gpu_mem_args "$@"
+        exit $?
+    fi
+    cat <<'HELP'
+gpu_utils.sh — GPU memory fraction estimator
+Usage:
+  ./gpu_utils.sh <engine> --model <name> [options...]
+  ./gpu_utils.sh --self-test
+Engines: vllm, sglang, trtllm
+Examples:
+  ./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B
+  ./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B --max-model-len 4096 --max-num-seqs 2
+  ./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B --workers-per-gpu 2
+  ./gpu_utils.sh sglang --model Qwen/Qwen3-0.6B --context-length 8192
+  ./gpu_utils.sh trtllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --max-seq-len 4096
+Options:
+  --model NAME               Model name (required)
+    aliases: --model-path
+  --max-model-len N          Max sequence length (default: 4096)
+    aliases: --context-length, --max-seq-len
+  --max-num-seqs N           Concurrent sequences (default: 2)
+    aliases: --max-running-requests, --max-batch-size
+  --gpu-memory-utilization F Override fraction (vllm flag)
+    aliases: --mem-fraction-static
+  --workers-per-gpu N        Divide fraction by N (shared-GPU disagg)
+  --self-test                Run built-in test suite
+Output: prints the fraction to stdout (empty if model is unknown).
+HELP
+    exit 0
+fi
--- a/examples/common/launch_utils.sh
+++ b/examples/common/launch_utils.sh
@@ -135,6 +135,12 @@ print_launch_banner() {
    echo "=========================================="
    echo "Model:       $_model"
    echo "Frontend:    http://localhost:$_port"
+    local _seq_len="${MAX_MODEL_LEN:-${CONTEXT_LENGTH:-${MAX_SEQ_LEN:-}}}"
+    local _frac="${GPU_MEM_FRACTION:-}"
+    [[ -n "$_seq_len" ]] && echo "Max seq len: $_seq_len"
+    [[ -n "$_frac" ]] && echo "GPU frac:    $_frac"
    for _line in "$@"; do
        echo "$_line"
    done

--- a/examples/multimodal/launch/audio_agg.sh
+++ b/examples/multimodal/launch/audio_agg.sh
@@ -4,6 +4,9 @@
 set -e
 trap 'echo Cleaning up...; kill 0' EXIT
+SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
+source "$SCRIPT_DIR/../../common/gpu_utils.sh"
 # Default values
 MODEL_NAME="Qwen/Qwen2-Audio-7B-Instruct"
 PROMPT_TEMPLATE=""
@@ -90,8 +93,10 @@ python -m dynamo.frontend --http-port 8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
 # run E/P/D workers
+GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
 CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
-VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill &
+VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/audio_disagg.sh
+++ b/examples/multimodal/launch/audio_disagg.sh
@@ -4,6 +4,9 @@
 set -e
 trap 'echo Cleaning up...; kill 0' EXIT
+SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
+source "$SCRIPT_DIR/../../common/gpu_utils.sh"
 # Default values
 MODEL_NAME="Qwen/Qwen2-Audio-7B-Instruct"
 PROMPT_TEMPLATE=""
@@ -90,9 +93,11 @@ python -m dynamo.frontend --http-port 8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
 # run E/P/D workers
+GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
 CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
-DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg &
+DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
-DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg &
+DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/video_agg.sh
+++ b/examples/multimodal/launch/video_agg.sh
@@ -4,6 +4,9 @@
 set -e
 trap 'echo Cleaning up...; kill 0' EXIT
+SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
+source "$SCRIPT_DIR/../../common/gpu_utils.sh"
 # Default values
 MODEL_NAME="llava-hf/LLaVA-NeXT-Video-7B-hf"
 PROMPT_TEMPLATE="USER: <video>\n<prompt> ASSISTANT:"
@@ -16,8 +19,10 @@ python -m dynamo.frontend --http-port=8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
 # run E/P/D workers
+GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
 CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
-VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill &
+VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/video_disagg.sh
+++ b/examples/multimodal/launch/video_disagg.sh
@@ -4,6 +4,9 @@
 set -e
 trap 'echo Cleaning up...; kill 0' EXIT
+SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
+source "$SCRIPT_DIR/../../common/gpu_utils.sh"
 # Default values
 MODEL_NAME="llava-hf/LLaVA-NeXT-Video-7B-hf"
 PROMPT_TEMPLATE="USER: <video>\n<prompt> ASSISTANT:"
@@ -17,9 +20,11 @@ python -m dynamo.frontend --http-port=8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
 # run E/P/D workers
+GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
 CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
-DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg &
+DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
-DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg &
+DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
 # Wait for all background processes to complete
 wait
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -233,6 +233,7 @@ markers = [
    "gpu_2: marks tests to run on 2GPUs",
    "gpu_4: marks tests to run on 4GPUs",
    "gpu_8: marks tests to run on 8GPUs",
+    "max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
    "e2e: marks tests as end-to-end tests",
    "integration: marks tests as integration tests",
    "unit: marks tests as unit tests",

--- a/tests/README.md
+++ b/tests/README.md
@@ -116,6 +116,7 @@ Markers are required for all tests. They are used for test selection in CI and l
 | Lifecycle [required]    | pre_merge, post_merge, nightly, weekly, release                  | When the test should run           |
 | Test Type [required]    | unit, integration, e2e, benchmark, performance, stress, multimodal | Nature of the test               |
 | Hardware [required]     | gpu_0, gpu_1, gpu_2, gpu_4, gpu_8, h100                         | Number/type of GPUs required       |
+| VRAM Requirement        | max_vram_gib(N)                                                              | Peak VRAM in GiB (with 10% safety). The pytest invocation can use `--max-vram-gib=N` to select only tests that fit on the available GPU. Does not prevent running on smaller GPUs (that will OOM). Use `profile_pytest.py` to measure. |
 | Component/Framework     | vllm, trtllm, sglang, kvbm, kvbm_concurrency, planner, router   | Backend or component specificity   |
 | Infrastructure          | k8s, deploy, fault_tolerance                                     | Infrastructure/environment needs   |
 | Execution               | parallel                                                         | Test can run in parallel with pytest-xdist. Must use dynamic port allocation (`alloc_ports`) and not share resources (e.g. filesystem) |
@@ -126,11 +127,30 @@ Markers are required for all tests. They are used for test selection in CI and l
 @pytest.mark.pre_merge
 @pytest.mark.integration
 @pytest.mark.gpu_1
+@pytest.mark.max_vram_gib(21)  # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
 @pytest.mark.vllm
 def test_kv_cache_behavior():
    ...
 ```
+### Filtering by VRAM
+The `max_vram_gib(N)` marker records how much GPU memory a test needs. The pytest invocation can use `--max-vram-gib=N` as a **selector** to run only tests that fit on the available GPU. Tests that exceed the budget are skipped at collection time (before any test starts). Tests without a `max_vram_gib` marker always run (no constraint assumed).
+Nothing prevents you from running without this flag — but if a test needs more VRAM than is physically available, it will OOM at runtime (e.g., vLLM raises `ValueError: No available memory for the cache blocks`).
+```bash
+# Run only tests that fit on a 48 GiB GPU — tests needing >48 GiB are skipped
+python3 -m pytest --max-vram-gib=48 tests/
+# GPU tests that have no max_vram_gib marker yet — need profiling
+# TODO: profile these tests and add max_vram_gib markers
+python3 -m pytest -m "(gpu_1 or gpu_2 or gpu_4 or gpu_8) and not max_vram_gib" tests/
+# No filter — run everything regardless of VRAM (tests that exceed available memory will OOM)
+python3 -m pytest tests/
+```
 ### Lifecycle Marker Note
 Use the marker for the earliest pipeline stage where the test must run (e.g., `@pytest.mark.pre_merge`). This ensures the test is included in that stage and all subsequent ones (e.g., nightly, release), as CI pipelines select tests marked for earlier stages.
@@ -416,6 +436,113 @@ GPU and model-loading overhead means Dynamo E2E tests are inherently slower than
 ---
+## GPU VRAM Profiler (`profile_pytest.py`)
+When writing or reviewing GPU tests, use `tests/utils/profile_pytest.py` to measure how much VRAM a test actually needs. The script runs the test repeatedly with different GPU memory caps and uses binary search to find the minimum VRAM required. It then prints recommended pytest markers you can copy into your test.
+### How it works
+The profiler sets the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` environment variable (a fraction from 0.0 to 1.0 of total GPU RAM) and runs the test at each probe point. It bisects between "passes" and "OOM/fails" to find the boundary. After the search, it samples `nvidia-smi` to report peak VRAM, phase analysis, and marker recommendations.
+**Requirement:** The test under profile **must** honor the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` env var. For standalone tests that allocate CUDA memory directly, check `os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")` and cap your allocation accordingly — see `tests/utils/test_mock_gpu_alloc.py` for an example.
+### Engine-specific mapping
+`_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is a generic env var (float 0.0-1.0) that launch scripts translate to the engine-specific CLI flag:
+| Engine  | CLI flag                         | Launch script support |
+|---------|----------------------------------|-----------------------|
+| vLLM    | `--gpu-memory-utilization`       | Implemented in `agg.sh`, `disagg.sh`, etc. |
+| SGLang  | `--mem-fraction-static`          | Not yet implemented (TODO) |
+| TRT-LLM | `--free-gpu-memory-fraction`    | Not yet implemented (has its own `DYN_TRTLLM_FREE_GPU_MEMORY_FRACTION`, TODO: unify) |
+Scripts that already hard-code their own memory fraction (e.g. `agg_multimodal.sh` with 0.85) have a TODO to honor `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` in the future. If the profiler detects constant VRAM across all probes (meaning the env var is ignored), it prints a warning and skips marker recommendations.
+### Usage
+```bash
+# Default mode: binary search for minimum VRAM (recommended)
+# -xvs is optional: stop on first failure, verbose, show output
+python tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated] -xvs
+# Single-pass profiling (no binary search, just measure one run using default RAM)
+python tests/utils/profile_pytest.py --no-find-min-vram tests/serve/test_vllm.py::test_serve_deployment[aggregated]
+```
+### Example output
+```bash
+========================================================================
+FIND MINIMUM VRAM (binary search)
+========================================================================
+  GPU total : 48.0 GiB
+  GPU free  : 48.0 GiB  (in use: 0.0 GiB)
+  Test      : tests/serve/test_vllm.py::test_serve_deployment[aggregated] -x
+  Range   : 5% - 95%  (tolerance 5%)
+  Max iter: 6 (1 validation + 5 bisections)
+  [probe 1/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.95 (45.6 GiB)  [validation run]
+  [PASS] peak 18.5 GiB, wall 41s, iter took 49s
+  ...
+  [probe 5/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.33 (15.9 GiB)
+  [FAIL] OOM or error at 33% (15.9 GiB), iter took 30s
+  [probe 6/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.36 (17.2 GiB)  [~0 left, ETA ~0s]
+  [PASS] peak 18.5 GiB, wall 41s, iter took 49s
+========================================================================
+MINIMUM VRAM RESULT
+========================================================================
+  Lowest passing utilization : 36%
+  Minimum VRAM needed        : ~17.2 GiB (peak observed: 18.5 GiB, +10% safety: 20.4 GiB)
+  # test_serve_deployment[aggregated]: @pytest.mark.max_vram_gib(21)
+  # Fits on: L4 (24 GiB), V100-32GB (32 GiB), A6000/A40 (48 GiB), A100/H100 (80 GiB)
+  # Will OOM on: edge/embedded (4 GiB), RTX 3060/4060 (8 GiB), T4 (16 GiB)
+========================================================================
+========================================================================
+Recommended markers to add to your pytest. You can copy-paste this:
+========================================================================
+# Measured using: tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated]
+@pytest.mark.e2e  # wall time 41.2s, loads a real model
+@pytest.mark.gpu_1  # 1 GPU(s) used, peak 18.5 GiB
+@pytest.mark.max_vram_gib(21)  # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
+@pytest.mark.timeout(124)  # 3x observed 41.2s
+  WARNING: Wall time 41.2s is too slow for pre_merge (> 20s). Consider post_merge or nightly instead.
+  WARNING: Will OOM on edge/embedded (4 GiB).
+  WARNING: Will OOM on RTX 3060/4060 (8 GiB).
+  WARNING: Will OOM on T4 (16 GiB).
+========================================================================
+```
+### How to use the recommendations
+1. **Copy the `@pytest.mark.*` lines** into your test function or `pytestmark` list.
+2. **VRAM marker** — `max_vram_gib(N)` records the peak GPU memory the test needs (with 10% safety margin). This marker does **not** skip tests on its own — if a test runs on a GPU that is too small, it will OOM and fail hard. Use `--max-vram-gib=N` to select only tests that fit on the available GPU (see [Filtering by VRAM](#filtering-by-vram) for examples). The WARNING lines in the profiler output tell you which GPU tiers would be too small (e.g., "Will OOM on T4 (16 GiB)").
+3. **Lifecycle markers** — the profiler recommends `pre_merge` only for tests under 20 seconds. For slower tests, it warns you to consider `post_merge` or `nightly` but does not choose for you — use your judgment based on how critical the test is for catching regressions early.
+4. **Timeout** — the recommended value is 3x the observed wall time. Adjust upward if your test has high variance (e.g., first-run model download, flaky network).
+5. **Test type** (`unit`, `integration`, `e2e`) — inferred from wall time and whether a real model was loaded. Override if you know better (e.g., a fast test that uses a mock engine is `integration`, not `e2e`).
+### Options
+| Flag | Description |
+|------|-------------|
+| `--no-find-min-vram` | Skip binary search; run a single profiling pass instead |
+| `--interval N` | GPU sampling interval in seconds (default: 1.0) |
+| `--baseline-seconds N` | Seconds to sample before launching pytest (default: 3.0) |
+| `--teardown-seconds N` | Seconds to sample after pytest exits (default: 5.0) |
+| `--csv FILE` | Write raw nvidia-smi samples to a CSV file |
+| `--no-recommend` | Suppress marker recommendations |
+---
 ## References
 - [pytest documentation](https://docs.pytest.org/en/stable/)
 - [Bazel Test Encyclopedia — test sizes and timeouts](https://docs.bazel.build/versions/2.0.0/test-encyclopedia.html)

--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -42,6 +42,7 @@ def pytest_configure(config):
        "gpu_2: marks tests to run on 2GPUs",
        "gpu_4: marks tests to run on 4GPUs",
        "gpu_8: marks tests to run on 8GPUs",
+        "max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
        "e2e: marks tests as end-to-end tests",
        "integration: marks tests as integration tests",
        "unit: marks tests as unit tests",
@@ -101,6 +102,12 @@ def pytest_addoption(parser: pytest.Parser) -> None:
        help="Skip restarting NATS and etcd services before deployment. "
        "Default: deploy tests skip (for speed), fault-tolerance tests restart (for clean state).",
    )
+    parser.addoption(
+        "--max-vram-gib",
+        type=float,
+        default=None,
+        help="Skip tests whose @pytest.mark.max_vram_gib(N) exceeds this value (GiB).",
+    )
 LOG_FORMAT = "[TEST] %(asctime)s %(levelname)s %(name)s: %(message)s"
@@ -293,6 +300,17 @@ def pytest_collection_modifyitems(config, items):
                if _item_has_marker(item, marker_name):
                    item.add_marker(skip)
+    # Skip tests that exceed --max-vram-gib
+    vram_limit = config.getoption("--max-vram-gib", default=None)
+    if vram_limit is not None:
+        skip_vram = pytest.mark.skip(
+            reason=f"requires more than {vram_limit} GiB VRAM (--max-vram-gib={vram_limit})"
+        )
+        for item in items:
+            vram_mark = item.get_closest_marker("max_vram_gib")
+            if vram_mark and vram_mark.args and vram_mark.args[0] > vram_limit:
+                item.add_marker(skip_vram)
    # Collect models via explicit pytest mark from final filtered items only
    models_to_download = set()
    for item in items:
@@ -836,11 +854,17 @@ def dynamo_dynamic_ports(num_system_ports) -> Generator[ServicePorts, None, None
    - frontend_port: OpenAI-compatible HTTP/gRPC ingress (dynamo.frontend)
    - system_ports: List of worker metrics/system ports (configurable count via num_system_ports)
+    - kv_event_port: ZMQ port for vLLM KV event publishing (avoids collisions under xdist)
    """
    frontend_port = allocate_port(DefaultPort.FRONTEND.value)
    system_port_list = allocate_ports(num_system_ports, DefaultPort.SYSTEM1.value)
-    all_ports = [frontend_port, *system_port_list]
+    kv_event_port = allocate_port(DefaultPort.SYSTEM1.value)
+    all_ports = [frontend_port, *system_port_list, kv_event_port]
    try:
-        yield ServicePorts(frontend_port=frontend_port, system_ports=system_port_list)
+        yield ServicePorts(
+            frontend_port=frontend_port,
+            system_ports=system_port_list,
+            kv_event_port=kv_event_port,
+        )
    finally:
        deallocate_ports(all_ports)
--- a/tests/frontend/test_vllm.py
+++ b/tests/frontend/test_vllm.py
@@ -89,6 +89,8 @@ class VllmWorkerProcess(ManagedProcess):
            "dynamo.vllm",
            "--model",
            TEST_MODEL,
+            "--max-model-len",
+            "32768",  # 32768 uses ~1.5 GiB (original default 131072 used ~6 GiB KV cache)
            "--dyn-tool-call-parser",
            "harmony",
            "--dyn-reasoning-parser",
@@ -97,6 +99,10 @@ class VllmWorkerProcess(ManagedProcess):
            "32768",
        ]
+        gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
+        if gpu_util:
+            command.extend(["--gpu-memory-utilization", gpu_util])
        env = os.environ.copy()
        env["DYN_LOG"] = "debug"
        env["DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS"] = '["generate"]'
@@ -222,7 +228,9 @@ def _validate_chat_response(response: requests.Response) -> Dict[str, Any]:
    return response_json
-@pytest.mark.timeout(300)  # ~3x measured total (~70s/test), rounded up
+# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort
+@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.timeout(300)  # 3x observed ~70s wall time, rounded up
 @pytest.mark.post_merge
 def test_reasoning_effort(
    request, start_services: ServicePorts, predownload_models
@@ -288,7 +296,9 @@ def test_reasoning_effort(
        )
-@pytest.mark.timeout(180)  # ~3x measured total (~50s/test), rounded up
+# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
+@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.timeout(113)  # 3x observed 37.4s wall time
 @pytest.mark.post_merge
 def test_tool_calling(
    request, start_services: ServicePorts, predownload_models
@@ -330,7 +340,9 @@ def test_tool_calling(
    ), "Expected get_current_weather tool to be called"
-@pytest.mark.timeout(180)  # ~3x measured total (~50s/test), rounded up
+# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling_second_round
+@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.timeout(115)  # 3x observed 38.1s wall time
 @pytest.mark.nightly
 def test_tool_calling_second_round(
    request, start_services: ServicePorts, predownload_models
@@ -394,7 +406,9 @@ def test_tool_calling_second_round(
    ), "Expected response to include temperature information from tool call result (20°C)"
-@pytest.mark.timeout(180)  # ~3x measured total (~57s/test), rounded up
+# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning
+@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.timeout(131)  # 3x observed 43.4s wall time
 @pytest.mark.nightly
 def test_reasoning(request, start_services: ServicePorts, predownload_models) -> None:
    """Test reasoning functionality with a mathematical problem."""

--- a/tests/serve/common.py
+++ b/tests/serve/common.py
@@ -6,6 +6,7 @@
 import dataclasses
 import logging
 import os
+import time
 from collections.abc import Mapping
 from copy import deepcopy
 from typing import Any, Dict, Optional
@@ -51,6 +52,16 @@ def run_serve_deployment(
    if extra_env:
        merged_env.update(extra_env)
+    # Stagger engine startup under xdist to avoid vLLM profiling race
+    # (vLLM bug #10643: concurrent profilers miscount each other's memory).
+    worker_id = os.environ.get("PYTEST_XDIST_WORKER", "")
+    if worker_id.startswith("gw"):
+        worker_num = int(worker_id.removeprefix("gw"))
+        if worker_num > 0:
+            stagger_s = worker_num * 15
+            logger.info("Staggering startup by %ds (xdist %s)", stagger_s, worker_id)
+            time.sleep(stagger_s)
    if ports is not None:
        dynamic_frontend_port = int(ports.frontend_port)
        dynamic_system_ports = [int(p) for p in ports.system_ports]
@@ -76,6 +87,10 @@ def run_serve_deployment(
            for idx, port in enumerate(dynamic_system_ports, start=1):
                merged_env[f"DYN_SYSTEM_PORT{idx}"] = str(port)
+        # Unique ZMQ port for vLLM KV event publishing (avoids xdist collisions).
+        if ports.kv_event_port:
+            merged_env["DYN_VLLM_KV_EVENT_PORT"] = str(ports.kv_event_port)
        # Ensure EngineProcess health checks hit the correct frontend port.
        config = dataclasses.replace(config, frontend_port=dynamic_frontend_port)
    else:

--- a/tests/serve/conftest.py
+++ b/tests/serve/conftest.py
@@ -9,9 +9,10 @@ from pytest_httpserver import HTTPServer
 from dynamo.common.utils.paths import WORKSPACE_DIR
 from tests.serve.lora_utils import MinioLoraConfig, MinioService
+from tests.utils.port_utils import allocate_port, deallocate_port
 # Shared constants for multimodal testing
-IMAGE_SERVER_PORT = 8765
+IMAGE_SERVER_PORT = allocate_port(8765)
 MULTIMODAL_IMG_PATH = os.path.join(
    WORKSPACE_DIR, "lib/llm/tests/data/media/llm-optimize-deploy-graphic.png"
 )
@@ -42,7 +43,8 @@ def get_multimodal_test_image_bytes() -> bytes:
 @pytest.fixture(scope="session")
 def httpserver_listen_address():
-    return ("127.0.0.1", IMAGE_SERVER_PORT)
+    yield ("127.0.0.1", IMAGE_SERVER_PORT)
+    deallocate_port(IMAGE_SERVER_PORT)
 @pytest.fixture(scope="function")
@@ -60,7 +62,7 @@ def image_server(httpserver: HTTPServer):
    Usage:
        def test_multimodal(image_server):
-            url = "http://localhost:8765/llm-graphic.png"
+            # Use MULTIMODAL_IMG_URL from this module
            # ... use url in your test payload
    """
    image_data = get_multimodal_test_image_bytes()

--- a/tests/serve/launch/multi_node_tp_headless.sh
+++ b/tests/serve/launch/multi_node_tp_headless.sh
@@ -12,6 +12,8 @@ trap 'echo "Cleaning up..."; kill 0' EXIT
 MODEL="${MODEL:-Qwen/Qwen3-0.6B}"
+GPU_MEM_FRACTION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}"
 echo "Starting Dynamo frontend..."
 python3 -m dynamo.frontend &
@@ -22,7 +24,8 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
  --nnodes 2 \
  --node-rank 0 \
  --master-addr 127.0.0.1 \
-  --enforce-eager &
+  --enforce-eager \
+  ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
 echo "Starting dynamo.vllm headless worker (TP=2, nnodes=2, node-rank=1, GPU 1)..."
 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
@@ -32,6 +35,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
  --node-rank 1 \
  --master-addr 127.0.0.1 \
  --enforce-eager \
+  ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} \
  --headless &
 wait
--- a/tests/serve/test_vllm.py
+++ b/tests/serve/test_vllm.py
@@ -54,10 +54,10 @@ vllm_dir = os.environ.get("VLLM_DIR") or os.path.join(
 # vLLM test configurations
 # NOTE: pytest.mark.gpu_1 tests take ~5.5 minutes total to run sequentially (with models pre-cached)
-# TODO: Now that these tests use dynamic ports, optimize the runtime by bin-packing and running
+# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
-# multiple engine deployments in parallel (while keeping GPU contention under control). This may
+# optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
-# require annotating each config with approximate GPU RAM usage so a future collector/launcher can
+# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
-# bin-pack safely.
+# concurrently without exceeding available VRAM.
 vllm_configs = {
    "aggregated": VLLMConfig(
        name="aggregated",
@@ -65,8 +65,9 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
+            pytest.mark.timeout(127),  # 3x observed 42.2s wall time
            pytest.mark.pre_merge,
-            pytest.mark.timeout(300),  # 3x measured time (43s) + download time (150s)
        ],
        model="Qwen/Qwen3-0.6B",
        request_payloads=[
@@ -90,7 +91,12 @@ vllm_configs = {
        name="aggregated_logprobs",
        directory=vllm_dir,
        script_name="agg.sh",
-        marks=[pytest.mark.gpu_1, pytest.mark.post_merge],
+        marks=[
+            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
+            pytest.mark.timeout(73),  # 3x observed 24.3s wall time
+            pytest.mark.post_merge,
+        ],
        model="Qwen/Qwen3-0.6B",
        request_payloads=[
            chat_payload_with_logprobs(
@@ -116,8 +122,9 @@ vllm_configs = {
        marks=[
            pytest.mark.lmcache,
            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(8.1),  # observed peak 7.4 GiB (+10% safety)
+            pytest.mark.timeout(147),  # 3x observed 49.0s wall time
            pytest.mark.pre_merge,
-            pytest.mark.timeout(360),  # 3x estimated time (70s) + download time (150s)
            pytest.mark.skipif(
                _is_cuda13(),
                reason="lmcache does not support CUDA 13 as of v0.3.11",
@@ -138,8 +145,9 @@ vllm_configs = {
        marks=[
            pytest.mark.lmcache,
            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(8.1),  # observed peak 7.4 GiB (+10% safety)
+            pytest.mark.timeout(148),  # 3x observed 49.3s wall time
            pytest.mark.pre_merge,
-            pytest.mark.timeout(360),  # 3x estimated time (70s) + download time (150s)
            pytest.mark.skipif(
                _is_cuda13(),
                reason="lmcache does not support CUDA 13 as of v0.3.11",
@@ -162,8 +170,9 @@ vllm_configs = {
        script_name="agg_request_planes.sh",
        marks=[
            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(8.1),  # observed peak 7.3 GiB (+10% safety)
+            pytest.mark.timeout(129),  # 3x observed 43.0s wall time
            pytest.mark.pre_merge,
-            pytest.mark.timeout(300),  # 3x measured time (43s) + download time (150s)
        ],
        model="Qwen/Qwen3-0.6B",
        script_args=["--tcp"],
@@ -178,8 +187,9 @@ vllm_configs = {
        script_name="agg_request_planes.sh",
        marks=[
            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(8.1),  # observed peak 7.3 GiB (+10% safety)
+            pytest.mark.timeout(127),  # 3x observed 42.3s wall time
            pytest.mark.pre_merge,
-            pytest.mark.timeout(300),  # 3x measured time (43s) + download time (150s)
        ],
        model="Qwen/Qwen3-0.6B",
        script_args=["--http"],
@@ -196,7 +206,7 @@ vllm_configs = {
            pytest.mark.gpu_2,
            pytest.mark.pre_merge,
            pytest.mark.skip(reason="DYN-2263"),
-        ],
+        ],  # TODO: profile to get max_vram and timeout
        model="Qwen/Qwen3-0.6B",
        request_payloads=[
            chat_payload_default(
@@ -219,7 +229,7 @@ vllm_configs = {
            pytest.mark.gpu_2,
            pytest.mark.pre_merge,
            pytest.mark.skip(reason="DYN-2264"),
-        ],
+        ],  # TODO: profile to get max_vram and timeout
        model="Qwen/Qwen3-0.6B",
        request_payloads=[
            # Test approximate KV routing (--no-kv-events mode)
@@ -250,7 +260,10 @@ vllm_configs = {
        name="disaggregated",
        directory=vllm_dir,
        script_name="disagg.sh",
-        marks=[pytest.mark.gpu_2, pytest.mark.pre_merge],
+        marks=[
+            pytest.mark.gpu_2,
+            pytest.mark.pre_merge,
+        ],  # TODO: profile to get max_vram and timeout
        model="Qwen/Qwen3-0.6B",
        request_payloads=[
            chat_payload_default(),
@@ -266,6 +279,7 @@ vllm_configs = {
            pytest.mark.vllm,
            pytest.mark.h100,
            pytest.mark.nightly,
+            # TODO: profile to get max_vram and timeout
        ],
        model="deepseek-ai/DeepSeek-V2-Lite",
        script_args=[
@@ -289,7 +303,12 @@ vllm_configs = {
        name="multimodal_disagg_qwen3vl_2b_e_pd",
        directory=vllm_dir,
        script_name="disagg_multimodal_e_pd.sh",
-        marks=[pytest.mark.gpu_1, pytest.mark.pre_merge],
+        marks=[
+            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(24.6),  # observed peak 22.3 GiB (+10% safety)
+            pytest.mark.timeout(206),  # 3x observed 68.4s wall time
+            pytest.mark.pre_merge,
+        ],
        model="Qwen/Qwen3-VL-2B-Instruct",
        script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
        request_payloads=[
@@ -318,7 +337,12 @@ vllm_configs = {
        directory=vllm_dir,
        script_name="agg_multimodal.sh",
        # post_merge because needs real NIXL not stub
-        marks=[pytest.mark.gpu_1, pytest.mark.post_merge],
+        marks=[
+            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(10.2),  # observed peak 9.3 GiB (+10% safety)
+            pytest.mark.timeout(131),  # 3x observed 43.7s wall time
+            pytest.mark.post_merge,
+        ],
        model="Qwen/Qwen2-VL-2B-Instruct",
        # Pass --frontend-decoding to enable Rust frontend image decoding + NIXL RDMA transfer
        script_args=[
@@ -345,13 +369,20 @@ vllm_configs = {
            )
        ],
    ),
-    # NOTE: Pack all workers on 1 GPU for lower CI resource requirements
+    # NOTE: Pack all workers on 1 GPU for lower CI resource requirements.
+    # NOTE: disagg_multimodal_epd.sh uses --kv-cache-memory-bytes=512MB for P/D
+    # workers. Per vLLM CacheConfig, kv_cache_memory_bytes (when not-None) ignores
+    # gpu_memory_utilization (ref: https://docs.vllm.ai/en/stable/api/vllm/config/cache/),
+    # so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect. Regardless of GPU_MEM
+    # fractions (0.1/0.4/0.4), the 3 workers combined consistently use ~17.6 GiB
+    # total on this GPU.
    "multimodal_disagg_qwen3vl_2b_epd": VLLMConfig(
        name="multimodal_disagg_qwen3vl_2b_epd",
        directory=vllm_dir,
        script_name="disagg_multimodal_epd.sh",
        marks=[
            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(19.4),  # observed peak 17.6 GiB (+10% safety)
            pytest.mark.post_merge,
            pytest.mark.skip(reason="DYN-2265"),
        ],
@@ -389,7 +420,12 @@ vllm_configs = {
        name="multimodal_agg_qwen",
        directory=vllm_dir,
        script_name="agg_multimodal.sh",
-        marks=[pytest.mark.gpu_1, pytest.mark.post_merge],
+        marks=[
+            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(21.6),  # observed peak 19.6 GiB (+10% safety)
+            pytest.mark.timeout(150),  # 3x observed 50.0s wall time
+            pytest.mark.post_merge,
+        ],
        model="Qwen/Qwen2.5-VL-7B-Instruct",
        script_args=["--model", "Qwen/Qwen2.5-VL-7B-Instruct"],
        delayed_start=0,
@@ -418,6 +454,8 @@ vllm_configs = {
        script_name="agg_multimodal.sh",
        marks=[
            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(18.9),  # observed peak 17.1 GiB (+10% safety)
+            pytest.mark.timeout(128),  # 3x observed 42.7s wall time
            pytest.mark.nightly,
            # https://github.com/ai-dynamo/dynamo/issues/4501
            pytest.mark.xfail(strict=False),
@@ -456,7 +494,10 @@ vllm_configs = {
        name="multimodal_video_agg",
        directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
        script_name="video_agg.sh",
-        marks=[pytest.mark.gpu_2, pytest.mark.nightly],
+        marks=[
+            pytest.mark.gpu_2,
+            pytest.mark.nightly,
+        ],  # TODO: profile to get max_vram and timeout
        model="llava-hf/LLaVA-NeXT-Video-7B-hf",
        delayed_start=60,  # Video models require longer loading time
        script_args=["--model", "llava-hf/LLaVA-NeXT-Video-7B-hf"],
@@ -483,7 +524,10 @@ vllm_configs = {
        name="multimodal_video_disagg",
        directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
        script_name="video_disagg.sh",
-        marks=[pytest.mark.gpu_2, pytest.mark.nightly],
+        marks=[
+            pytest.mark.gpu_2,
+            pytest.mark.nightly,
+        ],  # TODO: profile to get max_vram and timeout
        model="llava-hf/LLaVA-NeXT-Video-7B-hf",
        delayed_start=60,  # Video models require longer loading time
        script_args=["--model", "llava-hf/LLaVA-NeXT-Video-7B-hf"],
@@ -512,7 +556,10 @@ vllm_configs = {
        name="multimodal_audio_agg",
        directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
        script_name="audio_agg.sh",
-        marks=[pytest.mark.gpu_2, pytest.mark.nightly],
+        marks=[
+            pytest.mark.gpu_2,
+            pytest.mark.nightly,
+        ],  # TODO: profile to get max_vram and timeout
        model="Qwen/Qwen2-Audio-7B-Instruct",
        delayed_start=60,  # Audio models require longer loading time
        script_args=["--model", "Qwen/Qwen2-Audio-7B-Instruct"],
@@ -539,7 +586,10 @@ vllm_configs = {
        name="multimodal_audio_disagg",
        directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
        script_name="audio_disagg.sh",
-        marks=[pytest.mark.gpu_2, pytest.mark.nightly],
+        marks=[
+            pytest.mark.gpu_2,
+            pytest.mark.nightly,
+        ],  # TODO: profile to get max_vram and timeout
        model="Qwen/Qwen2-Audio-7B-Instruct",
        delayed_start=60,  # Audio models require longer loading time
        script_args=["--model", "Qwen/Qwen2-Audio-7B-Instruct"],
@@ -566,7 +616,11 @@ vllm_configs = {
        name="aggregated_toolcalling",
        directory=vllm_dir,
        script_name="agg_multimodal.sh",
-        marks=[pytest.mark.gpu_2, pytest.mark.multimodal, pytest.mark.nightly],
+        marks=[
+            pytest.mark.gpu_2,
+            pytest.mark.multimodal,
+            pytest.mark.nightly,
+        ],  # TODO: profile to get max_vram and timeout
        model="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8",
        script_args=[
            "--model",
@@ -646,10 +700,9 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(21.9),  # observed peak 19.9 GiB (+10% safety)
+            pytest.mark.timeout(233),  # 3x observed 77.7s wall time
            pytest.mark.post_merge,
-            pytest.mark.timeout(
-                420
-            ),  # 3x estimated time (60s) + download time (240s) for 7B model
        ],
        model="deepseek-ai/deepseek-llm-7b-base",
        script_args=[
@@ -669,6 +722,7 @@ vllm_configs = {
        marks=[
            pytest.mark.gpu_2,
            pytest.mark.pre_merge,
+            # TODO: profile to get max_vram
            pytest.mark.timeout(300),
        ],
        model="Qwen/Qwen3-0.6B",
@@ -681,7 +735,12 @@ vllm_configs = {
        name="guided_decoding",
        directory=vllm_dir,
        script_name="agg.sh",
-        marks=[pytest.mark.gpu_1, pytest.mark.pre_merge],
+        marks=[
+            pytest.mark.gpu_1,
+            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
+            pytest.mark.timeout(67),  # 3x observed 22.3s wall time
+            pytest.mark.pre_merge,
+        ],
        model="Qwen/Qwen3-0.6B",
        request_payloads=[
            chat_payload(

--- a/tests/utils/engine_process.py
+++ b/tests/utils/engine_process.py
@@ -187,6 +187,9 @@ class EngineProcess(ManagedProcess):
                ),
            ],
            delayed_start=config.delayed_start,
+            # Must stay False: command[0] is "bash", so True would kill every
+            # bash process system-wide.  Stale cleanup relies on stragglers list
+            # and process-group termination in __exit__ instead.
            terminate_all_matching_process_names=False,
            stragglers=config.stragglers,
            log_dir=request.node.name,

--- a/tests/utils/port_utils.py
+++ b/tests/utils/port_utils.py
@@ -38,6 +38,7 @@ class ServicePorts:
    frontend_port: int
    system_ports: list[int]
+    kv_event_port: int = 0
 def _load_port_registry() -> dict:

--- a/tests/utils/profile_pytest.py
+++ b/tests/utils/profile_pytest.py
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+"""Profile GPU VRAM usage during a pytest run.
+How it works
+~~~~~~~~~~~~
+A background thread queries NVML (via ``pynvml``) every 100 ms (configurable
+with ``--interval``) to record GPU memory usage while the test runs as a
+subprocess.  This captures *all* GPU memory (model weights, KV cache, CUDA
+contexts, NCCL buffers — not just PyTorch allocations) without requiring any
+in-process instrumentation.  Using NVML directly (the same C library that
+``nvidia-smi`` wraps) avoids the overhead of forking a subprocess each sample
+and allows high-frequency sampling.
+In **binary-search mode** (the default), the profiler sets the env var
+``_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`` to a value between 0.05 and 0.95 and
+re-runs the test at each midpoint.  If the test passes, the fraction is lowered;
+if it OOMs, the fraction is raised — standard bisection to find the minimum
+VRAM the test needs.  The peak ``memory.used`` from the last passing run
+(plus a 10 % safety margin) becomes the ``@pytest.mark.max_vram_gib`` recommendation.
+**IMPORTANT**: The test under profile **MUST** honor ``_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE``
+— either directly (see ``test_mock_gpu_alloc.py``) or via launch scripts that
+pass it as ``--gpu-memory-utilization`` to vLLM (e.g. ``agg.sh``).  If the test
+ignores this variable, every probe will pass at the same peak and the profiler
+will warn that the binary search is unreliable.
+Usage::
+    python tests/utils/profile_pytest.py [options] pytest-args...
+Examples (``-xvs`` is optional: stop on first failure, verbose, no capture)::
+    python tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
+    python tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort -xvs
+Single-pass profiling (no binary search, just measure one run using default RAM)::
+    python tests/utils/profile_pytest.py --no-find-min-vram tests/frontend/test_vllm.py::test_tool_calling
+The report is written to stdout after the test finishes.
+The raw CSV samples are saved to ``--csv`` if specified.
+Use ``--no-recommend`` to suppress the marker recommendation section.
+"""
+import argparse
+import atexit
+import json
+import logging
+import math
+import os
+import shutil
+import subprocess
+import sys
+import tempfile
+import threading
+import time
+from dataclasses import dataclass, field
+import pynvml
+logger = logging.getLogger(__name__)
+# Safety margin for VRAM tier recommendations.  Peak VRAM is multiplied by
+# this factor before comparing against tier thresholds, so the recommended
+# tier has headroom for variance across runs.
+_VRAM_SAFETY_FACTOR = 1.1
+# Phase detection: a memory jump exceeding this threshold (MiB) between
+# consecutive samples marks a phase boundary.
+_PHASE_JUMP_MIB = 200
+# How long memory must be stable (within this tolerance) to consider it
+# a plateau, in consecutive samples.
+_PLATEAU_TOLERANCE_MIB = 50
+_PLATEAU_MIN_SAMPLES = 3
+def _extract_model_from_markers(pytest_args: list[str]) -> str | None:
+    """Extract the model name from @pytest.mark.model(...) via pytest-json-report.
+    Runs ``pytest --collect-only`` with the json-report plugin to inspect markers
+    without executing the test.  Returns None if the plugin is missing or the
+    test has no ``model`` marker.
+    """
+    fd, json_path = tempfile.mkstemp(prefix="_profile_collect_", suffix=".json")
+    os.close(fd)
+    try:
+        result = subprocess.run(
+            [
+                sys.executable,
+                "-m",
+                "pytest",
+                "--collect-only",
+                "-q",
+                "--rootdir=.",
+                "--override-ini=testpaths=tests",
+                f"--json-report-file={json_path}",
+            ]
+            + list(pytest_args),
+            capture_output=True,
+            text=True,
+            timeout=30,
+        )
+        if result.returncode not in (0, 5):
+            return None
+        with open(json_path) as f:
+            data = json.load(f)
+        for collector in data.get("collectors", []):
+            for marker in collector.get("markers", []):
+                if marker.get("name") == "model" and marker.get("args"):
+                    return marker["args"][0]
+        for test in data.get("tests", []):
+            for marker in test.get("markers", []):
+                if marker.get("name") == "model" and marker.get("args"):
+                    return marker["args"][0]
+    except (subprocess.SubprocessError, OSError, json.JSONDecodeError, KeyError) as exc:
+        logger.warning("model marker extraction failed: %s", exc)
+        return None
+    finally:
+        try:
+            os.remove(json_path)
+        except OSError:
+            pass
+    return None
+@dataclass
+class GpuSample:
+    timestamp: float  # time.monotonic() offset from start
+    gpu_idx: int
+    mem_used_mib: int
+    mem_total_mib: int
+    gpu_util_pct: int
+@dataclass
+class PhaseInfo:
+    name: str
+    start_sec: float
+    end_sec: float
+    mem_start_mib: int
+    mem_peak_mib: int
+    mem_end_mib: int
+    description: str = ""
+@dataclass
+class GpuReport:
+    gpu_idx: int
+    mem_total_mib: int
+    baseline_mib: int
+    peak_mib: int
+    peak_timestamp: float
+    final_mib: int
+    leaked_mib: int  # final - baseline
+    phases: list[PhaseInfo] = field(default_factory=list)
+_nvml_initialized = False
+_nvml_handles: list = []
+def _nvml_init() -> None:
+    """Lazily initialize NVML and cache device handles."""
+    global _nvml_initialized, _nvml_handles
+    if _nvml_initialized:
+        return
+    pynvml.nvmlInit()
+    _nvml_initialized = True
+    count = pynvml.nvmlDeviceGetCount()
+    _nvml_handles = [pynvml.nvmlDeviceGetHandleByIndex(i) for i in range(count)]
+    atexit.register(_nvml_shutdown)
+def _nvml_shutdown() -> None:
+    global _nvml_initialized, _nvml_handles
+    if _nvml_initialized:
+        _nvml_handles = []
+        pynvml.nvmlShutdown()
+        _nvml_initialized = False
+def _query_gpu_stats() -> list[tuple[int, int, int, int]]:
+    """Return [(gpu_idx, mem_used_mib, mem_total_mib, util_pct), ...] via NVML."""
+    _nvml_init()
+    results = []
+    for idx, handle in enumerate(_nvml_handles):
+        mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
+        util = pynvml.nvmlDeviceGetUtilizationRates(handle)
+        used_mib = int(mem.used) // (1024 * 1024)
+        total_mib = int(mem.total) // (1024 * 1024)
+        results.append((idx, used_mib, total_mib, int(util.gpu)))
+    return results
+class _Sampler:
+    """Background thread that queries NVML at a fixed interval."""
+    def __init__(self, interval: float = 0.1):
+        self.interval = interval
+        self.samples: list[GpuSample] = []
+        self._stop = threading.Event()
+        self._t0 = time.monotonic()
+        self._thread = threading.Thread(target=self._run, daemon=True)
+    def start(self):
+        self._t0 = time.monotonic()
+        self._thread.start()
+    def stop(self):
+        self._stop.set()
+        self._thread.join(timeout=self.interval * 3)
+    def _run(self):
+        while not self._stop.is_set():
+            ts = time.monotonic() - self._t0
+            try:
+                for gpu_idx, mem_used, mem_total, util_pct in _query_gpu_stats():
+                    self.samples.append(
+                        GpuSample(ts, gpu_idx, mem_used, mem_total, util_pct)
+                    )
+            except pynvml.NVMLError:
+                pass  # transient NVML error; skip this sample
+            self._stop.wait(self.interval)
+def _detect_phases(
+    samples: list[GpuSample], baseline_end: float, test_end: float
+) -> list[PhaseInfo]:
+    """Heuristic phase detection from a single GPU's memory timeline.
+    Looks for large jumps (model load, KV cache alloc) and identifies
+    the inference peak and teardown regions.
+    """
+    if not samples:
+        return []
+    phases: list[PhaseInfo] = []
+    baseline_samples = [s for s in samples if s.timestamp <= baseline_end]
+    test_samples = [s for s in samples if baseline_end < s.timestamp <= test_end]
+    teardown_samples = [s for s in samples if s.timestamp > test_end]
+    if baseline_samples:
+        bl = baseline_samples[-1].mem_used_mib
+        phases.append(
+            PhaseInfo(
+                name="Baseline",
+                start_sec=samples[0].timestamp,
+                end_sec=baseline_end,
+                mem_start_mib=baseline_samples[0].mem_used_mib,
+                mem_peak_mib=max(s.mem_used_mib for s in baseline_samples),
+                mem_end_mib=bl,
+                description="Idle GPU before test starts",
+            )
+        )
+    if not test_samples:
+        return phases
+    # Walk test samples and detect jumps
+    prev_mem = baseline_samples[-1].mem_used_mib if baseline_samples else 0
+    phase_start = test_samples[0].timestamp
+    phase_start_mem = prev_mem
+    phase_peak = prev_mem
+    jump_count = 0
+    phase_names = ["Model load", "KV cache alloc", "Inference"]
+    for s in test_samples:
+        delta = s.mem_used_mib - prev_mem
+        phase_peak = max(phase_peak, s.mem_used_mib)
+        if delta > _PHASE_JUMP_MIB and jump_count < len(phase_names) - 1:
+            # Close current phase, start new one
+            if phase_start < s.timestamp:
+                name = phase_names[min(jump_count, len(phase_names) - 1)]
+                phases.append(
+                    PhaseInfo(
+                        name=name,
+                        start_sec=phase_start,
+                        end_sec=s.timestamp,
+                        mem_start_mib=phase_start_mem,
+                        mem_peak_mib=phase_peak,
+                        mem_end_mib=prev_mem,
+                    )
+                )
+            jump_count += 1
+            phase_start = s.timestamp
+            phase_start_mem = s.mem_used_mib
+            phase_peak = s.mem_used_mib
+        prev_mem = s.mem_used_mib
+    # Close final test phase
+    name = phase_names[min(jump_count, len(phase_names) - 1)]
+    phases.append(
+        PhaseInfo(
+            name=name,
+            start_sec=phase_start,
+            end_sec=test_end,
+            mem_start_mib=phase_start_mem,
+            mem_peak_mib=phase_peak,
+            mem_end_mib=test_samples[-1].mem_used_mib,
+        )
+    )
+    if teardown_samples:
+        phases.append(
+            PhaseInfo(
+                name="Teardown",
+                start_sec=test_end,
+                end_sec=teardown_samples[-1].timestamp,
+                mem_start_mib=teardown_samples[0].mem_used_mib,
+                mem_peak_mib=max(s.mem_used_mib for s in teardown_samples),
+                mem_end_mib=teardown_samples[-1].mem_used_mib,
+                description="After pytest exits; should return to baseline",
+            )
+        )
+    return phases
+def _build_reports(
+    samples: list[GpuSample], baseline_end: float, test_end: float
+) -> list[GpuReport]:
+    """Build per-GPU reports from collected samples."""
+    gpu_indices = sorted({s.gpu_idx for s in samples})
+    reports = []
+    for idx in gpu_indices:
+        gpu_samples = [s for s in samples if s.gpu_idx == idx]
+        if not gpu_samples:
+            continue
+        baseline_samples = [s for s in gpu_samples if s.timestamp <= baseline_end]
+        baseline_mib = baseline_samples[-1].mem_used_mib if baseline_samples else 0
+        peak_sample = max(gpu_samples, key=lambda s: s.mem_used_mib)
+        final_mib = gpu_samples[-1].mem_used_mib
+        reports.append(
+            GpuReport(
+                gpu_idx=idx,
+                mem_total_mib=gpu_samples[0].mem_total_mib,
+                baseline_mib=baseline_mib,
+                peak_mib=peak_sample.mem_used_mib,
+                peak_timestamp=peak_sample.timestamp,
+                final_mib=final_mib,
+                leaked_mib=final_mib - baseline_mib,
+                phases=_detect_phases(gpu_samples, baseline_end, test_end),
+            )
+        )
+    return reports
+def _format_mib(mib: int) -> str:
+    if mib >= 1024:
+        return f"{mib / 1024:.1f} GiB"
+    return f"{mib} MiB"
+def _print_report(
+    reports: list[GpuReport],
+    pytest_rc: int,
+    wall_secs: float,
+    model_name: str | None = None,
+):
+    """Print a human-readable profiling report."""
+    print("\n--- GPU MEMORY PROFILE ---")
+    print(f"  pytest exit code : {pytest_rc}")
+    print(f"  wall time        : {wall_secs:.1f}s")
+    print(f"  GPUs sampled     : {len(reports)}")
+    if model_name:
+        print(f"  model            : {model_name}")
+    for r in reports:
+        print(f"\n{'─' * 72}")
+        print(f"  GPU {r.gpu_idx}  ({_format_mib(r.mem_total_mib)} total)")
+        print(f"{'─' * 72}")
+        print(f"  Baseline         : {_format_mib(r.baseline_mib)}")
+        print(
+            f"  Peak             : {_format_mib(r.peak_mib)}  "
+            f"({r.peak_mib * 100 // r.mem_total_mib}% of total)  "
+            f"@ t={r.peak_timestamp:.1f}s"
+        )
+        print(f"  Final            : {_format_mib(r.final_mib)}")
+        delta = r.leaked_mib
+        tag = "OK" if abs(delta) < _PLATEAU_TOLERANCE_MIB else "LEAKED"
+        sign = "+" if delta > 0 else ""
+        print(f"  Delta (final-bl) : {sign}{_format_mib(delta)}  [{tag}]")
+        if r.phases:
+            print()
+            print(
+                f"  {'Phase':<16} {'Time':>12}  {'Start':>10} {'Peak':>10} {'End':>10}"
+            )
+            print(f"  {'─' * 16} {'─' * 12}  {'─' * 10} {'─' * 10} {'─' * 10}")
+            for p in r.phases:
+                dur = p.end_sec - p.start_sec
+                time_range = (
+                    f"{p.start_sec:.0f}s-{p.end_sec:.0f}s"
+                    if dur > 0
+                    else f"{p.start_sec:.0f}s"
+                )
+                print(
+                    f"  {p.name:<16} {time_range:>12}  "
+                    f"{_format_mib(p.mem_start_mib):>10} "
+                    f"{_format_mib(p.mem_peak_mib):>10} "
+                    f"{_format_mib(p.mem_end_mib):>10}"
+                )
+    print()
+def _write_csv(samples: list[GpuSample], path: str):
+    with open(path, "w") as f:
+        f.write("timestamp_s,gpu,mem_used_mib,mem_total_mib,gpu_util_pct\n")
+        for s in samples:
+            f.write(
+                f"{s.timestamp:.2f},{s.gpu_idx},{s.mem_used_mib},"
+                f"{s.mem_total_mib},{s.gpu_util_pct}\n"
+            )
+_GPU_REFERENCE_CARDS: list[tuple[int, str]] = [
+    (4, "edge/embedded"),
+    (8, "RTX 3060/4060"),
+    (16, "T4"),
+    (24, "L4"),
+    (32, "V100-32GB"),
+    (48, "A6000/A40"),
+    (80, "A100/H100"),
+]
+@dataclass
+class MarkerRecommendation:
+    marker: str
+    reason: str
+def _recommend_markers(
+    reports: list[GpuReport],
+    wall_secs: float,
+    model_name: str | None = None,
+    num_runs: int = 1,
+) -> tuple[list[MarkerRecommendation], list[str]]:
+    """Generate marker recommendations from profiling data.
+    Returns (recommendations, warnings).
+    """
+    recs: list[MarkerRecommendation] = []
+    warnings: list[str] = []
+    if model_name:
+        recs.append(
+            MarkerRecommendation(
+                f'model("{model_name}")',
+                "detected from test source",
+            )
+        )
+    max_peak_mib = max((r.peak_mib for r in reports), default=0)
+    max_baseline_mib = max((r.baseline_mib for r in reports), default=0)
+    used_vram = max_peak_mib - max_baseline_mib
+    gpus_with_vram = sum(
+        1 for r in reports if (r.peak_mib - r.baseline_mib) > _PLATEAU_TOLERANCE_MIB
+    )
+    has_model_load = any(
+        p.name == "Model load"
+        for r in reports
+        for p in r.phases
+        if p.mem_peak_mib - p.mem_start_mib > _PHASE_JUMP_MIB
+    )
+    any_leaked = any(abs(r.leaked_mib) >= _PLATEAU_TOLERANCE_MIB for r in reports)
+    # -- Test Type --
+    if wall_secs < 1.0 and used_vram < _PLATEAU_TOLERANCE_MIB:
+        recs.append(
+            MarkerRecommendation("unit", f"wall time {wall_secs:.1f}s, no GPU usage")
+        )
+    elif wall_secs < 30.0 and not has_model_load:
+        recs.append(
+            MarkerRecommendation(
+                "integration", f"wall time {wall_secs:.1f}s, no model load detected"
+            )
+        )
+    else:
+        reason = f"wall time avg {wall_secs:.1f}s based on {num_runs} run{'s' if num_runs != 1 else ''}"
+        if has_model_load:
+            reason += ", loads a real model"
+        recs.append(MarkerRecommendation("e2e", reason))
+    # -- Lifecycle --
+    if wall_secs < 20.0:
+        recs.append(
+            MarkerRecommendation(
+                "pre_merge", f"wall time {wall_secs:.1f}s (< 20s, fast enough per PR)"
+            )
+        )
+    elif wall_secs < 300.0:
+        warnings.append(
+            f"Wall time {wall_secs:.1f}s is too slow for pre_merge (> 20s). "
+            f"Consider post_merge or nightly instead."
+        )
+    else:
+        warnings.append(
+            f"Wall time {wall_secs:.1f}s is very slow (> 300s). "
+            f"Consider nightly instead."
+        )
+    # -- Hardware: GPU count --
+    if gpus_with_vram == 0:
+        recs.append(MarkerRecommendation("gpu_0", "no GPU VRAM used"))
+    else:
+        marker = f"gpu_{gpus_with_vram}"
+        recs.append(
+            MarkerRecommendation(
+                marker,
+                f"{gpus_with_vram} GPU(s) used, peak {_format_mib(max_peak_mib)}",
+            )
+        )
+    # -- Hardware: VRAM requirement --
+    if used_vram > _PLATEAU_TOLERANCE_MIB:
+        padded_peak_mib = int(max_peak_mib * _VRAM_SAFETY_FACTOR)
+        padded_peak_gib = round(padded_peak_mib / 1024, 1)
+        recs.append(
+            MarkerRecommendation(
+                f"max_vram_gib({padded_peak_gib})",
+                f"peak {_format_mib(max_peak_mib)} GPU RAM used "
+                f"(+10% safety: {_format_mib(padded_peak_mib)})",
+            )
+        )
+        # Warn about GPU cards that would OOM
+        for card_gib, card_name in _GPU_REFERENCE_CARDS:
+            if padded_peak_gib > card_gib:
+                warnings.append(f"Will OOM on {card_name} ({card_gib} GiB).")
+    # -- Timeout --
+    timeout_val = int(math.ceil(wall_secs * 3.0))
+    timeout_val = max(timeout_val, 10)
+    recs.append(
+        MarkerRecommendation(
+            f"timeout({timeout_val})",
+            f"wall time {wall_secs:.1f}s, based on {num_runs} run{'s' if num_runs != 1 else ''}",
+        )
+    )
+    # -- Memory leak warning --
+    if any_leaked:
+        leaked_reports = [
+            r for r in reports if abs(r.leaked_mib) >= _PLATEAU_TOLERANCE_MIB
+        ]
+        for r in leaked_reports:
+            warnings.append(
+                f"GPU {r.gpu_idx}: VRAM not fully released "
+                f"(baseline {_format_mib(r.baseline_mib)} -> "
+                f"final {_format_mib(r.final_mib)}, "
+                f"delta {_format_mib(r.leaked_mib)}). "
+                f"Possible leak or teardown issue."
+            )
+    return recs, warnings
+def _print_recommendations(
+    recs: list[MarkerRecommendation],
+    warnings: list[str],
+    pytest_args: list[str] | None = None,
+):
+    print("--- Recommended markers (copy-paste into your test) ---")
+    if pytest_args:
+        print(
+            f"# Measured using: tests/utils/profile_pytest.py {' '.join(pytest_args)}"
+        )
+    else:
+        print("# Measured using: tests/utils/profile_pytest.py")
+    for r in recs:
+        print(f"@pytest.mark.{r.marker}  # {r.reason}")
+    # Show example so user knows where to place the markers
+    test_name = None
+    if pytest_args:
+        test_name = next(
+            (a.rsplit("::", 1)[-1] for a in pytest_args if "::" in a), None
+        )
+    print(f"def {test_name or 'test_something'}(...):")
+    print("    ...")
+    if warnings:
+        print()
+        for w in warnings:
+            print(f"  WARNING: {w}")
+    print()
+_DEFAULT_PROBE_TIMEOUT = 300  # 5 minutes max per profile run
+def _run_once(
+    pytest_args: list[str],
+    interval: float = 0.1,
+    baseline_seconds: float = 3.0,
+    teardown_seconds: float = 5.0,
+    extra_env: dict[str, str] | None = None,
+    quiet: bool = False,
+    run_label: str | None = None,
+    timeout: float = _DEFAULT_PROBE_TIMEOUT,
+) -> tuple[int, float, list[GpuReport], list[GpuSample]]:
+    """Run pytest once with GPU sampling.
+    When *run_label* is set, each line of pytest stdout/stderr is prefixed
+    with ``[run_label]`` so multi-run output is easy to follow.
+    Returns (exit_code, wall_secs, reports, raw_samples).
+    """
+    sampler = _Sampler(interval=interval)
+    sampler.start()
+    if not quiet:
+        print(f"Sampling baseline for {baseline_seconds}s ...")
+    time.sleep(baseline_seconds)
+    baseline_end = time.monotonic() - sampler._t0
+    pytest_cmd = [sys.executable, "-m", "pytest"] + list(pytest_args)
+    if not quiet:
+        print(f"Running: {' '.join(pytest_cmd)}")
+    sys.stdout.flush()
+    env = os.environ.copy()
+    env.setdefault("HF_HUB_OFFLINE", "1")
+    if extra_env:
+        env.update(extra_env)
+    capture = run_label is not None
+    t_start = time.monotonic()
+    timed_out = False
+    try:
+        result = subprocess.run(
+            pytest_cmd,
+            env=env,
+            capture_output=capture,
+            text=capture or None,
+            timeout=timeout,
+        )
+        rc = result.returncode
+    except subprocess.TimeoutExpired:
+        timed_out = True
+        rc = 1
+        if not quiet or run_label:
+            print(
+                f"  [TIMEOUT] pytest exceeded {timeout:.0f}s limit "
+                f"(teardown likely hung)"
+            )
+    if not timed_out and capture:
+        prefix = f"[{run_label}] "
+        for line in result.stdout.splitlines():
+            print(f"{prefix}{line}")
+        for line in result.stderr.splitlines():
+            print(f"{prefix}{line}", file=sys.stderr)
+    sys.stdout.flush()
+    wall_secs = time.monotonic() - t_start
+    test_end = time.monotonic() - sampler._t0
+    if not quiet:
+        print(f"Sampling teardown for {teardown_seconds}s ...")
+    time.sleep(teardown_seconds)
+    sampler.stop()
+    reports = _build_reports(sampler.samples, baseline_end, test_end)
+    return rc, wall_secs, reports, sampler.samples
+def _find_min_vram(
+    pytest_args: list[str],
+    interval: float = 0.1,
+    baseline_seconds: float = 2.0,
+    teardown_seconds: float = 2.0,
+    recommend: bool = True,
+    csv_path: str | None = None,
+) -> int:
+    """Binary search _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE to find the minimum VRAM a test needs.
+    Sets _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE env var (honored by agg.sh and similar scripts),
+    runs the test at each profile point, and bisects until the boundary is found.
+    """
+    gpu_info = _query_gpu_stats()
+    if not gpu_info:
+        raise RuntimeError("NVML returned no GPU data")
+    used_mib = gpu_info[0][1]
+    total_mib = gpu_info[0][2]
+    free_mib = total_mib - used_mib
+    total_gib = total_mib / 1024
+    model_name = _extract_model_from_markers(pytest_args)
+    print("\n--- FIND MINIMUM VRAM (binary search) ---")
+    print(f"  GPU total : {total_gib:.1f} GiB")
+    print(
+        f"  GPU free  : {free_mib / 1024:.1f} GiB  "
+        f"(in use: {used_mib / 1024:.1f} GiB)"
+    )
+    print(f"  Test      : {' '.join(pytest_args)}")
+    if model_name:
+        print(f"  Model     : {model_name}")
+    # Warn if something is already consuming significant GPU memory
+    hogged_pct = used_mib / total_mib * 100
+    if hogged_pct > 10:
+        print(f"\n  {'!' * 72}")
+        print(
+            f"  WARNING: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) of GPU memory "
+            f"is already in use!"
+        )
+        print("  Another process is hogging the GPU. Results will be inaccurate")
+        print(
+            "  because _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is a fraction of TOTAL memory,"
+        )
+        print("  not FREE memory. Kill other GPU processes first.")
+        print(f"  {'!' * 72}")
+    print()
+    lo = 0.05
+    hi = 0.95
+    tolerance = 0.05
+    max_iterations = math.ceil(math.log2((hi - lo) / tolerance))
+    last_pass_util: float | None = None
+    last_pass_peak_mib: int = 0
+    elapsed_times: list[float] = []
+    all_peak_mibs: list[int] = []
+    pass_wall_times: list[float] = []
+    print(f"  Range   : {lo:.0%} - {hi:.0%}  (tolerance {tolerance:.0%})")
+    print(
+        f"  Max iter: {max_iterations + 1} (1 validation + {max_iterations} bisections)"
+    )
+    print()
+    # First, verify the test passes at hi (0.95)
+    print(
+        f"  [profile 1/{max_iterations + 1}] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE={hi:.2f} "
+        f"(allowed max GPU {hi * total_gib:.1f} GiB)  [validation run]"
+    )
+    sys.stdout.flush()
+    t_iter_start = time.monotonic()
+    label = f"profile 1/{max_iterations + 1}"
+    rc, wall, reports, raw_samples = _run_once(
+        pytest_args,
+        interval=interval,
+        baseline_seconds=baseline_seconds,
+        teardown_seconds=teardown_seconds,
+        extra_env={"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE": f"{hi:.2f}"},
+        quiet=True,
+        run_label=label,
+    )
+    iter_elapsed = time.monotonic() - t_iter_start
+    elapsed_times.append(iter_elapsed)
+    if rc != 0:
+        print(
+            f"  [FAIL] allowed GPU = {hi * total_gib:.1f} GiB ({hi:.0%}), "
+            f"test fails even at max utilization. Cannot determine minimum."
+        )
+        return rc
+    peak_mib = max((r.peak_mib for r in reports), default=0)
+    all_peak_mibs.append(peak_mib)
+    last_pass_util = hi
+    last_pass_peak_mib = peak_mib
+    last_pass_reports = reports
+    last_pass_samples = raw_samples
+    pass_wall_times.append(wall)
+    print(
+        f"  [PASS] allowed GPU = {hi * total_gib:.1f} GiB ({hi:.0%}), "
+        f"peak GPU used = {_format_mib(peak_mib)}, wall {wall:.0f}s, "
+        f"iter took {iter_elapsed:.0f}s"
+    )
+    # Use 2x the first profile's time as the timeout for subsequent profiles.
+    # If a profile takes longer than this, it's likely stuck in teardown.
+    baseline_time = iter_elapsed
+    probe_timeout = max(baseline_time * 2, 60)
+    print(f"  Profile timeout: {probe_timeout:.0f}s (2x first profile)")
+    iteration = 0
+    while (hi - lo) > tolerance:
+        iteration += 1
+        probe_num = iteration + 1
+        mid = (lo + hi) / 2
+        remaining = max_iterations + 1 - probe_num
+        avg_iter = sum(elapsed_times) / len(elapsed_times)
+        eta_s = remaining * avg_iter
+        label = f"profile {probe_num}/{max_iterations + 1}"
+        print(
+            f"\n  [{label}] "
+            f"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE={mid:.2f} "
+            f"(allowed max GPU {mid * total_gib:.1f} GiB)  "
+            f"[~{remaining} iters left, profiling ETA ~{eta_s:.0f}s]"
+        )
+        sys.stdout.flush()
+        stop_progress = threading.Event()
+        t_iter_start = time.monotonic()
+        is_tty = sys.stderr.isatty()
+        def _print_progress(t0: float, expected: float, stop: threading.Event) -> None:
+            if not is_tty:
+                return
+            term_width = shutil.get_terminal_size((80, 24)).columns
+            bar_total = max(term_width - 40, 10)
+            while not stop.wait(2):
+                elapsed = time.monotonic() - t0
+                frac = min(elapsed / expected, 1.0) if expected > 0 else 0
+                filled = int(frac * bar_total)
+                bar = "\u2588" * filled + "\u2591" * (bar_total - filled)
+                pct = frac * 100
+                line = f"    [{bar}] {elapsed:5.0f}s / ~{expected:.0f}s ({pct:3.0f}%)"
+                sys.stderr.write(f"\r{line}")
+                sys.stderr.flush()
+        progress_thread = threading.Thread(
+            target=_print_progress,
+            args=(t_iter_start, baseline_time, stop_progress),
+            daemon=True,
+        )
+        progress_thread.start()
+        rc, wall, reports, raw_samples = _run_once(
+            pytest_args,
+            interval=interval,
+            baseline_seconds=baseline_seconds,
+            teardown_seconds=teardown_seconds,
+            extra_env={"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE": f"{mid:.2f}"},
+            quiet=True,
+            run_label=label,
+            timeout=probe_timeout,
+        )
+        stop_progress.set()
+        progress_thread.join(timeout=2)
+        if is_tty:
+            sys.stderr.write(
+                "\r" + " " * shutil.get_terminal_size((80, 24)).columns + "\r"
+            )
+            sys.stderr.flush()
+        iter_elapsed = time.monotonic() - t_iter_start
+        elapsed_times.append(iter_elapsed)
+        peak_mib = max((r.peak_mib for r in reports), default=0)
+        all_peak_mibs.append(peak_mib)
+        if rc == 0:
+            last_pass_util = mid
+            last_pass_peak_mib = peak_mib
+            last_pass_reports = reports
+            last_pass_samples = raw_samples
+            pass_wall_times.append(wall)
+            hi = mid
+            print(
+                f"  [PASS] allowed GPU = {mid * total_gib:.1f} GiB ({mid:.0%}), "
+                f"peak GPU used = {_format_mib(peak_mib)}, wall {wall:.0f}s, "
+                f"iter took {iter_elapsed:.0f}s"
+            )
+        else:
+            lo = mid
+            print(
+                f"  [FAIL] allowed GPU = {mid * total_gib:.1f} GiB ({mid:.0%}), "
+                f"OOM or error, iter took {iter_elapsed:.0f}s"
+            )
+    # Detect if _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is being ignored: all peaks are nearly
+    # identical despite wildly different utilization caps.
+    if len(all_peak_mibs) >= 3:
+        peak_range = max(all_peak_mibs) - min(all_peak_mibs)
+        if peak_range < _PLATEAU_TOLERANCE_MIB:
+            print(f"\n  {'!' * 72}")
+            print(
+                f"  WARNING: Peak VRAM was ~{_format_mib(all_peak_mibs[0])} across ALL "
+                f"{len(all_peak_mibs)} probes (range: {peak_range} MiB)."
+            )
+            print(
+                "  This strongly suggests the test IGNORES the _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
+            )
+            print("  env var.  Binary search results are UNRELIABLE — no marker")
+            print("  recommendation will be provided.")
+            print("  ")
+            print(
+                "  FIX: The test (or its launch script) must read _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
+            )
+            print("  and pass --gpu-memory-utilization to vLLM / the engine.")
+            print("  See tests/README.md 'GPU VRAM Profiler' for details.")
+            print(f"  {'!' * 72}")
+            return 4
+    # Results
+    assert last_pass_util is not None
+    min_vram_gib = last_pass_util * total_gib
+    padded_peak_mib = int(last_pass_peak_mib * _VRAM_SAFETY_FACTOR)
+    padded_peak_gib = round(padded_peak_mib / 1024, 1)
+    # Extract a short test name from pytest args for the summary
+    test_name = next(
+        (a for a in pytest_args if "::" in a or a.endswith(".py")),
+        " ".join(pytest_args),
+    )
+    test_short = test_name.rsplit("::", 1)[-1] if "::" in test_name else test_name
+    print("\n--- RESULT ---")
+    print(f"  Lowest passing utilization : {last_pass_util:.0%}")
+    print(
+        f"  Minimum VRAM needed        : ~{min_vram_gib:.1f} GiB "
+        f"(peak observed: {_format_mib(last_pass_peak_mib)}, "
+        f"+10% safety: {_format_mib(padded_peak_mib)})"
+    )
+    print(f"  {test_short}: @pytest.mark.max_vram_gib({padded_peak_gib})")
+    # Full marker recommendations using average wall time across all passing runs
+    if recommend:
+        avg_pass_wall = sum(pass_wall_times) / len(pass_wall_times)
+        recs, warnings = _recommend_markers(
+            last_pass_reports, avg_pass_wall, model_name, num_runs=len(pass_wall_times)
+        )
+        _print_recommendations(recs, warnings, pytest_args=pytest_args)
+    if csv_path and last_pass_samples:
+        _write_csv(last_pass_samples, csv_path)
+        print(f"Raw samples (last passing run) written to {csv_path}")
+    return 0
+def main(argv: list[str] | None = None) -> int:
+    logging.basicConfig(
+        level=logging.INFO,
+        format="%(levelname)s: %(message)s",
+    )
+    parser = argparse.ArgumentParser(
+        description="Profile GPU memory during a pytest run.",
+        usage="%(prog)s [options] [-- ] pytest-args...",
+    )
+    parser.add_argument(
+        "--interval",
+        type=float,
+        default=0.1,
+        help="Sampling interval in seconds (default: 0.1)",
+    )
+    parser.add_argument(
+        "--baseline-seconds",
+        type=float,
+        default=3.0,
+        help="Seconds to sample baseline before launching pytest (default: 3.0)",
+    )
+    parser.add_argument(
+        "--teardown-seconds",
+        type=float,
+        default=5.0,
+        help="Seconds to sample after pytest exits to measure teardown (default: 5.0)",
+    )
+    parser.add_argument(
+        "--csv",
+        type=str,
+        default=None,
+        help="Write raw samples to this CSV file",
+    )
+    parser.add_argument(
+        "--no-recommend",
+        action="store_true",
+        default=False,
+        help="Suppress marker recommendations",
+    )
+    parser.add_argument(
+        "--no-find-min-vram",
+        action="store_true",
+        default=False,
+        help="Disable the default binary-search mode that finds minimum VRAM. "
+        "When set, runs a single profiling pass instead.",
+    )
+    raw = argv if argv is not None else sys.argv[1:]
+    if "--" in raw:
+        split_idx = raw.index("--")
+        args = parser.parse_args(raw[:split_idx])
+        pytest_args = raw[split_idx + 1 :]
+    else:
+        args, pytest_args = parser.parse_known_args(raw)
+    if not pytest_args:
+        parser.error("No pytest arguments provided")
+    # Validate that test file paths actually exist
+    for arg in pytest_args:
+        if arg.startswith("-"):
+            continue
+        test_path = arg.split("::")[0]
+        looks_like_test_path = test_path.endswith(".py") or (os.path.sep in test_path)
+        if looks_like_test_path and not os.path.exists(test_path):
+            parser.error(f"Test path does not exist: {test_path}")
+    gpu_info = _query_gpu_stats()
+    if not gpu_info:
+        raise RuntimeError("NVML returned no GPU data")
+    used_mib = gpu_info[0][1]
+    total_mib = gpu_info[0][2]
+    hogged_pct = used_mib / total_mib * 100
+    if hogged_pct > 10:
+        print(
+            f"\nWARNING: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) of GPU memory "
+            f"is already in use! Results may be inaccurate.\n"
+        )
+    if not args.no_find_min_vram:
+        return _find_min_vram(
+            pytest_args,
+            interval=args.interval,
+            baseline_seconds=args.baseline_seconds,
+            teardown_seconds=args.teardown_seconds,
+            recommend=not args.no_recommend,
+            csv_path=args.csv,
+        )
+    model_name = _extract_model_from_markers(pytest_args)
+    rc, wall_secs, reports, samples = _run_once(
+        pytest_args,
+        interval=args.interval,
+        baseline_seconds=args.baseline_seconds,
+        teardown_seconds=args.teardown_seconds,
+    )
+    _print_report(reports, rc, wall_secs, model_name=model_name)
+    if not args.no_recommend and reports:
+        recs, warnings = _recommend_markers(reports, wall_secs, model_name=model_name)
+        _print_recommendations(recs, warnings, pytest_args=pytest_args)
+    if args.csv:
+        _write_csv(samples, args.csv)
+        print(f"Raw samples written to {args.csv}")
+    return rc
+if __name__ == "__main__":
+    if (
+        os.environ.get("CI")
+        or os.environ.get("GITHUB_ACTIONS")
+        or os.environ.get("GITLAB_CI")
+    ):
+        print("ERROR: profile_pytest.py must not run in CI.", file=sys.stderr)
+        raise SystemExit(1)
+    raise SystemExit(main())