feat: GPU-parallel test runner with VRAM-aware scheduling (#7560)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

feat: GPU-parallel test runner with VRAM-aware scheduling (#7560)
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
6dc85fbc · Keiven C · GitHub · 4ea21079 · 6dc85fbc · 6dc85fbc
Unverified Commit 6dc85fbc authored Apr 02, 2026 by Keiven C Committed by GitHub Apr 02, 2026
20 changed files
--- a/examples/backends/vllm/mm_router_worker/launch.sh
+++ b/examples/backends/vllm/mm_router_worker/launch.sh
@@ -20,9 +20,17 @@ MODEL="${MODEL:-Qwen/Qwen3-VL-8B-Instruct}"
 NAMESPACE="${NAMESPACE:-dynamo}"
 HTTP_PORT="${HTTP_PORT:-8000}"
 BLOCK_SIZE="${BLOCK_SIZE:-16}"            # Must match vLLM backend KV block size
-GPU_MEMORY_UTILIZATION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-${GPU_MEMORY_UTILIZATION:-0.85}}"
+GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.85}"
 MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"

+# KV cache override for parallel-safe GPU memory control
+KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
+if [[ -n "$KV_BYTES" ]]; then
+    GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
+else
+    GPU_MEM_ARGS="--gpu-memory-utilization ${GPU_MEMORY_UTILIZATION}"
+fi
+
 NATS_SERVER="${NATS_SERVER:-nats://127.0.0.1:4222}"
 ETCD_ENDPOINTS="${ETCD_ENDPOINTS:-http://127.0.0.1:2379}"

@@ -121,7 +129,7 @@ env "${COMMON_ENV[@]}" \
        --enable-multimodal \
        --block-size "${BLOCK_SIZE}" \
        --enforce-eager \
-        --gpu-memory-utilization "${GPU_MEMORY_UTILIZATION}" \
+        $GPU_MEM_ARGS \
        --max-model-len "${MAX_MODEL_LEN}" \
        --served-model-name "${MODEL}__internal" \
        ${VLLM_EXTRA_ARGS} &

--- a/examples/common/gpu_utils.md
+++ b/examples/common/gpu_utils.md
-# GPU Memory Parameters by Engine
+# GPU Memory Control

-How vLLM, sglang, and TensorRT-LLM interpret memory-related parameters, and how
-to estimate total GPU VRAM usage for each.
+How vLLM, SGLang, and TensorRT-LLM allocate GPU memory, and how we override
+it for deterministic parallel test execution.

 ---

-## Quick Reference
-
-| Parameter | vLLM | sglang | TensorRT-LLM |
-|---|---|---|---|
-| Memory fraction | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` (YAML/override) |
-| Fraction base | Total VRAM | Total VRAM | Free VRAM (after model load) |
-| Default fraction | 0.90 | 0.90 | 0.90 |
-| Max sequence length | `--max-model-len` | `--context-length` | `max_seq_len` (YAML/override) |
-| KV cache size override | `--kv-cache-memory-bytes` | N/A | `max_gpu_total_bytes` (broken in 1.3.0rc5) |
-
---
-
-## 1. vLLM
-
-### How `--gpu-memory-utilization` works
-
-This is a fraction of **total** GPU VRAM. The engine budgets everything within
-this limit:
-
-```
-budget = total_vram * gpu_memory_utilization
-
-KV cache = budget - model_weights - peak_activations - framework_overhead
-```
-
-At startup, vLLM profiles actual model weight and activation memory, then
-pre-allocates the remaining budget as KV cache blocks. The KV pool size is fixed
-for the lifetime of the engine.
-
-### How `--max-model-len` works
-
-Sets the maximum total sequence length (input + output tokens). Longer sequences
-require more KV cache per request. If the requested `max-model-len` needs more
-KV cache than the budget allows, vLLM errors at startup:
-
-```
-ValueError: ... X GiB KV cache is needed, which is larger than the available
-KV cache memory (Y GiB). ...
-```
-
-Reducing `--max-model-len` is the most effective way to reduce VRAM when the
-model fits but the KV cache doesn't.
-
-### How `--kv-cache-memory-bytes` works
-
-When set, this overrides the automatic KV cache sizing from
-`gpu-memory-utilization`. The engine allocates exactly this many bytes for KV
-cache regardless of the fraction. This means `gpu-memory-utilization` still
-controls the *overall* VRAM budget (and thus whether the model fits), but the
-KV cache portion is pinned to the explicit byte value.
-
-Consequence for profiling: if a script uses `--kv-cache-memory-bytes`,
-changing `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` (which maps to
-`--gpu-memory-utilization`) won't change the KV cache size, only the leftover
-headroom for activations and overhead.
-
-### Estimating total GPU usage
-
-```
-total_vram ≈ model_weights + kv_cache + activations + overhead
-
-model_weights ≈ num_params * bytes_per_param
-                (e.g. 7B * 2 bytes for BF16 ≈ 14 GiB)
+## Why absolute caps, not fractions

-kv_cache_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
-                     (the factor of 2 is for K and V tensors)
+Memory fractions (`--gpu-memory-utilization`, `--mem-fraction-static`) are
+unreliable for parallel / CI workloads:

-kv_cache_total = kv_cache_per_token * max_model_len * max_concurrent_seqs
+- **Non-deterministic** — same fraction produces different KV cache sizes
+  depending on what else is on the GPU at init time.
+- **Profiling race** — concurrent engines each see "nearly all memory free",
+  allocate based on that, and OOM.
+- **Not portable** — a fraction tuned for 48 GiB is wrong on 24 or 80 GiB.
+- **Different semantics** — vLLM/SGLang use fraction of *total* VRAM;
+  TensorRT-LLM uses fraction of *free* VRAM after model load.

-overhead ≈ engine-dependent (auto-computed by estimate_worker_vram):
-           vllm:   1.2 + 1.0 * sqrt(params_b) GiB  (0.6B≈2.0, 8B≈4.0)
-           sglang: 1.5 + 1.0 * sqrt(params_b) GiB  (0.6B≈2.3, 8B≈4.3)
-           trtllm: 2.0 + 1.2 * sqrt(params_b) GiB  (0.6B≈2.9, 8B≈5.4)
-```
+Instead, we use **absolute KV cache caps**:

-Rule of thumb: set `gpu-memory-utilization` so that
-`total_vram * fraction >= model_weights + 2 GiB`. The rest becomes KV cache.
+| Engine | Deterministic override | Env var |
+|--------|----------------------|---------|
+| vLLM | `--kv-cache-memory-bytes N` | `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` |
+| SGLang | `--max-total-tokens N` | `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` |
+| TensorRT-LLM | *(future TODO)* | — |

 ---

-## 2. sglang
-
-### How `--mem-fraction-static` works
-
-Like vLLM, this is a fraction of **total** GPU VRAM:
-
-```
-budget = total_vram * mem_fraction_static
-
-KV cache pool = budget - model_weights
-```
-
-The budget covers model weights and the KV cache pool. Activations and CUDA
-graph buffers are allocated *outside* this budget from the remaining VRAM.
-This is slightly different from vLLM (which includes activations in the budget).
-
-sglang recommends keeping 5-8 GiB free for activations and overhead. If you
-see OOM errors, decrease `--mem-fraction-static` by 0.01-0.05 increments.
-
-### How `--context-length` and `--max-running-requests` work
-
-Unlike vLLM (where `--max-model-len` directly affects KV cache sizing), sglang's
-`--context-length` and `--max-running-requests` do **not** affect KV cache
-allocation. The KV cache pool is sized entirely from `--mem-fraction-static`:
-
-```
-kv_cache_pool = total_vram * mem_fraction_static - model_weights
-```
-
-Profiling confirmed this: changing `--context-length` from 512 to 40960 produced
-identical `max_total_num_tokens` values (269,136 on a 48 GiB GPU at fraction 0.95).
-
-These flags only affect **request scheduling**:
- `--context-length` caps the per-request token usage from the KV pool
- `--max-running-requests` limits concurrent request slots (allocated from
-  memory outside the `--mem-fraction-static` budget)
-
-Setting `--max-running-requests` too high at high fractions can cause OOM because
-the request slot pool competes for the small amount of memory left after KV cache
-allocation.
-
-### Estimating total GPU usage
-
-```
-total_vram ≈ model_weights + kv_cache_pool + activations_and_overhead
-
-kv_cache_pool = total_vram * mem_fraction_static - model_weights
+## Quick Reference

-activations_and_overhead ≈ 1-2 GiB for small models (0.6B-4B)
-                           ~3-5 GiB for larger models (7B+)
-  (CUDA context, graphs, request pools — allocated outside mem_fraction_static)
-```
+| | vLLM | SGLang | TensorRT-LLM |
+|---|---|---|---|
+| Fraction flag | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` |
+| Fraction base | Total VRAM | Total VRAM | Free VRAM (post-load) |
+| Default | 0.90 | 0.90 | 0.90 |
+| Max seq len | `--max-model-len` | `--context-length` | `max_seq_len` |
+| KV cache override | `--kv-cache-memory-bytes` | `--max-total-tokens` | *(broken in 1.3.0rc5)* |

 ---

-## 3. TensorRT-LLM
-
-### How `free_gpu_memory_fraction` works
-
-This is a fraction of **free** VRAM (not total). The engine:
-
-1. Loads model weights and builds the TRT engine (fixed cost).
-2. Queries remaining free GPU memory.
-3. Allocates `free_memory * free_gpu_memory_fraction` for the KV cache pool.
-
-```
-kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
-```
-
-This means the same fraction yields different absolute KV cache sizes depending
-on how much VRAM the model consumed. A 5 GiB model on a 48 GiB GPU leaves
-~43 GiB free; fraction=0.24 gives ~10 GiB KV cache. A 30 GiB model leaves
-~18 GiB free; fraction=0.24 gives only ~4 GiB.
-
-Set via YAML config, CLI, or env var:
-
-```bash
--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
-DYN_TRTLLM_OVERRIDE_ENGINE_ARGS='{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
-```
-
-### How `max_seq_len` works
-
-Maximum total sequence length. Defaults to the model's native context.
-Sequences exceeding this limit are rejected at runtime.
-
-**VRAM impact: none (PyTorch backend).** Reducing max_seq_len from 40960 to
-2048 had zero effect on total VRAM or KV cache size in testing (Qwen3-0.6B,
-trtllm 1.3.0rc5). The PyTorch backend does not pre-allocate internal buffers
-proportional to max_seq_len; KV cache size is determined solely by
-`free_gpu_memory_fraction`. This differs from vLLM/sglang where reducing
-context length measurably reduces memory.
-
-Override via:
-
-```bash
--override-engine-args '{"max_seq_len": 4096}'
-```
+## Per-Engine Notes

-### Override gotcha: sub-dict replacement
+### vLLM

-Overriding any field inside `kv_cache_config` **replaces the entire sub-dict**.
-If your YAML has `enable_block_reuse: true` and you override only
-`free_gpu_memory_fraction`, you lose `enable_block_reuse`. Always re-include
-all fields you need:
+`--gpu-memory-utilization` sets a budget as fraction of total VRAM.
+KV cache = budget - weights - activations - overhead. Pool is fixed at startup.

-```json
-{"kv_cache_config": {"free_gpu_memory_fraction": 0.15, "enable_block_reuse": true}}
-```
+`--kv-cache-memory-bytes` overrides automatic sizing and **skips memory
+profiling** ([PR #21489]). The KV cache is pinned to the exact byte value —
+no profiling race, no CUDAGraph estimation errors, safe for concurrent
+instances ([#10643]). When set, `--gpu-memory-utilization` only affects
+headroom for activations, not KV cache size.

-### How `max_num_tokens` works
+`--max-model-len` caps sequence length. Reducing it is the fastest way to
+cut VRAM when the model fits but KV cache doesn't.

-Maximum batched input tokens per iteration. Primarily a throughput knob.
+[PR #21489]: https://github.com/vllm-project/vllm/pull/21489
+[#10643]: https://github.com/vllm-project/vllm/issues/10643

-**VRAM impact: none.** Reducing from 8192 → 256 had no measurable effect on
-total VRAM (41,643 vs 41,465 MiB — within noise; the slight *increase* is
-because smaller activation footprint lets the fraction claim marginally more
-KV cache).
+### SGLang

-### `max_gpu_total_bytes` (broken)
+`--mem-fraction-static` sets a budget as fraction of total VRAM.
+KV cache pool = budget - weights. Activations and CUDA graph buffers are
+*outside* this budget (unlike vLLM).

-Intended as an absolute byte cap for KV cache. As of trtllm 1.3.0rc5, this
-field is **ignored**. Setting 5 GiB cap with `free_gpu_memory_fraction=0.95`
-still allocated ~42 GiB of KV cache. Setting `free_gpu_memory_fraction=0.0`
-with only `max_gpu_total_bytes` causes `"Impossible to fit any sequence in
-kvCache"`. Do not rely on this field.
+`--max-total-tokens` caps the KV token pool directly, regardless of fraction.
+When set, the token cap is the binding constraint.

-### Override precedence
+`--context-length` and `--max-running-requests` affect request scheduling
+only — they do **not** change KV cache allocation.

-```
--override-engine-args JSON  >  --extra-engine-args YAML  >  CLI flags
-```
+### TensorRT-LLM

-The `DYN_TRTLLM_OVERRIDE_ENGINE_ARGS` env var is equivalent to
-`--override-engine-args` and avoids shell quoting issues with scripts whose
-arg parsers consume unknown flags before passing `"$@"`.
+`free_gpu_memory_fraction` is a fraction of **free** VRAM after model load.
+Set via YAML or `--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'`.

-### Estimating total GPU usage
-
-```
-total_vram ≈ model_weights + engine_overhead + kv_cache
-
-model_weights ≈ num_params * bytes_per_param / tensor_parallel_size
-engine_overhead ≈ 2.0 + 1.2 * sqrt(params_b) GiB  (CUDA context + TRT buffers + activations)
-kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
-```
-
-Engine overhead is auto-computed by `estimate_worker_vram` when called with the
-`trtllm` engine name.  Examples: 0.6B → 2.9 GiB, 8B → 5.4 GiB, 30B → 8.6 GiB.
-
-### Empirical validation (Qwen3-0.6B, RTX 6000 Ada 48 GiB, trtllm 1.3.0rc5)
-
-Controlled test: single worker via agg.sh, one override at a time.
-
-| # | Override | Total VRAM | KV Cache | Tokens |
-|---|---------|-----------|----------|--------|
-| 1 | Baseline (YAML frac=0.85) | 41,465 MiB | 38.04 GiB | 356,160 |
-| 2 | `free_gpu_memory_fraction=0.15` | 9,383 MiB | 6.71 GiB | 62,848 |
-| 3 | `max_num_tokens=256` | 41,643 MiB | 38.26 GiB | 358,208 |
-| 4 | `max_seq_len=4096` | 41,469 MiB | 38.05 GiB | 356,192 |
-| 5 | `max_seq_len=2048` | 41,469 MiB | 38.05 GiB | 356,192 |
-| 6 | seq=4096 + frac=0.15 | 9,383 MiB | 6.71 GiB | 62,848 |
-| 7 | tokens=256 + seq=4096 + frac=0.15 | 9,377 MiB | 6.75 GiB | 63,200 |
-
-**Conclusion:** `free_gpu_memory_fraction` is the **sole effective knob** for
-trtllm VRAM control. Neither `max_seq_len` nor `max_num_tokens` reduce memory.
-Combined overrides (test 7) produce no additional benefit over fraction alone
-(test 2).
+Deterministic KV cache control via `build_gpu_mem_args` is a future TODO.

 ---

-## Why vLLM/sglang fractions are NOT interchangeable with TensorRT-LLM
-
-Consider wanting 10 GiB of KV cache on a 48 GiB GPU with a 5 GiB model:
-
-| Engine | Fraction meaning | Calculation | Result |
-|---|---|---|---|
-| vLLM | 10/48 = 0.21 of total | `48 * 0.21 = 10 GiB` budget (minus model = 5 GiB KV) | Wrong — need higher fraction |
-| sglang | Same as vLLM | Same math | Same problem |
-| TensorRT-LLM | 10/43 = 0.23 of free | `43 * 0.23 = 10 GiB` KV cache | Correct |
-
-For vLLM/sglang, you actually need `(model + kv) / total = (5 + 10) / 48 = 0.31`
-to get 10 GiB of KV cache with a 5 GiB model.
+## `build_gpu_mem_args` and Env Vars

-The helper functions in `gpu_utils.sh` handle these differences:
- `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM)
- `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM)
- `gpu_worker_fraction <engine> <total_gib> <kv_gib>`: converts estimated GiB
-  into the engine-appropriate fraction (total for vllm/sglang, free for trtllm).
-
-Launch scripts use `build_gpu_mem_args` which calls these internally:
+Launch scripts source `gpu_utils.sh` and call `build_gpu_mem_args` to pick
+up env-var overrides during profiling and parallel execution:

 ```bash
-GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --max-model-len "$SEQ_LEN" --max-num-seqs "$CONCURRENCY")
-```
-
---
-
-## KV Cache Memory Per Token
+source "$SCRIPT_DIR/../../../common/gpu_utils.sh"

-The formula for KV cache memory per token is the same across all engines:
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
+python -m dynamo.vllm --model "$MODEL" $GPU_MEM_ARGS &

-```
-kv_bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
+GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
+python -m dynamo.sglang --model-path "$MODEL" $GPU_MEM_ARGS &
 ```

-| Model | Layers | KV Heads | Head Dim | Dtype | Per Token |
-|---|---|---|---|---|---|
-| Qwen3-0.6B | 28 | 8 | 128 | BF16 | 112 KiB |
-| Llama-3.1-8B | 32 | 8 | 128 | BF16 | 128 KiB |
-| Llama-3.1-70B | 80 | 8 | 128 | BF16 | 320 KiB |
-| Qwen2.5-VL-7B | 28 | 4 | 128 | BF16 | 56 KiB |
+When the env var is set, `build_gpu_mem_args` returns the corresponding flag.
+Otherwise it returns empty and the engine uses its default allocation.

-To estimate KV cache for a given context length:
-
-```
-kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^3)
-```
-
---
+| Env var | Engine | CLI flag produced |
+|---------|--------|-------------------|
+| `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` | vLLM | `--kv-cache-memory-bytes N --gpu-memory-utilization 0.01` |
+| `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` | SGLang | `--max-total-tokens N` |

-## `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`
+For multi-worker single-GPU scripts, pass `--workers-per-gpu N` to divide
+the allocation: `build_gpu_mem_args vllm --workers-per-gpu 2`.

-Environment variable used by Dynamo's VRAM profiler to binary-search the minimum
-memory fraction a script needs.
+**Profiler** (`profile_pytest.py`): binary-searches the KV cap to find the
+minimum passing value, applies a 2x safety factor, outputs pytest markers
+(`@pytest.mark.requested_vllm_kv_cache_bytes(N)` or
+`@pytest.mark.requested_sglang_kv_tokens(N)`).

- Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang.
- For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via
-  `--override-engine-args`.
- Launch scripts use `build_gpu_mem_args` to compute the default fraction;
-  the override bypasses the estimator and splits the raw value between workers.
- Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV
-  cache sizing, making the profiler's fraction override ineffective for KV cache.
-  Those scripts should warn when `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is set.
+**Scheduler** (`pytest_parallel_gpu.py`): reads the markers at runtime and
+sets the env var per-test. See `tests/README.md` for details.
--- a/examples/common/gpu_utils.sh
+++ b/examples/common/gpu_utils.sh
--- a/examples/common/launch_utils.sh
+++ b/examples/common/launch_utils.sh
@@ -137,9 +137,9 @@ print_launch_banner() {
    echo "Frontend:    http://localhost:$_port"

    local _seq_len="${MAX_MODEL_LEN:-${CONTEXT_LENGTH:-${MAX_SEQ_LEN:-}}}"
-    local _frac="${GPU_MEM_FRACTION:-}"
+    local _mem_args="${GPU_MEM_ARGS:-}"
    [[ -n "$_seq_len" ]] && echo "Max seq len: $_seq_len"
-    [[ -n "$_frac" ]] && echo "GPU frac:    $_frac"
+    [[ -n "$_mem_args" ]] && echo "GPU mem:     $_mem_args"

    for _line in "$@"; do
        echo "$_line"

--- a/examples/multimodal/launch/audio_agg.sh
+++ b/examples/multimodal/launch/audio_agg.sh
@@ -93,10 +93,10 @@ python -m dynamo.frontend --http-port 8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &

 # run E/P/D workers
-GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)

 CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
-VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill $GPU_MEM_ARGS &

 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/audio_disagg.sh
+++ b/examples/multimodal/launch/audio_disagg.sh
@@ -93,11 +93,11 @@ python -m dynamo.frontend --http-port 8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &

 # run E/P/D workers
-GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)

 CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
-DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
-DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg $GPU_MEM_ARGS &
+DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg $GPU_MEM_ARGS &

 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/video_agg.sh
+++ b/examples/multimodal/launch/video_agg.sh
@@ -19,10 +19,10 @@ python -m dynamo.frontend --http-port=8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &

 # run E/P/D workers
-GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)

 CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
-VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill $GPU_MEM_ARGS &

 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/video_disagg.sh
+++ b/examples/multimodal/launch/video_disagg.sh
@@ -20,11 +20,11 @@ python -m dynamo.frontend --http-port=8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &

 # run E/P/D workers
-GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)

 CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
-DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
-DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg $GPU_MEM_ARGS &
+DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg $GPU_MEM_ARGS &

 # Wait for all background processes to complete
 wait
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -234,7 +234,10 @@ markers = [
    "gpu_8: marks tests to run on 8GPUs",
    "xpu_1: marks tests to run on XPU",
    "xpu_2: marks tests to run on 2XPUs",
-    "max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
+    # These 3 (profiled_vram_gib and requested_*) are used for parallel pytest executions:
+    "profiled_vram_gib(N): actual peak VRAM observed by nvidia-smi during profiling. Used for --max-vram-gib filtering and scheduler budget tracking",
+    "requested_vllm_kv_cache_bytes(N): exact KV cache bytes for vLLM (skips memory profiling). Sets _PROFILE_PYTEST_KV_CACHE_BYTES. Most deterministic method for parallel execution",
+    "requested_sglang_kv_tokens(N): max KV cache tokens for SGLang parallel execution. Sets _OVERRIDE_SGLANG_MAX_TOTAL_TOKENS to cap --max-total-tokens and prevent over-allocation",
    "e2e: marks tests as end-to-end tests",
    "integration: marks tests as integration tests",
    "unit: marks tests as unit tests",

--- a/tests/README.md
+++ b/tests/README.md
--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -25,6 +25,11 @@ from tests.utils.test_output import resolve_test_output_path

 _logger = logging.getLogger(__name__)

+# Typed stash keys for GPU-parallel config (avoids setting unknown attrs on Config)
+_gpu_parallel_gpus_key: pytest.StashKey[list[dict]] = pytest.StashKey()
+_gpu_indices_key: pytest.StashKey[list[int] | None] = pytest.StashKey()
+_gpu_slots_key: pytest.StashKey[int | None] = pytest.StashKey()
+

 def pytest_addoption(parser: pytest.Parser) -> None:
    """Add shared command-line options for all tests.
@@ -59,7 +64,18 @@ def pytest_addoption(parser: pytest.Parser) -> None:
        "--max-vram-gib",
        type=float,
        default=None,
-        help="Skip tests whose @pytest.mark.max_vram_gib(N) exceeds this value (GiB).",
+        help="Only run tests with @pytest.mark.profiled_vram_gib(N) that fit in N GiB. "
+        "Without -n: runs tests sequentially. "
+        "With -n N: runs N tests concurrently as subprocesses with VRAM-aware scheduling. "
+        "With -n auto: calculates max concurrent slots from GPU VRAM / max_vram_gib.",
+    )
+    parser.addoption(
+        "--gpus",
+        "--gpu",
+        type=str,
+        default="all",
+        help="Comma-separated GPU indices or 'all' (default: all). "
+        "Controls which GPUs the parallel test runner distributes tests across.",
    )
    parser.addoption(
        "--dry-run",
@@ -79,6 +95,130 @@ logging.basicConfig(
 )


+# ---------------------------------------------------------------------------
+# GPU-serial and GPU-parallel: VRAM-aware test scheduling
+#
+# Activated only when both --max-vram-gib and -n auto are passed:
+#   pytest --max-vram-gib=48 -n auto -m "gpu_1 and sglang" tests/serve/
+# ---------------------------------------------------------------------------
+
+
+def pytest_configure(config: pytest.Config) -> None:
+    """Detect GPUs for --max-vram-gib planning and parallel execution."""
+    vram_limit = config.getoption("max_vram_gib", default=None)
+    if vram_limit is None:
+        return
+    # Delayed: vram_utils requires pynvml, otherwise conftest fails to load
+    # on CPU-only CI runners (e.g. ARM deploy tests) that lack nvidia-ml-py.
+    from tests.utils.pytest_parallel_gpu import _parse_gpu_indices
+    from tests.utils.vram_utils import auto_worker_count, detect_gpus
+
+    gpus = detect_gpus()
+    if gpus:
+        config.stash[_gpu_parallel_gpus_key] = gpus
+
+    # Parse --gpus into a list of indices (or None for all)
+    gpus_raw = config.getoption("gpus", default="all")
+    if gpus_raw and gpus_raw.strip().lower() != "all":
+        config.stash[_gpu_indices_key] = _parse_gpu_indices(gpus_raw, gpus)
+        selected_gpus = [
+            g for g in gpus if g["index"] in config.stash[_gpu_indices_key]
+        ]
+    else:
+        config.stash[_gpu_indices_key] = None  # all GPUs
+        selected_gpus = gpus
+
+    # If -n is set with --max-vram-gib, save the slot count and disable xdist
+    # so our subprocess orchestrator handles parallelism instead.
+    # xdist's pytest_configure(trylast=True) checks _is_distribution_mode()
+    # which reads dist/tx (not numprocesses), so we must also clear dist.
+    numproc = config.getoption("numprocesses", default=None)
+    if numproc is not None and numproc != 0:
+        if isinstance(numproc, str) or numproc == -1:
+            config.stash[_gpu_slots_key] = (
+                auto_worker_count(selected_gpus, vram_limit) if selected_gpus else 1
+            )
+        else:
+            config.stash[_gpu_slots_key] = int(numproc)
+        config.option.numprocesses = 0
+        config.option.dist = "no"
+
+
+@pytest.hookimpl(tryfirst=True)
+def pytest_runtestloop(session: pytest.Session) -> bool | None:
+    """Intercept the test loop for GPU-parallel execution.
+
+    When --max-vram-gib and -n are both present, run tests as independent
+    subprocesses via the GPU orchestrator instead of the normal pytest loop.
+    Must run before the default pytest loop (tryfirst) so we can return True
+    to prevent the default sequential execution.
+    """
+    config = session.config
+    num_slots = config.stash.get(_gpu_slots_key, None)
+    vram_limit = config.getoption("max_vram_gib", default=None)
+
+    if num_slots is None or vram_limit is None:
+        return None  # serial execution: let normal pytest handle it
+
+    # Imports related to parallel execution must be delayed. See vram_utils pynvml note in pytest_configure for the full reasons
+    from tests.utils.pytest_parallel_gpu import run_parallel
+    from tests.utils.vram_utils import load_test_meta
+
+    # Collect test IDs from the already-filtered session items
+    test_ids = [item.nodeid for item in session.items]
+    if not test_ids:
+        return True
+
+    meta = load_test_meta()
+    is_stream = config.getoption("capture", default="fd") == "no"
+    gpu_indices = config.stash.get(_gpu_indices_key, None)
+
+    # Forward original CLI args to child pytest subprocesses so they
+    # inherit options like -s, -v, --tb, --durations, --image, etc.
+    extra_args: list[str] = []
+    if is_stream:
+        extra_args.append("-s")
+    verbose = config.getoption("verbose", default=0)
+    if verbose >= 2:
+        extra_args.append("-vv")
+    elif verbose >= 1:
+        extra_args.append("-v")
+    tb_style = config.getoption("tbstyle", default="short")
+    if tb_style and tb_style != "short":
+        extra_args.append(f"--tb={tb_style}")
+    durations = config.getoption("durations", default=None)
+    if durations is not None:
+        extra_args.append(f"--durations={durations}")
+    durations_min = config.getoption("durations_min", default=None)
+    if durations_min is not None:
+        extra_args.append(f"--durations-min={durations_min}")
+    for opt_name, cli_flag in [
+        ("image", "--image"),
+        ("namespace", "--namespace"),
+        ("framework", "--framework"),
+        ("profile", "--profile"),
+    ]:
+        val = config.getoption(opt_name, default=None)
+        if val is not None:
+            extra_args.extend([cli_flag, str(val)])
+    if config.getoption("skip_service_restart", default=None):
+        extra_args.append("--skip-service-restart")
+
+    rc = run_parallel(
+        test_ids=test_ids,
+        meta=meta,
+        max_vram_gib=vram_limit,
+        num_slots=num_slots,
+        gpu_indices=gpu_indices,
+        extra_pytest_args=extra_args or None,
+        stream=is_stream,
+    )
+
+    if rc != 0:
+        session.testsfailed = 1
+    return True  # we handled the test loop
+
+
 @pytest.fixture()
 def set_ucx_tls_no_mm():
    """Set UCX env defaults for all tests."""
@@ -205,8 +345,10 @@ def _enable_offline_with_mistral_patch():
    except (ImportError, AttributeError):
        return  # transformers version without _patch_mistral_regex — nothing to do

-    # Write a sitecustomize.py so subprocesses also get the patch
-    patch_dir = os.path.join(tempfile.gettempdir(), "dynamo_test_hf_patch")
+    # Write a sitecustomize.py so subprocesses also get the patch.
+    # Use a per-worker dir under xdist to avoid write races.
+    worker_id = os.environ.get("PYTEST_XDIST_WORKER", "main")
+    patch_dir = os.path.join(tempfile.gettempdir(), f"dynamo_test_hf_patch_{worker_id}")
    os.makedirs(patch_dir, exist_ok=True)
    with open(os.path.join(patch_dir, "sitecustomize.py"), "w") as f:
        f.write(
@@ -239,26 +381,33 @@ def _enable_offline_with_mistral_patch():
 def _disable_offline_with_mistral_patch():
    """Undo _enable_offline_with_mistral_patch."""
    os.environ.pop("HF_HUB_OFFLINE", None)
-    patch_dir = os.path.join(tempfile.gettempdir(), "dynamo_test_hf_patch")
+    worker_id = os.environ.get("PYTEST_XDIST_WORKER", "main")
+    patch_dir = os.path.join(tempfile.gettempdir(), f"dynamo_test_hf_patch_{worker_id}")
    pythonpath = os.environ.get("PYTHONPATH", "")
    os.environ["PYTHONPATH"] = pythonpath.replace(f"{patch_dir}:", "").replace(
        patch_dir, ""
    )


+_download_lock_path = os.path.join(tempfile.gettempdir(), "pytest_model_download.lock")
+
+
 @pytest.fixture(scope="session")
 def predownload_models(pytestconfig):
-    """Fixture wrapper around download_models for models used in collected tests"""
-    # Get models from pytest config if available, otherwise fall back to TEST_MODELS
+    """Fixture wrapper around download_models for models used in collected tests.
+
+    Uses a file lock so that under xdist, only one worker downloads at a time
+    and the rest reuse the HuggingFace cache.
+    """
    models = getattr(pytestconfig, "models_to_download", None)
-    if models:
-        logging.info(
-            f"Downloading {len(models)} models needed for collected tests\nModels: {models}"
-        )
-        download_models(model_list=list(models))
-    else:
-        # Fallback to original behavior if extraction failed
-        download_models()
+    with FileLock(_download_lock_path):
+        if models:
+            logging.info(
+                f"Downloading {len(models)} models needed for collected tests\nModels: {models}"
+            )
+            download_models(model_list=list(models))
+        else:
+            download_models()

    _enable_offline_with_mistral_patch()
    yield
@@ -267,21 +416,20 @@ def predownload_models(pytestconfig):

 @pytest.fixture(scope="session")
 def predownload_tokenizers(pytestconfig):
-    """Fixture wrapper around download_models for tokenizers used in collected tests"""
-    # Get models from pytest config if available, otherwise fall back to TEST_MODELS
+    """Fixture wrapper around download_models for tokenizers used in collected tests.
+
+    Uses a file lock so that under xdist, only one worker downloads at a time.
+    """
    models = getattr(pytestconfig, "models_to_download", None)
-    if models:
-        logging.info(
-            f"Downloading tokenizers for {len(models)} models needed for collected tests\nModels: {models}"
-        )
-        download_models(model_list=list(models), ignore_weights=True)
-    else:
-        # Fallback to original behavior if extraction failed
-        download_models(ignore_weights=True)
+    with FileLock(_download_lock_path):
+        if models:
+            logging.info(
+                f"Downloading tokenizers for {len(models)} models needed for collected tests\nModels: {models}"
+            )
+            download_models(model_list=list(models), ignore_weights=True)
+        else:
+            download_models(ignore_weights=True)

-    # Skip redundant HuggingFace API calls in worker subprocesses since
-    # tokenizers are already cached. This avoids flaky timeouts from slow
-    # HF API responses (the RepoInfo fetch still happens even for cached models).
    _enable_offline_with_mistral_patch()
    yield
    _disable_offline_with_mistral_patch()
@@ -337,26 +485,41 @@ def pytest_collection_modifyitems(config, items):
                if _item_has_marker(item, marker_name):
                    item.add_marker(skip)

-    # Skip tests that exceed --max-vram-gib
+    # Deselect tests based on --max-vram-gib:
+    #   - Tests whose profiled VRAM exceeds the limit are removed
+    #   - Tests WITHOUT a VRAM marker are also removed (unknown VRAM = unsafe)
+    # Using deselect (not skip) so they never reach the xdist scheduler.
    vram_limit = config.getoption("--max-vram-gib", default=None)
    if vram_limit is not None:
-        skip_vram = pytest.mark.skip(
-            reason=f"requires more than {vram_limit} GiB VRAM (--max-vram-gib={vram_limit})"
-        )
+        keep = []
+        deselected = []
        for item in items:
-            vram_mark = item.get_closest_marker("max_vram_gib")
-            if vram_mark and vram_mark.args and vram_mark.args[0] > vram_limit:
-                item.add_marker(skip_vram)
+            vram_mark = item.get_closest_marker("profiled_vram_gib")
+            if vram_mark and vram_mark.args and vram_mark.args[0] <= vram_limit:
+                keep.append(item)
+            else:
+                deselected.append(item)
+        if deselected:
+            config.hook.pytest_deselected(items=deselected)
+            items[:] = keep
+
+    # Write test metadata for the GPU orchestrator to read.
+    if vram_limit is not None:
+        # Delayed: see vram_utils pynvml note in pytest_configure
+        from tests.utils.vram_utils import print_gpu_plan, write_test_meta
+
+        write_test_meta(items)

-    # --dry-run: print run/skip breakdown and exit without executing tests
+    # --dry-run: print run/skip breakdown and exit without executing tests.
+    # At this point, items only contains tests that passed --max-vram-gib
+    # filtering (deselected items were already removed above).
    if config.getoption("--dry-run", default=False):
        would_run = []
        would_skip = []
-        unmarked = []
        for item in items:
-            vram_mark = item.get_closest_marker("max_vram_gib")
+            vram_mark = item.get_closest_marker("profiled_vram_gib")
            vram_val = vram_mark.args[0] if vram_mark and vram_mark.args else None
-            name = item.nodeid.split("::", 1)[1] if "::" in item.nodeid else item.nodeid
+            name = item.nodeid

            skip_reasons = []
            for marker in item.iter_markers("skip"):
@@ -365,39 +528,28 @@ def pytest_collection_modifyitems(config, items):
                    reason = marker.args[0]
                skip_reasons.append(reason or "no reason given")

-            vram_skipped = (
-                vram_limit is not None
-                and vram_val is not None
-                and vram_val > vram_limit
-            )
-            if vram_skipped:
-                skip_reasons.insert(0, f"{vram_val} GiB > {vram_limit} GiB VRAM limit")
-
            if skip_reasons:
                would_skip.append((name, vram_val, skip_reasons))
-            elif vram_val is not None:
-                would_run.append((name, vram_val))
            else:
-                unmarked.append(name)
+                would_run.append((name, vram_val))

        print(f"\n{'=' * 60}")
-        print(
-            f"--max-vram-gib={vram_limit or 'not set'}  |  {len(items)} tests selected"
-        )
+        print(f"--max-vram-gib={vram_limit or 'not set'}  |  {len(items)} tests")
        print(f"{'=' * 60}")
        if would_run:
            print(f"\nWould RUN ({len(would_run)}):")
            for name, gib in would_run:
-                print(f"  {name}  ({gib} GiB)")
+                gib_str = f"  ({gib} GiB)" if gib is not None else ""
+                print(f"  {name}{gib_str}")
        if would_skip:
            print(f"\nWould SKIP ({len(would_skip)}):")
            for name, vram_val, reasons in would_skip:
                vram_str = f"  ({vram_val} GiB)" if vram_val is not None else ""
                print(f"  {name}{vram_str}  -- {'; '.join(reasons)}")
-        if unmarked:
-            print(f"\nNo VRAM marker — always run ({len(unmarked)}):")
-            for name in unmarked:
-                print(f"  {name}")
+
+        gpus = config.stash.get(_gpu_parallel_gpus_key, None)
+        if gpus and vram_limit is not None:
+            print_gpu_plan(gpus, vram_limit, would_run)
        print()
        items.clear()
        return

--- a/tests/frontend/test_vllm.py
+++ b/tests/frontend/test_vllm.py
@@ -99,9 +99,16 @@ class VllmWorkerProcess(ManagedProcess):
            "32768",
        ]

-        gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
-        if gpu_util:
-            command.extend(["--gpu-memory-utilization", gpu_util])
+        kv_bytes = os.environ.get("_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES")
+        if kv_bytes:
+            command.extend(
+                [
+                    "--kv-cache-memory-bytes",
+                    kv_bytes,
+                    "--gpu-memory-utilization",
+                    "0.01",
+                ]
+            )

        env = os.environ.copy()
        env["DYN_LOG"] = "debug"
@@ -229,7 +236,8 @@ def _validate_chat_response(response: requests.Response) -> Dict[str, Any]:


 # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort
-@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.profiled_vram_gib(20.4)  # actual profiled peak
+# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
 @pytest.mark.timeout(300)  # 3x observed ~70s wall time, rounded up
 @pytest.mark.post_merge
 def test_reasoning_effort(
@@ -297,7 +305,8 @@ def test_reasoning_effort(


 # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
-@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.profiled_vram_gib(20.4)  # actual profiled peak
+# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
 @pytest.mark.timeout(113)  # 3x observed 37.4s wall time
 @pytest.mark.post_merge
 def test_tool_calling(
@@ -341,7 +350,8 @@ def test_tool_calling(


 # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling_second_round
-@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.profiled_vram_gib(20.4)  # actual profiled peak
+# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
 @pytest.mark.timeout(115)  # 3x observed 38.1s wall time
 @pytest.mark.nightly
 def test_tool_calling_second_round(
@@ -407,7 +417,8 @@ def test_tool_calling_second_round(


 # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning
-@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.profiled_vram_gib(20.4)  # actual profiled peak
+# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
 @pytest.mark.timeout(131)  # 3x observed 43.4s wall time
 @pytest.mark.nightly
 def test_reasoning(request, start_services: ServicePorts, predownload_models) -> None:

--- a/tests/serve/common.py
+++ b/tests/serve/common.py
@@ -18,6 +18,7 @@ from tests.conftest import ServicePorts
 from tests.utils.client import send_request
 from tests.utils.constants import DefaultPort
 from tests.utils.engine_process import EngineConfig, EngineProcess
+from tests.utils.port_utils import allocate_port, deallocate_port

 DEFAULT_TIMEOUT = 10

@@ -93,6 +94,7 @@ def run_serve_deployment(

        # Ensure EngineProcess health checks hit the correct frontend port.
        config = dataclasses.replace(config, frontend_port=dynamic_frontend_port)
+
    else:
        # Backward compat: infer from config/extra_env if no explicit ports are passed.
        dynamic_frontend_port = int(config.frontend_port)
@@ -108,76 +110,86 @@ def run_serve_deployment(
            int(merged_env.get("DYN_SYSTEM_PORT2") or DefaultPort.SYSTEM2.value),
        ]

-    with EngineProcess.from_script(
-        config, request, extra_env=merged_env
-    ) as server_process:
-        for _payload in config.request_payloads:
-            logger.info("TESTING: Payload: %s", _payload.__class__.__name__)
-
-            # Make a per-iteration copy so tests can safely override ports/fields
-            # without mutating shared config instances across parametrized cases.
-            payload = deepcopy(_payload)
-            # inject model
-            if hasattr(payload, "with_model"):
-                payload = payload.with_model(config.model)
-
-            # Default behavior: requests go to the frontend port, except metrics which target
-            # worker system ports (mapped from DefaultPort -> per-test ports).
-            if getattr(payload, "endpoint", "") == "/metrics":
-                if payload.port == DefaultPort.SYSTEM1.value:
-                    if len(dynamic_system_ports) < 1:
-                        raise RuntimeError(
-                            "Payload targets SYSTEM_PORT1 but no system ports were provided "
-                            f"(payload={payload.__class__.__name__})"
-                        )
-                    payload.port = dynamic_system_ports[0]
-                elif payload.port == DefaultPort.SYSTEM2.value:
-                    if len(dynamic_system_ports) < 2:
-                        raise RuntimeError(
-                            "Payload targets SYSTEM_PORT2 but only 1 system port was provided "
-                            f"(payload={payload.__class__.__name__})"
-                        )
-                    payload.port = dynamic_system_ports[1]
-            else:
-                payload.port = dynamic_frontend_port
-
-            # Optional extra system ports for specialized payloads (e.g. LoRA control-plane APIs).
-            # BasePayload always defines `system_ports` (usually empty); map defaults
-            # (SYSTEM_PORT1/2) to per-test system ports when present.
-            if payload.system_ports:
-                mapped_system_ports: list[int] = []
-                for p in payload.system_ports:
-                    if p == DefaultPort.SYSTEM1.value:
+    # Disagg scripts need a unique bootstrap port so parallel runs don't collide.
+    disagg_bootstrap_port: int | None = None
+    if config.script_name and "disagg" in config.script_name:
+        disagg_bootstrap_port = allocate_port(12000)
+        merged_env["DYN_DISAGG_BOOTSTRAP_PORT"] = str(disagg_bootstrap_port)
+
+    try:
+        with EngineProcess.from_script(
+            config, request, extra_env=merged_env
+        ) as server_process:
+            for _payload in config.request_payloads:
+                logger.info("TESTING: Payload: %s", _payload.__class__.__name__)
+
+                # Make a per-iteration copy so tests can safely override ports/fields
+                # without mutating shared config instances across parametrized cases.
+                payload = deepcopy(_payload)
+                # inject model
+                if hasattr(payload, "with_model"):
+                    payload = payload.with_model(config.model)
+
+                # Default behavior: requests go to the frontend port, except metrics which target
+                # worker system ports (mapped from DefaultPort -> per-test ports).
+                if getattr(payload, "endpoint", "") == "/metrics":
+                    if payload.port == DefaultPort.SYSTEM1.value:
                        if len(dynamic_system_ports) < 1:
                            raise RuntimeError(
-                                "Payload.system_ports includes SYSTEM_PORT1 but no system ports were provided "
+                                "Payload targets SYSTEM_PORT1 but no system ports were provided "
                                f"(payload={payload.__class__.__name__})"
                            )
-                        mapped_system_ports.append(dynamic_system_ports[0])
-                    elif p == DefaultPort.SYSTEM2.value:
+                        payload.port = dynamic_system_ports[0]
+                    elif payload.port == DefaultPort.SYSTEM2.value:
                        if len(dynamic_system_ports) < 2:
                            raise RuntimeError(
-                                "Payload.system_ports includes SYSTEM_PORT2 but only 1 system port was provided "
+                                "Payload targets SYSTEM_PORT2 but only 1 system port was provided "
                                f"(payload={payload.__class__.__name__})"
                            )
-                        mapped_system_ports.append(dynamic_system_ports[1])
-                    else:
-                        mapped_system_ports.append(p)
-                payload.system_ports = mapped_system_ports
-
-            for _ in range(payload.repeat_count):
-                response = send_request(
-                    url=payload.url(),
-                    payload=payload.body,
-                    timeout=payload.timeout,
-                    method=payload.method,
-                    stream=payload.http_stream,
-                )
-                server_process.check_response(payload, response)
-
-            # Call final_validation if the payload has one (e.g., CachedTokensChatPayload)
-            if hasattr(payload, "final_validation"):
-                payload.final_validation()
+                        payload.port = dynamic_system_ports[1]
+                else:
+                    payload.port = dynamic_frontend_port
+
+                # Optional extra system ports for specialized payloads (e.g. LoRA control-plane APIs).
+                # BasePayload always defines `system_ports` (usually empty); map defaults
+                # (SYSTEM_PORT1/2) to per-test system ports when present.
+                if payload.system_ports:
+                    mapped_system_ports: list[int] = []
+                    for p in payload.system_ports:
+                        if p == DefaultPort.SYSTEM1.value:
+                            if len(dynamic_system_ports) < 1:
+                                raise RuntimeError(
+                                    "Payload.system_ports includes SYSTEM_PORT1 but no system ports were provided "
+                                    f"(payload={payload.__class__.__name__})"
+                                )
+                            mapped_system_ports.append(dynamic_system_ports[0])
+                        elif p == DefaultPort.SYSTEM2.value:
+                            if len(dynamic_system_ports) < 2:
+                                raise RuntimeError(
+                                    "Payload.system_ports includes SYSTEM_PORT2 but only 1 system port was provided "
+                                    f"(payload={payload.__class__.__name__})"
+                                )
+                            mapped_system_ports.append(dynamic_system_ports[1])
+                        else:
+                            mapped_system_ports.append(p)
+                    payload.system_ports = mapped_system_ports
+
+                for _ in range(payload.repeat_count):
+                    response = send_request(
+                        url=payload.url(),
+                        payload=payload.body,
+                        timeout=payload.timeout,
+                        method=payload.method,
+                        stream=payload.http_stream,
+                    )
+                    server_process.check_response(payload, response)
+
+                # Call final_validation if the payload has one (e.g., CachedTokensChatPayload)
+                if hasattr(payload, "final_validation"):
+                    payload.final_validation()
+    finally:
+        if disagg_bootstrap_port is not None:
+            deallocate_port(disagg_bootstrap_port)


 def params_with_model_mark(configs: Mapping[str, EngineConfig]):

--- a/tests/serve/launch/multi_node_tp_headless.sh
+++ b/tests/serve/launch/multi_node_tp_headless.sh
@@ -12,7 +12,11 @@ trap 'echo "Cleaning up..."; kill 0' EXIT

 MODEL="${MODEL:-Qwen/Qwen3-0.6B}"

-GPU_MEM_FRACTION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}"
+KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
+GPU_MEM_ARGS=""
+if [[ -n "$KV_BYTES" ]]; then
+    GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
+fi

 echo "Starting Dynamo frontend..."
 python3 -m dynamo.frontend &
@@ -25,7 +29,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
  --node-rank 0 \
  --master-addr 127.0.0.1 \
  --enforce-eager \
-  ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+  $GPU_MEM_ARGS &

 echo "Starting dynamo.vllm headless worker (TP=2, nnodes=2, node-rank=1, GPU 1)..."
 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
@@ -35,7 +39,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
  --node-rank 1 \
  --master-addr 127.0.0.1 \
  --enforce-eager \
-  ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} \
+  $GPU_MEM_ARGS \
  --headless &

 wait
--- a/tests/serve/test_sglang.py
+++ b/tests/serve/test_sglang.py
@@ -45,9 +45,9 @@ sglang_dir = os.environ.get("SGLANG_DIR") or os.path.join(

 # SGLang test configurations
 # NOTE: pytest.mark.gpu_1 tests take ~167s (2m 47s) total to run sequentially (with models pre-cached)
-# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
+# TODO: Now that these tests use dynamic ports and each config has a profiled_vram_gib marker,
 # optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
-# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
+# A future collector/launcher can sum profiled_vram_gib values to decide how many tests fit
 # concurrently without exceeding available VRAM.
 sglang_configs = {
    "aggregated": SGLangConfig(
@@ -58,8 +58,13 @@ sglang_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(6.1),  # observed peak 5.6 GiB (+10% safety)
-            pytest.mark.timeout(240),  # profiled 34.4s on A6000
+            pytest.mark.profiled_vram_gib(
+                3.7
+            ),  # actual peak at recommended token count
+            pytest.mark.requested_sglang_kv_tokens(
+                96
+            ),  # KV cache cap (2x safety over min=48)
+            pytest.mark.timeout(195),  # profiled 33s on RTX 6000 Ada
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-0.6B",
@@ -160,7 +165,8 @@ sglang_configs = {
        script_name="template_verifier.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.timeout(240),  # profiled 11.7s on A6000 (no GPU model load)
+            pytest.mark.profiled_vram_gib(0.0),  # no GPU model load
+            pytest.mark.timeout(120),  # profiled 12s on RTX 6000 Ada
            pytest.mark.pre_merge,
            pytest.mark.nightly,
        ],
@@ -175,8 +181,8 @@ sglang_configs = {
    ),
    # NOTE: Pack all workers on 1 GPU for lower CI resource requirements.
    # NOTE: multimodal_epd.sh uses explicit --mem-fraction-static via DYN_ENCODE_GPU_MEM
-    # / DYN_WORKER_GPU_MEM env vars, so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect.
-    # Regardless of fraction overrides, the workers combined consistently use ~23.6 GiB.
+    # / DYN_WORKER_GPU_MEM env vars. The profiler override distributes proportionally
+    # but workers combined consistently use ~23.6 GiB regardless of fraction overrides.
    "multimodal_e_pd_qwen": SGLangConfig(
        # E/P/D architecture: Encode, Prefill, Decode workers all on GPU 0
        name="multimodal_e_pd_qwen",
@@ -184,16 +190,15 @@ sglang_configs = {
        script_name="multimodal_epd.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(13.3),  # observed peak 12.1 GiB (+10% safety)
-            pytest.mark.timeout(360),  # profiled 31.0s on A6000
+            # No profiled_vram_gib: uses hard-coded --mem-fraction-static via
+            # DYN_ENCODE_GPU_MEM / DYN_WORKER_GPU_MEM, so VRAM scales with GPU size.
+            pytest.mark.timeout(210),  # profiled 35s on RTX 6000 Ada
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-VL-2B-Instruct",
        script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
        timeout=360,
        env={
-            "DYN_ENCODE_WORKER_GPU": "0",
-            "DYN_WORKER_GPU": "0",
            "DYN_ENCODE_GPU_MEM": "0.1",
            "DYN_WORKER_GPU_MEM": "0.4",
        },
@@ -226,8 +231,11 @@ sglang_configs = {
        script_name="multimodal_disagg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(17.7),  # observed peak 16.1 GiB (+10% safety)
-            pytest.mark.timeout(360),  # profiled 36.0s on A6000
+            pytest.mark.profiled_vram_gib(16.1),  # actual profiled peak
+            pytest.mark.requested_sglang_kv_tokens(
+                1024
+            ),  # KV cache cap (2x safety over min=512)
+            pytest.mark.timeout(222),  # profiled 37s on RTX 6000 Ada
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-VL-2B-Instruct",
@@ -261,8 +269,13 @@ sglang_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(21.0),  # observed peak 19.1 GiB (+10% safety)
-            pytest.mark.timeout(300),  # profiled 41.3s on A6000
+            pytest.mark.profiled_vram_gib(
+                19.1
+            ),  # actual peak at recommended token count
+            pytest.mark.requested_sglang_kv_tokens(
+                768
+            ),  # KV cache cap (2x safety over min=384)
+            pytest.mark.timeout(182),  # profiled 30s on RTX 6000 Ada
            pytest.mark.pre_merge,
            pytest.mark.nightly,
        ],
@@ -300,8 +313,13 @@ sglang_configs = {
        script_name="agg_embed.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(12.1),  # observed peak 11.0 GiB (+10% safety)
-            pytest.mark.timeout(270),  # profiled 25.5s on A6000
+            pytest.mark.profiled_vram_gib(
+                9.8
+            ),  # actual peak at recommended token count
+            pytest.mark.requested_sglang_kv_tokens(
+                128
+            ),  # KV cache cap (2x safety over min=64)
+            pytest.mark.timeout(147),  # profiled 24s on RTX 6000 Ada
            pytest.mark.pre_merge,
            pytest.mark.nightly,
        ],
@@ -338,8 +356,13 @@ sglang_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(16.2),  # observed peak 14.8 GiB (+10% safety)
-            pytest.mark.timeout(420),  # profiled 73s on A6000
+            pytest.mark.profiled_vram_gib(
+                14.7
+            ),  # actual peak at recommended token count
+            pytest.mark.requested_sglang_kv_tokens(
+                64
+            ),  # KV cache cap (2x safety over min=32)
+            pytest.mark.timeout(341),  # profiled 57s on RTX 6000 Ada
            pytest.mark.post_merge,
        ],
        model="deepseek-ai/deepseek-llm-7b-base",
@@ -362,7 +385,7 @@ sglang_configs = {
            pytest.mark.post_merge,
            pytest.mark.timeout(240),
            pytest.mark.skip(reason="DYN-2261"),
-            # TODO: profile to get max_vram (currently skipped)
+            # TODO: profile once DYN-2261 is fixed (uses agg.sh, profiler works)
        ],
        model="Qwen/Qwen3-0.6B",
        env={"DYN_ENABLE_ANTHROPIC_API": "1"},

--- a/tests/serve/test_vllm.py
+++ b/tests/serve/test_vllm.py
@@ -54,9 +54,9 @@ vllm_dir = os.environ.get("VLLM_DIR") or os.path.join(

 # vLLM test configurations
 # NOTE: pytest.mark.gpu_1 tests take ~5.5 minutes total to run sequentially (with models pre-cached)
-# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
+# TODO: Now that these tests use dynamic ports and each config has VRAM markers,
 # optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
-# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
+# A future collector/launcher can sum profiled_vram_gib values to decide how many tests fit
 # concurrently without exceeding available VRAM.
 vllm_configs = {
    "aggregated": VLLMConfig(
@@ -65,8 +65,13 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
-            pytest.mark.timeout(300),  # ~7x observed 42.2s; old value before profiling
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
+            pytest.mark.timeout(
+                360
+            ),  # ~8.5x observed 42.2s; bumped for GPU-parallel headroom
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-0.6B",
@@ -93,7 +98,10 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
            pytest.mark.timeout(120),  # ~5x observed 24.3s; CI machines are slower
            pytest.mark.post_merge,
        ],
@@ -122,7 +130,10 @@ vllm_configs = {
        marks=[
            pytest.mark.lmcache,
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.1),  # observed peak 7.4 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
            pytest.mark.timeout(360),  # ~7x observed 49.0s; old value before profiling
            pytest.mark.pre_merge,
            pytest.mark.skipif(
@@ -145,7 +156,10 @@ vllm_configs = {
        marks=[
            pytest.mark.lmcache,
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.1),  # observed peak 7.4 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
            pytest.mark.timeout(360),  # ~7x observed 49.3s; old value before profiling
            pytest.mark.pre_merge,
            pytest.mark.skipif(
@@ -170,8 +184,13 @@ vllm_configs = {
        script_name="agg_request_planes.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.1),  # observed peak 7.3 GiB (+10% safety)
-            pytest.mark.timeout(300),  # ~7x observed 43.0s; old value before profiling
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
+            pytest.mark.timeout(
+                360
+            ),  # ~8x observed 43.0s; bumped for GPU-parallel headroom
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-0.6B",
@@ -187,8 +206,13 @@ vllm_configs = {
        script_name="agg_request_planes.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.1),  # observed peak 7.3 GiB (+10% safety)
-            pytest.mark.timeout(300),  # ~7x observed 42.3s; old value before profiling
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
+            pytest.mark.timeout(
+                360
+            ),  # ~8.5x observed 42.3s; bumped for GPU-parallel headroom
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-0.6B",
@@ -299,13 +323,17 @@ vllm_configs = {
        ],
    ),
    # NOTE: Pack all workers on 1 GPU for lower CI resource requirements
+    # NOTE: disagg_multimodal_e_pd.sh uses explicit --gpu-memory-utilization via
+    # DYN_ENCODE_GPU_MEM / DYN_PD_GPU_MEM env vars in single-GPU mode.
+    # PD worker honors build_gpu_mem_args for parallel execution.
    "multimodal_e_pd_qwen": VLLMConfig(
        name="multimodal_e_pd_qwen",
        directory=vllm_dir,
        script_name="disagg_multimodal_e_pd.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(24.6),  # observed peak 22.3 GiB (+10% safety)
+            # No profiled_vram_gib / requested_vllm_kv_cache_bytes: single-GPU mode
+            # uses hardcoded fractions (encode=0.1, PD=0.7) that scale with GPU size.
            pytest.mark.timeout(340),  # ~5x observed 68.4s; 2B model loads slower on CI
            pytest.mark.pre_merge,
        ],
@@ -339,7 +367,10 @@ vllm_configs = {
        # post_merge because needs real NIXL not stub
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(10.2),  # observed peak 9.3 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(9.6),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_710_490_000
+            ),  # KV cache cap (2x safety over min=855_244_800)
            pytest.mark.timeout(220),  # ~5x observed 43.7s; 2B model loads slower on CI
            pytest.mark.post_merge,
        ],
@@ -373,21 +404,25 @@ vllm_configs = {
    # NOTE: disagg_multimodal_epd.sh uses --kv-cache-memory-bytes=512MB for P/D
    # workers. Per vLLM CacheConfig, kv_cache_memory_bytes (when not-None) ignores
    # gpu_memory_utilization (ref: https://docs.vllm.ai/en/stable/api/vllm/config/cache/),
-    # so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect. Regardless of GPU_MEM
+    # so KV cache overrides have no effect. Regardless of GPU_MEM
    # fractions (0.1/0.4/0.4), the 3 workers combined consistently use ~17.6 GiB
    # total on this GPU.
+    # NOTE: disagg_multimodal_epd.sh uses explicit --gpu-memory-utilization via
+    # DYN_ENCODE_GPU_MEM / DYN_PREFILL_GPU_MEM / DYN_DECODE_GPU_MEM env vars.
+    # P/D workers honor build_gpu_mem_args for parallel execution.
    "multimodal_disagg_qwen": VLLMConfig(
        name="multimodal_disagg_qwen",
        directory=vllm_dir,
        script_name="disagg_multimodal_epd.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(19.4),  # observed peak 17.6 GiB (+10% safety)
+            # No profiled_vram_gib / requested_vllm_kv_cache_bytes: single-GPU mode
+            # uses hardcoded fractions via DYN_*_GPU_MEM that scale with GPU size.
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-VL-2B-Instruct",
        script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
-        timeout=360,
+        timeout=300,
        env={
            "DYN_ENCODE_WORKER_GPU": "0",
            "DYN_PREFILL_WORKER_GPU": "0",
@@ -421,7 +456,10 @@ vllm_configs = {
        script_name="agg_multimodal.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(21.6),  # observed peak 19.6 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(19.9),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                922_354_000
+            ),  # KV cache cap (2x safety over min=461_176_832)
            pytest.mark.timeout(
                360
            ),  # ~7x observed 50.0s; 7B model loads ~48s on CI (A10G/L4)
@@ -455,7 +493,10 @@ vllm_configs = {
        script_name="agg_multimodal.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(18.9),  # observed peak 17.1 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(14.9),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                922_354_000
+            ),  # KV cache cap (2x safety over min=461_176_832)
            pytest.mark.timeout(
                300
            ),  # ~7x observed 42.7s; 7B model loads ~48s on CI (A10G/L4)
@@ -703,7 +744,10 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(21.9),  # observed peak 19.9 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(18.3),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                4_074_898_000
+            ),  # KV cache cap (2x safety over min=2_037_448_704)
            pytest.mark.timeout(
                420
            ),  # 7B model loads ~48s on CI (A10G/L4) vs ~15s locally
@@ -742,7 +786,10 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
            pytest.mark.timeout(110),  # ~5x observed 22.3s; CI machines are slower
            pytest.mark.pre_merge,
        ],

--- a/tests/utils/profile_pytest.py
+++ b/tests/utils/profile_pytest.py
--- a/tests/utils/pytest_parallel_gpu.py
+++ b/tests/utils/pytest_parallel_gpu.py
--- a/tests/utils/test_mock_gpu_alloc.py
+++ b/tests/utils/test_mock_gpu_alloc.py
@@ -32,27 +32,27 @@ ALLOC_MIB = 4096  # 4 GiB
 @pytest.mark.gpu_1
 @pytest.mark.timeout(30)
 def test_mock_4gb_gpu_alloc():
-    """Allocate 4 GiB of GPU VRAM, hold 2s, release. Honors _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE."""
+    """Allocate 4 GiB of GPU VRAM, hold 2s, release. Honors _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES."""
    if not torch.cuda.is_available():
        pytest.skip("CUDA not available")

    device = 0
    total_mib = torch.cuda.get_device_properties(device).total_memory / (1024 * 1024)

-    gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
-    if gpu_util is not None:
-        cap_mib = total_mib * float(gpu_util)
+    kv_bytes_str = os.environ.get("_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES")
+    if kv_bytes_str is not None:
+        cap_mib = int(kv_bytes_str) / (1024 * 1024)
        logger.info(
-            "_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=%.2f -> cap %.0f MiB (%.1f GiB) of %.0f MiB total",
-            float(gpu_util),
+            "_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=%s -> cap %.0f MiB (%.1f GiB) of %.0f MiB total",
+            kv_bytes_str,
            cap_mib,
            cap_mib / 1024,
            total_mib,
        )
        if ALLOC_MIB > cap_mib:
            raise RuntimeError(
-                f"Requested {ALLOC_MIB} MiB exceeds _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE "
-                f"cap of {cap_mib:.0f} MiB ({gpu_util})"
+                f"Requested {ALLOC_MIB} MiB exceeds KV cache cap "
+                f"of {cap_mib:.0f} MiB ({kv_bytes_str} bytes)"
            )

    num_elements = (ALLOC_MIB * 1024 * 1024) // 4

--- a/tests/utils/vram_utils.py
+++ b/tests/utils/vram_utils.py
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""GPU VRAM utilities for parallel test execution.
+
+Functions:
+    detect_gpus()                  Enumerate GPUs via pynvml
+    auto_worker_count(gpus, limit) Calculate slot count for -n auto
+    write_test_meta(items)         Serialize profiled/requested vram + timeout
+    load_test_meta()               Read the serialized test metadata
+    print_gpu_plan(gpus, limit, would_run)  Dry-run GPU plan summary
+
+Usage:
+    # Sequential (filter only)
+    pytest --max-vram-gib=10 -m "gpu_1 and vllm" tests/serve/
+
+    # Parallel (VRAM-aware scheduling)
+    pytest --max-vram-gib=10 -n auto -m "gpu_1 and vllm" tests/serve/
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import os
+import tempfile
+
+import pynvml
+
+_logger = logging.getLogger(__name__)
+
+# When 2+ tests run concurrently, reserve 15% of GPU VRAM for CUDA context
+# overhead across processes.  A single test gets the full GPU (0% margin).
+VRAM_MULTI_PROC_MARGIN = 0.15
+
+_TEST_META_FILENAME = "pytest_gpu_parallel_test_meta.json"
+
+
+def detect_gpus() -> list[dict]:
+    """Return list of dicts with 'index', 'name', 'total_mib' per GPU.
+
+    Uses pynvml (already a dependency via profile_pytest.py).
+    Returns empty list if no GPUs or pynvml is unavailable.
+    """
+    try:
+        pynvml.nvmlInit()
+    except pynvml.NVMLError:
+        return []
+    try:
+        count = pynvml.nvmlDeviceGetCount()
+        gpus = []
+        for i in range(count):
+            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
+            name = pynvml.nvmlDeviceGetName(handle)
+            mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
+            gpus.append(
+                {
+                    "index": i,
+                    "name": name,
+                    "total_mib": mem.total // (1024 * 1024),
+                }
+            )
+        return gpus
+    finally:
+        pynvml.nvmlShutdown()
+
+
+def auto_worker_count(
+    gpus: list[dict],
+    vram_limit: float,
+    test_profiled_gibs: list[float] | None = None,
+) -> int:
+    """Calculate slot count for -n auto.
+
+    Uses the smallest profiled test size (if provided) to maximize parallelism.
+    Falls back to vram_limit when no test sizes are available.
+    """
+    if not gpus or vram_limit <= 0:
+        return len(gpus) or 1
+    min_gpu_gib = min(g["total_mib"] for g in gpus) / 1024.0
+    budget_gib = min_gpu_gib * (1.0 - VRAM_MULTI_PROC_MARGIN)
+    divisor = vram_limit
+    if test_profiled_gibs:
+        nonzero = [g for g in test_profiled_gibs if g > 0]
+        if nonzero:
+            divisor = min(nonzero)
+    workers_per_gpu = max(1, int(budget_gib / divisor)) if divisor > 0 else 1
+    return len(gpus) * workers_per_gpu
+
+
+def write_test_meta(items, dest_dir: str | None = None) -> None:
+    """Serialize profiled_vram_gib, timeout, and KV cache markers to JSON.
+
+    Called from pytest_collection_modifyitems so the GPU orchestrator can
+    read test metadata without re-collecting.
+    """
+    test_meta: dict[str, dict] = {}
+    for item in items:
+        meta: dict = {}
+        profiled_mark = item.get_closest_marker("profiled_vram_gib")
+        if profiled_mark and profiled_mark.args:
+            meta["profiled_vram_gib"] = profiled_mark.args[0]
+        kv_bytes_mark = item.get_closest_marker("requested_vllm_kv_cache_bytes")
+        if kv_bytes_mark and kv_bytes_mark.args:
+            meta["requested_vllm_kv_cache_bytes"] = kv_bytes_mark.args[0]
+        timeout_mark = item.get_closest_marker("timeout")
+        if timeout_mark and timeout_mark.args:
+            meta["timeout"] = timeout_mark.args[0]
+        kv_tokens_mark = item.get_closest_marker("requested_sglang_kv_tokens")
+        if kv_tokens_mark and kv_tokens_mark.args:
+            meta["requested_sglang_kv_tokens"] = kv_tokens_mark.args[0]
+        skip_mark = item.get_closest_marker("skip")
+        if skip_mark:
+            reason = skip_mark.kwargs.get("reason", "")
+            if not reason and skip_mark.args:
+                reason = skip_mark.args[0]
+            meta["skip_reason"] = reason or "skipped"
+        if meta:
+            test_meta[item.nodeid] = meta
+    if test_meta:
+        path = os.path.join(dest_dir or tempfile.gettempdir(), _TEST_META_FILENAME)
+        with open(path, "w") as f:
+            json.dump(test_meta, f)
+
+
+def load_test_meta() -> dict[str, dict]:
+    """Load the nodeid -> {profiled_vram_gib, timeout, ...} map."""
+    path = os.path.join(tempfile.gettempdir(), _TEST_META_FILENAME)
+    try:
+        with open(path) as f:
+            return json.load(f)
+    except (FileNotFoundError, json.JSONDecodeError):
+        return {}
+
+
+def print_gpu_plan(
+    gpus: list[dict], vram_limit: float, would_run: list[tuple[str, float]]
+) -> None:
+    """Print the GPU-parallel plan section for --dry-run output."""
+    min_gpu_gib = min(g["total_mib"] for g in gpus) / 1024.0
+    budget_gib = min_gpu_gib * (1.0 - VRAM_MULTI_PROC_MARGIN)
+    profiled_gibs = [gib for _, gib in would_run if gib is not None and gib > 0]
+    min_test_gib = min(profiled_gibs) if profiled_gibs else vram_limit
+    auto_slots = max(1, int(budget_gib / min_test_gib)) if min_test_gib > 0 else 1
+
+    print(f"\n{'=' * 60}")
+    print("GPU-Parallel Plan")
+    print(f"{'=' * 60}")
+    for gpu in gpus:
+        gib = gpu["total_mib"] / 1024
+        print(f"  GPU {gpu['index']}: {gpu['name']} ({gib:.1f} GiB)")
+    print(f"\n  Usable VRAM: {budget_gib:.0f} GiB")
+    print("\n  Run options:")
+    print("    (no -n)  : sequential, 1 test at a time")
+    print(
+        f"    -n auto  : up to {auto_slots} slots per GPU "
+        f"({budget_gib:.0f} / {min_test_gib:.0f} GiB smallest test)"
+    )
+    print(f"    -n N     : N concurrent slots across {len(gpus)} GPU(s)")
+    print("\n  Usage:")
+    print(
+        f"    pytest --max-vram-gib={vram_limit:.0f} -n {auto_slots} "
+        f'-m "gpu_1 and vllm" tests/serve/'
+    )