feat: GPU-parallel test runner with VRAM-aware scheduling (#7560)

Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>

feat: GPU-parallel test runner with VRAM-aware scheduling (#7560)
Signed-off-by: Keiven Chang <keivenchang@users.noreply.github.com>
6dc85fbc · Keiven C · GitHub · 4ea21079 · 6dc85fbc · 6dc85fbc
Unverified Commit 6dc85fbc authored Apr 02, 2026 by Keiven C Committed by GitHub Apr 02, 2026
20 changed files
--- a/examples/backends/vllm/mm_router_worker/launch.sh
+++ b/examples/backends/vllm/mm_router_worker/launch.sh
@@ -20,9 +20,17 @@ MODEL="${MODEL:-Qwen/Qwen3-VL-8B-Instruct}"
 NAMESPACE="${NAMESPACE:-dynamo}"
 HTTP_PORT="${HTTP_PORT:-8000}"
 BLOCK_SIZE="${BLOCK_SIZE:-16}"            # Must match vLLM backend KV block size
-GPU_MEMORY_UTILIZATION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-${GPU_MEMORY_UTILIZATION:-0.85}}"
+GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.85}"
 MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"

+# KV cache override for parallel-safe GPU memory control
+KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
+if [[ -n "$KV_BYTES" ]]; then
+    GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
+else
+    GPU_MEM_ARGS="--gpu-memory-utilization ${GPU_MEMORY_UTILIZATION}"
+fi
+
 NATS_SERVER="${NATS_SERVER:-nats://127.0.0.1:4222}"
 ETCD_ENDPOINTS="${ETCD_ENDPOINTS:-http://127.0.0.1:2379}"

@@ -121,7 +129,7 @@ env "${COMMON_ENV[@]}" \
        --enable-multimodal \
        --block-size "${BLOCK_SIZE}" \
        --enforce-eager \
-        --gpu-memory-utilization "${GPU_MEMORY_UTILIZATION}" \
+        $GPU_MEM_ARGS \
        --max-model-len "${MAX_MODEL_LEN}" \
        --served-model-name "${MODEL}__internal" \
        ${VLLM_EXTRA_ARGS} &

--- a/examples/common/gpu_utils.md
+++ b/examples/common/gpu_utils.md
-# GPU Memory Parameters by Engine
+# GPU Memory Control

-How vLLM, sglang, and TensorRT-LLM interpret memory-related parameters, and how
-to estimate total GPU VRAM usage for each.
+How vLLM, SGLang, and TensorRT-LLM allocate GPU memory, and how we override
+it for deterministic parallel test execution.

 ---

-## Quick Reference
-
-| Parameter | vLLM | sglang | TensorRT-LLM |
-|---|---|---|---|
-| Memory fraction | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` (YAML/override) |
-| Fraction base | Total VRAM | Total VRAM | Free VRAM (after model load) |
-| Default fraction | 0.90 | 0.90 | 0.90 |
-| Max sequence length | `--max-model-len` | `--context-length` | `max_seq_len` (YAML/override) |
-| KV cache size override | `--kv-cache-memory-bytes` | N/A | `max_gpu_total_bytes` (broken in 1.3.0rc5) |
-
---
-
-## 1. vLLM
-
-### How `--gpu-memory-utilization` works
-
-This is a fraction of **total** GPU VRAM. The engine budgets everything within
-this limit:
-
-```
-budget = total_vram * gpu_memory_utilization
-
-KV cache = budget - model_weights - peak_activations - framework_overhead
-```
-
-At startup, vLLM profiles actual model weight and activation memory, then
-pre-allocates the remaining budget as KV cache blocks. The KV pool size is fixed
-for the lifetime of the engine.
-
-### How `--max-model-len` works
-
-Sets the maximum total sequence length (input + output tokens). Longer sequences
-require more KV cache per request. If the requested `max-model-len` needs more
-KV cache than the budget allows, vLLM errors at startup:
-
-```
-ValueError: ... X GiB KV cache is needed, which is larger than the available
-KV cache memory (Y GiB). ...
-```
-
-Reducing `--max-model-len` is the most effective way to reduce VRAM when the
-model fits but the KV cache doesn't.
-
-### How `--kv-cache-memory-bytes` works
-
-When set, this overrides the automatic KV cache sizing from
-`gpu-memory-utilization`. The engine allocates exactly this many bytes for KV
-cache regardless of the fraction. This means `gpu-memory-utilization` still
-controls the *overall* VRAM budget (and thus whether the model fits), but the
-KV cache portion is pinned to the explicit byte value.
-
-Consequence for profiling: if a script uses `--kv-cache-memory-bytes`,
-changing `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` (which maps to
-`--gpu-memory-utilization`) won't change the KV cache size, only the leftover
-headroom for activations and overhead.
-
-### Estimating total GPU usage
-
-```
-total_vram ≈ model_weights + kv_cache + activations + overhead
-
-model_weights ≈ num_params * bytes_per_param
-                (e.g. 7B * 2 bytes for BF16 ≈ 14 GiB)
+## Why absolute caps, not fractions

-kv_cache_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
-                     (the factor of 2 is for K and V tensors)
+Memory fractions (`--gpu-memory-utilization`, `--mem-fraction-static`) are
+unreliable for parallel / CI workloads:

-kv_cache_total = kv_cache_per_token * max_model_len * max_concurrent_seqs
+- **Non-deterministic** — same fraction produces different KV cache sizes
+  depending on what else is on the GPU at init time.
+- **Profiling race** — concurrent engines each see "nearly all memory free",
+  allocate based on that, and OOM.
+- **Not portable** — a fraction tuned for 48 GiB is wrong on 24 or 80 GiB.
+- **Different semantics** — vLLM/SGLang use fraction of *total* VRAM;
+  TensorRT-LLM uses fraction of *free* VRAM after model load.

-overhead ≈ engine-dependent (auto-computed by estimate_worker_vram):
-           vllm:   1.2 + 1.0 * sqrt(params_b) GiB  (0.6B≈2.0, 8B≈4.0)
-           sglang: 1.5 + 1.0 * sqrt(params_b) GiB  (0.6B≈2.3, 8B≈4.3)
-           trtllm: 2.0 + 1.2 * sqrt(params_b) GiB  (0.6B≈2.9, 8B≈5.4)
-```
+Instead, we use **absolute KV cache caps**:

-Rule of thumb: set `gpu-memory-utilization` so that
-`total_vram * fraction >= model_weights + 2 GiB`. The rest becomes KV cache.
+| Engine | Deterministic override | Env var |
+|--------|----------------------|---------|
+| vLLM | `--kv-cache-memory-bytes N` | `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` |
+| SGLang | `--max-total-tokens N` | `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` |
+| TensorRT-LLM | *(future TODO)* | — |

 ---

-## 2. sglang
-
-### How `--mem-fraction-static` works
-
-Like vLLM, this is a fraction of **total** GPU VRAM:
-
-```
-budget = total_vram * mem_fraction_static
-
-KV cache pool = budget - model_weights
-```
-
-The budget covers model weights and the KV cache pool. Activations and CUDA
-graph buffers are allocated *outside* this budget from the remaining VRAM.
-This is slightly different from vLLM (which includes activations in the budget).
-
-sglang recommends keeping 5-8 GiB free for activations and overhead. If you
-see OOM errors, decrease `--mem-fraction-static` by 0.01-0.05 increments.
-
-### How `--context-length` and `--max-running-requests` work
-
-Unlike vLLM (where `--max-model-len` directly affects KV cache sizing), sglang's
-`--context-length` and `--max-running-requests` do **not** affect KV cache
-allocation. The KV cache pool is sized entirely from `--mem-fraction-static`:
-
-```
-kv_cache_pool = total_vram * mem_fraction_static - model_weights
-```
-
-Profiling confirmed this: changing `--context-length` from 512 to 40960 produced
-identical `max_total_num_tokens` values (269,136 on a 48 GiB GPU at fraction 0.95).
-
-These flags only affect **request scheduling**:
- `--context-length` caps the per-request token usage from the KV pool
- `--max-running-requests` limits concurrent request slots (allocated from
-  memory outside the `--mem-fraction-static` budget)
-
-Setting `--max-running-requests` too high at high fractions can cause OOM because
-the request slot pool competes for the small amount of memory left after KV cache
-allocation.
-
-### Estimating total GPU usage
-
-```
-total_vram ≈ model_weights + kv_cache_pool + activations_and_overhead
-
-kv_cache_pool = total_vram * mem_fraction_static - model_weights
+## Quick Reference

-activations_and_overhead ≈ 1-2 GiB for small models (0.6B-4B)
-                           ~3-5 GiB for larger models (7B+)
-  (CUDA context, graphs, request pools — allocated outside mem_fraction_static)
-```
+| | vLLM | SGLang | TensorRT-LLM |
+|---|---|---|---|
+| Fraction flag | `--gpu-memory-utilization` | `--mem-fraction-static` | `free_gpu_memory_fraction` |
+| Fraction base | Total VRAM | Total VRAM | Free VRAM (post-load) |
+| Default | 0.90 | 0.90 | 0.90 |
+| Max seq len | `--max-model-len` | `--context-length` | `max_seq_len` |
+| KV cache override | `--kv-cache-memory-bytes` | `--max-total-tokens` | *(broken in 1.3.0rc5)* |

 ---

-## 3. TensorRT-LLM
-
-### How `free_gpu_memory_fraction` works
-
-This is a fraction of **free** VRAM (not total). The engine:
-
-1. Loads model weights and builds the TRT engine (fixed cost).
-2. Queries remaining free GPU memory.
-3. Allocates `free_memory * free_gpu_memory_fraction` for the KV cache pool.
-
-```
-kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
-```
-
-This means the same fraction yields different absolute KV cache sizes depending
-on how much VRAM the model consumed. A 5 GiB model on a 48 GiB GPU leaves
-~43 GiB free; fraction=0.24 gives ~10 GiB KV cache. A 30 GiB model leaves
-~18 GiB free; fraction=0.24 gives only ~4 GiB.
-
-Set via YAML config, CLI, or env var:
-
-```bash
--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
-DYN_TRTLLM_OVERRIDE_ENGINE_ARGS='{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'
-```
-
-### How `max_seq_len` works
-
-Maximum total sequence length. Defaults to the model's native context.
-Sequences exceeding this limit are rejected at runtime.
-
-**VRAM impact: none (PyTorch backend).** Reducing max_seq_len from 40960 to
-2048 had zero effect on total VRAM or KV cache size in testing (Qwen3-0.6B,
-trtllm 1.3.0rc5). The PyTorch backend does not pre-allocate internal buffers
-proportional to max_seq_len; KV cache size is determined solely by
-`free_gpu_memory_fraction`. This differs from vLLM/sglang where reducing
-context length measurably reduces memory.
-
-Override via:
-
-```bash
--override-engine-args '{"max_seq_len": 4096}'
-```
+## Per-Engine Notes

-### Override gotcha: sub-dict replacement
+### vLLM

-Overriding any field inside `kv_cache_config` **replaces the entire sub-dict**.
-If your YAML has `enable_block_reuse: true` and you override only
-`free_gpu_memory_fraction`, you lose `enable_block_reuse`. Always re-include
-all fields you need:
+`--gpu-memory-utilization` sets a budget as fraction of total VRAM.
+KV cache = budget - weights - activations - overhead. Pool is fixed at startup.

-```json
-{"kv_cache_config": {"free_gpu_memory_fraction": 0.15, "enable_block_reuse": true}}
-```
+`--kv-cache-memory-bytes` overrides automatic sizing and **skips memory
+profiling** ([PR #21489]). The KV cache is pinned to the exact byte value —
+no profiling race, no CUDAGraph estimation errors, safe for concurrent
+instances ([#10643]). When set, `--gpu-memory-utilization` only affects
+headroom for activations, not KV cache size.

-### How `max_num_tokens` works
+`--max-model-len` caps sequence length. Reducing it is the fastest way to
+cut VRAM when the model fits but KV cache doesn't.

-Maximum batched input tokens per iteration. Primarily a throughput knob.
+[PR #21489]: https://github.com/vllm-project/vllm/pull/21489
+[#10643]: https://github.com/vllm-project/vllm/issues/10643

-**VRAM impact: none.** Reducing from 8192 → 256 had no measurable effect on
-total VRAM (41,643 vs 41,465 MiB — within noise; the slight *increase* is
-because smaller activation footprint lets the fraction claim marginally more
-KV cache).
+### SGLang

-### `max_gpu_total_bytes` (broken)
+`--mem-fraction-static` sets a budget as fraction of total VRAM.
+KV cache pool = budget - weights. Activations and CUDA graph buffers are
+*outside* this budget (unlike vLLM).

-Intended as an absolute byte cap for KV cache. As of trtllm 1.3.0rc5, this
-field is **ignored**. Setting 5 GiB cap with `free_gpu_memory_fraction=0.95`
-still allocated ~42 GiB of KV cache. Setting `free_gpu_memory_fraction=0.0`
-with only `max_gpu_total_bytes` causes `"Impossible to fit any sequence in
-kvCache"`. Do not rely on this field.
+`--max-total-tokens` caps the KV token pool directly, regardless of fraction.
+When set, the token cap is the binding constraint.

-### Override precedence
+`--context-length` and `--max-running-requests` affect request scheduling
+only — they do **not** change KV cache allocation.

-```
--override-engine-args JSON  >  --extra-engine-args YAML  >  CLI flags
-```
+### TensorRT-LLM

-The `DYN_TRTLLM_OVERRIDE_ENGINE_ARGS` env var is equivalent to
-`--override-engine-args` and avoids shell quoting issues with scripts whose
-arg parsers consume unknown flags before passing `"$@"`.
+`free_gpu_memory_fraction` is a fraction of **free** VRAM after model load.
+Set via YAML or `--override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.24}}'`.

-### Estimating total GPU usage
-
-```
-total_vram ≈ model_weights + engine_overhead + kv_cache
-
-model_weights ≈ num_params * bytes_per_param / tensor_parallel_size
-engine_overhead ≈ 2.0 + 1.2 * sqrt(params_b) GiB  (CUDA context + TRT buffers + activations)
-kv_cache = free_vram_after_model_load * free_gpu_memory_fraction
-```
-
-Engine overhead is auto-computed by `estimate_worker_vram` when called with the
-`trtllm` engine name.  Examples: 0.6B → 2.9 GiB, 8B → 5.4 GiB, 30B → 8.6 GiB.
-
-### Empirical validation (Qwen3-0.6B, RTX 6000 Ada 48 GiB, trtllm 1.3.0rc5)
-
-Controlled test: single worker via agg.sh, one override at a time.
-
-| # | Override | Total VRAM | KV Cache | Tokens |
-|---|---------|-----------|----------|--------|
-| 1 | Baseline (YAML frac=0.85) | 41,465 MiB | 38.04 GiB | 356,160 |
-| 2 | `free_gpu_memory_fraction=0.15` | 9,383 MiB | 6.71 GiB | 62,848 |
-| 3 | `max_num_tokens=256` | 41,643 MiB | 38.26 GiB | 358,208 |
-| 4 | `max_seq_len=4096` | 41,469 MiB | 38.05 GiB | 356,192 |
-| 5 | `max_seq_len=2048` | 41,469 MiB | 38.05 GiB | 356,192 |
-| 6 | seq=4096 + frac=0.15 | 9,383 MiB | 6.71 GiB | 62,848 |
-| 7 | tokens=256 + seq=4096 + frac=0.15 | 9,377 MiB | 6.75 GiB | 63,200 |
-
-**Conclusion:** `free_gpu_memory_fraction` is the **sole effective knob** for
-trtllm VRAM control. Neither `max_seq_len` nor `max_num_tokens` reduce memory.
-Combined overrides (test 7) produce no additional benefit over fraction alone
-(test 2).
+Deterministic KV cache control via `build_gpu_mem_args` is a future TODO.

 ---

-## Why vLLM/sglang fractions are NOT interchangeable with TensorRT-LLM
-
-Consider wanting 10 GiB of KV cache on a 48 GiB GPU with a 5 GiB model:
-
-| Engine | Fraction meaning | Calculation | Result |
-|---|---|---|---|
-| vLLM | 10/48 = 0.21 of total | `48 * 0.21 = 10 GiB` budget (minus model = 5 GiB KV) | Wrong — need higher fraction |
-| sglang | Same as vLLM | Same math | Same problem |
-| TensorRT-LLM | 10/43 = 0.23 of free | `43 * 0.23 = 10 GiB` KV cache | Correct |
-
-For vLLM/sglang, you actually need `(model + kv) / total = (5 + 10) / 48 = 0.31`
-to get 10 GiB of KV cache with a 5 GiB model.
+## `build_gpu_mem_args` and Env Vars

-The helper functions in `gpu_utils.sh` handle these differences:
- `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM)
- `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM)
- `gpu_worker_fraction <engine> <total_gib> <kv_gib>`: converts estimated GiB
-  into the engine-appropriate fraction (total for vllm/sglang, free for trtllm).
-
-Launch scripts use `build_gpu_mem_args` which calls these internally:
+Launch scripts source `gpu_utils.sh` and call `build_gpu_mem_args` to pick
+up env-var overrides during profiling and parallel execution:

 ```bash
-GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --max-model-len "$SEQ_LEN" --max-num-seqs "$CONCURRENCY")
-```
-
---
-
-## KV Cache Memory Per Token
+source "$SCRIPT_DIR/../../../common/gpu_utils.sh"

-The formula for KV cache memory per token is the same across all engines:
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
+python -m dynamo.vllm --model "$MODEL" $GPU_MEM_ARGS &

-```
-kv_bytes_per_token = 2 * num_layers * num_kv_heads * head_dim * bytes_per_element
+GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
+python -m dynamo.sglang --model-path "$MODEL" $GPU_MEM_ARGS &
 ```

-| Model | Layers | KV Heads | Head Dim | Dtype | Per Token |
-|---|---|---|---|---|---|
-| Qwen3-0.6B | 28 | 8 | 128 | BF16 | 112 KiB |
-| Llama-3.1-8B | 32 | 8 | 128 | BF16 | 128 KiB |
-| Llama-3.1-70B | 80 | 8 | 128 | BF16 | 320 KiB |
-| Qwen2.5-VL-7B | 28 | 4 | 128 | BF16 | 56 KiB |
+When the env var is set, `build_gpu_mem_args` returns the corresponding flag.
+Otherwise it returns empty and the engine uses its default allocation.

-To estimate KV cache for a given context length:
-
-```
-kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^3)
-```
-
---
+| Env var | Engine | CLI flag produced |
+|---------|--------|-------------------|
+| `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` | vLLM | `--kv-cache-memory-bytes N --gpu-memory-utilization 0.01` |
+| `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` | SGLang | `--max-total-tokens N` |

-## `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`
+For multi-worker single-GPU scripts, pass `--workers-per-gpu N` to divide
+the allocation: `build_gpu_mem_args vllm --workers-per-gpu 2`.

-Environment variable used by Dynamo's VRAM profiler to binary-search the minimum
-memory fraction a script needs.
+**Profiler** (`profile_pytest.py`): binary-searches the KV cap to find the
+minimum passing value, applies a 2x safety factor, outputs pytest markers
+(`@pytest.mark.requested_vllm_kv_cache_bytes(N)` or
+`@pytest.mark.requested_sglang_kv_tokens(N)`).

- Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang.
- For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via
-  `--override-engine-args`.
- Launch scripts use `build_gpu_mem_args` to compute the default fraction;
-  the override bypasses the estimator and splits the raw value between workers.
- Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV
-  cache sizing, making the profiler's fraction override ineffective for KV cache.
-  Those scripts should warn when `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is set.
+**Scheduler** (`pytest_parallel_gpu.py`): reads the markers at runtime and
+sets the env var per-test. See `tests/README.md` for details.
--- a/examples/common/gpu_utils.sh
+++ b/examples/common/gpu_utils.sh
@@ -2,470 +2,62 @@
 # SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 #
-# Shared GPU utility functions for launch scripts.
+# Shared GPU utility functions for launch scripts (source, don't execute).
 #
-# CLI:
-#   ./gpu_utils.sh <engine> --model <name> [options...]   Print GPU fraction
-#   ./gpu_utils.sh --self-test                            Run self-test suite
-#
-# Source:
+# Usage:
 #   source "$(dirname "$(readlink -f "$0")")/../common/gpu_utils.sh"
 #   # or with SCRIPT_DIR already set:
 #   source "$SCRIPT_DIR/../common/gpu_utils.sh"
 #
-# Functions (all return via stdout — no hidden globals):
-#   build_gpu_mem_args <engine> <model> ...     Prints fraction (or empty)
-#   get_model_params <model>                    Prints "pb wb layers kvh hd"
-#   estimate_worker_vram <model> ...            Prints "w_gib kv_gib oh_gib total_gib"
-#   gpu_worker_fraction <engine> <total> <kv>   Prints engine-appropriate fraction
-#   gpu_peak_to_engine_fraction <engine> <peak> Prints fraction (subtracts engine overhead)
-#   gpu_gb_to_total_fraction <gib>              Prints fraction of TOTAL VRAM (vLLM/sglang)
-#   gpu_gb_to_free_fraction <gib>               Prints fraction of FREE VRAM (TensorRT-LLM)
-
-# build_gpu_mem_args <engine> [options...]
-#
-# Prints the computed memory fraction to stdout (empty line if none).
-# Callers capture with:  GPU_MEM_FRACTION=$(build_gpu_mem_args ...)
+# Functions (all return via stdout):
+#   build_gpu_mem_args <engine> [--workers-per-gpu N]
+#       Returns engine-specific CLI args for GPU memory control based on
+#       environment variable overrides. Empty if no overrides.
 #
-# Priority:
-#   1. _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE  (profiler binary search)
-#   2. Engine flag passed to this function  (user already chose a value)
-#   3. estimate_worker_vram + gpu_worker_fraction  (model architecture)
-#   4. Empty  (let engine use its own default)
-#
-# Options (each flag accepts engine-specific aliases):
-#   --model NAME                 Model name (required).
-#     aliases: --model-path        (sglang, trtllm)
-#   --max-model-len N            Max tokens per sequence (default: 4096).
-#     aliases: --context-length    (sglang)
-#              --max-seq-len       (trtllm)
-#   --max-num-seqs N             Concurrent sequences to budget for (default: 2).
-#     aliases: --max-running-requests (sglang)
-#              --max-batch-size       (trtllm)
-#   --gpu-memory-utilization F   User override (vllm flag name).  Skipped when empty.
-#   --mem-fraction-static F      User override (sglang flag name).
-#   --workers-per-gpu N          Divide the fraction by N (for shared-GPU disagg).
+#       vLLM:   _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES   → --kv-cache-memory-bytes N --gpu-memory-utilization 0.01
+#       SGLang: _PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS → --max-total-tokens N
 #
 # Usage:
-#   # Simple single-worker (agg.sh)
-#   GPU_MEM_FRACTION=$(build_gpu_mem_args vllm \
-#       --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
-#   python -m dynamo.vllm --model "$MODEL" \
-#       ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
-#
-#   # Two workers sharing one GPU (disagg_same_gpu.sh)
-#   GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --workers-per-gpu 2)
-#   python -m dynamo.vllm ... --gpu-memory-utilization "${GPU_MEM_FRACTION}" &
+#   GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
+#   python -m dynamo.sglang --model-path "$MODEL" $GPU_MEM_ARGS &
 #
-#   # sglang
-#   GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL" --workers-per-gpu 2)
-#   python -m dynamo.sglang ... --mem-fraction-static "${GPU_MEM_FRACTION}" &
-#
-#   # trtllm (fraction goes into JSON, not CLI)
-#   GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --workers-per-gpu 2)
-#   OVERRIDE_ARGS=(--override-engine-args "{\"kv_cache_config\":{\"free_gpu_memory_fraction\":${GPU_MEM_FRACTION}}}")
+#   GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
+#   python -m dynamo.vllm --model "$MODEL" $GPU_MEM_ARGS &
 build_gpu_mem_args() {
-    local engine="${1:?usage: build_gpu_mem_args <engine> --model <name> [options...]}"
+    local engine="${1:?usage: build_gpu_mem_args <engine> [--workers-per-gpu N]}"
    shift

-    local model=""
-    local max_model_len="4096"
-    local max_seqs="2"
    local workers_per_gpu=1
-    local user_frac=""
-
    while [[ $# -gt 0 ]]; do
        case "$1" in
-            --model|--model-path)
-                                model="$2";           shift 2 ;;
-            --max-model-len|--context-length|--max-seq-len)
-                                max_model_len="$2";   shift 2 ;;
-            --max-num-seqs|--max-running-requests|--max-batch-size)
-                                max_seqs="$2";        shift 2 ;;
-            --gpu-memory-utilization|--mem-fraction-static)
-                                user_frac="$2";       shift 2 ;;
-            --workers-per-gpu)  workers_per_gpu="$2"; shift 2 ;;
+            --workers-per-gpu) workers_per_gpu="$2"; shift 2 ;;
            *) echo "build_gpu_mem_args: unknown option '$1'" >&2; return 1 ;;
        esac
    done

-    if [[ -z "$model" ]]; then
-        echo "build_gpu_mem_args: --model is required" >&2
-        return 1
-    fi
-
-    local frac=""
-    local from_estimator=false
-    local est_w="" est_kv="" est_oh="" est_total=""
-    if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" ]]; then
-        frac="$_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
-    elif [[ -n "$user_frac" ]]; then
-        frac="$user_frac"
-    elif read -r est_w est_kv est_oh est_total <<< "$(estimate_worker_vram "$model" "$max_model_len" "$max_seqs" "$engine" 2>/dev/null)" && [[ -n "$est_total" ]]; then
-        frac=$(gpu_worker_fraction "$engine" "$est_total" "$est_kv")
-        from_estimator=true
-    fi
-
-    # --workers-per-gpu divides profiler/user/estimator results only
-    if [[ -n "$frac" && "$workers_per_gpu" -gt 1 ]]; then
-        frac=$(awk -v f="$frac" -v n="$workers_per_gpu" 'BEGIN { printf "%.2f", f / n }')
-    fi
-
-    echo "$frac"
-}
-
-# get_model_params <model_name>
-#
-# Prints "params_b weight_bytes layers kv_heads head_dim" to stdout.
-# Returns 1 (prints nothing) if the model is unknown.
-#
-# Fields:
-#   params_b       Total parameters in billions (all experts for MoE)
-#   weight_bytes   Bytes per weight element (2=BF16/FP16, 1=FP8)
-#   layers         Number of transformer layers
-#   kv_heads       Number of key-value heads (GQA groups)
-#   head_dim       Dimension per attention head
-#
-# KV cache is assumed BF16 (2 bytes per element) regardless of weight dtype,
-# since FP8 KV cache (--kv-cache-dtype fp8) is opt-in and not the default.
-#
-# To add a model:
-#   1. Find config.json at  https://huggingface.co/<model>/raw/main/config.json
-#      For VL/multimodal models, architecture params are under text_config.
-#   2. Map fields:
-#        layers    ← num_hidden_layers
-#        kv_heads  ← num_key_value_heads
-#        head_dim  ← head_dim  (or hidden_size / num_attention_heads)
-#   3. params_b: total parameter count in billions.  Derive from:
-#        - safetensors file size:  size_bytes / weight_bytes / 1e9
-#          (single file: ls -l model.safetensors; sharded: metadata.total_size
-#          in model.safetensors.index.json)
-#        - or the model card / paper
-#      For MoE: params_b is the TOTAL count (all experts loaded into VRAM).
-#   4. weight_bytes: 2 for BF16/FP16, 1 for FP8/INT8.
-#
-# Usage:
-#   read -r pb wb layers kvh hd <<< "$(get_model_params "Qwen/Qwen3-0.6B")"
-#   echo "$layers layers, $kvh KV heads"
-get_model_params() {
-    local model="${1:?usage: get_model_params <model_name>}"
-    local pb wb layers kvh hd
-    case "$model" in
-        # https://huggingface.co/Qwen/Qwen3-0.6B/raw/main/config.json
-        Qwen/Qwen3-0.6B)
-            pb=0.6;  wb=2; layers=28; kvh=8;  hd=128 ;;
-        # https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct/raw/main/config.json  (text_config)
-        # params_b from model.safetensors.index.json metadata.total_size / 2 / 1e9
-        Qwen/Qwen2-VL-2B-Instruct)
-            pb=2.2;  wb=2; layers=28; kvh=2;  hd=128 ;;
-        # https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct/raw/main/config.json  (text_config)
-        Qwen/Qwen2.5-VL-7B-Instruct)
-            pb=8.3;  wb=2; layers=28; kvh=4;  hd=128 ;;
-        # https://huggingface.co/Qwen/Qwen3-VL-2B-Instruct/raw/main/config.json  (text_config)
-        # params_b from model.safetensors size / 2 / 1e9
-        Qwen/Qwen3-VL-2B-Instruct)
-            pb=2.1;  wb=2; layers=28; kvh=8;  hd=128 ;;
-        # https://huggingface.co/Qwen/Qwen3-VL-8B-Instruct/raw/main/config.json  (text_config)
-        Qwen/Qwen3-VL-8B-Instruct)
-            pb=9.2;  wb=2; layers=36; kvh=8;  hd=128 ;;
-        # https://huggingface.co/Qwen/Qwen3-30B-A3B/raw/main/config.json
-        Qwen/Qwen3-30B-A3B|\
-        Qwen/Qwen3-30B-A3B-Instruct)
-            pb=30.5; wb=2; layers=48; kvh=4;  hd=128 ;;
-        # Same architecture as Qwen3-30B-A3B but FP8 quantized (1 byte per weight)
-        Qwen/Qwen3-VL-30B-A3B-Instruct-FP8)
-            pb=30.5; wb=1; layers=48; kvh=4;  hd=128 ;;
-        # https://huggingface.co/meta-llama/Meta-Llama-3.1-8B-Instruct/raw/main/config.json
-        meta-llama/Meta-Llama-3.1-8B-Instruct)
-            pb=8.0;  wb=2; layers=32; kvh=8;  hd=128 ;;
-        # https://huggingface.co/deepseek-ai/deepseek-llm-7b-base/raw/main/config.json
-        # MHA (not GQA): num_key_value_heads == num_attention_heads == 32
-        deepseek-ai/deepseek-llm-7b-base)
-            pb=6.9;  wb=2; layers=30; kvh=32; hd=128 ;;
-        # https://huggingface.co/Qwen/Qwen3-Embedding-4B/raw/main/config.json
-        # params_b from model.safetensors.index.json metadata.total_size / 2 / 1e9
-        # head_dim = hidden_size(2560) / num_attention_heads(32) = 80
-        Qwen/Qwen3-Embedding-4B)
-            pb=4.0;  wb=2; layers=36; kvh=8;  hd=80 ;;
-        # https://huggingface.co/llava-hf/llava-1.5-7b-hf/raw/main/config.json  (text_config)
-        # MHA: num_key_value_heads == num_attention_heads == 32
-        llava-hf/llava-1.5-7b-hf)
-            pb=7.1;  wb=2; layers=32; kvh=32; hd=128 ;;
-        *)
-            echo "get_model_params: unknown model '$model'" >&2
-            echo "Add it to get_model_params() in gpu_utils.sh" >&2
-            return 1 ;;
-    esac
-    echo "$pb $wb $layers $kvh $hd"
-}
-
-# estimate_worker_vram <model> [max_model_len] [max_concurrent_seqs] [engine_or_overhead]
-#
-# Prints "weights_gib kv_gib overhead_gib total_gib" to stdout.
-# Returns 1 (prints nothing) if the model is unknown to get_model_params.
-#
-# Formula:
-#   weights = params_b * 1e9 * weight_bytes
-#   kv      = 2 * layers * kv_heads * head_dim * 2(BF16) * seq_len * seqs
-#   total   = weights + kv + overhead
-#
-# Arguments:
-#   model               HuggingFace model name (required)
-#   max_model_len       Max tokens per sequence (default: 4096)
-#   max_concurrent_seqs Concurrent sequences to budget for (default: 2)
-#   engine_or_overhead  Engine name OR explicit GiB value (default: 2.0)
-#
-# If the 4th argument is an engine name (vllm, sglang, trtllm), overhead is
-# auto-computed from model parameters:
-#   overhead = base + scale * sqrt(params_b)
-#
-# Per-engine constants (calibrated from measurements on RTX 6000 Ada 48 GiB):
-#   vllm:   base=1.2, scale=1.0  → 0.6B≈2.0, 8B≈4.0, 30B≈6.7
-#   sglang: base=1.5, scale=1.0  → 0.6B≈2.3, 8B≈4.3, 30B≈7.0
-#   trtllm: base=2.0, scale=1.2  → 0.6B≈2.9, 8B≈5.4, 30B≈8.6
-#
-# sglang overhead was re-calibrated via profile_pytest.py bisection on
-# RTX 6000 Ada 48 GiB. Observed CUDA overhead (outside --mem-fraction-static):
-#   Qwen3-0.6B: ~1.8 GiB. Previous coefficients (2.5, 1.5) over-estimated by ~2x.
-#
-# If the 4th argument is a number, it's used directly (backward compatible).
-# If omitted, defaults to 2.0 (backward compatible).
-#
-# See examples/common/gpu_utils.md for the full derivation.
-#
-# Usage:
-#   read -r w kv oh total <<< "$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)"
-#   echo "$total GiB (w=$w kv=$kv oh=$oh)"
-estimate_worker_vram() {
-    local model="${1:?usage: estimate_worker_vram <model> [seq_len] [seqs] [engine_or_overhead]}"
-    local seqlen="${2:-4096}"
-    local seqs="${3:-2}"
-    local engine_or_overhead="${4:-2.0}"
-
-    local mp_out
-    mp_out=$(get_model_params "$model") || return 1
-    local pb wb layers kvh hd
-    read -r pb wb layers kvh hd <<< "$mp_out"
-
-    local overhead
-    case "$engine_or_overhead" in
-        vllm)   overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 1.2 + 1.0 * sqrt(p) }') ;;
-        sglang) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 1.5 + 1.0 * sqrt(p) }') ;;
-        trtllm) overhead=$(awk -v p="$pb" 'BEGIN { printf "%.1f", 2.0 + 1.2 * sqrt(p) }') ;;
-        *)      overhead="$engine_or_overhead" ;;
-    esac
-
-    awk -v pb="$pb" -v wbytes="$wb" \
-        -v layers="$layers" -v heads="$kvh" -v dim="$hd" \
-        -v seqlen="$seqlen" -v seqs="$seqs" -v overhead="$overhead" \
-        'BEGIN {
-            gib = 1024 * 1024 * 1024
-            w   = pb * 1e9 * wbytes / gib
-            kv  = 2 * layers * heads * dim * 2 * seqlen * seqs / gib
-            printf "%.1f %.1f %.1f %.1f", w, kv, overhead, w + kv + overhead
-        }'
-}
-
-# gpu_worker_fraction <engine> <total_gib> <kv_gib> [gpu_index]
-#
-# Convert estimated GiB into the engine-appropriate GPU memory fraction.
-#
-# Engine semantics (see examples/common/gpu_utils.md):
-#   vllm/sglang  — fraction of TOTAL VRAM (uses total_gib).
-#   trtllm       — fraction of FREE VRAM after model load (uses kv_gib).
-#
-# Usage:
-#   gpu_worker_fraction vllm   4.0 0.9      # fraction of total
-#   gpu_worker_fraction trtllm 4.0 0.9      # fraction of free
-#   gpu_worker_fraction trtllm 4.0 0.9 1    # query GPU index 1
-gpu_worker_fraction() {
-    local engine="${1:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib> [gpu_index]}"
-    local total_gib="${2:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib>}"
-    local kv_gib="${3:?usage: gpu_worker_fraction <engine> <total_gib> <kv_gib>}"
-    local gpu_idx="${4:-0}"
-    case "$engine" in
-        vllm|sglang)
-            gpu_gb_to_total_fraction "$total_gib" "$gpu_idx" ;;
-        trtllm)
-            gpu_gb_to_free_fraction "$kv_gib" "$gpu_idx" ;;
-        *)
-            echo "gpu_worker_fraction: unknown engine '$engine'" >&2
-            echo "Supported: vllm, sglang, trtllm" >&2
-            return 1 ;;
-    esac
-}
-
-# gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]
-#
-# Convert a measured/profiled GPU peak (total VRAM including CUDA context,
-# activations, etc.) into the engine-specific memory fraction flag.
-#
-# Each engine's fraction controls only a SUBSET of GPU memory (e.g. vLLM's
-# --gpu-memory-utilization covers weights + KV cache but not CUDA context).
-# This function subtracts the engine-specific overhead so the fraction
-# targets the right internal budget, keeping the real peak stable across
-# re-profiles.
-#
-# Overhead constants (GiB outside the engine's budget):
-#   vllm   2.0   CUDA ctx ~0.6 + activations/sampler ~0.5 + PyTorch alloc ~0.5
-#   sglang 2.0   (assumed same as vllm; refine when profiled)
-#   trtllm 0.0   free-fraction is measured after model load, no subtraction needed
-#
-# Usage:
-#   gpu_peak_to_engine_fraction vllm 8.6       # on 48 GiB → 0.14
-#   gpu_peak_to_engine_fraction vllm 20.9      # on 48 GiB → 0.40
-#   gpu_peak_to_engine_fraction vllm 8.6 1     # query GPU index 1
-gpu_peak_to_engine_fraction() {
-    local engine=${1:?usage: gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]}
-    local peak_gib=${2:?usage: gpu_peak_to_engine_fraction <engine> <peak_gib> [gpu_index]}
-    local gpu_idx=${3:-0}
-
-    local overhead
-    case "$engine" in
-        vllm|sglang) overhead=2.0 ;;
-        trtllm)      overhead=0.0 ;;
-        *)
-            echo "gpu_peak_to_engine_fraction: unknown engine '$engine'" >&2
-            echo "Supported: vllm, sglang, trtllm" >&2
-            return 1 ;;
-    esac
-
-    local budget
-    budget=$(awk -v g="$peak_gib" -v oh="$overhead" \
-        'BEGIN { b = g - oh; if (b < 1) b = 1; printf "%.1f", b }')
-
-    case "$engine" in
-        vllm|sglang) gpu_gb_to_total_fraction "$budget" "$gpu_idx" ;;
-        trtllm)      gpu_gb_to_free_fraction  "$budget" "$gpu_idx" ;;
-    esac
-}
-
-# gpu_gb_to_total_fraction <gib> [gpu_index]
-#
-# For vLLM / sglang: --gpu-memory-utilization is a fraction of TOTAL GPU memory.
-# The engine budgets model weights + KV cache + activations within that limit.
-#
-# Prints the fraction of total GPU VRAM that <gib> GiB represents.
-# Useful for converting portable absolute memory requirements to
-# engine-specific fraction parameters (--gpu-memory-utilization, etc).
-#
-# Examples:
-#   gpu_gb_to_total_fraction 4        # on 48 GiB GPU → 0.09
-#   gpu_gb_to_total_fraction 16       # on 48 GiB GPU → 0.34
-#   gpu_gb_to_total_fraction 4 1      # query GPU index 1 instead of 0
-#
-# The result is ceil-rounded to 2 decimal places with a minimum of 0.05
-# and a maximum of 0.95.
-gpu_gb_to_total_fraction() {
-    local gib=${1:?usage: gpu_gb_to_total_fraction <gib> [gpu_index]}
-    local gpu_idx=${2:-0}
-
-    local total_mib
-    total_mib=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits -i "$gpu_idx" 2>/dev/null)
-    if [[ -z "$total_mib" || "$total_mib" -eq 0 ]]; then
-        echo "gpu_gb_to_total_fraction: failed to query GPU $gpu_idx total memory" >&2
-        return 1
+    # --- SGLang: token-based KV cache cap ---
+    if [[ "$engine" == "sglang" && -n "${_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS:-}" ]]; then
+        echo "--max-total-tokens ${_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS}"
+        return 0
    fi

-    local total_gib
-    total_gib=$(awk -v t="$total_mib" 'BEGIN { printf "%.1f", t / 1024 }')
-
-    if awk -v gib="$gib" -v total="$total_mib" 'BEGIN { exit (gib * 1024 > total) ? 0 : 1 }'; then
-        echo "" >&2
-        echo "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" >&2
-        echo "WARNING: Requested ${gib} GiB but GPU $gpu_idx only has ${total_gib} GiB total." >&2
-        echo "The model likely won't fit. Consider a GPU with more VRAM" >&2
-        echo "or reduce the model size (quantization, smaller model, etc)." >&2
-        echo "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" >&2
-        echo "" >&2
+    # --- vLLM: byte-based KV cache cap ---
+    # --gpu-memory-utilization 0.01 prevents vLLM's startup check from rejecting
+    # the launch when co-resident tests use >10% of VRAM (vLLM checks free memory
+    # against the fraction *before* applying the byte cap).
+    if [[ "$engine" == "vllm" && -n "${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}" ]]; then
+        local kv_bytes="$_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES"
+        if [[ "$workers_per_gpu" -gt 1 ]]; then
+            kv_bytes=$(awk -v b="$kv_bytes" -v n="$workers_per_gpu" 'BEGIN { printf "%d", b / n }')
+        fi
+        echo "--kv-cache-memory-bytes $kv_bytes --gpu-memory-utilization 0.01"
+        return 0
    fi

-    # fraction = gib * 1024 / total_mib, ceil to 2 decimals, clamp [0.05, 0.95]
-    awk -v gib="$gib" -v total="$total_mib" 'BEGIN {
-        frac = (gib * 1024) / total
-        # ceil to 2 decimal places
-        frac = int(frac * 100 + 0.99) / 100
-        if (frac < 0.05) frac = 0.05
-        if (frac > 0.95) frac = 0.95
-        printf "%.2f\n", frac
-    }'
+    # No override — engine uses its default allocation
+    echo ""
 }

-# gpu_gb_to_free_fraction <gib> [gpu_index]
-#
-# For TensorRT-LLM: --free-gpu-memory-fraction (CLI) and
-# kv_cache_config.free_gpu_memory_fraction (YAML) are fractions of FREE
-# memory AFTER model weights are loaded — NOT fractions of total VRAM.
-# The engine loads model weights first, queries remaining free memory,
-# then allocates  fraction * free_after_model  for the KV cache.
-#
-# Why gpu_gb_to_total_fraction won't work for TensorRT-LLM:
-#   gpu_gb_to_total_fraction(10) on a 48 GiB GPU → 0.21 (fraction of total).
-#   Passing 0.21 as free_gpu_memory_fraction after a 5 GiB model loads
-#   would allocate 0.21 * 43 GiB ≈ 9 GiB — close but not exact.
-#   For larger models the error grows: a 30 GiB model leaves 18 GiB free,
-#   so 0.21 * 18 ≈ 3.8 GiB — far less than the 10 GiB intended.
-#
-# This function queries CURRENT free memory from nvidia-smi and computes
-# gib / free_mib. The result is a best-effort estimate: TensorRT-LLM will
-# see less free memory than we measure here (model weights haven't loaded
-# yet), so the actual KV cache allocation will be smaller than <gib>.
-# For rough sizing this is fine; for precise control use the YAML config
-# with a known model size.
-#
-# For disagg_same_gpu (two workers sharing one GPU), launch workers
-# sequentially: start the first, wait for it to finish loading (poll
-# nvidia-smi or logs), then query free memory again and compute the
-# fraction for the second worker. This gives predictable per-worker
-# KV cache sizes on any GPU.
-#
-# Override at launch via CLI or env var:
-#   --override-engine-args '{"kv_cache_config":{"free_gpu_memory_fraction": 0.15}}'
-#   DYN_TRTLLM_OVERRIDE_ENGINE_ARGS='{"kv_cache_config":{"free_gpu_memory_fraction": 0.15}}'
-#
-# GOTCHA: overriding any field inside kv_cache_config REPLACES the entire
-# sub-dict from the YAML. You must re-include all fields you care about
-# (e.g. enable_block_reuse, dtype) or they'll be lost.
-#
-# Examples:
-#   gpu_gb_to_free_fraction 10       # on 48 GiB GPU with 46 GiB free → 0.22
-#   gpu_gb_to_free_fraction 10 1     # query GPU index 1 instead of 0
-#
-# The result is ceil-rounded to 2 decimal places, clamped [0.01, 0.95].
-# The floor is 0.01 (not 0.05 like gpu_gb_to_total_fraction) because this
-# fraction only controls KV cache, so small values are valid.
-gpu_gb_to_free_fraction() {
-    local gib=${1:?usage: gpu_gb_to_free_fraction <gib> [gpu_index]}
-    local gpu_idx=${2:-0}
-
-    local free_mib
-    free_mib=$(nvidia-smi --query-gpu=memory.free --format=csv,noheader,nounits -i "$gpu_idx" 2>/dev/null)
-    if [[ -z "$free_mib" || "$free_mib" -eq 0 ]]; then
-        echo "gpu_gb_to_free_fraction: failed to query GPU $gpu_idx free memory" >&2
-        return 1
-    fi
-
-    local free_gib
-    free_gib=$(awk -v f="$free_mib" 'BEGIN { printf "%.1f", f / 1024 }')
-
-    if awk -v gib="$gib" -v free="$free_mib" 'BEGIN { exit (gib * 1024 > free) ? 0 : 1 }'; then
-        echo "" >&2
-        echo "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" >&2
-        echo "WARNING: Requested ${gib} GiB KV cache but GPU $gpu_idx only has ${free_gib} GiB free." >&2
-        echo "After model loading, even less will be available." >&2
-        echo "!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!" >&2
-        echo "" >&2
-    fi
-
-    # fraction = gib * 1024 / free_mib, ceil to 2 decimals, clamp [0.01, 0.95]
-    awk -v gib="$gib" -v free="$free_mib" 'BEGIN {
-        frac = (gib * 1024) / free
-        frac = int(frac * 100 + 0.99) / 100
-        if (frac < 0.01) frac = 0.01
-        if (frac > 0.95) frac = 0.95
-        printf "%.2f\n", frac
-    }'
-}

 # ---------------------------------------------------------------------------
 # Self-test: bash gpu_utils.sh --self-test
@@ -483,125 +75,51 @@ _gpu_utils_self_test() {
        fi
    }

-    echo "=== get_model_params ==="
+    local result

-    local out
-    out=$(get_model_params "Qwen/Qwen3-0.6B")
-    _assert "known model returns 5 fields" "0.6 2 28 8 128" "$out"
-
-    out=$(get_model_params "nope/unknown" 2>/dev/null)
-    _assert "unknown model returns empty" "" "$out"
-
-    get_model_params "nope/unknown" >/dev/null 2>&1
-    _assert "unknown model exits 1" "1" "$?"
+    echo "=== vLLM: kv bytes override ==="
+    result=$(_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=942054000 \
+        build_gpu_mem_args vllm)
+    _assert "kv bytes" "--kv-cache-memory-bytes 942054000 --gpu-memory-utilization 0.01" "$result"

    echo ""
-    echo "=== estimate_worker_vram ==="
-
-    out=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)
-    _assert "returns 4 space-separated fields" "4" "$(echo "$out" | wc -w | tr -d ' ')"
-
-    local w kv oh total
-    read -r w kv oh total <<< "$out"
-    _assert "weights > 0" "yes" "$(awk -v v="$w" 'BEGIN { print (v > 0) ? "yes" : "no" }')"
-    _assert "total > weights" "yes" "$(awk -v t="$total" -v w="$w" 'BEGIN { print (t > w) ? "yes" : "no" }')"
-
-    out=$(estimate_worker_vram "nope/unknown" 2>/dev/null)
-    _assert "unknown model returns empty" "" "$out"
-
-    local out_vllm out_sglang
-    out_vllm=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 vllm)
-    out_sglang=$(estimate_worker_vram "Qwen/Qwen3-0.6B" 4096 2 sglang)
-    _assert "sglang overhead > vllm overhead" "yes" \
-        "$(awk -v v="$out_vllm" -v s="$out_sglang" 'BEGIN {
-            split(v, a); split(s, b); print (b[3]+0 > a[3]+0) ? "yes" : "no"
-        }')"
+    echo "=== vLLM: kv bytes with --workers-per-gpu 2 ==="
+    result=$(_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=942054000 \
+        build_gpu_mem_args vllm --workers-per-gpu 2)
+    _assert "kv bytes / 2" "--kv-cache-memory-bytes 471027000 --gpu-memory-utilization 0.01" "$result"

    echo ""
-    echo "=== build_gpu_mem_args: estimator path (known model) ==="
-
-    local frac
-    frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2)
-    _assert "FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
+    echo "=== vLLM: no override = empty ==="
+    result=$(build_gpu_mem_args vllm)
+    _assert "empty (engine default)" "" "$result"

    echo ""
-    echo "=== build_gpu_mem_args: unknown model, no default ==="
-
-    frac=$(build_gpu_mem_args vllm --model "nope/unknown")
-    _assert "FRACTION empty" "" "$frac"
+    echo "=== vLLM: sglang token env ignored ==="
+    result=$(_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS=23824 \
+        build_gpu_mem_args vllm)
+    _assert "vllm ignores token cap" "" "$result"

    echo ""
-    echo "=== build_gpu_mem_args: profiler wins over all ==="
-
-    frac=$(_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.55 \
-        build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --gpu-memory-utilization 0.70)
-    _assert "FRACTION = profiler (beats user flag)" "0.55" "$frac"
+    echo "=== sglang: token cap env ==="
+    result=$(_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS=1024 \
+        build_gpu_mem_args sglang)
+    _assert "token cap" "--max-total-tokens 1024" "$result"

    echo ""
-    echo "=== build_gpu_mem_args: user flag wins over estimator ==="
-
-    frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --gpu-memory-utilization 0.70)
-    _assert "FRACTION = user flag" "0.70" "$frac"
+    echo "=== sglang: no override = empty ==="
+    result=$(build_gpu_mem_args sglang)
+    _assert "empty (engine default)" "" "$result"

    echo ""
-    echo "=== build_gpu_mem_args: empty user flag falls through ==="
-
-    frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2 --gpu-memory-utilization "")
-    _assert "FRACTION = estimator" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
+    echo "=== sglang: vllm kv bytes env ignored ==="
+    result=$(_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=942054000 \
+        build_gpu_mem_args sglang)
+    _assert "sglang ignores kv bytes" "" "$result"

    echo ""
-    echo "=== build_gpu_mem_args: --workers-per-gpu divides estimator ==="
-
-    local undivided
-    undivided=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2)
-    frac=$(build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --max-model-len 4096 --max-num-seqs 2 --workers-per-gpu 2)
-    local expected_half
-    expected_half=$(awk -v f="$undivided" 'BEGIN { printf "%.2f", f / 2 }')
-    _assert "FRACTION halved" "$expected_half" "$frac"
-
-    echo ""
-    echo "=== build_gpu_mem_args: --workers-per-gpu divides profiler ==="
-
-    frac=$(_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.80 \
-        build_gpu_mem_args vllm --model "Qwen/Qwen3-0.6B" --workers-per-gpu 2)
-    _assert "FRACTION = 0.80/2 = 0.40" "0.40" "$frac"
-
-    echo ""
-    echo "=== build_gpu_mem_args: sglang engine (sglang flag names) ==="
-
-    frac=$(build_gpu_mem_args sglang --model-path "Qwen/Qwen3-0.6B" --context-length 4096 --max-running-requests 2)
-    _assert "sglang FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
-
-    echo ""
-    echo "=== build_gpu_mem_args: trtllm engine (trtllm flag names) ==="
-
-    frac=$(build_gpu_mem_args trtllm --model-path "Qwen/Qwen3-0.6B" --max-seq-len 4096 --max-batch-size 2)
-    _assert "trtllm FRACTION non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
-
-    echo ""
-    echo "=== build_gpu_mem_args: --mem-fraction-static user flag (sglang) ==="
-
-    frac=$(build_gpu_mem_args sglang --model-path "Qwen/Qwen3-0.6B" --mem-fraction-static 0.60)
-    _assert "FRACTION = user flag" "0.60" "$frac"
-
-    echo ""
-    echo "=== build_gpu_mem_args: missing --model ==="
-
-    build_gpu_mem_args vllm 2>/dev/null
-    _assert "missing --model exits 1" "1" "$?"
-
-    echo ""
-    echo "=== gpu_worker_fraction: explicit args ==="
-
-    local frac
-    frac=$(gpu_worker_fraction vllm 4.0 0.9)
-    _assert "vllm returns non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
-
-    frac=$(gpu_worker_fraction trtllm 4.0 0.9)
-    _assert "trtllm returns non-empty" "yes" "$([[ -n "$frac" ]] && echo yes || echo no)"
-
-    gpu_worker_fraction badengine 4.0 0.9 >/dev/null 2>&1
-    _assert "bad engine exits 1" "1" "$?"
+    echo "=== missing engine ==="
+    (build_gpu_mem_args 2>/dev/null)
+    _assert "missing engine exits non-zero" "1" "$?"

    echo ""
    echo "=========================================="
@@ -610,46 +128,8 @@ _gpu_utils_self_test() {
    [[ "$fail" -eq 0 ]]
 }

-# CLI mode: only when executed directly (not sourced by another script)
-if [[ "${BASH_SOURCE[0]}" == "$0" ]]; then
-    if [[ "${1:-}" == "--self-test" ]]; then
-        _gpu_utils_self_test
-        exit $?
-    fi
-    if [[ $# -gt 0 ]]; then
-        build_gpu_mem_args "$@"
-        exit $?
-    fi
-
-    cat <<'HELP'
-gpu_utils.sh — GPU memory fraction estimator
-
-Usage:
-  ./gpu_utils.sh <engine> --model <name> [options...]
-  ./gpu_utils.sh --self-test
-
-Engines: vllm, sglang, trtllm
-
-Examples:
-  ./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B
-  ./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B --max-model-len 4096 --max-num-seqs 2
-  ./gpu_utils.sh vllm --model Qwen/Qwen3-0.6B --workers-per-gpu 2
-  ./gpu_utils.sh sglang --model Qwen/Qwen3-0.6B --context-length 8192
-  ./gpu_utils.sh trtllm --model meta-llama/Meta-Llama-3.1-8B-Instruct --max-seq-len 4096
-
-Options:
-  --model NAME               Model name (required)
-    aliases: --model-path
-  --max-model-len N          Max sequence length (default: 4096)
-    aliases: --context-length, --max-seq-len
-  --max-num-seqs N           Concurrent sequences (default: 2)
-    aliases: --max-running-requests, --max-batch-size
-  --gpu-memory-utilization F Override fraction (vllm flag)
-    aliases: --mem-fraction-static
-  --workers-per-gpu N        Divide fraction by N (shared-GPU disagg)
-  --self-test                Run built-in test suite
-
-Output: prints the fraction to stdout (empty if model is unknown).
-HELP
-    exit 0
+# Self-test: source this file then call _gpu_utils_self_test
+if [[ "${BASH_SOURCE[0]}" == "$0" && "${1:-}" == "--self-test" ]]; then
+    _gpu_utils_self_test
+    exit $?
 fi
--- a/examples/common/launch_utils.sh
+++ b/examples/common/launch_utils.sh
@@ -137,9 +137,9 @@ print_launch_banner() {
    echo "Frontend:    http://localhost:$_port"

    local _seq_len="${MAX_MODEL_LEN:-${CONTEXT_LENGTH:-${MAX_SEQ_LEN:-}}}"
-    local _frac="${GPU_MEM_FRACTION:-}"
+    local _mem_args="${GPU_MEM_ARGS:-}"
    [[ -n "$_seq_len" ]] && echo "Max seq len: $_seq_len"
-    [[ -n "$_frac" ]] && echo "GPU frac:    $_frac"
+    [[ -n "$_mem_args" ]] && echo "GPU mem:     $_mem_args"

    for _line in "$@"; do
        echo "$_line"

--- a/examples/multimodal/launch/audio_agg.sh
+++ b/examples/multimodal/launch/audio_agg.sh
@@ -93,10 +93,10 @@ python -m dynamo.frontend --http-port 8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &

 # run E/P/D workers
-GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)

 CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
-VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill $GPU_MEM_ARGS &

 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/audio_disagg.sh
+++ b/examples/multimodal/launch/audio_disagg.sh
@@ -93,11 +93,11 @@ python -m dynamo.frontend --http-port 8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &

 # run E/P/D workers
-GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)

 CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
-DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
-DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg $GPU_MEM_ARGS &
+DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg $GPU_MEM_ARGS &

 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/video_agg.sh
+++ b/examples/multimodal/launch/video_agg.sh
@@ -19,10 +19,10 @@ python -m dynamo.frontend --http-port=8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &

 # run E/P/D workers
-GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)

 CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
-VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill $GPU_MEM_ARGS &

 # Wait for all background processes to complete
 wait
--- a/examples/multimodal/launch/video_disagg.sh
+++ b/examples/multimodal/launch/video_disagg.sh
@@ -20,11 +20,11 @@ python -m dynamo.frontend --http-port=8000 &
 python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &

 # run E/P/D workers
-GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
+GPU_MEM_ARGS=$(build_gpu_mem_args vllm)

 CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
-DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
-DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg $GPU_MEM_ARGS &
+DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg $GPU_MEM_ARGS &

 # Wait for all background processes to complete
 wait
--- a/pyproject.toml
+++ b/pyproject.toml
@@ -234,7 +234,10 @@ markers = [
    "gpu_8: marks tests to run on 8GPUs",
    "xpu_1: marks tests to run on XPU",
    "xpu_2: marks tests to run on 2XPUs",
-    "max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
+    # These 3 (profiled_vram_gib and requested_*) are used for parallel pytest executions:
+    "profiled_vram_gib(N): actual peak VRAM observed by nvidia-smi during profiling. Used for --max-vram-gib filtering and scheduler budget tracking",
+    "requested_vllm_kv_cache_bytes(N): exact KV cache bytes for vLLM (skips memory profiling). Sets _PROFILE_PYTEST_KV_CACHE_BYTES. Most deterministic method for parallel execution",
+    "requested_sglang_kv_tokens(N): max KV cache tokens for SGLang parallel execution. Sets _OVERRIDE_SGLANG_MAX_TOTAL_TOKENS to cap --max-total-tokens and prevent over-allocation",
    "e2e: marks tests as end-to-end tests",
    "integration: marks tests as integration tests",
    "unit: marks tests as unit tests",

--- a/tests/README.md
+++ b/tests/README.md
@@ -114,43 +114,96 @@ Markers are required for all tests. They are used for test selection in CI and l
 | Lifecycle [required]    | pre_merge, post_merge, nightly, weekly, release                  | When the test should run           |
 | Test Type [required]    | unit, integration, e2e, benchmark, performance, stress, multimodal | Nature of the test               |
 | Hardware [required]     | gpu_0, gpu_1, gpu_2, gpu_4, gpu_8, h100                         | Number/type of GPUs required       |
-| VRAM Requirement        | max_vram_gib(N)                                                              | Peak VRAM in GiB (with 10% safety). The pytest invocation can use `--max-vram-gib=N` to select only tests that fit on the available GPU. Does not prevent running on smaller GPUs (that will OOM). Use `profile_pytest.py` to measure. |
+| VRAM (profiled)         | profiled_vram_gib(N)                                                         | Actual peak VRAM observed by nvidia-smi during profiling (includes CUDA overhead). Used for `--max-vram-gib=N` filtering and GPU-parallel scheduler budget tracking. |
+| vLLM KV cache bytes     | requested_vllm_kv_cache_bytes(N)                                             | (vLLM only) Exact KV cache bytes. Sets `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` → `--kv-cache-memory-bytes`. Deterministic, parallel-safe. |
+| SGLang KV tokens        | requested_sglang_kv_tokens(N)                                                          | (SGLang only) Max KV cache tokens. Sets `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` → `--max-total-tokens`. Deterministic, parallel-safe. |
 | Component/Framework     | vllm, trtllm, sglang, kvbm, kvbm_concurrency, planner, router   | Backend or component specificity   |
 | Infrastructure          | k8s, deploy, fault_tolerance                                     | Infrastructure/environment needs   |
 | Execution               | parallel                                                         | Test can run in parallel with pytest-xdist. Must use dynamic port allocation (`alloc_ports`) and not share resources (e.g. filesystem) |
 | Other                   | slow, skip, xfail, custom_build, model, aiconfigurator           | Special handling                   |

-### Example
+### Example (vLLM)
 ```python
 @pytest.mark.pre_merge
 @pytest.mark.integration
 @pytest.mark.gpu_1
-@pytest.mark.max_vram_gib(21)  # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
+@pytest.mark.profiled_vram_gib(20.5)  # actual nvidia-smi peak
+@pytest.mark.requested_vllm_kv_cache_bytes(942_054_000)  # KV cache cap (2x safety over min=471_027_000)
 @pytest.mark.vllm
 def test_kv_cache_behavior():
    ...
 ```

-### Filtering by VRAM
+### Example (SGLang with token cap)
+```python
+@pytest.mark.pre_merge
+@pytest.mark.e2e
+@pytest.mark.gpu_1
+@pytest.mark.profiled_vram_gib(3.7)   # actual nvidia-smi peak at recommended token count
+@pytest.mark.requested_sglang_kv_tokens(96)     # KV cache cap (2x safety over min=48)
+@pytest.mark.timeout(265)
+@pytest.mark.sglang
+def test_sglang_aggregated():
+    ...
+```

-The `max_vram_gib(N)` marker records how much GPU memory a test needs. The pytest invocation can use `--max-vram-gib=N` as a **selector** to run only tests that fit on the available GPU. Tests that exceed the budget are skipped at collection time (before any test starts). Tests without a `max_vram_gib` marker always run (no constraint assumed).
+### VRAM Markers and Filtering

-This is for the following use cases:
- **MIG partitioned GPUs:** when running tests in parallel on MIG slices (e.g., 2x 40 GiB partitions on an 80 GiB GPU), each slice has limited VRAM.
- **Smaller CI GPUs:** some CI jobs use L4 GPUs with only 24 GiB of VRAM.
+Markers differ by engine:

-Nothing prevents you from running without this flag — but if a test needs more VRAM than is physically available, it will OOM at runtime (e.g., vLLM raises `ValueError: No available memory for the cache blocks`).
+**vLLM** uses byte-based KV cache control:
+- **`profiled_vram_gib(N)`** — actual peak from nvidia-smi. Used for `--max-vram-gib` filtering and scheduler budget.
+- **`requested_vllm_kv_cache_bytes(N)`** — exact KV cache bytes. Sets `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` → `--kv-cache-memory-bytes`. Deterministic and parallel-safe.

-```bash
-# Preview which gpu_1 vllm tests fit on a 16 GiB MIG partition (no tests are executed)
-python3 -m pytest --max-vram-gib=16 --dry-run -m "gpu_1 and vllm" tests/serve/test_vllm.py
+**SGLang** uses token-based control:
+- **`profiled_vram_gib(N)`** — actual peak from nvidia-smi at the recommended token count. Used for `--max-vram-gib` filtering and scheduler budget.
+- **`requested_sglang_kv_tokens(N)`** — max KV cache tokens. Sets `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` → `--max-total-tokens`. SGLang's default `--mem-fraction-static` is never overridden; the token cap is the sole allocation control. Deterministic and parallel-safe (see `examples/common/gpu_utils.md`).
+
+`--max-vram-gib=N` deselects tests whose `profiled_vram_gib` exceeds N. Tests without a VRAM marker are also deselected (unknown VRAM = unsafe for parallel). To add a test to the pool, profile it with `tests/utils/profile_pytest.py` (see [GPU VRAM Profiler](#gpu-vram-profiler-profile_pytestpy)).
+
+### GPU-Parallel Execution
+
+GPU tests run concurrently via a custom VRAM-aware scheduler (`tests/utils/pytest_parallel_gpu.py`). This is separate from `pytest-xdist` because:
+
+1. **VRAM budget**: xdist has no GPU memory awareness — two 20 GiB tests on a 48 GiB GPU will OOM.
+2. **Profiling race**: engines snapshot free memory during init; concurrent startups corrupt each other. The scheduler staggers launches (VRAM stability check) and retries transient failures.
+3. **Engine-specific allocation**: each test gets a constrained allocation so it uses only its budgeted share. xdist has no mechanism for this.
+   - **vLLM**: `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES = N` → `--kv-cache-memory-bytes` (from `requested_vllm_kv_cache_bytes` marker). Byte-based cap is deterministic and doesn't depend on current free memory, making it inherently parallel-safe.
+   - **SGLang**: `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS = N` → `--max-total-tokens` (from `requested_sglang_kv_tokens` marker). Token-based cap is deterministic and doesn't depend on current free memory, making it inherently parallel-safe.

-# Same, but for 24 GiB L4 CI GPUs
+```bash
+# Dry-run: preview which tests fit and the GPU plan
 python3 -m pytest --max-vram-gib=24 --dry-run -m "gpu_1 and vllm" tests/serve/test_vllm.py

-# GPU tests that have no max_vram_gib marker yet — need profiling
-# TODO: profile these tests and add max_vram_gib markers
-python3 -m pytest --dry-run -m "(gpu_1 or gpu_2 or gpu_4 or gpu_8) and not max_vram_gib" tests/serve/test_vllm.py
+# Run pre-merge vllm tests in parallel
+python3 -m pytest --max-vram-gib=6 -n auto -m "gpu_1 and vllm and not nightly and not post_merge" tests/serve/test_vllm.py
+
+# Run all (pre+post merge) with live output
+python3 -m pytest --max-vram-gib=48 -n auto -sv -m "gpu_1 and vllm and not nightly" tests/serve/test_vllm.py tests/frontend/test_vllm.py
+
+# SGLang tests
+python3 -m pytest --max-vram-gib=48 -n auto -m "gpu_1 and sglang" tests/serve/test_sglang.py
+
+# Tests that still need profiling
+python3 -m pytest --dry-run -m "(gpu_1 or gpu_2) and not profiled_vram_gib" tests/serve/
+```
+
+Example output (6 SGLang tests, RTX 6000 Ada 48 GiB):
+```
+GPU parallel: 6 tests, 7 concurrent slots, GPU0 (48 GiB, 43 GiB multi-proc budget)
+
+[w0] tests/serve/test_sglang.py::...completions_only-2]     profiled= 14.9 GiB  req_kv_tokens=  1024  timeout=420s
+[w1] tests/serve/test_sglang.py::...multimodal_agg_qwen-2]  profiled= 20.2 GiB  req_kv_tokens=   512  timeout=280s
+[w2] tests/serve/test_sglang.py::...aggregated-2]            profiled=  6.0 GiB  req_kv_tokens=  1024  timeout=240s
+...
+
+[w0] tests/serve/...completions_only-2] (GPU0, profiled 14.9 GiB, req_kv_tokens=  1024) RUNNING
+[w1] tests/serve/...multimodal_agg_qwen-2] (GPU0, profiled 20.2 GiB, req_kv_tokens=   512) RUNNING
+[elapsed 10s] GPU0: 0.6/48 GiB [w0(10s), w1(5s)] [queued: w2, w3, w4, w5]
+[w1] tests/serve/...multimodal_agg_qwen-2] PASSED [31s]
+[w0] tests/serve/...completions_only-2] PASSED [76s]
+...
+=============== 6 passed in 111.00s (1:51) (vs 228s seq, 2.1x) ===============
 ```

 ### Lifecycle Marker Note
@@ -294,13 +347,20 @@ pytest -m "pre_merge and parallel and not (vllm or sglang or trtllm) and gpu_0"
 pytest -m "pre_merge and not parallel and not (vllm or sglang or trtllm) and gpu_0" -v --tb=short
 ```

-> **Parallel vs sequential:** CPU-only tests (`gpu_0`) marked `parallel` run with `pytest-xdist` (`-n auto` or `-n <workers>`, `--dist=loadscope`). Tests not marked `parallel`, and all GPU tests (`gpu_1`, `gpu_2`, etc.), run sequentially (no `-n` flag). See [`.github/actions/pytest/action.yml`](../.github/actions/pytest/action.yml).
+> **Parallel vs sequential:** CPU-only tests (`gpu_0`) marked `parallel` run with `pytest-xdist` (`-n auto` or `-n <workers>`, `--dist=loadscope`). GPU tests (`gpu_1`, `gpu_2`, etc.) run sequentially by default, but can run in parallel with `--max-vram-gib=N -n auto` (uses a custom VRAM-aware scheduler, not xdist). See [`.github/actions/pytest/action.yml`](../.github/actions/pytest/action.yml).

 **Full E2E suite** -- launches engines for every test configuration; slowest, requires GPU and a framework container (typically <30min depending on framework and model):
 ```bash
+# Sequential (default)
 pytest -m "vllm and e2e and gpu_1" -v --tb=short
 pytest -m "sglang and e2e and gpu_1" -v --tb=short
 pytest -m "trtllm and e2e and gpu_1" -v --tb=short
+
+# GPU-parallel (VRAM-aware scheduling, ~2x faster on 48 GiB GPU)
+# Only tests with profiled_vram_gib markers are selected; -n auto calculates
+# concurrent slots from GPU VRAM / smallest test. See "GPU-Parallel Execution" below.
+python3 -m pytest --max-vram-gib=48 -n auto -m "gpu_1 and sglang" tests/serve/test_sglang.py -v
+python3 -m pytest --max-vram-gib=48 -n auto -m "gpu_1 and vllm" tests/serve/test_vllm.py -v
 ```

 **Post-merge equivalent** -- CI runs `(pre_merge or post_merge)` after merge, which adds slower tests on top of the pre_merge set. **Running the full post-merge suite locally can take several hours per framework** (model downloads, GPU inference, multi-GPU coordination). For day-to-day development, before you submit to CI, use the `pre_merge` commands above for quicker feedback. See [`.github/workflows/post-merge-ci.yml`](../.github/workflows/post-merge-ci.yml) for exact markers:
@@ -444,66 +504,83 @@ When writing or reviewing GPU tests, use `tests/utils/profile_pytest.py` to meas

 ### How it works

-The profiler sets the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` environment variable (a fraction from 0.0 to 1.0 of total GPU RAM) and runs the test at each probe point. It bisects between "passes" and "OOM/fails" to find the boundary. After the search, it samples `nvidia-smi` to report peak VRAM, phase analysis, and marker recommendations.
+The profiler automatically detects the engine type and uses the appropriate binary search:
+
+- **vLLM**: bisects `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` (bytes) → `--kv-cache-memory-bytes`. Finds the minimum KV cache bytes where the test passes, applies a 2x safety factor. Outputs `profiled_vram_gib` and `requested_vllm_kv_cache_bytes` markers.
+- **SGLang**: bisects `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` (token count) → `--max-total-tokens`. Finds the minimum KV cache tokens where the test passes, applies a 2x safety factor, then runs a final probe at the safe token count to measure the actual VRAM. Outputs `profiled_vram_gib` and `requested_sglang_kv_tokens` markers.

-**Requirement:** The test under profile **must** honor the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` env var. For standalone tests that allocate CUDA memory directly, check `os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")` and cap your allocation accordingly — see `tests/utils/test_mock_gpu_alloc.py` for an example.
+**Requirement (vLLM):** The launch script must honor `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES`. This is handled by `build_gpu_mem_args` in `gpu_utils.sh` (returns `--kv-cache-memory-bytes N`).
+
+**Requirement (SGLang):** The launch script must honor `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS`. This is handled by `build_gpu_mem_args` in `gpu_utils.sh` (returns `--max-total-tokens N`).

 ### Engine-specific mapping

-`_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is a generic env var (float 0.0-1.0) that launch scripts translate to the engine-specific CLI flag:
+Launch scripts call `build_gpu_mem_args` (from `examples/common/gpu_utils.sh`) which checks env var overrides and returns the appropriate CLI flags:
+
+```bash
+GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
+python -m dynamo.sglang --model-path "$MODEL" $GPU_MEM_ARGS &
+```
+
+Env vars control engine allocation during profiling and parallel test execution:

-| Engine  | CLI flag                         | Launch script support |
-|---------|----------------------------------|-----------------------|
-| vLLM    | `--gpu-memory-utilization`       | Implemented in `agg.sh`, `disagg.sh`, etc. via `build_gpu_mem_args` |
-| SGLang  | `--mem-fraction-static`          | Implemented in `agg.sh`, `agg_embed.sh`, `disagg.sh`, `agg_router.sh`, `disagg_same_gpu.sh` via `build_gpu_mem_args`. Multimodal scripts (`multimodal_epd.sh`, `multimodal_disagg.sh`) split the override proportionally between workers. |
-| TRT-LLM | `--free-gpu-memory-fraction`    | Not yet implemented (has its own `DYN_TRTLLM_FREE_GPU_MEMORY_FRACTION`, TODO: unify) |
+**`_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES`** (integer) — vLLM only:

-**Note on sglang:** Unlike vLLM (where `--max-model-len` affects KV cache sizing), sglang's `--mem-fraction-static` is the sole knob for KV cache allocation. `--context-length` and `--max-running-requests` only affect request scheduling, not memory allocation. See `examples/common/gpu_utils.md` for details.
+| Engine  | Returned CLI flag                | Notes |
+|---------|----------------------------------|-------|
+| vLLM    | `--kv-cache-memory-bytes N`      | Exact byte cap on KV cache; deterministic and parallel-safe |

-If the profiler detects constant VRAM across all probes (meaning the env var is ignored), it prints a warning and skips marker recommendations.
+**`_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS`** (integer) — SGLang only:
+
+| Engine  | Returned CLI flag                | Notes |
+|---------|----------------------------------|-------|
+| SGLang  | `--max-total-tokens N`           | Token-based KV cache cap |
+
+Both use absolute caps (bytes and tokens) — deterministic and independent of current free memory, which is critical for parallel test execution. See `examples/common/gpu_utils.md`.

 ### Usage

 ```bash
-# Default mode: binary search for minimum VRAM (recommended)
-# -xvs is optional: stop on first failure, verbose, show output
+# vLLM: binary search for minimum KV cache bytes
 python tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated] -xvs

+# Profile on a specific GPU (default: 0)
+python tests/utils/profile_pytest.py --gpu 1 tests/serve/test_vllm.py::test_serve_deployment[aggregated] -xvs
+
+# SGLang: binary search for minimum KV cache tokens (automatic)
+python tests/utils/profile_pytest.py tests/serve/test_sglang.py::test_sglang_deployment[aggregated-2] -xvs
+
 # Single-pass profiling (no binary search, just measure one run using default RAM)
 python tests/utils/profile_pytest.py --no-find-min-vram tests/serve/test_vllm.py::test_serve_deployment[aggregated]
 ```

-### Example output
+### Example output (vLLM)

 ```bash
 ========================================================================
-FIND MINIMUM VRAM (binary search)
+FIND MINIMUM KV CACHE BYTES (vLLM, deterministic) (binary search)
 ========================================================================
  GPU total : 48.0 GiB
-  GPU free  : 48.0 GiB  (in use: 0.0 GiB)
+  GPU free  : 47.4 GiB  (in use: 0.6 GiB)
  Test      : tests/serve/test_vllm.py::test_serve_deployment[aggregated] -x

-  Range   : 5% - 95%  (tolerance 5%)
-  Max iter: 6 (1 validation + 5 bisections)
-
-  [probe 1/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.95 (45.6 GiB)  [validation run]
-  [PASS] peak 18.5 GiB, wall 41s, iter took 49s
+  [probe 1] Validation run: kv_cache=23296 MiB (50% of free)
+  [PASS] peak 2.9 GiB, wall 42s, iter took 49s
  ...
-  [probe 5/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.33 (15.9 GiB)
-  [FAIL] OOM or error at 33% (15.9 GiB), iter took 30s
+  [probe 6/15] kv_cache=449 MiB (471,027,000 bytes)
+  [PASS] peak 2.9 GiB, wall 41s, iter took 49s

-  [probe 6/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.36 (17.2 GiB)  [~0 left, ETA ~0s]
-  [PASS] peak 18.5 GiB, wall 41s, iter took 49s
+  [probe 7/15] kv_cache=224 MiB (235,513,856 bytes)
+  [FAIL] OOM, iter took 30s

 ========================================================================
-MINIMUM VRAM RESULT
-========================================================================
-  Lowest passing utilization : 36%
-  Minimum VRAM needed        : ~17.2 GiB (peak observed: 18.5 GiB, +10% safety: 20.4 GiB)
+  Minimum KV cache : 449 MiB (471,027,000 bytes)
+  Safe KV cache    : 898 MiB (942,054,000 bytes) (2x safety)
+  Peak VRAM        : 2.9 GiB

-  # test_serve_deployment[aggregated]: @pytest.mark.max_vram_gib(21)
-  # Fits on: L4 (24 GiB), V100-32GB (32 GiB), A6000/A40 (48 GiB), A100/H100 (80 GiB)
-  # Will OOM on: edge/embedded (4 GiB), RTX 3060/4060 (8 GiB), T4 (16 GiB)
+  Recommended markers:
+    @pytest.mark.profiled_vram_gib(2.9)
+    @pytest.mark.requested_vllm_kv_cache_bytes(942_054_000),  # KV cache cap (2x safety over min=471_027_000)
 ========================================================================

 ========================================================================
@@ -511,14 +588,41 @@ Recommended markers to add to your pytest. You can copy-paste this:
 ========================================================================
 # Measured using: tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated]
 @pytest.mark.e2e  # wall time 41.2s, loads a real model
-@pytest.mark.gpu_1  # 1 GPU(s) used, peak 18.5 GiB
-@pytest.mark.max_vram_gib(21)  # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
+@pytest.mark.gpu_1  # 1 GPU(s) used, peak 2.9 GiB
+@pytest.mark.profiled_vram_gib(2.9)  # actual nvidia-smi peak
+@pytest.mark.requested_vllm_kv_cache_bytes(942_054_000)  # KV cache cap (2x safety over min=471_027_000)
 @pytest.mark.timeout(124)  # 3x observed 41.2s

  WARNING: Wall time 41.2s is too slow for pre_merge (> 20s). Consider post_merge or nightly instead.
-  WARNING: Will OOM on edge/embedded (4 GiB).
-  WARNING: Will OOM on RTX 3060/4060 (8 GiB).
-  WARNING: Will OOM on T4 (16 GiB).
+========================================================================
+```
+
+### Example output (SGLang — token-based bisection)
+
+```bash
+========================================================================
+FIND MINIMUM KV TOKENS (SGLang) (binary search)
+========================================================================
+  GPU total : 48.0 GiB
+  GPU free  : 47.4 GiB  (in use: 0.6 GiB)
+  Test      : tests/serve/test_sglang.py::test_sglang_deployment[aggregated-2] -xvs
+
+  [probe 1] Validation run (no token cap)
+  [PASS] peak 43.0 GiB, wall 36s, max_total_tokens=366688, iter took 44s
+  ...
+  [probe 14/15] tokens=48  [~1 left, ETA ~45s]
+  [PASS] tokens=48, peak 3.7 GiB, wall 26s, iter took 34s
+  [final probe] Measuring VRAM at safe_tokens=96
+  [PASS] tokens=96, peak 3.7 GiB, wall 27s
+
+========================================================================
+MINIMUM KV TOKENS RESULT
+========================================================================
+  Minimum tokens  : 16 (raw bisection result)
+  Recommended     : 96 (2x safety)
+  Peak VRAM       : 3.7 GiB (at 96 tokens)
+  @pytest.mark.profiled_vram_gib(3.7)
+  @pytest.mark.requested_sglang_kv_tokens(96),  # KV cache cap (2x safety over min=48)
 ========================================================================
 ```

@@ -526,7 +630,7 @@ Recommended markers to add to your pytest. You can copy-paste this:

 1. **Copy the `@pytest.mark.*` lines** into your test function or `pytestmark` list.

-2. **VRAM marker** — `max_vram_gib(N)` records the peak GPU memory the test needs (with 10% safety margin). This marker does **not** skip tests on its own — if a test runs on a GPU that is too small, it will OOM and fail hard. Use `--max-vram-gib=N` to select only tests that fit on the available GPU (see [Filtering by VRAM](#filtering-by-vram) for examples). The WARNING lines in the profiler output tell you which GPU tiers would be too small (e.g., "Will OOM on T4 (16 GiB)").
+2. **VRAM markers** — `profiled_vram_gib(N)` records the actual nvidia-smi peak (for filtering/scheduling), `requested_vllm_kv_cache_bytes(N)` or `requested_sglang_kv_tokens(N)` controls the engine's KV cache allocation for deterministic parallel execution. Use `--max-vram-gib=N` to deselect tests whose profiled VRAM exceeds N (see [Filtering by VRAM](#filtering-by-vram)). The WARNING lines in the profiler output tell you which GPU tiers would be too small (e.g., "Will OOM on T4 (16 GiB)").

 3. **Lifecycle markers** — the profiler recommends `pre_merge` only for tests under 20 seconds. For slower tests, it warns you to consider `post_merge` or `nightly` but does not choose for you — use your judgment based on how critical the test is for catching regressions early.

@@ -538,6 +642,7 @@ Recommended markers to add to your pytest. You can copy-paste this:

 | Flag | Description |
 |------|-------------|
+| `--kv-bytes` | No-op (kept for backward compat). vLLM always bisects on `--kv-cache-memory-bytes` |
 | `--no-find-min-vram` | Skip binary search; run a single profiling pass instead |
 | `--interval N` | GPU sampling interval in seconds (default: 1.0) |
 | `--baseline-seconds N` | Seconds to sample before launching pytest (default: 3.0) |

--- a/tests/conftest.py
+++ b/tests/conftest.py
@@ -25,6 +25,11 @@ from tests.utils.test_output import resolve_test_output_path

 _logger = logging.getLogger(__name__)

+# Typed stash keys for GPU-parallel config (avoids setting unknown attrs on Config)
+_gpu_parallel_gpus_key: pytest.StashKey[list[dict]] = pytest.StashKey()
+_gpu_indices_key: pytest.StashKey[list[int] | None] = pytest.StashKey()
+_gpu_slots_key: pytest.StashKey[int | None] = pytest.StashKey()
+

 def pytest_addoption(parser: pytest.Parser) -> None:
    """Add shared command-line options for all tests.
@@ -59,7 +64,18 @@ def pytest_addoption(parser: pytest.Parser) -> None:
        "--max-vram-gib",
        type=float,
        default=None,
-        help="Skip tests whose @pytest.mark.max_vram_gib(N) exceeds this value (GiB).",
+        help="Only run tests with @pytest.mark.profiled_vram_gib(N) that fit in N GiB. "
+        "Without -n: runs tests sequentially. "
+        "With -n N: runs N tests concurrently as subprocesses with VRAM-aware scheduling. "
+        "With -n auto: calculates max concurrent slots from GPU VRAM / max_vram_gib.",
+    )
+    parser.addoption(
+        "--gpus",
+        "--gpu",
+        type=str,
+        default="all",
+        help="Comma-separated GPU indices or 'all' (default: all). "
+        "Controls which GPUs the parallel test runner distributes tests across.",
    )
    parser.addoption(
        "--dry-run",
@@ -79,6 +95,130 @@ logging.basicConfig(
 )


+# ---------------------------------------------------------------------------
+# GPU-serial and GPU-parallel: VRAM-aware test scheduling
+#
+# Activated only when both --max-vram-gib and -n auto are passed:
+#   pytest --max-vram-gib=48 -n auto -m "gpu_1 and sglang" tests/serve/
+# ---------------------------------------------------------------------------
+
+
+def pytest_configure(config: pytest.Config) -> None:
+    """Detect GPUs for --max-vram-gib planning and parallel execution."""
+    vram_limit = config.getoption("max_vram_gib", default=None)
+    if vram_limit is None:
+        return
+    # Delayed: vram_utils requires pynvml, otherwise conftest fails to load
+    # on CPU-only CI runners (e.g. ARM deploy tests) that lack nvidia-ml-py.
+    from tests.utils.pytest_parallel_gpu import _parse_gpu_indices
+    from tests.utils.vram_utils import auto_worker_count, detect_gpus
+
+    gpus = detect_gpus()
+    if gpus:
+        config.stash[_gpu_parallel_gpus_key] = gpus
+
+    # Parse --gpus into a list of indices (or None for all)
+    gpus_raw = config.getoption("gpus", default="all")
+    if gpus_raw and gpus_raw.strip().lower() != "all":
+        config.stash[_gpu_indices_key] = _parse_gpu_indices(gpus_raw, gpus)
+        selected_gpus = [
+            g for g in gpus if g["index"] in config.stash[_gpu_indices_key]
+        ]
+    else:
+        config.stash[_gpu_indices_key] = None  # all GPUs
+        selected_gpus = gpus
+
+    # If -n is set with --max-vram-gib, save the slot count and disable xdist
+    # so our subprocess orchestrator handles parallelism instead.
+    # xdist's pytest_configure(trylast=True) checks _is_distribution_mode()
+    # which reads dist/tx (not numprocesses), so we must also clear dist.
+    numproc = config.getoption("numprocesses", default=None)
+    if numproc is not None and numproc != 0:
+        if isinstance(numproc, str) or numproc == -1:
+            config.stash[_gpu_slots_key] = (
+                auto_worker_count(selected_gpus, vram_limit) if selected_gpus else 1
+            )
+        else:
+            config.stash[_gpu_slots_key] = int(numproc)
+        config.option.numprocesses = 0
+        config.option.dist = "no"
+
+
+@pytest.hookimpl(tryfirst=True)
+def pytest_runtestloop(session: pytest.Session) -> bool | None:
+    """Intercept the test loop for GPU-parallel execution.
+
+    When --max-vram-gib and -n are both present, run tests as independent
+    subprocesses via the GPU orchestrator instead of the normal pytest loop.
+    Must run before the default pytest loop (tryfirst) so we can return True
+    to prevent the default sequential execution.
+    """
+    config = session.config
+    num_slots = config.stash.get(_gpu_slots_key, None)
+    vram_limit = config.getoption("max_vram_gib", default=None)
+
+    if num_slots is None or vram_limit is None:
+        return None  # serial execution: let normal pytest handle it
+
+    # Imports related to parallel execution must be delayed. See vram_utils pynvml note in pytest_configure for the full reasons
+    from tests.utils.pytest_parallel_gpu import run_parallel
+    from tests.utils.vram_utils import load_test_meta
+
+    # Collect test IDs from the already-filtered session items
+    test_ids = [item.nodeid for item in session.items]
+    if not test_ids:
+        return True
+
+    meta = load_test_meta()
+    is_stream = config.getoption("capture", default="fd") == "no"
+    gpu_indices = config.stash.get(_gpu_indices_key, None)
+
+    # Forward original CLI args to child pytest subprocesses so they
+    # inherit options like -s, -v, --tb, --durations, --image, etc.
+    extra_args: list[str] = []
+    if is_stream:
+        extra_args.append("-s")
+    verbose = config.getoption("verbose", default=0)
+    if verbose >= 2:
+        extra_args.append("-vv")
+    elif verbose >= 1:
+        extra_args.append("-v")
+    tb_style = config.getoption("tbstyle", default="short")
+    if tb_style and tb_style != "short":
+        extra_args.append(f"--tb={tb_style}")
+    durations = config.getoption("durations", default=None)
+    if durations is not None:
+        extra_args.append(f"--durations={durations}")
+    durations_min = config.getoption("durations_min", default=None)
+    if durations_min is not None:
+        extra_args.append(f"--durations-min={durations_min}")
+    for opt_name, cli_flag in [
+        ("image", "--image"),
+        ("namespace", "--namespace"),
+        ("framework", "--framework"),
+        ("profile", "--profile"),
+    ]:
+        val = config.getoption(opt_name, default=None)
+        if val is not None:
+            extra_args.extend([cli_flag, str(val)])
+    if config.getoption("skip_service_restart", default=None):
+        extra_args.append("--skip-service-restart")
+
+    rc = run_parallel(
+        test_ids=test_ids,
+        meta=meta,
+        max_vram_gib=vram_limit,
+        num_slots=num_slots,
+        gpu_indices=gpu_indices,
+        extra_pytest_args=extra_args or None,
+        stream=is_stream,
+    )
+
+    if rc != 0:
+        session.testsfailed = 1
+    return True  # we handled the test loop
+
+
 @pytest.fixture()
 def set_ucx_tls_no_mm():
    """Set UCX env defaults for all tests."""
@@ -205,8 +345,10 @@ def _enable_offline_with_mistral_patch():
    except (ImportError, AttributeError):
        return  # transformers version without _patch_mistral_regex — nothing to do

-    # Write a sitecustomize.py so subprocesses also get the patch
-    patch_dir = os.path.join(tempfile.gettempdir(), "dynamo_test_hf_patch")
+    # Write a sitecustomize.py so subprocesses also get the patch.
+    # Use a per-worker dir under xdist to avoid write races.
+    worker_id = os.environ.get("PYTEST_XDIST_WORKER", "main")
+    patch_dir = os.path.join(tempfile.gettempdir(), f"dynamo_test_hf_patch_{worker_id}")
    os.makedirs(patch_dir, exist_ok=True)
    with open(os.path.join(patch_dir, "sitecustomize.py"), "w") as f:
        f.write(
@@ -239,26 +381,33 @@ def _enable_offline_with_mistral_patch():
 def _disable_offline_with_mistral_patch():
    """Undo _enable_offline_with_mistral_patch."""
    os.environ.pop("HF_HUB_OFFLINE", None)
-    patch_dir = os.path.join(tempfile.gettempdir(), "dynamo_test_hf_patch")
+    worker_id = os.environ.get("PYTEST_XDIST_WORKER", "main")
+    patch_dir = os.path.join(tempfile.gettempdir(), f"dynamo_test_hf_patch_{worker_id}")
    pythonpath = os.environ.get("PYTHONPATH", "")
    os.environ["PYTHONPATH"] = pythonpath.replace(f"{patch_dir}:", "").replace(
        patch_dir, ""
    )


+_download_lock_path = os.path.join(tempfile.gettempdir(), "pytest_model_download.lock")
+
+
 @pytest.fixture(scope="session")
 def predownload_models(pytestconfig):
-    """Fixture wrapper around download_models for models used in collected tests"""
-    # Get models from pytest config if available, otherwise fall back to TEST_MODELS
+    """Fixture wrapper around download_models for models used in collected tests.
+
+    Uses a file lock so that under xdist, only one worker downloads at a time
+    and the rest reuse the HuggingFace cache.
+    """
    models = getattr(pytestconfig, "models_to_download", None)
-    if models:
-        logging.info(
-            f"Downloading {len(models)} models needed for collected tests\nModels: {models}"
-        )
-        download_models(model_list=list(models))
-    else:
-        # Fallback to original behavior if extraction failed
-        download_models()
+    with FileLock(_download_lock_path):
+        if models:
+            logging.info(
+                f"Downloading {len(models)} models needed for collected tests\nModels: {models}"
+            )
+            download_models(model_list=list(models))
+        else:
+            download_models()

    _enable_offline_with_mistral_patch()
    yield
@@ -267,21 +416,20 @@ def predownload_models(pytestconfig):

 @pytest.fixture(scope="session")
 def predownload_tokenizers(pytestconfig):
-    """Fixture wrapper around download_models for tokenizers used in collected tests"""
-    # Get models from pytest config if available, otherwise fall back to TEST_MODELS
+    """Fixture wrapper around download_models for tokenizers used in collected tests.
+
+    Uses a file lock so that under xdist, only one worker downloads at a time.
+    """
    models = getattr(pytestconfig, "models_to_download", None)
-    if models:
-        logging.info(
-            f"Downloading tokenizers for {len(models)} models needed for collected tests\nModels: {models}"
-        )
-        download_models(model_list=list(models), ignore_weights=True)
-    else:
-        # Fallback to original behavior if extraction failed
-        download_models(ignore_weights=True)
+    with FileLock(_download_lock_path):
+        if models:
+            logging.info(
+                f"Downloading tokenizers for {len(models)} models needed for collected tests\nModels: {models}"
+            )
+            download_models(model_list=list(models), ignore_weights=True)
+        else:
+            download_models(ignore_weights=True)

-    # Skip redundant HuggingFace API calls in worker subprocesses since
-    # tokenizers are already cached. This avoids flaky timeouts from slow
-    # HF API responses (the RepoInfo fetch still happens even for cached models).
    _enable_offline_with_mistral_patch()
    yield
    _disable_offline_with_mistral_patch()
@@ -337,26 +485,41 @@ def pytest_collection_modifyitems(config, items):
                if _item_has_marker(item, marker_name):
                    item.add_marker(skip)

-    # Skip tests that exceed --max-vram-gib
+    # Deselect tests based on --max-vram-gib:
+    #   - Tests whose profiled VRAM exceeds the limit are removed
+    #   - Tests WITHOUT a VRAM marker are also removed (unknown VRAM = unsafe)
+    # Using deselect (not skip) so they never reach the xdist scheduler.
    vram_limit = config.getoption("--max-vram-gib", default=None)
    if vram_limit is not None:
-        skip_vram = pytest.mark.skip(
-            reason=f"requires more than {vram_limit} GiB VRAM (--max-vram-gib={vram_limit})"
-        )
+        keep = []
+        deselected = []
        for item in items:
-            vram_mark = item.get_closest_marker("max_vram_gib")
-            if vram_mark and vram_mark.args and vram_mark.args[0] > vram_limit:
-                item.add_marker(skip_vram)
+            vram_mark = item.get_closest_marker("profiled_vram_gib")
+            if vram_mark and vram_mark.args and vram_mark.args[0] <= vram_limit:
+                keep.append(item)
+            else:
+                deselected.append(item)
+        if deselected:
+            config.hook.pytest_deselected(items=deselected)
+            items[:] = keep
+
+    # Write test metadata for the GPU orchestrator to read.
+    if vram_limit is not None:
+        # Delayed: see vram_utils pynvml note in pytest_configure
+        from tests.utils.vram_utils import print_gpu_plan, write_test_meta
+
+        write_test_meta(items)

-    # --dry-run: print run/skip breakdown and exit without executing tests
+    # --dry-run: print run/skip breakdown and exit without executing tests.
+    # At this point, items only contains tests that passed --max-vram-gib
+    # filtering (deselected items were already removed above).
    if config.getoption("--dry-run", default=False):
        would_run = []
        would_skip = []
-        unmarked = []
        for item in items:
-            vram_mark = item.get_closest_marker("max_vram_gib")
+            vram_mark = item.get_closest_marker("profiled_vram_gib")
            vram_val = vram_mark.args[0] if vram_mark and vram_mark.args else None
-            name = item.nodeid.split("::", 1)[1] if "::" in item.nodeid else item.nodeid
+            name = item.nodeid

            skip_reasons = []
            for marker in item.iter_markers("skip"):
@@ -365,39 +528,28 @@ def pytest_collection_modifyitems(config, items):
                    reason = marker.args[0]
                skip_reasons.append(reason or "no reason given")

-            vram_skipped = (
-                vram_limit is not None
-                and vram_val is not None
-                and vram_val > vram_limit
-            )
-            if vram_skipped:
-                skip_reasons.insert(0, f"{vram_val} GiB > {vram_limit} GiB VRAM limit")
-
            if skip_reasons:
                would_skip.append((name, vram_val, skip_reasons))
-            elif vram_val is not None:
-                would_run.append((name, vram_val))
            else:
-                unmarked.append(name)
+                would_run.append((name, vram_val))

        print(f"\n{'=' * 60}")
-        print(
-            f"--max-vram-gib={vram_limit or 'not set'}  |  {len(items)} tests selected"
-        )
+        print(f"--max-vram-gib={vram_limit or 'not set'}  |  {len(items)} tests")
        print(f"{'=' * 60}")
        if would_run:
            print(f"\nWould RUN ({len(would_run)}):")
            for name, gib in would_run:
-                print(f"  {name}  ({gib} GiB)")
+                gib_str = f"  ({gib} GiB)" if gib is not None else ""
+                print(f"  {name}{gib_str}")
        if would_skip:
            print(f"\nWould SKIP ({len(would_skip)}):")
            for name, vram_val, reasons in would_skip:
                vram_str = f"  ({vram_val} GiB)" if vram_val is not None else ""
                print(f"  {name}{vram_str}  -- {'; '.join(reasons)}")
-        if unmarked:
-            print(f"\nNo VRAM marker — always run ({len(unmarked)}):")
-            for name in unmarked:
-                print(f"  {name}")
+
+        gpus = config.stash.get(_gpu_parallel_gpus_key, None)
+        if gpus and vram_limit is not None:
+            print_gpu_plan(gpus, vram_limit, would_run)
        print()
        items.clear()
        return

--- a/tests/frontend/test_vllm.py
+++ b/tests/frontend/test_vllm.py
@@ -99,9 +99,16 @@ class VllmWorkerProcess(ManagedProcess):
            "32768",
        ]

-        gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
-        if gpu_util:
-            command.extend(["--gpu-memory-utilization", gpu_util])
+        kv_bytes = os.environ.get("_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES")
+        if kv_bytes:
+            command.extend(
+                [
+                    "--kv-cache-memory-bytes",
+                    kv_bytes,
+                    "--gpu-memory-utilization",
+                    "0.01",
+                ]
+            )

        env = os.environ.copy()
        env["DYN_LOG"] = "debug"
@@ -229,7 +236,8 @@ def _validate_chat_response(response: requests.Response) -> Dict[str, Any]:


 # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort
-@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.profiled_vram_gib(20.4)  # actual profiled peak
+# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
 @pytest.mark.timeout(300)  # 3x observed ~70s wall time, rounded up
 @pytest.mark.post_merge
 def test_reasoning_effort(
@@ -297,7 +305,8 @@ def test_reasoning_effort(


 # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
-@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.profiled_vram_gib(20.4)  # actual profiled peak
+# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
 @pytest.mark.timeout(113)  # 3x observed 37.4s wall time
 @pytest.mark.post_merge
 def test_tool_calling(
@@ -341,7 +350,8 @@ def test_tool_calling(


 # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling_second_round
-@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.profiled_vram_gib(20.4)  # actual profiled peak
+# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
 @pytest.mark.timeout(115)  # 3x observed 38.1s wall time
 @pytest.mark.nightly
 def test_tool_calling_second_round(
@@ -407,7 +417,8 @@ def test_tool_calling_second_round(


 # Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning
-@pytest.mark.max_vram_gib(20.4)  # observed peak 18.5 GiB (+10% safety)
+@pytest.mark.profiled_vram_gib(20.4)  # actual profiled peak
+# TODO: profile with --kv-bytes once pre-existing 500 panic is fixed (JoinError::Panic "Cannot drop a runtime in a context where blocking is not allowed")
 @pytest.mark.timeout(131)  # 3x observed 43.4s wall time
 @pytest.mark.nightly
 def test_reasoning(request, start_services: ServicePorts, predownload_models) -> None:

--- a/tests/serve/common.py
+++ b/tests/serve/common.py
@@ -18,6 +18,7 @@ from tests.conftest import ServicePorts
 from tests.utils.client import send_request
 from tests.utils.constants import DefaultPort
 from tests.utils.engine_process import EngineConfig, EngineProcess
+from tests.utils.port_utils import allocate_port, deallocate_port

 DEFAULT_TIMEOUT = 10

@@ -93,6 +94,7 @@ def run_serve_deployment(

        # Ensure EngineProcess health checks hit the correct frontend port.
        config = dataclasses.replace(config, frontend_port=dynamic_frontend_port)
+
    else:
        # Backward compat: infer from config/extra_env if no explicit ports are passed.
        dynamic_frontend_port = int(config.frontend_port)
@@ -108,76 +110,86 @@ def run_serve_deployment(
            int(merged_env.get("DYN_SYSTEM_PORT2") or DefaultPort.SYSTEM2.value),
        ]

-    with EngineProcess.from_script(
-        config, request, extra_env=merged_env
-    ) as server_process:
-        for _payload in config.request_payloads:
-            logger.info("TESTING: Payload: %s", _payload.__class__.__name__)
-
-            # Make a per-iteration copy so tests can safely override ports/fields
-            # without mutating shared config instances across parametrized cases.
-            payload = deepcopy(_payload)
-            # inject model
-            if hasattr(payload, "with_model"):
-                payload = payload.with_model(config.model)
-
-            # Default behavior: requests go to the frontend port, except metrics which target
-            # worker system ports (mapped from DefaultPort -> per-test ports).
-            if getattr(payload, "endpoint", "") == "/metrics":
-                if payload.port == DefaultPort.SYSTEM1.value:
-                    if len(dynamic_system_ports) < 1:
-                        raise RuntimeError(
-                            "Payload targets SYSTEM_PORT1 but no system ports were provided "
-                            f"(payload={payload.__class__.__name__})"
-                        )
-                    payload.port = dynamic_system_ports[0]
-                elif payload.port == DefaultPort.SYSTEM2.value:
-                    if len(dynamic_system_ports) < 2:
-                        raise RuntimeError(
-                            "Payload targets SYSTEM_PORT2 but only 1 system port was provided "
-                            f"(payload={payload.__class__.__name__})"
-                        )
-                    payload.port = dynamic_system_ports[1]
-            else:
-                payload.port = dynamic_frontend_port
-
-            # Optional extra system ports for specialized payloads (e.g. LoRA control-plane APIs).
-            # BasePayload always defines `system_ports` (usually empty); map defaults
-            # (SYSTEM_PORT1/2) to per-test system ports when present.
-            if payload.system_ports:
-                mapped_system_ports: list[int] = []
-                for p in payload.system_ports:
-                    if p == DefaultPort.SYSTEM1.value:
+    # Disagg scripts need a unique bootstrap port so parallel runs don't collide.
+    disagg_bootstrap_port: int | None = None
+    if config.script_name and "disagg" in config.script_name:
+        disagg_bootstrap_port = allocate_port(12000)
+        merged_env["DYN_DISAGG_BOOTSTRAP_PORT"] = str(disagg_bootstrap_port)
+
+    try:
+        with EngineProcess.from_script(
+            config, request, extra_env=merged_env
+        ) as server_process:
+            for _payload in config.request_payloads:
+                logger.info("TESTING: Payload: %s", _payload.__class__.__name__)
+
+                # Make a per-iteration copy so tests can safely override ports/fields
+                # without mutating shared config instances across parametrized cases.
+                payload = deepcopy(_payload)
+                # inject model
+                if hasattr(payload, "with_model"):
+                    payload = payload.with_model(config.model)
+
+                # Default behavior: requests go to the frontend port, except metrics which target
+                # worker system ports (mapped from DefaultPort -> per-test ports).
+                if getattr(payload, "endpoint", "") == "/metrics":
+                    if payload.port == DefaultPort.SYSTEM1.value:
                        if len(dynamic_system_ports) < 1:
                            raise RuntimeError(
-                                "Payload.system_ports includes SYSTEM_PORT1 but no system ports were provided "
+                                "Payload targets SYSTEM_PORT1 but no system ports were provided "
                                f"(payload={payload.__class__.__name__})"
                            )
-                        mapped_system_ports.append(dynamic_system_ports[0])
-                    elif p == DefaultPort.SYSTEM2.value:
+                        payload.port = dynamic_system_ports[0]
+                    elif payload.port == DefaultPort.SYSTEM2.value:
                        if len(dynamic_system_ports) < 2:
                            raise RuntimeError(
-                                "Payload.system_ports includes SYSTEM_PORT2 but only 1 system port was provided "
+                                "Payload targets SYSTEM_PORT2 but only 1 system port was provided "
                                f"(payload={payload.__class__.__name__})"
                            )
-                        mapped_system_ports.append(dynamic_system_ports[1])
-                    else:
-                        mapped_system_ports.append(p)
-                payload.system_ports = mapped_system_ports
-
-            for _ in range(payload.repeat_count):
-                response = send_request(
-                    url=payload.url(),
-                    payload=payload.body,
-                    timeout=payload.timeout,
-                    method=payload.method,
-                    stream=payload.http_stream,
-                )
-                server_process.check_response(payload, response)
-
-            # Call final_validation if the payload has one (e.g., CachedTokensChatPayload)
-            if hasattr(payload, "final_validation"):
-                payload.final_validation()
+                        payload.port = dynamic_system_ports[1]
+                else:
+                    payload.port = dynamic_frontend_port
+
+                # Optional extra system ports for specialized payloads (e.g. LoRA control-plane APIs).
+                # BasePayload always defines `system_ports` (usually empty); map defaults
+                # (SYSTEM_PORT1/2) to per-test system ports when present.
+                if payload.system_ports:
+                    mapped_system_ports: list[int] = []
+                    for p in payload.system_ports:
+                        if p == DefaultPort.SYSTEM1.value:
+                            if len(dynamic_system_ports) < 1:
+                                raise RuntimeError(
+                                    "Payload.system_ports includes SYSTEM_PORT1 but no system ports were provided "
+                                    f"(payload={payload.__class__.__name__})"
+                                )
+                            mapped_system_ports.append(dynamic_system_ports[0])
+                        elif p == DefaultPort.SYSTEM2.value:
+                            if len(dynamic_system_ports) < 2:
+                                raise RuntimeError(
+                                    "Payload.system_ports includes SYSTEM_PORT2 but only 1 system port was provided "
+                                    f"(payload={payload.__class__.__name__})"
+                                )
+                            mapped_system_ports.append(dynamic_system_ports[1])
+                        else:
+                            mapped_system_ports.append(p)
+                    payload.system_ports = mapped_system_ports
+
+                for _ in range(payload.repeat_count):
+                    response = send_request(
+                        url=payload.url(),
+                        payload=payload.body,
+                        timeout=payload.timeout,
+                        method=payload.method,
+                        stream=payload.http_stream,
+                    )
+                    server_process.check_response(payload, response)
+
+                # Call final_validation if the payload has one (e.g., CachedTokensChatPayload)
+                if hasattr(payload, "final_validation"):
+                    payload.final_validation()
+    finally:
+        if disagg_bootstrap_port is not None:
+            deallocate_port(disagg_bootstrap_port)


 def params_with_model_mark(configs: Mapping[str, EngineConfig]):

--- a/tests/serve/launch/multi_node_tp_headless.sh
+++ b/tests/serve/launch/multi_node_tp_headless.sh
@@ -12,7 +12,11 @@ trap 'echo "Cleaning up..."; kill 0' EXIT

 MODEL="${MODEL:-Qwen/Qwen3-0.6B}"

-GPU_MEM_FRACTION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}"
+KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
+GPU_MEM_ARGS=""
+if [[ -n "$KV_BYTES" ]]; then
+    GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
+fi

 echo "Starting Dynamo frontend..."
 python3 -m dynamo.frontend &
@@ -25,7 +29,7 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
  --node-rank 0 \
  --master-addr 127.0.0.1 \
  --enforce-eager \
-  ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
+  $GPU_MEM_ARGS &

 echo "Starting dynamo.vllm headless worker (TP=2, nnodes=2, node-rank=1, GPU 1)..."
 CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
@@ -35,7 +39,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
  --node-rank 1 \
  --master-addr 127.0.0.1 \
  --enforce-eager \
-  ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} \
+  $GPU_MEM_ARGS \
  --headless &

 wait
--- a/tests/serve/test_sglang.py
+++ b/tests/serve/test_sglang.py
@@ -45,9 +45,9 @@ sglang_dir = os.environ.get("SGLANG_DIR") or os.path.join(

 # SGLang test configurations
 # NOTE: pytest.mark.gpu_1 tests take ~167s (2m 47s) total to run sequentially (with models pre-cached)
-# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
+# TODO: Now that these tests use dynamic ports and each config has a profiled_vram_gib marker,
 # optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
-# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
+# A future collector/launcher can sum profiled_vram_gib values to decide how many tests fit
 # concurrently without exceeding available VRAM.
 sglang_configs = {
    "aggregated": SGLangConfig(
@@ -58,8 +58,13 @@ sglang_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(6.1),  # observed peak 5.6 GiB (+10% safety)
-            pytest.mark.timeout(240),  # profiled 34.4s on A6000
+            pytest.mark.profiled_vram_gib(
+                3.7
+            ),  # actual peak at recommended token count
+            pytest.mark.requested_sglang_kv_tokens(
+                96
+            ),  # KV cache cap (2x safety over min=48)
+            pytest.mark.timeout(195),  # profiled 33s on RTX 6000 Ada
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-0.6B",
@@ -160,7 +165,8 @@ sglang_configs = {
        script_name="template_verifier.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.timeout(240),  # profiled 11.7s on A6000 (no GPU model load)
+            pytest.mark.profiled_vram_gib(0.0),  # no GPU model load
+            pytest.mark.timeout(120),  # profiled 12s on RTX 6000 Ada
            pytest.mark.pre_merge,
            pytest.mark.nightly,
        ],
@@ -175,8 +181,8 @@ sglang_configs = {
    ),
    # NOTE: Pack all workers on 1 GPU for lower CI resource requirements.
    # NOTE: multimodal_epd.sh uses explicit --mem-fraction-static via DYN_ENCODE_GPU_MEM
-    # / DYN_WORKER_GPU_MEM env vars, so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect.
-    # Regardless of fraction overrides, the workers combined consistently use ~23.6 GiB.
+    # / DYN_WORKER_GPU_MEM env vars. The profiler override distributes proportionally
+    # but workers combined consistently use ~23.6 GiB regardless of fraction overrides.
    "multimodal_e_pd_qwen": SGLangConfig(
        # E/P/D architecture: Encode, Prefill, Decode workers all on GPU 0
        name="multimodal_e_pd_qwen",
@@ -184,16 +190,15 @@ sglang_configs = {
        script_name="multimodal_epd.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(13.3),  # observed peak 12.1 GiB (+10% safety)
-            pytest.mark.timeout(360),  # profiled 31.0s on A6000
+            # No profiled_vram_gib: uses hard-coded --mem-fraction-static via
+            # DYN_ENCODE_GPU_MEM / DYN_WORKER_GPU_MEM, so VRAM scales with GPU size.
+            pytest.mark.timeout(210),  # profiled 35s on RTX 6000 Ada
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-VL-2B-Instruct",
        script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
        timeout=360,
        env={
-            "DYN_ENCODE_WORKER_GPU": "0",
-            "DYN_WORKER_GPU": "0",
            "DYN_ENCODE_GPU_MEM": "0.1",
            "DYN_WORKER_GPU_MEM": "0.4",
        },
@@ -226,8 +231,11 @@ sglang_configs = {
        script_name="multimodal_disagg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(17.7),  # observed peak 16.1 GiB (+10% safety)
-            pytest.mark.timeout(360),  # profiled 36.0s on A6000
+            pytest.mark.profiled_vram_gib(16.1),  # actual profiled peak
+            pytest.mark.requested_sglang_kv_tokens(
+                1024
+            ),  # KV cache cap (2x safety over min=512)
+            pytest.mark.timeout(222),  # profiled 37s on RTX 6000 Ada
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-VL-2B-Instruct",
@@ -261,8 +269,13 @@ sglang_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(21.0),  # observed peak 19.1 GiB (+10% safety)
-            pytest.mark.timeout(300),  # profiled 41.3s on A6000
+            pytest.mark.profiled_vram_gib(
+                19.1
+            ),  # actual peak at recommended token count
+            pytest.mark.requested_sglang_kv_tokens(
+                768
+            ),  # KV cache cap (2x safety over min=384)
+            pytest.mark.timeout(182),  # profiled 30s on RTX 6000 Ada
            pytest.mark.pre_merge,
            pytest.mark.nightly,
        ],
@@ -300,8 +313,13 @@ sglang_configs = {
        script_name="agg_embed.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(12.1),  # observed peak 11.0 GiB (+10% safety)
-            pytest.mark.timeout(270),  # profiled 25.5s on A6000
+            pytest.mark.profiled_vram_gib(
+                9.8
+            ),  # actual peak at recommended token count
+            pytest.mark.requested_sglang_kv_tokens(
+                128
+            ),  # KV cache cap (2x safety over min=64)
+            pytest.mark.timeout(147),  # profiled 24s on RTX 6000 Ada
            pytest.mark.pre_merge,
            pytest.mark.nightly,
        ],
@@ -338,8 +356,13 @@ sglang_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(16.2),  # observed peak 14.8 GiB (+10% safety)
-            pytest.mark.timeout(420),  # profiled 73s on A6000
+            pytest.mark.profiled_vram_gib(
+                14.7
+            ),  # actual peak at recommended token count
+            pytest.mark.requested_sglang_kv_tokens(
+                64
+            ),  # KV cache cap (2x safety over min=32)
+            pytest.mark.timeout(341),  # profiled 57s on RTX 6000 Ada
            pytest.mark.post_merge,
        ],
        model="deepseek-ai/deepseek-llm-7b-base",
@@ -362,7 +385,7 @@ sglang_configs = {
            pytest.mark.post_merge,
            pytest.mark.timeout(240),
            pytest.mark.skip(reason="DYN-2261"),
-            # TODO: profile to get max_vram (currently skipped)
+            # TODO: profile once DYN-2261 is fixed (uses agg.sh, profiler works)
        ],
        model="Qwen/Qwen3-0.6B",
        env={"DYN_ENABLE_ANTHROPIC_API": "1"},

--- a/tests/serve/test_vllm.py
+++ b/tests/serve/test_vllm.py
@@ -54,9 +54,9 @@ vllm_dir = os.environ.get("VLLM_DIR") or os.path.join(

 # vLLM test configurations
 # NOTE: pytest.mark.gpu_1 tests take ~5.5 minutes total to run sequentially (with models pre-cached)
-# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
+# TODO: Now that these tests use dynamic ports and each config has VRAM markers,
 # optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
-# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
+# A future collector/launcher can sum profiled_vram_gib values to decide how many tests fit
 # concurrently without exceeding available VRAM.
 vllm_configs = {
    "aggregated": VLLMConfig(
@@ -65,8 +65,13 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
-            pytest.mark.timeout(300),  # ~7x observed 42.2s; old value before profiling
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
+            pytest.mark.timeout(
+                360
+            ),  # ~8.5x observed 42.2s; bumped for GPU-parallel headroom
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-0.6B",
@@ -93,7 +98,10 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
            pytest.mark.timeout(120),  # ~5x observed 24.3s; CI machines are slower
            pytest.mark.post_merge,
        ],
@@ -122,7 +130,10 @@ vllm_configs = {
        marks=[
            pytest.mark.lmcache,
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.1),  # observed peak 7.4 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
            pytest.mark.timeout(360),  # ~7x observed 49.0s; old value before profiling
            pytest.mark.pre_merge,
            pytest.mark.skipif(
@@ -145,7 +156,10 @@ vllm_configs = {
        marks=[
            pytest.mark.lmcache,
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.1),  # observed peak 7.4 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
            pytest.mark.timeout(360),  # ~7x observed 49.3s; old value before profiling
            pytest.mark.pre_merge,
            pytest.mark.skipif(
@@ -170,8 +184,13 @@ vllm_configs = {
        script_name="agg_request_planes.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.1),  # observed peak 7.3 GiB (+10% safety)
-            pytest.mark.timeout(300),  # ~7x observed 43.0s; old value before profiling
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
+            pytest.mark.timeout(
+                360
+            ),  # ~8x observed 43.0s; bumped for GPU-parallel headroom
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-0.6B",
@@ -187,8 +206,13 @@ vllm_configs = {
        script_name="agg_request_planes.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.1),  # observed peak 7.3 GiB (+10% safety)
-            pytest.mark.timeout(300),  # ~7x observed 42.3s; old value before profiling
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
+            pytest.mark.timeout(
+                360
+            ),  # ~8.5x observed 42.3s; bumped for GPU-parallel headroom
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-0.6B",
@@ -299,13 +323,17 @@ vllm_configs = {
        ],
    ),
    # NOTE: Pack all workers on 1 GPU for lower CI resource requirements
+    # NOTE: disagg_multimodal_e_pd.sh uses explicit --gpu-memory-utilization via
+    # DYN_ENCODE_GPU_MEM / DYN_PD_GPU_MEM env vars in single-GPU mode.
+    # PD worker honors build_gpu_mem_args for parallel execution.
    "multimodal_e_pd_qwen": VLLMConfig(
        name="multimodal_e_pd_qwen",
        directory=vllm_dir,
        script_name="disagg_multimodal_e_pd.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(24.6),  # observed peak 22.3 GiB (+10% safety)
+            # No profiled_vram_gib / requested_vllm_kv_cache_bytes: single-GPU mode
+            # uses hardcoded fractions (encode=0.1, PD=0.7) that scale with GPU size.
            pytest.mark.timeout(340),  # ~5x observed 68.4s; 2B model loads slower on CI
            pytest.mark.pre_merge,
        ],
@@ -339,7 +367,10 @@ vllm_configs = {
        # post_merge because needs real NIXL not stub
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(10.2),  # observed peak 9.3 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(9.6),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_710_490_000
+            ),  # KV cache cap (2x safety over min=855_244_800)
            pytest.mark.timeout(220),  # ~5x observed 43.7s; 2B model loads slower on CI
            pytest.mark.post_merge,
        ],
@@ -373,21 +404,25 @@ vllm_configs = {
    # NOTE: disagg_multimodal_epd.sh uses --kv-cache-memory-bytes=512MB for P/D
    # workers. Per vLLM CacheConfig, kv_cache_memory_bytes (when not-None) ignores
    # gpu_memory_utilization (ref: https://docs.vllm.ai/en/stable/api/vllm/config/cache/),
-    # so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect. Regardless of GPU_MEM
+    # so KV cache overrides have no effect. Regardless of GPU_MEM
    # fractions (0.1/0.4/0.4), the 3 workers combined consistently use ~17.6 GiB
    # total on this GPU.
+    # NOTE: disagg_multimodal_epd.sh uses explicit --gpu-memory-utilization via
+    # DYN_ENCODE_GPU_MEM / DYN_PREFILL_GPU_MEM / DYN_DECODE_GPU_MEM env vars.
+    # P/D workers honor build_gpu_mem_args for parallel execution.
    "multimodal_disagg_qwen": VLLMConfig(
        name="multimodal_disagg_qwen",
        directory=vllm_dir,
        script_name="disagg_multimodal_epd.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(19.4),  # observed peak 17.6 GiB (+10% safety)
+            # No profiled_vram_gib / requested_vllm_kv_cache_bytes: single-GPU mode
+            # uses hardcoded fractions via DYN_*_GPU_MEM that scale with GPU size.
            pytest.mark.pre_merge,
        ],
        model="Qwen/Qwen3-VL-2B-Instruct",
        script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
-        timeout=360,
+        timeout=300,
        env={
            "DYN_ENCODE_WORKER_GPU": "0",
            "DYN_PREFILL_WORKER_GPU": "0",
@@ -421,7 +456,10 @@ vllm_configs = {
        script_name="agg_multimodal.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(21.6),  # observed peak 19.6 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(19.9),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                922_354_000
+            ),  # KV cache cap (2x safety over min=461_176_832)
            pytest.mark.timeout(
                360
            ),  # ~7x observed 50.0s; 7B model loads ~48s on CI (A10G/L4)
@@ -455,7 +493,10 @@ vllm_configs = {
        script_name="agg_multimodal.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(18.9),  # observed peak 17.1 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(14.9),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                922_354_000
+            ),  # KV cache cap (2x safety over min=461_176_832)
            pytest.mark.timeout(
                300
            ),  # ~7x observed 42.7s; 7B model loads ~48s on CI (A10G/L4)
@@ -703,7 +744,10 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(21.9),  # observed peak 19.9 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(18.3),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                4_074_898_000
+            ),  # KV cache cap (2x safety over min=2_037_448_704)
            pytest.mark.timeout(
                420
            ),  # 7B model loads ~48s on CI (A10G/L4) vs ~15s locally
@@ -742,7 +786,10 @@ vllm_configs = {
        script_name="agg.sh",
        marks=[
            pytest.mark.gpu_1,
-            pytest.mark.max_vram_gib(8.6),  # observed peak 7.8 GiB (+10% safety)
+            pytest.mark.profiled_vram_gib(3.8),  # actual profiled peak with kv-bytes
+            pytest.mark.requested_vllm_kv_cache_bytes(
+                1_119_388_000
+            ),  # KV cache cap (2x safety over min=559_693_824)
            pytest.mark.timeout(110),  # ~5x observed 22.3s; CI machines are slower
            pytest.mark.pre_merge,
        ],

--- a/tests/utils/profile_pytest.py
+++ b/tests/utils/profile_pytest.py
@@ -14,17 +14,18 @@ in-process instrumentation.  Using NVML directly (the same C library that
 ``nvidia-smi`` wraps) avoids the overhead of forking a subprocess each sample
 and allows high-frequency sampling.

-In **binary-search mode** (the default), the profiler sets the env var
-``_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`` to a value between 0.05 and 0.95 and
-re-runs the test at each midpoint.  If the test passes, the fraction is lowered;
-if it OOMs, the fraction is raised — standard bisection to find the minimum
-VRAM the test needs.  The peak ``memory.used`` from the last passing run
-(plus a 10 % safety margin) becomes the ``@pytest.mark.max_vram_gib`` recommendation.
-
-**IMPORTANT**: The test under profile **MUST** honor ``_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE``
-— either directly (see ``test_mock_gpu_alloc.py``) or via launch scripts that
-pass it as ``--gpu-memory-utilization`` to vLLM (e.g. ``agg.sh``).  If the test
-ignores this variable, every probe will pass at the same peak and the profiler
+In **binary-search mode** (the default), the profiler bisects the KV cache
+allocation — ``_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES`` for vLLM (bytes) or
+``_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS`` for SGLang (tokens).
+If the test passes, the allocation is lowered; if it OOMs, it is raised —
+standard bisection to find the minimum the test needs.  A safety factor
+is applied and the peak ``memory.used`` from the last passing run becomes
+the ``@pytest.mark.profiled_vram_gib`` recommendation.
+
+**IMPORTANT**: The test under profile **MUST** read the appropriate KV cache
+override — either directly (see ``test_mock_gpu_alloc.py``) or via launch
+scripts that call ``build_gpu_mem_args`` (e.g. ``agg.sh``).  If the test
+ignores the override, every probe will pass at the same peak and the profiler
 will warn that the binary search is unreliable.

 Usage::
@@ -51,6 +52,7 @@ import json
 import logging
 import math
 import os
+import re
 import shutil
 import subprocess
 import sys
@@ -68,6 +70,11 @@ logger = logging.getLogger(__name__)
 # tier has headroom for variance across runs.
 _VRAM_SAFETY_FACTOR = 1.1

+# Safety margin for KV cache recommendations (both SGLang tokens and vLLM bytes).
+# The minimum passing value is multiplied by this factor to provide headroom for
+# prompt length variation, scheduling jitter, and multi-turn conversations.
+_KV_SAFETY_FACTOR = 2.0
+
 # Phase detection: a memory jump exceeding this threshold (MiB) between
 # consecutive samples marks a phase boundary.
 _PHASE_JUMP_MIB = 200
@@ -77,6 +84,11 @@ _PHASE_JUMP_MIB = 200
 _PLATEAU_TOLERANCE_MIB = 50
 _PLATEAU_MIN_SAMPLES = 3

+# Early-stop threshold for binary search: if the last 3 probes have peak
+# VRAM within this range, the bisection is in the noise floor (model weights
+# dominate) and further probes won't yield meaningful data.
+_EARLY_STOP_RANGE_MIB = 768  # 0.75 GiB
+

 def _extract_model_from_markers(pytest_args: list[str]) -> str | None:
    """Extract the model name from @pytest.mark.model(...) via pytest-json-report.
@@ -446,6 +458,9 @@ def _recommend_markers(
    wall_secs: float,
    model_name: str | None = None,
    num_runs: int = 1,
+    requested_sglang_kv_tokens: int | None = None,
+    requested_vllm_kv_cache_bytes: int | None = None,
+    min_kv_value: int | None = None,
 ) -> tuple[list[MarkerRecommendation], list[str]]:
    """Generate marker recommendations from profiling data.

@@ -523,17 +538,37 @@ def _recommend_markers(
            )
        )

-    # -- Hardware: VRAM requirement --
+    # -- Hardware: VRAM requirements (two markers) --
    if used_vram > _PLATEAU_TOLERANCE_MIB:
+        max_peak_gib = round(max_peak_mib / 1024, 1)
        padded_peak_mib = int(max_peak_mib * _VRAM_SAFETY_FACTOR)
        padded_peak_gib = round(padded_peak_mib / 1024, 1)
+
+        # profiled_vram_gib: actual nvidia-smi peak (for scheduling/filtering)
        recs.append(
            MarkerRecommendation(
-                f"max_vram_gib({padded_peak_gib})",
-                f"peak {_format_mib(max_peak_mib)} GPU RAM used "
-                f"(+10% safety: {_format_mib(padded_peak_mib)})",
+                f"profiled_vram_gib({max_peak_gib})",
+                f"actual nvidia-smi peak {_format_mib(max_peak_mib)}",
            )
        )
+        if requested_sglang_kv_tokens is not None:
+            min_label = f" over min={min_kv_value}" if min_kv_value is not None else ""
+            recs.append(
+                MarkerRecommendation(
+                    f"requested_sglang_kv_tokens({requested_sglang_kv_tokens})",
+                    f"KV cache cap ({_KV_SAFETY_FACTOR:.0f}x safety{min_label})",
+                )
+            )
+        if requested_vllm_kv_cache_bytes is not None:
+            min_label = (
+                f" over min={min_kv_value:_}" if min_kv_value is not None else ""
+            )
+            recs.append(
+                MarkerRecommendation(
+                    f"requested_vllm_kv_cache_bytes({requested_vllm_kv_cache_bytes:_})",
+                    f"KV cache cap ({_KV_SAFETY_FACTOR:.0f}x safety{min_label})",
+                )
+            )

        # Warn about GPU cards that would OOM
        for card_gib, card_name in _GPU_REFERENCE_CARDS:
@@ -541,7 +576,7 @@ def _recommend_markers(
                warnings.append(f"Will OOM on {card_name} ({card_gib} GiB).")

    # -- Timeout --
-    timeout_val = int(math.ceil(wall_secs * 3.0))
+    timeout_val = int(math.ceil(wall_secs * 6.0))
    timeout_val = max(timeout_val, 10)
    recs.append(
        MarkerRecommendation(
@@ -598,6 +633,46 @@ def _print_recommendations(
    print()


+_SGLANG_NODEID_MARKERS = ["test_sglang", "sglang"]
+
+
+def _is_sglang_test(pytest_args: list[str]) -> bool:
+    """Check if any pytest arg looks like a SGLang test node ID."""
+    return any(
+        marker in arg for arg in pytest_args for marker in _SGLANG_NODEID_MARKERS
+    )
+
+
+_OOM_PATTERNS = [
+    "OutOfMemoryError",
+    "CUDA out of memory",
+    "CUDA error: out of memory",
+    "not enough memory",
+    "Cannot allocate",
+    "oom-kill",
+]
+
+
+def _looks_like_oom(stdout: str) -> bool:
+    """Check if captured output contains OOM-like errors."""
+    stdout_lower = stdout.lower()
+    return any(pat.lower() in stdout_lower for pat in _OOM_PATTERNS)
+
+
+_SGLANG_MAX_TOKENS_RE = re.compile(r"max_total_tokens=(\d+)")
+
+
+def _extract_requested_sglang_kv_tokens(stdout: str) -> int | None:
+    """Extract max_total_tokens from SGLang engine output.
+
+    SGLang logs: "Got total KV blocks from scheduler: N (max_total_tokens=M, page_size=P)"
+    """
+    match = _SGLANG_MAX_TOKENS_RE.search(stdout)
+    if match:
+        return int(match.group(1))
+    return None
+
+
 _DEFAULT_PROBE_TIMEOUT = 300  # 5 minutes max per profile run


@@ -610,13 +685,13 @@ def _run_once(
    quiet: bool = False,
    run_label: str | None = None,
    timeout: float = _DEFAULT_PROBE_TIMEOUT,
-) -> tuple[int, float, list[GpuReport], list[GpuSample]]:
+) -> tuple[int, float, list[GpuReport], list[GpuSample], str]:
    """Run pytest once with GPU sampling.

    When *run_label* is set, each line of pytest stdout/stderr is prefixed
    with ``[run_label]`` so multi-run output is easy to follow.

-    Returns (exit_code, wall_secs, reports, raw_samples).
+    Returns (exit_code, wall_secs, reports, raw_samples, captured_stdout).
    """
    sampler = _Sampler(interval=interval)
    sampler.start()
@@ -639,6 +714,7 @@ def _run_once(
    capture = run_label is not None
    t_start = time.monotonic()
    timed_out = False
+    captured_stdout = ""
    try:
        result = subprocess.run(
            pytest_cmd,
@@ -648,6 +724,8 @@ def _run_once(
            timeout=timeout,
        )
        rc = result.returncode
+        if capture:
+            captured_stdout = result.stdout or ""
    except subprocess.TimeoutExpired:
        timed_out = True
        rc = 1
@@ -658,9 +736,9 @@ def _run_once(
            )
    if not timed_out and capture:
        prefix = f"[{run_label}] "
-        for line in result.stdout.splitlines():
+        for line in captured_stdout.splitlines():
            print(f"{prefix}{line}")
-        for line in result.stderr.splitlines():
+        for line in (result.stderr or "").splitlines():
            print(f"{prefix}{line}", file=sys.stderr)
    sys.stdout.flush()
    wall_secs = time.monotonic() - t_start
@@ -672,7 +750,7 @@ def _run_once(

    sampler.stop()
    reports = _build_reports(sampler.samples, baseline_end, test_end)
-    return rc, wall_secs, reports, sampler.samples
+    return rc, wall_secs, reports, sampler.samples, captured_stdout


 def _find_min_vram(
@@ -682,23 +760,46 @@ def _find_min_vram(
    teardown_seconds: float = 2.0,
    recommend: bool = True,
    csv_path: str | None = None,
+    kv_bytes_mode: bool = False,
+    gpu_index: int = 0,
 ) -> int:
-    """Binary search _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE to find the minimum VRAM a test needs.
+    """Binary search to find the minimum VRAM a test needs.

-    Sets _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE env var (honored by agg.sh and similar scripts),
-    runs the test at each profile point, and bisects until the boundary is found.
+    Three modes, two patterns:
+
+    KV bisection (deterministic, no profiling race):
+      vLLM:   bisects _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES (bytes)
+      SGLang: bisects _PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS (tokens)
+      Both use the same _KV_SAFETY_FACTOR (2x) and the same bisect loop.
+      The only differences are env var name, units, display, and bounds.
    """
+    is_sglang = _is_sglang_test(pytest_args)
+
    gpu_info = _query_gpu_stats()
    if not gpu_info:
        raise RuntimeError("NVML returned no GPU data")
-    used_mib = gpu_info[0][1]
-    total_mib = gpu_info[0][2]
+    if gpu_index >= len(gpu_info):
+        raise RuntimeError(
+            f"GPU {gpu_index} not found (available: 0..{len(gpu_info) - 1})"
+        )
+    used_mib = gpu_info[gpu_index][1]
+    total_mib = gpu_info[gpu_index][2]
    free_mib = total_mib - used_mib
    total_gib = total_mib / 1024

+    # Base env: pin subprocess to the selected GPU
+    _gpu_env = {"CUDA_VISIBLE_DEVICES": str(gpu_index)}
+
    model_name = _extract_model_from_markers(pytest_args)

-    print("\n--- FIND MINIMUM VRAM (binary search) ---")
+    if not is_sglang:
+        kv_bytes_mode = True
+
+    if kv_bytes_mode:
+        mode_label = "KV CACHE BYTES (vLLM, deterministic)"
+    else:
+        mode_label = "KV TOKENS (SGLang)"
+    print(f"\n--- FIND MINIMUM {mode_label} (binary search) ---")
    print(f"  GPU total : {total_gib:.1f} GiB")
    print(
        f"  GPU free  : {free_mib / 1024:.1f} GiB  "
@@ -708,7 +809,6 @@ def _find_min_vram(
    if model_name:
        print(f"  Model     : {model_name}")

-    # Warn if something is already consuming significant GPU memory
    hogged_pct = used_mib / total_mib * 100
    if hogged_pct > 10:
        print(f"\n  {'!' * 72}")
@@ -716,91 +816,169 @@ def _find_min_vram(
            f"  WARNING: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) of GPU memory "
            f"is already in use!"
        )
-        print("  Another process is hogging the GPU. Results will be inaccurate")
-        print(
-            "  because _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is a fraction of TOTAL memory,"
-        )
-        print("  not FREE memory. Kill other GPU processes first.")
+        print("  Another process is hogging the GPU. Free memory is reduced,")
+        print("  which limits KV cache headroom. Kill other GPU processes first.")
        print(f"  {'!' * 72}")
    print()

-    lo = 0.05
-    hi = 0.95
-    tolerance = 0.05
-    max_iterations = math.ceil(math.log2((hi - lo) / tolerance))
-    last_pass_util: float | None = None
-    last_pass_peak_mib: int = 0
-    elapsed_times: list[float] = []
-    all_peak_mibs: list[int] = []
-    pass_wall_times: list[float] = []
-
-    print(f"  Range   : {lo:.0%} - {hi:.0%}  (tolerance {tolerance:.0%})")
-    print(
-        f"  Max iter: {max_iterations + 1} (1 validation + {max_iterations} bisections)"
-    )
-    print()
+    # -- Validation run --
+    validation_env: dict[str, str] = dict(_gpu_env)
+    if kv_bytes_mode:
+        # Start at 50% of free GPU. If it passes, that's the upper bound and we
+        # search downward. If it fails (model weights too large), halve again
+        # until we find a passing point, then search downward from there.
+        max_kv_bytes = int(max(free_mib // 2, 1024) * 1024 * 1024)
+        validation_env["_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES"] = str(max_kv_bytes)
+        validation_desc = f"kv_cache={max_kv_bytes // (1024**2)} MiB (50% of free)"
+    else:
+        validation_desc = "no token cap, default fraction"

-    # First, verify the test passes at hi (0.95)
-    print(
-        f"  [profile 1/{max_iterations + 1}] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE={hi:.2f} "
-        f"(allowed max GPU {hi * total_gib:.1f} GiB)  [validation run]"
-    )
+    print(f"  [probe 1] Validation run ({validation_desc})")
    sys.stdout.flush()
    t_iter_start = time.monotonic()
-    label = f"profile 1/{max_iterations + 1}"
-    rc, wall, reports, raw_samples = _run_once(
+    rc, wall, reports, raw_samples, stdout = _run_once(
        pytest_args,
        interval=interval,
        baseline_seconds=baseline_seconds,
        teardown_seconds=teardown_seconds,
-        extra_env={"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE": f"{hi:.2f}"},
+        extra_env=validation_env or None,
        quiet=True,
-        run_label=label,
+        run_label="probe 1",
    )
    iter_elapsed = time.monotonic() - t_iter_start
-    elapsed_times.append(iter_elapsed)
+
+    # kv-bytes mode: if validation fails, check whether it's OOM (over-allocated)
+    # or a genuine test failure (unrelated to KV cache). Only retry with less KV
+    # if the output looks like OOM; otherwise the test is broken and retrying won't help.
+    if rc != 0 and kv_bytes_mode:
+        if _looks_like_oom(stdout):
+            for attempt in range(4):
+                max_kv_bytes //= 2
+                if max_kv_bytes < 64 * 1024 * 1024:
+                    break
+                validation_env["_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES"] = str(
+                    max_kv_bytes
+                )
+                print(
+                    f"  [OOM] Reducing KV cache to {max_kv_bytes // (1024**2)} MiB "
+                    f"(retry {attempt + 1}/4)"
+                )
+                sys.stdout.flush()
+                t_iter_start = time.monotonic()
+                rc, wall, reports, raw_samples, stdout = _run_once(
+                    pytest_args,
+                    interval=interval,
+                    baseline_seconds=baseline_seconds,
+                    teardown_seconds=teardown_seconds,
+                    extra_env=validation_env,
+                    quiet=True,
+                    run_label=f"probe 1 (retry {attempt + 1})",
+                )
+                iter_elapsed = time.monotonic() - t_iter_start
+                if rc == 0:
+                    break
+        else:
+            print(
+                "  [FAIL] Test failed but NOT from OOM — the test appears genuinely broken."
+            )
+            print(
+                "  Hint: check the test output above for the root cause "
+                "(EngineDeadError, timeout, assertion, etc.)."
+            )
+
    if rc != 0:
-        print(
-            f"  [FAIL] allowed GPU = {hi * total_gib:.1f} GiB ({hi:.0%}), "
-            f"test fails even at max utilization. Cannot determine minimum."
+        reason = (
+            "OOM at all KV sizes"
+            if _looks_like_oom(stdout)
+            else "test broken (not OOM)"
        )
+        print(f"  [FAIL] Cannot determine minimum KV cache: {reason}.")
        return rc

    peak_mib = max((r.peak_mib for r in reports), default=0)
-    all_peak_mibs.append(peak_mib)
-    last_pass_util = hi
-    last_pass_peak_mib = peak_mib
+
+    if kv_bytes_mode:
+        # Search range: 64 MiB to 40 GiB in bytes.
+        # Lower bound at 64 MiB to skip probes that always fail (no model
+        # can serve even 1 request with < 64 MiB KV cache).
+        lo: float | int = 64 * 1024 * 1024  # 64 MiB minimum
+        hi: float | int = max_kv_bytes
+        tolerance: float | int = 16 * 1024 * 1024  # 16 MiB tolerance
+        print(
+            f"  [PASS] peak {_format_mib(peak_mib)}, wall {wall:.0f}s, "
+            f"iter took {iter_elapsed:.0f}s"
+        )
+    else:
+        max_tokens = _extract_requested_sglang_kv_tokens(stdout)
+        if max_tokens is None:
+            print(
+                "  [ERROR] Could not extract max_total_tokens from SGLang output.\n"
+                "  The launch script must log 'max_total_tokens=N' (SGLang does this by default)."
+            )
+            return 4
+        page_size = 16
+        lo = page_size
+        hi = max_tokens
+        tolerance = page_size * 2
+        print(
+            f"  [PASS] peak {_format_mib(peak_mib)}, wall {wall:.0f}s, "
+            f"max_total_tokens={max_tokens}, iter took {iter_elapsed:.0f}s"
+        )
+
+    baseline_time = iter_elapsed
+    probe_timeout = max(baseline_time * 2, 60)
+    print(f"  Profile timeout: {probe_timeout:.0f}s (2x first probe)")
+
+    max_iterations = (
+        max(1, math.ceil(math.log2((hi - lo) / tolerance))) if hi > lo else 0
+    )
+    last_pass_value: float | int = hi
+    last_pass_peak_mib: int = peak_mib
    last_pass_reports = reports
    last_pass_samples = raw_samples
-    pass_wall_times.append(wall)
+    elapsed_times: list[float] = [iter_elapsed]
+    pass_wall_times: list[float] = [wall]
+    all_peak_mibs: list[int] = [peak_mib]
+
+    if kv_bytes_mode:
+        print(
+            f"\n  Range   : {int(lo) // (1024**2)} - {int(hi) // (1024**2)} MiB  (tolerance {int(tolerance) // (1024**2)} MiB)"
+        )
+    else:
+        print(f"\n  Range   : {lo} - {hi} tokens  (tolerance {tolerance} tokens)")
    print(
-        f"  [PASS] allowed GPU = {hi * total_gib:.1f} GiB ({hi:.0%}), "
-        f"peak GPU used = {_format_mib(peak_mib)}, wall {wall:.0f}s, "
-        f"iter took {iter_elapsed:.0f}s"
+        f"  Max iter: {max_iterations + 1} (1 validation + {max_iterations} bisections)"
    )
+    print()

-    # Use 2x the first profile's time as the timeout for subsequent profiles.
-    # If a profile takes longer than this, it's likely stuck in teardown.
-    baseline_time = iter_elapsed
-    probe_timeout = max(baseline_time * 2, 60)
-    print(f"  Profile timeout: {probe_timeout:.0f}s (2x first profile)")
-
+    # -- Binary search loop --
    iteration = 0
    while (hi - lo) > tolerance:
        iteration += 1
        probe_num = iteration + 1
-        mid = (lo + hi) / 2
        remaining = max_iterations + 1 - probe_num
        avg_iter = sum(elapsed_times) / len(elapsed_times)
        eta_s = remaining * avg_iter

-        label = f"profile {probe_num}/{max_iterations + 1}"
-        print(
-            f"\n  [{label}] "
-            f"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE={mid:.2f} "
-            f"(allowed max GPU {mid * total_gib:.1f} GiB)  "
-            f"[~{remaining} iters left, profiling ETA ~{eta_s:.0f}s]"
-        )
+        if kv_bytes_mode:
+            mid_int = (int(lo) + int(hi)) // 2
+            mid_int = max(mid_int, 1024 * 1024)  # minimum 1 MiB
+            probe_env = {
+                **_gpu_env,
+                "_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES": str(mid_int),
+            }
+            probe_desc = f"kv_cache={mid_int // (1024**2)} MiB ({mid_int:,} bytes)"
+        else:
+            mid_int = ((int(lo) + int(hi)) // 2 // page_size) * page_size
+            mid_int = max(mid_int, page_size)
+            probe_env = {
+                **_gpu_env,
+                "_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS": str(mid_int),
+            }
+            probe_desc = f"tokens={mid_int}"
+
+        label = f"probe {probe_num}/{max_iterations + 1}"
+        print(f"  [{label}] {probe_desc}  [~{remaining} left, ETA ~{eta_s:.0f}s]")
        sys.stdout.flush()

        stop_progress = threading.Event()
@@ -829,12 +1007,12 @@ def _find_min_vram(
        )
        progress_thread.start()

-        rc, wall, reports, raw_samples = _run_once(
+        rc, wall, reports, raw_samples, stdout = _run_once(
            pytest_args,
            interval=interval,
            baseline_seconds=baseline_seconds,
            teardown_seconds=teardown_seconds,
-            extra_env={"_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE": f"{mid:.2f}"},
+            extra_env=probe_env,
            quiet=True,
            run_label=label,
            timeout=probe_timeout,
@@ -853,77 +1031,173 @@ def _find_min_vram(
        peak_mib = max((r.peak_mib for r in reports), default=0)
        all_peak_mibs.append(peak_mib)

+        mid_value = mid_int
        if rc == 0:
-            last_pass_util = mid
+            last_pass_value = mid_value
            last_pass_peak_mib = peak_mib
            last_pass_reports = reports
            last_pass_samples = raw_samples
            pass_wall_times.append(wall)
-            hi = mid
+            hi = mid_value
            print(
-                f"  [PASS] allowed GPU = {mid * total_gib:.1f} GiB ({mid:.0%}), "
-                f"peak GPU used = {_format_mib(peak_mib)}, wall {wall:.0f}s, "
-                f"iter took {iter_elapsed:.0f}s"
+                f"  [PASS] {probe_desc}, peak {_format_mib(peak_mib)}, "
+                f"wall {wall:.0f}s, iter took {iter_elapsed:.0f}s"
            )
        else:
-            lo = mid
+            lo = mid_value
+            print(f"  [FAIL] {probe_desc}, iter took {iter_elapsed:.0f}s")
+
+        # Early termination: if last 3 probes have peak VRAM within
+        # _EARLY_STOP_RANGE_MIB, further bisection is in the noise floor.
+        if len(all_peak_mibs) >= 4:
+            recent = all_peak_mibs[-3:]
+            peak_range = max(recent) - min(recent)
+            if peak_range < _EARLY_STOP_RANGE_MIB:
+                print(
+                    f"  [EARLY STOP] Peak VRAM stable at ~{_format_mib(recent[-1])} "
+                    f"for last 3 probes (range {peak_range} MiB < "
+                    f"{_EARLY_STOP_RANGE_MIB} MiB threshold) "
+                    f"-- stopping bisection early"
+                )
+                break
+
+    # -- Results --
+    test_name = next(
+        (a for a in pytest_args if "::" in a or a.endswith(".py")),
+        " ".join(pytest_args),
+    )
+    test_short = test_name.rsplit("::", 1)[-1] if "::" in test_name else test_name
+    peak_gib = round(last_pass_peak_mib / 1024, 1)
+
+    print(f"\n{'=' * 72}")
+    if kv_bytes_mode:
+        min_kv_bytes = int(last_pass_value)
+        safe_kv_bytes = int(min_kv_bytes * _KV_SAFETY_FACTOR)
+        # Round up to nearest 1000 for clean marker values
+        safe_kv_bytes = ((safe_kv_bytes + 999) // 1000) * 1000
+        safe_kv_mib = safe_kv_bytes // (1024 * 1024)
+        min_kv_mib = min_kv_bytes // (1024 * 1024)
+
+        # Final validation probe at safe_kv_bytes to get accurate profiled_vram_gib.
+        # The bisection's last pass was at min_kv_bytes; the recommended marker uses
+        # safe_kv_bytes which allocates more KV cache and thus more VRAM.
+        print(f"  [final probe] Measuring VRAM at safe_kv_bytes={safe_kv_mib} MiB")
+        sys.stdout.flush()
+        rc_final, wall_final, reports_final, samples_final, stdout_final = _run_once(
+            pytest_args,
+            interval=interval,
+            baseline_seconds=baseline_seconds,
+            teardown_seconds=teardown_seconds,
+            extra_env={
+                **_gpu_env,
+                "_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES": str(safe_kv_bytes),
+            },
+            quiet=True,
+            run_label="final",
+            timeout=probe_timeout,
+        )
+        if rc_final == 0:
+            last_pass_peak_mib = max((r.peak_mib for r in reports_final), default=0)
+            last_pass_reports = reports_final
+            last_pass_samples = samples_final
+            pass_wall_times.append(wall_final)
+            peak_gib = round(last_pass_peak_mib / 1024, 1)
            print(
-                f"  [FAIL] allowed GPU = {mid * total_gib:.1f} GiB ({mid:.0%}), "
-                f"OOM or error, iter took {iter_elapsed:.0f}s"
+                f"  [PASS] kv_cache={safe_kv_mib} MiB, "
+                f"peak {_format_mib(last_pass_peak_mib)}, wall {wall_final:.0f}s"
            )
-
-    # Detect if _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE is being ignored: all peaks are nearly
-    # identical despite wildly different utilization caps.
-    if len(all_peak_mibs) >= 3:
-        peak_range = max(all_peak_mibs) - min(all_peak_mibs)
-        if peak_range < _PLATEAU_TOLERANCE_MIB:
-            print(f"\n  {'!' * 72}")
+        else:
            print(
-                f"  WARNING: Peak VRAM was ~{_format_mib(all_peak_mibs[0])} across ALL "
-                f"{len(all_peak_mibs)} probes (range: {peak_range} MiB)."
+                f"  [FAIL] kv_cache={safe_kv_mib} MiB failed unexpectedly, "
+                f"using VRAM from min_kv_bytes={min_kv_mib} MiB instead"
            )
+
+        print(f"\n{'=' * 72}")
+        print("MINIMUM KV CACHE RESULT")
+        print(f"{'=' * 72}")
+        print(f"  Minimum KV cache : {min_kv_mib} MiB ({min_kv_bytes:,} bytes)")
+        print(
+            f"  Safe KV cache    : {safe_kv_mib} MiB ({safe_kv_bytes:,} bytes) ({_KV_SAFETY_FACTOR:.0f}x safety)"
+        )
+        print(
+            f"  Peak VRAM        : {_format_mib(last_pass_peak_mib)} (at {safe_kv_mib} MiB)"
+        )
+        print()
+        print("  Recommended markers:")
+        print(f"    @pytest.mark.profiled_vram_gib({peak_gib})")
+        print(
+            f"    @pytest.mark.requested_vllm_kv_cache_bytes({safe_kv_bytes:_}),  # KV cache cap ({_KV_SAFETY_FACTOR:.0f}x safety over min={min_kv_bytes:_})"
+        )
+        print(f"{'=' * 72}")
+
+    else:
+        min_tokens = int(last_pass_value)
+        safe_tokens = int(min_tokens * _KV_SAFETY_FACTOR)
+        page_size = 16
+        safe_tokens = ((safe_tokens + page_size - 1) // page_size) * page_size
+
+        # Final validation probe at safe_tokens to get accurate profiled_vram_gib.
+        # The bisection's last pass was at min_tokens; the recommended marker uses
+        # safe_tokens which allocates more KV cache and thus more VRAM.
+        print(f"  [final probe] Measuring VRAM at safe_tokens={safe_tokens}")
+        sys.stdout.flush()
+        rc_final, wall_final, reports_final, samples_final, stdout_final = _run_once(
+            pytest_args,
+            interval=interval,
+            baseline_seconds=baseline_seconds,
+            teardown_seconds=teardown_seconds,
+            extra_env={
+                **_gpu_env,
+                "_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS": str(safe_tokens),
+            },
+            quiet=True,
+            run_label="final",
+            timeout=probe_timeout,
+        )
+        if rc_final == 0:
+            last_pass_peak_mib = max((r.peak_mib for r in reports_final), default=0)
+            last_pass_reports = reports_final
+            last_pass_samples = samples_final
+            pass_wall_times.append(wall_final)
+            peak_gib = round(last_pass_peak_mib / 1024, 1)
            print(
-                "  This strongly suggests the test IGNORES the _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
+                f"  [PASS] tokens={safe_tokens}, peak {_format_mib(last_pass_peak_mib)}, "
+                f"wall {wall_final:.0f}s"
            )
-            print("  env var.  Binary search results are UNRELIABLE — no marker")
-            print("  recommendation will be provided.")
-            print("  ")
+        else:
            print(
-                "  FIX: The test (or its launch script) must read _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE"
+                f"  [FAIL] tokens={safe_tokens} failed unexpectedly, "
+                f"using VRAM from min_tokens={min_tokens} instead"
            )
-            print("  and pass --gpu-memory-utilization to vLLM / the engine.")
-            print("  See tests/README.md 'GPU VRAM Profiler' for details.")
-            print(f"  {'!' * 72}")
-            return 4
-
-    # Results
-    assert last_pass_util is not None
-    min_vram_gib = last_pass_util * total_gib

-    padded_peak_mib = int(last_pass_peak_mib * _VRAM_SAFETY_FACTOR)
-    padded_peak_gib = round(padded_peak_mib / 1024, 1)
-
-    # Extract a short test name from pytest args for the summary
-    test_name = next(
-        (a for a in pytest_args if "::" in a or a.endswith(".py")),
-        " ".join(pytest_args),
-    )
-    test_short = test_name.rsplit("::", 1)[-1] if "::" in test_name else test_name
-
-    print("\n--- RESULT ---")
-    print(f"  Lowest passing utilization : {last_pass_util:.0%}")
-    print(
-        f"  Minimum VRAM needed        : ~{min_vram_gib:.1f} GiB "
-        f"(peak observed: {_format_mib(last_pass_peak_mib)}, "
-        f"+10% safety: {_format_mib(padded_peak_mib)})"
-    )
-    print(f"  {test_short}: @pytest.mark.max_vram_gib({padded_peak_gib})")
+        print(f"\n{'=' * 72}")
+        print("MINIMUM KV TOKENS RESULT")
+        print(f"{'=' * 72}")
+        print(f"  Minimum tokens  : {min_tokens} (raw bisection result)")
+        print(f"  Recommended     : {safe_tokens} ({_KV_SAFETY_FACTOR:.0f}x safety)")
+        print(
+            f"  Peak VRAM       : {_format_mib(last_pass_peak_mib)} (at {safe_tokens} tokens)"
+        )
+        print(f"  {test_short}: @pytest.mark.profiled_vram_gib({peak_gib})")
+        print(
+            f"  {test_short}: @pytest.mark.requested_sglang_kv_tokens({safe_tokens}),  # KV cache cap ({_KV_SAFETY_FACTOR:.0f}x safety over min={min_tokens})"
+        )
+    print(f"{'=' * 72}")

-    # Full marker recommendations using average wall time across all passing runs
+    # Marker recommendations
+    requested_sglang_kv_tokens = safe_tokens if is_sglang else None
+    requested_vllm_kv_cache_bytes = safe_kv_bytes if kv_bytes_mode else None
+    min_kv_value = int(last_pass_value)
    if recommend:
        avg_pass_wall = sum(pass_wall_times) / len(pass_wall_times)
        recs, warnings = _recommend_markers(
-            last_pass_reports, avg_pass_wall, model_name, num_runs=len(pass_wall_times)
+            last_pass_reports,
+            avg_pass_wall,
+            model_name,
+            num_runs=len(pass_wall_times),
+            requested_sglang_kv_tokens=requested_sglang_kv_tokens,
+            requested_vllm_kv_cache_bytes=requested_vllm_kv_cache_bytes,
+            min_kv_value=min_kv_value,
        )
        _print_recommendations(recs, warnings, pytest_args=pytest_args)

@@ -980,6 +1254,22 @@ def main(argv: list[str] | None = None) -> int:
        help="Disable the default binary-search mode that finds minimum VRAM. "
        "When set, runs a single profiling pass instead.",
    )
+    parser.add_argument(
+        "--kv-bytes",
+        action="store_true",
+        default=False,
+        help="(No-op, kept for backward compat.) vLLM always uses KV byte "
+        "bisection via _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES. "
+        "Outputs @pytest.mark.requested_vllm_kv_cache_bytes(N).",
+    )
+    parser.add_argument(
+        "--gpu",
+        "--gpus",
+        type=int,
+        default=0,
+        help="GPU index to profile on (default: 0). "
+        "Sets CUDA_VISIBLE_DEVICES for the subprocess.",
+    )

    raw = argv if argv is not None else sys.argv[1:]

@@ -1002,19 +1292,26 @@ def main(argv: list[str] | None = None) -> int:
        if looks_like_test_path and not os.path.exists(test_path):
            parser.error(f"Test path does not exist: {test_path}")

+    gpu_idx = args.gpu
    gpu_info = _query_gpu_stats()
    if not gpu_info:
        raise RuntimeError("NVML returned no GPU data")
+    if gpu_idx >= len(gpu_info):
+        raise RuntimeError(
+            f"GPU {gpu_idx} not found (available: 0..{len(gpu_info) - 1})"
+        )

-    used_mib = gpu_info[0][1]
-    total_mib = gpu_info[0][2]
+    used_mib = gpu_info[gpu_idx][1]
+    total_mib = gpu_info[gpu_idx][2]
    hogged_pct = used_mib / total_mib * 100
    if hogged_pct > 10:
        print(
-            f"\nWARNING: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) of GPU memory "
-            f"is already in use! Results may be inaccurate.\n"
+            f"\nWARNING: GPU {gpu_idx}: {used_mib / 1024:.1f} GiB ({hogged_pct:.0f}%) "
+            f"of GPU memory is already in use! Results may be inaccurate.\n"
        )

+    gpu_env = {"CUDA_VISIBLE_DEVICES": str(gpu_idx)}
+
    if not args.no_find_min_vram:
        return _find_min_vram(
            pytest_args,
@@ -1023,21 +1320,34 @@ def main(argv: list[str] | None = None) -> int:
            teardown_seconds=args.teardown_seconds,
            recommend=not args.no_recommend,
            csv_path=args.csv,
+            kv_bytes_mode=args.kv_bytes,
+            gpu_index=gpu_idx,
        )

    model_name = _extract_model_from_markers(pytest_args)
+    is_sglang = _is_sglang_test(pytest_args)

-    rc, wall_secs, reports, samples = _run_once(
+    rc, wall_secs, reports, samples, stdout = _run_once(
        pytest_args,
        interval=args.interval,
        baseline_seconds=args.baseline_seconds,
        teardown_seconds=args.teardown_seconds,
+        extra_env=gpu_env,
+        run_label="profile" if is_sglang else None,
    )

    _print_report(reports, rc, wall_secs, model_name=model_name)

    if not args.no_recommend and reports:
-        recs, warnings = _recommend_markers(reports, wall_secs, model_name=model_name)
+        requested_sglang_kv_tokens = None
+        if is_sglang:
+            requested_sglang_kv_tokens = _extract_requested_sglang_kv_tokens(stdout)
+        recs, warnings = _recommend_markers(
+            reports,
+            wall_secs,
+            model_name=model_name,
+            requested_sglang_kv_tokens=requested_sglang_kv_tokens,
+        )
        _print_recommendations(recs, warnings, pytest_args=pytest_args)

    if args.csv:

--- a/tests/utils/pytest_parallel_gpu.py
+++ b/tests/utils/pytest_parallel_gpu.py
+#!/usr/bin/env python3
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""GPU-parallel test runner (used by conftest.py, not invoked directly).
+
+Runs pytest tests as independent subprocesses with VRAM-aware scheduling.
+Each test gets CUDA_VISIBLE_DEVICES and KV cache overrides
+(_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES / _PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS)
+so the engine allocates only its declared VRAM budget.
+
+Usage (always via pytest):
+    pytest --max-vram-gib=6 -n auto -m "gpu_1 and vllm" tests/serve/
+    pytest --max-vram-gib=6 -n 4 -sv -m "gpu_1 and vllm" tests/serve/
+
+Flags:
+    --max-vram-gib=N   Only run tests with profiled_vram_gib <= N
+    -n N / -n auto     Run N tests concurrently (auto = GPU budget / smallest test)
+    -s                 Stream subprocess output live with [wN] prefixes
+    -v / -vv           Passed through to subprocesses for verbose test names
+
+A 10-second cooldown between launches avoids the vLLM profiling race
+(bug #10643). Tests that fail due to profiling race are retried up to 3 times.
+"""
+
+from __future__ import annotations
+
+import argparse
+import os
+import subprocess
+import sys
+import tempfile
+import threading
+import time
+from dataclasses import dataclass, field
+from pathlib import Path
+
+import pynvml
+
+_repo_root = str(Path(__file__).resolve().parents[2])
+if _repo_root not in sys.path:
+    sys.path.insert(0, _repo_root)
+
+from tests.utils.vram_utils import (  # noqa: E402
+    VRAM_MULTI_PROC_MARGIN,
+    auto_worker_count,
+    detect_gpus,
+    load_test_meta,
+)
+
+
+@dataclass
+class _TestEntry:
+    """A test scheduled for GPU-parallel execution."""
+
+    id: str
+    name: str
+    profiled_gib: float
+    timeout: float
+    requested_vllm_kv_cache_bytes: int | None = None
+    requested_sglang_kv_tokens: int | None = None
+    skip_reason: str | None = None
+    w_id: int = 0
+    assigned_gpu: int | None = None
+    retries: int = 0
+
+
+@dataclass
+class _CompletedTest:
+    """Result record for a finished test subprocess."""
+
+    test: _TestEntry
+    duration: float
+    passed: bool
+    skipped: bool = False
+    skip_reason: str | None = None
+    fail_reason: str | None = None
+
+
+@dataclass
+class _TentativeGpu:
+    """Scratch copy of GPU budget/free state used during scheduling."""
+
+    budget: float
+    free: float
+    count: int
+
+
+@dataclass
+class _GpuState:
+    """Per-GPU bookkeeping for VRAM budget tracking."""
+
+    index: int
+    total_gib: float
+    budget_multi: float
+    budget_used: float = 0.0
+    running_count: int = 0
+
+
+@dataclass
+class _RunningTest:
+    """State for a test subprocess currently executing on a GPU."""
+
+    proc: subprocess.Popen[str]
+    test: _TestEntry
+    start_time: float
+    captured: list[str] = field(default_factory=list)
+    reader_thread: threading.Thread | None = None
+
+
+def _print(msg: str = "") -> None:
+    """Print to stderr so pytest doesn't capture it."""
+    print(msg, file=sys.stderr, flush=True)
+
+
+def _fmt_req(test: _TestEntry) -> str:
+    """Format the resource request value for display."""
+    if test.requested_sglang_kv_tokens is not None:
+        return f"req_kv_tokens={int(test.requested_sglang_kv_tokens)}"
+    if test.requested_vllm_kv_cache_bytes is not None:
+        gib = int(test.requested_vllm_kv_cache_bytes) / (1024**3)
+        return f"req_kv={gib:.2f} GiB"
+    return "req_kv=None"
+
+
+_JUNIT_DIR = os.path.join(tempfile.gettempdir(), "gpu_parallel_junit")
+_JUNIT_COMBINED = os.path.join(_JUNIT_DIR, "combined.xml")
+
+
+def _parse_junit_skipped(junit_path: str) -> str | None:
+    """Check JUnit XML for a skipped test. Returns skip reason or None."""
+    import xml.etree.ElementTree as ET
+
+    try:
+        tree = ET.parse(junit_path)
+    except (ET.ParseError, FileNotFoundError):
+        return None
+    root = tree.getroot()
+    suite = root if root.tag == "testsuite" else root.find("testsuite")
+    if suite is None:
+        return None
+    for tc in suite.findall("testcase"):
+        skip_el = tc.find("skipped")
+        if skip_el is not None:
+            return skip_el.get("message", "skipped")
+    return None
+
+
+def _aggregate_junit_xml(junit_dir: str) -> str | None:
+    """Merge per-test JUnit XML files into one combined testsuite."""
+    import xml.etree.ElementTree as ET
+
+    xmls = sorted(Path(junit_dir).glob("*.xml"))
+    xmls = [x for x in xmls if x.name != "combined.xml"]
+    if not xmls:
+        return None
+
+    total_tests = total_errors = total_failures = 0
+    total_time = 0.0
+    testcases = []
+
+    for xml_path in xmls:
+        try:
+            tree = ET.parse(xml_path)
+        except ET.ParseError:
+            continue
+        root = tree.getroot()
+        suite = root if root.tag == "testsuite" else root.find("testsuite")
+        if suite is None:
+            continue
+        total_tests += int(suite.get("tests", 0))
+        total_errors += int(suite.get("errors", 0))
+        total_failures += int(suite.get("failures", 0))
+        total_time += float(suite.get("time", 0))
+        testcases.extend(suite.findall("testcase"))
+
+    combined = ET.Element(
+        "testsuite",
+        {
+            "name": "gpu-parallel",
+            "tests": str(total_tests),
+            "errors": str(total_errors),
+            "failures": str(total_failures),
+            "time": f"{total_time:.3f}",
+        },
+    )
+    for tc in testcases:
+        combined.append(tc)
+
+    out = _JUNIT_COMBINED
+    ET.ElementTree(combined).write(out, encoding="unicode", xml_declaration=True)
+    return out
+
+
+def _collect_tests(pytest_args: list[str], max_vram_gib: float) -> list[str]:
+    """Run pytest --collect-only to get test IDs, filtered by --max-vram-gib."""
+    _strip_flags = {"-v", "-vv", "-vvv", "--verbose", "-s", "--capture=no"}
+    collect_args = [a for a in pytest_args if a not in _strip_flags]
+    cmd = [
+        sys.executable,
+        "-m",
+        "pytest",
+        f"--max-vram-gib={max_vram_gib}",
+        "--collect-only",
+        "-q",
+        *collect_args,
+    ]
+    result = subprocess.run(cmd, capture_output=True, text=True)
+    test_ids = []
+    for line in result.stdout.strip().split("\n"):
+        line = line.strip()
+        if "::" in line and not line.startswith(" "):
+            test_ids.append(line)
+    return test_ids
+
+
+def _get_gpu_used_gib(gpu_index: int = 0) -> float:
+    """Query actual GPU memory used via pynvml."""
+    try:
+        pynvml.nvmlInit()
+        handle = pynvml.nvmlDeviceGetHandleByIndex(gpu_index)
+        mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
+        pynvml.nvmlShutdown()
+        return mem.used / (1024**3)
+    except pynvml.NVMLError:
+        return 0.0
+
+
+_RETRYABLE_INIT_MARKERS = [
+    "Error in memory profiling",  # vLLM profiling race assertion
+    "Free memory on device",  # not enough free VRAM at startup
+    "Engine core initialization failed",  # engine init crash
+    "exited with code 0 while waiting for health check",  # engine started but died during init
+    "exited with code -15 while waiting for health check",  # SIGTERM during init
+    "exited with code -9 while waiting for health check",  # SIGKILL (OOM killer) during init
+]
+_MAX_RETRIES = 3
+
+
+def _capture_output(pipe, captured: list[str], prefix: str | None = None) -> None:
+    """Read all lines from a pipe into `captured`. Runs in a thread.
+
+    If prefix is set, also prints each line live (-s mode).
+    """
+    for line in iter(pipe.readline, ""):
+        line = line.rstrip("\n")
+        if line:
+            captured.append(line)
+            if prefix is not None:
+                _print(f"{prefix} {line}")
+    pipe.close()
+
+
+def _parse_gpu_indices(raw: str, available: list[dict]) -> list[int]:
+    """Parse --gpus value into a list of GPU indices.
+
+    Accepts 'all' or comma-separated indices (e.g. '0,1').
+    """
+    avail_indices = [g["index"] for g in available]
+    if raw.strip().lower() == "all":
+        return avail_indices
+    indices = []
+    for part in raw.split(","):
+        part = part.strip()
+        if not part:
+            continue
+        idx = int(part)
+        if idx not in avail_indices:
+            raise ValueError(f"GPU {idx} not found (available: {avail_indices})")
+        indices.append(idx)
+    return indices or avail_indices
+
+
+def run_parallel(
+    test_ids: list[str],
+    meta: dict[str, dict],
+    max_vram_gib: float,
+    num_slots: int,
+    gpu_indices: list[int] | None = None,
+    extra_pytest_args: list[str] | None = None,
+    stream: bool = False,
+) -> int:
+    """Run tests in parallel with VRAM-aware scheduling across multiple GPUs.
+
+    Flags (mimic pytest semantics):
+      -s       Stream subprocess output live with [wN] prefixes.
+      -v/-vv   Passed through to subprocesses for verbose test names / diffs.
+               No effect on the orchestrator's output.
+
+    Without -s, output is buffered and printed after each test completes.
+    Returns exit code: 0 if all pass, 1 if any fail.
+    """
+    gpus = detect_gpus()
+    if not gpus:
+        _print("ERROR: No GPUs detected")
+        return 1
+
+    if gpu_indices is None:
+        gpu_indices = [g["index"] for g in gpus]
+
+    gpu_by_idx = {g["index"]: g for g in gpus}
+    gpu_states: dict[int, _GpuState] = {}
+    for gi in gpu_indices:
+        if gi not in gpu_by_idx:
+            _print(
+                f"ERROR: GPU{gi} not found "
+                f"(available: {[g['index'] for g in gpus]})"
+            )
+            return 1
+        total = gpu_by_idx[gi]["total_mib"] / 1024.0
+        gpu_states[gi] = _GpuState(
+            index=gi,
+            total_gib=total,
+            budget_multi=total * (1.0 - VRAM_MULTI_PROC_MARGIN),
+        )
+
+    tests: list[_TestEntry] = []
+    for tid in test_ids:
+        m = meta.get(tid, {})
+        tests.append(
+            _TestEntry(
+                id=tid,
+                name=tid,
+                profiled_gib=m.get("profiled_vram_gib", max_vram_gib),
+                requested_vllm_kv_cache_bytes=m.get("requested_vllm_kv_cache_bytes"),
+                timeout=m.get("timeout", 600),
+                requested_sglang_kv_tokens=m.get("requested_sglang_kv_tokens"),
+                skip_reason=m.get("skip_reason"),
+            )
+        )
+
+    # Separate skip-marked tests — they won't actually run, so don't
+    # validate KV markers or consume GPU budget.
+    skipped_tests = [t for t in tests if t.skip_reason is not None]
+    tests = [t for t in tests if t.skip_reason is None]
+
+    # Sort by timeout descending (longest first to minimize tail latency)
+    tests.sort(key=lambda t: t.timeout, reverse=True)
+
+    # Reject tests without a KV marker — without explicit memory control
+    # they'd each grab the engine's default (e.g. vLLM 90%) and OOM when
+    # run concurrently. Tests with profiled_gib=0 are exempt (mock/CPU-only).
+    no_kv = [
+        t
+        for t in tests
+        if t.requested_vllm_kv_cache_bytes is None
+        and t.requested_sglang_kv_tokens is None
+        and t.profiled_gib > 0
+    ]
+    if no_kv:
+        _print(
+            f"\nERROR: {len(no_kv)} test(s) lack a requested_vllm_kv_cache_bytes "
+            f"or requested_sglang_kv_tokens marker and cannot run in parallel:"
+        )
+        for t in no_kv:
+            _print(f"  {t.name}")
+        _print(
+            "\nAdd the appropriate marker via profile_pytest.py --kv-bytes, "
+            "then rerun."
+        )
+        return 1
+
+    # Identify tests in metadata that exceed the VRAM budget
+    test_id_set = set(test_ids)
+    over_budget = []
+    for nodeid, m in meta.items():
+        if nodeid not in test_id_set:
+            profiled = m.get("profiled_vram_gib")
+            if profiled is not None and profiled > max_vram_gib:
+                over_budget.append((nodeid, profiled))
+
+    # Assign permanent worker IDs (w0, w1, ...) to all tests including skipped
+    all_tests = tests + skipped_tests
+    for idx, test in enumerate(all_tests):
+        test.w_id = idx
+
+    os.makedirs(_JUNIT_DIR, exist_ok=True)
+
+    # --- Plan header ---
+    n_run = len(tests)
+    n_skip = len(skipped_tests)
+    count_str = f"{n_run} tests"
+    if n_skip:
+        count_str += f", {n_skip} skipped"
+
+    if len(gpu_states) == 1:
+        gi = next(iter(gpu_states))
+        gs = gpu_states[gi]
+        _print(
+            f"\nGPU parallel: {count_str}, {num_slots} concurrent slots, "
+            f"GPU{gi} ({gs.total_gib:.0f} GiB, "
+            f"{gs.budget_multi:.0f} GiB multi-proc budget)"
+        )
+    else:
+        gpu_list = ",".join(str(gi) for gi in sorted(gpu_states))
+        sizes = {int(gs.total_gib) for gs in gpu_states.values()}
+        budgets = {int(gs.budget_multi) for gs in gpu_states.values()}
+        if len(sizes) == 1 and len(budgets) == 1:
+            size_str = (
+                f"{next(iter(sizes))} GiB each, "
+                f"{next(iter(budgets))} GiB multi-proc budget"
+            )
+        else:
+            size_str = ", ".join(
+                f"GPU{gi}: {gs.total_gib:.0f}/{gs.budget_multi:.0f} GiB"
+                for gi, gs in sorted(gpu_states.items())
+            )
+        _print(
+            f"\nGPU parallel: {count_str}, {num_slots} concurrent slots, "
+            f"GPUs {gpu_list} ({size_str})"
+        )
+
+    _print()
+    for test in tests:
+        _print(
+            f"[w{test.w_id}] {test.name}  "
+            f"profiled={test.profiled_gib:.1f} GiB, "
+            f"{_fmt_req(test)}, "
+            f"timeout={int(test.timeout)}s"
+        )
+    if over_budget:
+        _print()
+        _print(
+            f"Over budget ({len(over_budget)} -- profiled > max_vram_gib {max_vram_gib:.0f} GiB):"
+        )
+        for name, profiled in sorted(over_budget, key=lambda x: x[1], reverse=True):
+            _print(f"  {name}  (profiled={profiled:.1f} GiB)")
+    _print()
+
+    # --- Report skip-marked tests immediately (like xdist SKIPPED) ---
+    completed: list[_CompletedTest] = []
+    for test in skipped_tests:
+        _print(f"[w{test.w_id}] {test.name} SKIPPED" f" - {test.skip_reason}")
+        completed.append(
+            _CompletedTest(
+                test=test,
+                duration=0,
+                passed=False,
+                skipped=True,
+                skip_reason=test.skip_reason,
+            )
+        )
+
+    # --- Scheduling state ---
+    t0 = time.monotonic()
+    pending = list(tests)
+    running: dict[int, _RunningTest] = {}
+    next_status = t0 + 10
+    # vLLM needs a stagger because --gpu-memory-utilization triggers a memory
+    # profiling step that snapshots free memory — concurrent launches corrupt
+    # each other's snapshots (bug #10643). SGLang uses --max-total-tokens
+    # which is deterministic, so no stagger is needed.
+    _VLLM_LAUNCH_STAGGER_S = 5.0
+    last_vllm_launch: dict[int, float] = {}  # gpu_index -> monotonic timestamp
+
+    def _build_status(now: float) -> str:
+        """Build multi-GPU status string for periodic output."""
+        elapsed = int(now - t0)
+        gpu_parts = []
+        for gi in sorted(gpu_states):
+            gs = gpu_states[gi]
+            actual = _get_gpu_used_gib(gi)
+            workers = sorted(
+                w for w, run_info in running.items() if run_info.test.assigned_gpu == gi
+            )
+            wstr = ", ".join(
+                f"w{w}({int(now - running[w].start_time)}s)" for w in workers
+            )
+            part = f"GPU{gi}: {actual:.1f}/{gs.total_gib:.0f} GiB"
+            if wstr:
+                part += f" [{wstr}]"
+            gpu_parts.append(part)
+        return f"[elapsed {elapsed}s] {', '.join(gpu_parts)}"
+
+    def _launch_test(test: _TestEntry, env_base: dict) -> _RunningTest:
+        """Build env, spawn subprocess, start output streamer thread."""
+        env = env_base.copy()
+        env["CUDA_VISIBLE_DEVICES"] = str(test.assigned_gpu)
+        if test.requested_sglang_kv_tokens is not None:
+            env["_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS"] = str(
+                int(test.requested_sglang_kv_tokens)
+            )
+        elif test.requested_vllm_kv_cache_bytes is not None:
+            env["_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES"] = str(
+                int(test.requested_vllm_kv_cache_bytes)
+            )
+
+        safe_name = test.name.replace("/", "_").replace("::", "__")
+        junit_path = os.path.join(_JUNIT_DIR, f"{safe_name}.xml")
+        has_tb = extra_pytest_args and any(
+            a.startswith("--tb") for a in extra_pytest_args
+        )
+        cmd = [
+            sys.executable,
+            "-m",
+            "pytest",
+            test.id,
+            "-x",
+            *([] if has_tb else ["--tb=short"]),
+            f"--timeout={int(test.timeout)}",
+            f"--junitxml={junit_path}",
+        ]
+        if extra_pytest_args:
+            cmd.extend(extra_pytest_args)
+
+        proc = subprocess.Popen(
+            cmd,
+            env=env,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+        )
+        run_info = _RunningTest(proc=proc, test=test, start_time=time.monotonic())
+        w_id = test.w_id
+        stream_prefix = f"[w{w_id}]" if stream else None
+        t = threading.Thread(
+            target=_capture_output,
+            args=(proc.stdout, run_info.captured, stream_prefix),
+            daemon=True,
+        )
+        t.start()
+        run_info.reader_thread = t
+        return run_info
+
+    env_base = os.environ.copy()
+
+    while pending or running:
+        now = time.monotonic()
+
+        # Check for completed subprocesses
+        for w_id in list(running.keys()):
+            run_info = running[w_id]
+            rc = run_info.proc.poll()
+            if rc is not None:
+                if run_info.reader_thread is not None:
+                    run_info.reader_thread.join(timeout=5)
+                duration = now - run_info.start_time
+                passed = rc == 0
+                test = run_info.test
+                gi = test.assigned_gpu
+
+                # Detect retryable init errors (profiling race, OOM at startup)
+                if not passed and test.retries < _MAX_RETRIES:
+                    matched_marker = None
+                    for line in run_info.captured:
+                        for marker in _RETRYABLE_INIT_MARKERS:
+                            if marker in line:
+                                matched_marker = marker
+                                break
+                        if matched_marker:
+                            break
+                    if matched_marker:
+                        test.retries += 1
+                        _print(
+                            f"[w{w_id}] retrying ({test.retries}/{_MAX_RETRIES})"
+                            f" — {matched_marker}"
+                        )
+                        if gi is not None:
+                            gpu_states[gi].budget_used -= test.profiled_gib
+                            gpu_states[gi].running_count -= 1
+                        del running[w_id]
+                        test.assigned_gpu = None
+                        pending.insert(0, test)
+                        continue
+
+                # Detect runtime skips via JUnit XML (subprocess exit 0
+                # covers both "all passed" and "all skipped").
+                skipped = False
+                skip_reason: str | None = None
+                if passed:
+                    safe_name = test.name.replace("/", "_").replace("::", "__")
+                    junit_path = os.path.join(_JUNIT_DIR, f"{safe_name}.xml")
+                    skip_reason = _parse_junit_skipped(junit_path)
+                    if skip_reason is not None:
+                        passed = False
+                        skipped = True
+
+                # Dump buffered output on failure only (matches pytest behavior).
+                # With -s, output was already streamed live.
+                fail_reason = ""
+                if not passed and not skipped:
+                    if not stream:
+                        prefix = f"[w{w_id}]"
+                        for line in run_info.captured:
+                            _print(f"{prefix} {line}")
+                    for line in reversed(run_info.captured):
+                        stripped = line.strip()
+                        if stripped and not stripped.startswith("="):
+                            fail_reason = stripped
+                            break
+
+                if skipped:
+                    status = "SKIPPED"
+                elif passed:
+                    status = "PASSED"
+                else:
+                    status = "FAILED"
+
+                if skipped:
+                    _print(f"[w{w_id}] {test.name} SKIPPED" f" - {skip_reason}")
+                else:
+                    _print(f"[w{w_id}] {test.name} {status} [{duration:.0f}s]")
+
+                if gi is not None:
+                    gpu_states[gi].budget_used -= test.profiled_gib
+                    gpu_states[gi].running_count -= 1
+                completed.append(
+                    _CompletedTest(
+                        test=test,
+                        duration=duration,
+                        passed=passed,
+                        skipped=skipped,
+                        skip_reason=skip_reason,
+                        fail_reason=fail_reason,
+                    )
+                )
+                del running[w_id]
+
+                # Print status immediately after completion
+                parts = [_build_status(now)]
+                if pending:
+                    queued_str = ", ".join(f"w{t.w_id}" for t in pending)
+                    parts.append(f"[queued: {queued_str}]")
+                _print(" ".join(parts))
+                next_status = now + 10
+
+        # --- Launch pending tests ---
+        # For each pending test, find the GPU with most available budget.
+        # Gate on BOTH budget tracking AND actual GPU free memory.
+        # vLLM stagger is per-GPU only — tests on different GPUs launch
+        # simultaneously.
+        if pending and len(running) < num_slots:
+            actual_free = {
+                gi: gs.total_gib - _get_gpu_used_gib(gi)
+                for gi, gs in gpu_states.items()
+            }
+            tentative = {
+                gi: _TentativeGpu(
+                    budget=gs.budget_used,
+                    free=actual_free[gi],
+                    count=gs.running_count,
+                )
+                for gi, gs in gpu_states.items()
+            }
+
+            to_launch: list[tuple[int, int]] = []  # (pending_idx, gpu_idx)
+            n_total = len(running)
+            for i, test in enumerate(pending):
+                if n_total + len(to_launch) >= num_slots:
+                    break
+                best_gi: int | None = None
+                best_avail = -1.0
+                for gi, gs in gpu_states.items():
+                    ts = tentative[gi]
+                    will_be_multi = ts.count >= 1
+                    cap = gs.budget_multi if will_be_multi else gs.total_gib
+                    avail = cap - ts.budget
+                    if avail < test.profiled_gib:
+                        continue
+                    if ts.free < test.profiled_gib:
+                        continue
+                    if avail > best_avail:
+                        best_gi = gi
+                        best_avail = avail
+                if best_gi is not None:
+                    to_launch.append((i, best_gi))
+                    tentative[best_gi].budget += test.profiled_gib
+                    tentative[best_gi].free -= test.profiled_gib
+                    tentative[best_gi].count += 1
+
+            # Pop from pending in reverse to preserve indices, then reverse
+            # back so longest-timeout tests launch first.
+            batch: list[_TestEntry] = []
+            for pending_idx, assigned_gpu in reversed(to_launch):
+                entry = pending.pop(pending_idx)
+                entry.assigned_gpu = assigned_gpu
+                batch.append(entry)
+            batch.reverse()
+
+            for entry in batch:
+                w_id = entry.w_id
+                gi = entry.assigned_gpu
+                assert gi is not None
+                is_vllm = (
+                    entry.requested_sglang_kv_tokens is None and entry.profiled_gib > 0
+                )
+
+                # Per-GPU vLLM stagger — only between vLLM tests on the
+                # same GPU.  Tests on different GPUs launch simultaneously.
+                if is_vllm:
+                    last_t = last_vllm_launch.get(gi, 0)
+                    wait = _VLLM_LAUNCH_STAGGER_S - (time.monotonic() - last_t)
+                    if wait > 0:
+                        time.sleep(wait)
+
+                gpu_states[gi].budget_used += entry.profiled_gib
+                gpu_states[gi].running_count += 1
+                run_info = _launch_test(entry, env_base)
+                running[w_id] = run_info
+
+                if is_vllm:
+                    last_vllm_launch[gi] = time.monotonic()
+
+                retry_str = f" (retry {entry.retries})" if entry.retries else ""
+                _print(
+                    f"[w{w_id}] {entry.name} "
+                    f"(GPU{gi}, profiled={entry.profiled_gib:.1f} GiB, "
+                    f"{_fmt_req(entry)}) RUNNING{retry_str}"
+                )
+
+                now = time.monotonic()
+                if now >= next_status and (running or pending):
+                    parts = [_build_status(now)]
+                    if pending:
+                        queued_str = ", ".join(f"w{t.w_id}" for t in pending)
+                        parts.append(f"[queued: {queued_str}]")
+                    _print(" ".join(parts))
+                    next_status = now + 10
+
+        # Periodic status (print even when waiting for VRAM to free up)
+        if now >= next_status and (running or pending):
+            parts = [_build_status(now)]
+            if pending:
+                queued_str = ", ".join(f"w{t.w_id}" for t in pending)
+                if not running:
+                    next_needed = pending[0].profiled_gib
+                    parts.append(f"[waiting for {next_needed:.1f} GiB free]")
+                parts.append(f"[queued: {queued_str}]")
+            _print(" ".join(parts))
+            next_status = now + 10
+
+        if running or pending:
+            time.sleep(1.0)
+
+    # Summary
+    wall_time = time.monotonic() - t0
+    sequential_time = sum(c.duration for c in completed if not c.skipped)
+    n_passed = sum(1 for c in completed if c.passed)
+    n_skipped = sum(1 for c in completed if c.skipped)
+    n_failed = sum(1 for c in completed if not c.passed and not c.skipped)
+
+    completed.sort(key=lambda c: c.test.w_id)
+
+    _print()
+    _print(f"{'=' * 27} short test summary info {'=' * 27}")
+    for c in completed:
+        test = c.test
+        w_id = test.w_id
+        if c.skipped:
+            reason = c.skip_reason or "skipped"
+            _print(f"SKIPPED [w{w_id}] {test.name} - {reason}")
+        elif c.passed:
+            duration = int(c.duration)
+            timeout = int(test.timeout)
+            retries = test.retries
+            retry_str = f" ({retries} retries)" if retries else ""
+            _print(
+                f"PASSED [w{w_id}] {test.name} " f"[{duration}s/{timeout}s]{retry_str}"
+            )
+        else:
+            duration = int(c.duration)
+            timeout = int(test.timeout)
+            retries = test.retries
+            retry_str = f" ({retries} retries)" if retries else ""
+            fail_str = f" - {c.fail_reason}" if c.fail_reason else ""
+            _print(
+                f"FAILED [w{w_id}] {test.name} "
+                f"[{duration}s/{timeout}s]{retry_str}{fail_str}"
+            )
+
+    n_summary_parts = []
+    if n_failed:
+        n_summary_parts.append(f"{n_failed} failed")
+    n_summary_parts.append(f"{n_passed} passed")
+    if n_skipped:
+        n_summary_parts.append(f"{n_skipped} skipped")
+
+    wall_int = int(wall_time)
+    h, remainder = divmod(wall_int, 3600)
+    m, s = divmod(remainder, 60)
+    time_str = f"{wall_time:.2f}s"
+    if h:
+        time_str += f" ({h}:{m:02d}:{s:02d})"
+    elif m:
+        time_str += f" ({m:01d}:{s:02d})"
+
+    summary = ", ".join(n_summary_parts) + f" in {time_str}"
+    if n_passed > 1 and sequential_time > 0:
+        speedup = sequential_time / wall_time
+        summary += f" (vs {sequential_time:.0f}s seq, {speedup:.1f}x)"
+
+    pad = max(0, (78 - len(summary) - 2) // 2)
+    _print(f"{'=' * pad} {summary} {'=' * pad}")
+
+    combined = _aggregate_junit_xml(_JUNIT_DIR)
+    if combined:
+        _print(f"JUnit XML: {combined}")
+
+    return 0 if n_failed == 0 else 1
+
+
+# ---------------------------------------------------------------------------
+# Standalone CLI
+# ---------------------------------------------------------------------------
+
+
+def main() -> int:
+    parser = argparse.ArgumentParser(
+        description="Run GPU tests in parallel with VRAM-aware scheduling.",
+        usage="%(prog)s --max-vram-gib=N [-n SLOTS] [--gpu=0,1] [pytest-args...]",
+    )
+    parser.add_argument(
+        "--max-vram-gib",
+        type=float,
+        required=True,
+        help="Only run tests with profiled_vram_gib <= N.",
+    )
+    parser.add_argument(
+        "-n",
+        type=str,
+        default="auto",
+        help="Number of concurrent slots. 'auto' = gpu_usable / max_vram_gib.",
+    )
+    parser.add_argument(
+        "--gpu",
+        "--gpus",
+        type=str,
+        default="all",
+        help="Comma-separated GPU indices or 'all' (default: all).",
+    )
+
+    raw = sys.argv[1:]
+    if "--" in raw:
+        split = raw.index("--")
+        args = parser.parse_args(raw[:split])
+        pytest_args = raw[split + 1 :]
+    else:
+        args, pytest_args = parser.parse_known_args(raw)
+
+    if not pytest_args:
+        parser.error("No pytest arguments provided")
+
+    is_stream = any(a in ("-s", "--capture=no") or "-s" in a for a in pytest_args)
+
+    gpus = detect_gpus()
+    if not gpus:
+        _print("ERROR: No GPUs detected")
+        return 1
+
+    gpu_indices = _parse_gpu_indices(args.gpus, gpus)
+
+    _print(f"Collecting tests with --max-vram-gib={args.max_vram_gib}...")
+    test_ids = _collect_tests(pytest_args, args.max_vram_gib)
+    if not test_ids:
+        _print("No tests collected.")
+        return 0
+
+    meta = load_test_meta()
+
+    if args.n == "auto":
+        profiled_gibs = [
+            meta.get(tid, {}).get("profiled_vram_gib", args.max_vram_gib)
+            for tid in test_ids
+        ]
+        selected_gpus = [g for g in gpus if g["index"] in gpu_indices]
+        num_slots = auto_worker_count(selected_gpus, args.max_vram_gib, profiled_gibs)
+    else:
+        num_slots = int(args.n)
+
+    return run_parallel(
+        test_ids=test_ids,
+        meta=meta,
+        max_vram_gib=args.max_vram_gib,
+        num_slots=num_slots,
+        gpu_indices=gpu_indices,
+        stream=is_stream,
+    )
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/tests/utils/test_mock_gpu_alloc.py
+++ b/tests/utils/test_mock_gpu_alloc.py
@@ -32,27 +32,27 @@ ALLOC_MIB = 4096  # 4 GiB
 @pytest.mark.gpu_1
 @pytest.mark.timeout(30)
 def test_mock_4gb_gpu_alloc():
-    """Allocate 4 GiB of GPU VRAM, hold 2s, release. Honors _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE."""
+    """Allocate 4 GiB of GPU VRAM, hold 2s, release. Honors _PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES."""
    if not torch.cuda.is_available():
        pytest.skip("CUDA not available")

    device = 0
    total_mib = torch.cuda.get_device_properties(device).total_memory / (1024 * 1024)

-    gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
-    if gpu_util is not None:
-        cap_mib = total_mib * float(gpu_util)
+    kv_bytes_str = os.environ.get("_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES")
+    if kv_bytes_str is not None:
+        cap_mib = int(kv_bytes_str) / (1024 * 1024)
        logger.info(
-            "_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=%.2f -> cap %.0f MiB (%.1f GiB) of %.0f MiB total",
-            float(gpu_util),
+            "_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES=%s -> cap %.0f MiB (%.1f GiB) of %.0f MiB total",
+            kv_bytes_str,
            cap_mib,
            cap_mib / 1024,
            total_mib,
        )
        if ALLOC_MIB > cap_mib:
            raise RuntimeError(
-                f"Requested {ALLOC_MIB} MiB exceeds _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE "
-                f"cap of {cap_mib:.0f} MiB ({gpu_util})"
+                f"Requested {ALLOC_MIB} MiB exceeds KV cache cap "
+                f"of {cap_mib:.0f} MiB ({kv_bytes_str} bytes)"
            )

    num_elements = (ALLOC_MIB * 1024 * 1024) // 4

--- a/tests/utils/vram_utils.py
+++ b/tests/utils/vram_utils.py
+# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+
+"""GPU VRAM utilities for parallel test execution.
+
+Functions:
+    detect_gpus()                  Enumerate GPUs via pynvml
+    auto_worker_count(gpus, limit) Calculate slot count for -n auto
+    write_test_meta(items)         Serialize profiled/requested vram + timeout
+    load_test_meta()               Read the serialized test metadata
+    print_gpu_plan(gpus, limit, would_run)  Dry-run GPU plan summary
+
+Usage:
+    # Sequential (filter only)
+    pytest --max-vram-gib=10 -m "gpu_1 and vllm" tests/serve/
+
+    # Parallel (VRAM-aware scheduling)
+    pytest --max-vram-gib=10 -n auto -m "gpu_1 and vllm" tests/serve/
+"""
+
+from __future__ import annotations
+
+import json
+import logging
+import os
+import tempfile
+
+import pynvml
+
+_logger = logging.getLogger(__name__)
+
+# When 2+ tests run concurrently, reserve 15% of GPU VRAM for CUDA context
+# overhead across processes.  A single test gets the full GPU (0% margin).
+VRAM_MULTI_PROC_MARGIN = 0.15
+
+_TEST_META_FILENAME = "pytest_gpu_parallel_test_meta.json"
+
+
+def detect_gpus() -> list[dict]:
+    """Return list of dicts with 'index', 'name', 'total_mib' per GPU.
+
+    Uses pynvml (already a dependency via profile_pytest.py).
+    Returns empty list if no GPUs or pynvml is unavailable.
+    """
+    try:
+        pynvml.nvmlInit()
+    except pynvml.NVMLError:
+        return []
+    try:
+        count = pynvml.nvmlDeviceGetCount()
+        gpus = []
+        for i in range(count):
+            handle = pynvml.nvmlDeviceGetHandleByIndex(i)
+            name = pynvml.nvmlDeviceGetName(handle)
+            mem = pynvml.nvmlDeviceGetMemoryInfo(handle)
+            gpus.append(
+                {
+                    "index": i,
+                    "name": name,
+                    "total_mib": mem.total // (1024 * 1024),
+                }
+            )
+        return gpus
+    finally:
+        pynvml.nvmlShutdown()
+
+
+def auto_worker_count(
+    gpus: list[dict],
+    vram_limit: float,
+    test_profiled_gibs: list[float] | None = None,
+) -> int:
+    """Calculate slot count for -n auto.
+
+    Uses the smallest profiled test size (if provided) to maximize parallelism.
+    Falls back to vram_limit when no test sizes are available.
+    """
+    if not gpus or vram_limit <= 0:
+        return len(gpus) or 1
+    min_gpu_gib = min(g["total_mib"] for g in gpus) / 1024.0
+    budget_gib = min_gpu_gib * (1.0 - VRAM_MULTI_PROC_MARGIN)
+    divisor = vram_limit
+    if test_profiled_gibs:
+        nonzero = [g for g in test_profiled_gibs if g > 0]
+        if nonzero:
+            divisor = min(nonzero)
+    workers_per_gpu = max(1, int(budget_gib / divisor)) if divisor > 0 else 1
+    return len(gpus) * workers_per_gpu
+
+
+def write_test_meta(items, dest_dir: str | None = None) -> None:
+    """Serialize profiled_vram_gib, timeout, and KV cache markers to JSON.
+
+    Called from pytest_collection_modifyitems so the GPU orchestrator can
+    read test metadata without re-collecting.
+    """
+    test_meta: dict[str, dict] = {}
+    for item in items:
+        meta: dict = {}
+        profiled_mark = item.get_closest_marker("profiled_vram_gib")
+        if profiled_mark and profiled_mark.args:
+            meta["profiled_vram_gib"] = profiled_mark.args[0]
+        kv_bytes_mark = item.get_closest_marker("requested_vllm_kv_cache_bytes")
+        if kv_bytes_mark and kv_bytes_mark.args:
+            meta["requested_vllm_kv_cache_bytes"] = kv_bytes_mark.args[0]
+        timeout_mark = item.get_closest_marker("timeout")
+        if timeout_mark and timeout_mark.args:
+            meta["timeout"] = timeout_mark.args[0]
+        kv_tokens_mark = item.get_closest_marker("requested_sglang_kv_tokens")
+        if kv_tokens_mark and kv_tokens_mark.args:
+            meta["requested_sglang_kv_tokens"] = kv_tokens_mark.args[0]
+        skip_mark = item.get_closest_marker("skip")
+        if skip_mark:
+            reason = skip_mark.kwargs.get("reason", "")
+            if not reason and skip_mark.args:
+                reason = skip_mark.args[0]
+            meta["skip_reason"] = reason or "skipped"
+        if meta:
+            test_meta[item.nodeid] = meta
+    if test_meta:
+        path = os.path.join(dest_dir or tempfile.gettempdir(), _TEST_META_FILENAME)
+        with open(path, "w") as f:
+            json.dump(test_meta, f)
+
+
+def load_test_meta() -> dict[str, dict]:
+    """Load the nodeid -> {profiled_vram_gib, timeout, ...} map."""
+    path = os.path.join(tempfile.gettempdir(), _TEST_META_FILENAME)
+    try:
+        with open(path) as f:
+            return json.load(f)
+    except (FileNotFoundError, json.JSONDecodeError):
+        return {}
+
+
+def print_gpu_plan(
+    gpus: list[dict], vram_limit: float, would_run: list[tuple[str, float]]
+) -> None:
+    """Print the GPU-parallel plan section for --dry-run output."""
+    min_gpu_gib = min(g["total_mib"] for g in gpus) / 1024.0
+    budget_gib = min_gpu_gib * (1.0 - VRAM_MULTI_PROC_MARGIN)
+    profiled_gibs = [gib for _, gib in would_run if gib is not None and gib > 0]
+    min_test_gib = min(profiled_gibs) if profiled_gibs else vram_limit
+    auto_slots = max(1, int(budget_gib / min_test_gib)) if min_test_gib > 0 else 1
+
+    print(f"\n{'=' * 60}")
+    print("GPU-Parallel Plan")
+    print(f"{'=' * 60}")
+    for gpu in gpus:
+        gib = gpu["total_mib"] / 1024
+        print(f"  GPU {gpu['index']}: {gpu['name']} ({gib:.1f} GiB)")
+    print(f"\n  Usable VRAM: {budget_gib:.0f} GiB")
+    print("\n  Run options:")
+    print("    (no -n)  : sequential, 1 test at a time")
+    print(
+        f"    -n auto  : up to {auto_slots} slots per GPU "
+        f"({budget_gib:.0f} / {min_test_gib:.0f} GiB smallest test)"
+    )
+    print(f"    -n N     : N concurrent slots across {len(gpus)} GPU(s)")
+    print("\n  Usage:")
+    print(
+        f"    pytest --max-vram-gib={vram_limit:.0f} -n {auto_slots} "
+        f'-m "gpu_1 and vllm" tests/serve/'
+    )