Unverified Commit 0b20745e authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

feat: GPU VRAM profiler via memory fraction injection + profiled test markers...


feat: GPU VRAM profiler via memory fraction injection + profiled test markers (part 2 - vLLM only) (#6719)
Signed-off-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent d047851e
......@@ -28,10 +28,12 @@ if [[ "$CAPACITY_GB" != "0" ]]; then
}")
fi
GPU_MEM_UTIL="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-.9}"
CUDA_VISIBLE_DEVICES=2 \
vllm serve "$MODEL" \
--enable-log-requests \
--max-model-len 16384 \
--gpu-memory-utilization .9 \
--gpu-memory-utilization "$GPU_MEM_UTIL" \
"${EC_ARGS[@]}" \
"${EXTRA_ARGS[@]}"
......@@ -20,7 +20,7 @@ MODEL="${MODEL:-Qwen/Qwen3-VL-8B-Instruct}"
NAMESPACE="${NAMESPACE:-dynamo}"
HTTP_PORT="${HTTP_PORT:-8000}"
BLOCK_SIZE="${BLOCK_SIZE:-16}" # Must match vLLM backend KV block size
GPU_MEMORY_UTILIZATION="${GPU_MEMORY_UTILIZATION:-0.85}"
GPU_MEMORY_UTILIZATION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-${GPU_MEMORY_UTILIZATION:-0.85}}"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-8192}"
NATS_SERVER="${NATS_SERVER:-nats://127.0.0.1:4222}"
......
......@@ -57,7 +57,7 @@ controls the *overall* VRAM budget (and thus whether the model fits), but the
KV cache portion is pinned to the explicit byte value.
Consequence for profiling: if a script uses `--kv-cache-memory-bytes`,
changing `DYN_GPU_MEMORY_FRACTION_OVERRIDE` (which maps to
changing `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` (which maps to
`--gpu-memory-utilization`) won't change the KV cache size, only the leftover
headroom for activations and overhead.
......@@ -256,14 +256,13 @@ to get 10 GiB of KV cache with a 5 GiB model.
The helper functions in `gpu_utils.sh` handle these differences:
- `gpu_gb_to_total_fraction`: for vLLM/sglang (fraction of total VRAM)
- `gpu_gb_to_free_fraction`: for TensorRT-LLM (fraction of free VRAM)
- `gpu_worker_fraction <engine>`: unified wrapper — reads `_EW_*` vars from
`estimate_worker_vram` and calls the right function for the engine.
- `gpu_worker_fraction <engine> <total_gib> <kv_gib>`: converts estimated GiB
into the engine-appropriate fraction (total for vllm/sglang, free for trtllm).
Launch scripts use `gpu_worker_fraction` so they all follow the same pattern:
Launch scripts use `build_gpu_mem_args` which calls these internally:
```bash
estimate_worker_vram "$MODEL" "$SEQ_LEN" "$CONCURRENCY" trtllm
GPU_MEM_FRACTION=$(gpu_worker_fraction trtllm)
GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --max-model-len "$SEQ_LEN" --max-num-seqs "$CONCURRENCY")
```
---
......@@ -291,7 +290,7 @@ kv_cache_gib = kv_bytes_per_token * max_model_len * max_concurrent_seqs / (1024^
---
## `DYN_GPU_MEMORY_FRACTION_OVERRIDE`
## `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE`
Environment variable used by Dynamo's VRAM profiler to binary-search the minimum
memory fraction a script needs.
......@@ -299,8 +298,8 @@ memory fraction a script needs.
- Maps to `--gpu-memory-utilization` in vLLM and `--mem-fraction-static` in sglang.
- For TensorRT-LLM, maps to `kv_cache_config.free_gpu_memory_fraction` via
`--override-engine-args`.
- Launch scripts use `gpu_worker_fraction <engine>` to compute the default
fraction; the override bypasses this and splits the raw value between workers.
- Launch scripts use `build_gpu_mem_args` to compute the default fraction;
the override bypasses the estimator and splits the raw value between workers.
- Scripts that use `--kv-cache-memory-bytes` (vLLM) bypass the fraction-based KV
cache sizing, making the profiler's fraction override ineffective for KV cache.
Those scripts should warn when `DYN_GPU_MEMORY_FRACTION_OVERRIDE` is set.
Those scripts should warn when `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is set.
This diff is collapsed.
......@@ -135,6 +135,12 @@ print_launch_banner() {
echo "=========================================="
echo "Model: $_model"
echo "Frontend: http://localhost:$_port"
local _seq_len="${MAX_MODEL_LEN:-${CONTEXT_LENGTH:-${MAX_SEQ_LEN:-}}}"
local _frac="${GPU_MEM_FRACTION:-}"
[[ -n "$_seq_len" ]] && echo "Max seq len: $_seq_len"
[[ -n "$_frac" ]] && echo "GPU frac: $_frac"
for _line in "$@"; do
echo "$_line"
done
......
......@@ -4,6 +4,9 @@
set -e
trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../common/gpu_utils.sh"
# Default values
MODEL_NAME="Qwen/Qwen2-Audio-7B-Instruct"
PROMPT_TEMPLATE=""
......@@ -90,8 +93,10 @@ python -m dynamo.frontend --http-port 8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
# Wait for all background processes to complete
wait
......@@ -4,6 +4,9 @@
set -e
trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../common/gpu_utils.sh"
# Default values
MODEL_NAME="Qwen/Qwen2-Audio-7B-Instruct"
PROMPT_TEMPLATE=""
......@@ -90,9 +93,11 @@ python -m dynamo.frontend --http-port 8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
CUDA_VISIBLE_DEVICES=0 python3 components/audio_encode_worker.py --model $MODEL_NAME &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
# Wait for all background processes to complete
wait
......@@ -4,6 +4,9 @@
set -e
trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../common/gpu_utils.sh"
# Default values
MODEL_NAME="llava-hf/LLaVA-NeXT-Video-7B-hf"
PROMPT_TEMPLATE="USER: <video>\n<prompt> ASSISTANT:"
......@@ -16,8 +19,10 @@ python -m dynamo.frontend --http-port=8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill &
VLLM_NIXL_SIDE_CHANNEL_PORT=20097 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
# Wait for all background processes to complete
wait
......@@ -4,6 +4,9 @@
set -e
trap 'echo Cleaning up...; kill 0' EXIT
SCRIPT_DIR="$(dirname "$(readlink -f "$0")")"
source "$SCRIPT_DIR/../../common/gpu_utils.sh"
# Default values
MODEL_NAME="llava-hf/LLaVA-NeXT-Video-7B-hf"
PROMPT_TEMPLATE="USER: <video>\n<prompt> ASSISTANT:"
......@@ -17,9 +20,11 @@ python -m dynamo.frontend --http-port=8000 &
python3 components/processor.py --model $MODEL_NAME --prompt-template "$PROMPT_TEMPLATE" &
# run E/P/D workers
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME")
CUDA_VISIBLE_DEVICES=0 python3 components/video_encode_worker.py --model $MODEL_NAME --num-frames-to-sample $NUM_FRAMES_TO_SAMPLE &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg &
DYN_VLLM_KV_EVENT_PORT=20081 VLLM_NIXL_SIDE_CHANNEL_PORT=20098 CUDA_VISIBLE_DEVICES=1 python3 components/worker.py --model $MODEL_NAME --worker-type prefill --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
DYN_VLLM_KV_EVENT_PORT=20082 VLLM_NIXL_SIDE_CHANNEL_PORT=20099 CUDA_VISIBLE_DEVICES=2 python3 components/worker.py --model $MODEL_NAME --worker-type decode --enable-disagg ${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
# Wait for all background processes to complete
wait
......@@ -233,6 +233,7 @@ markers = [
"gpu_2: marks tests to run on 2GPUs",
"gpu_4: marks tests to run on 4GPUs",
"gpu_8: marks tests to run on 8GPUs",
"max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
"e2e: marks tests as end-to-end tests",
"integration: marks tests as integration tests",
"unit: marks tests as unit tests",
......
......@@ -116,6 +116,7 @@ Markers are required for all tests. They are used for test selection in CI and l
| Lifecycle [required] | pre_merge, post_merge, nightly, weekly, release | When the test should run |
| Test Type [required] | unit, integration, e2e, benchmark, performance, stress, multimodal | Nature of the test |
| Hardware [required] | gpu_0, gpu_1, gpu_2, gpu_4, gpu_8, h100 | Number/type of GPUs required |
| VRAM Requirement | max_vram_gib(N) | Peak VRAM in GiB (with 10% safety). The pytest invocation can use `--max-vram-gib=N` to select only tests that fit on the available GPU. Does not prevent running on smaller GPUs (that will OOM). Use `profile_pytest.py` to measure. |
| Component/Framework | vllm, trtllm, sglang, kvbm, kvbm_concurrency, planner, router | Backend or component specificity |
| Infrastructure | k8s, deploy, fault_tolerance | Infrastructure/environment needs |
| Execution | parallel | Test can run in parallel with pytest-xdist. Must use dynamic port allocation (`alloc_ports`) and not share resources (e.g. filesystem) |
......@@ -126,11 +127,30 @@ Markers are required for all tests. They are used for test selection in CI and l
@pytest.mark.pre_merge
@pytest.mark.integration
@pytest.mark.gpu_1
@pytest.mark.max_vram_gib(21) # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
@pytest.mark.vllm
def test_kv_cache_behavior():
...
```
### Filtering by VRAM
The `max_vram_gib(N)` marker records how much GPU memory a test needs. The pytest invocation can use `--max-vram-gib=N` as a **selector** to run only tests that fit on the available GPU. Tests that exceed the budget are skipped at collection time (before any test starts). Tests without a `max_vram_gib` marker always run (no constraint assumed).
Nothing prevents you from running without this flag — but if a test needs more VRAM than is physically available, it will OOM at runtime (e.g., vLLM raises `ValueError: No available memory for the cache blocks`).
```bash
# Run only tests that fit on a 48 GiB GPU — tests needing >48 GiB are skipped
python3 -m pytest --max-vram-gib=48 tests/
# GPU tests that have no max_vram_gib marker yet — need profiling
# TODO: profile these tests and add max_vram_gib markers
python3 -m pytest -m "(gpu_1 or gpu_2 or gpu_4 or gpu_8) and not max_vram_gib" tests/
# No filter — run everything regardless of VRAM (tests that exceed available memory will OOM)
python3 -m pytest tests/
```
### Lifecycle Marker Note
Use the marker for the earliest pipeline stage where the test must run (e.g., `@pytest.mark.pre_merge`). This ensures the test is included in that stage and all subsequent ones (e.g., nightly, release), as CI pipelines select tests marked for earlier stages.
......@@ -416,6 +436,113 @@ GPU and model-loading overhead means Dynamo E2E tests are inherently slower than
---
## GPU VRAM Profiler (`profile_pytest.py`)
When writing or reviewing GPU tests, use `tests/utils/profile_pytest.py` to measure how much VRAM a test actually needs. The script runs the test repeatedly with different GPU memory caps and uses binary search to find the minimum VRAM required. It then prints recommended pytest markers you can copy into your test.
### How it works
The profiler sets the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` environment variable (a fraction from 0.0 to 1.0 of total GPU RAM) and runs the test at each probe point. It bisects between "passes" and "OOM/fails" to find the boundary. After the search, it samples `nvidia-smi` to report peak VRAM, phase analysis, and marker recommendations.
**Requirement:** The test under profile **must** honor the `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` env var. For standalone tests that allocate CUDA memory directly, check `os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")` and cap your allocation accordingly — see `tests/utils/test_mock_gpu_alloc.py` for an example.
### Engine-specific mapping
`_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` is a generic env var (float 0.0-1.0) that launch scripts translate to the engine-specific CLI flag:
| Engine | CLI flag | Launch script support |
|---------|----------------------------------|-----------------------|
| vLLM | `--gpu-memory-utilization` | Implemented in `agg.sh`, `disagg.sh`, etc. |
| SGLang | `--mem-fraction-static` | Not yet implemented (TODO) |
| TRT-LLM | `--free-gpu-memory-fraction` | Not yet implemented (has its own `DYN_TRTLLM_FREE_GPU_MEMORY_FRACTION`, TODO: unify) |
Scripts that already hard-code their own memory fraction (e.g. `agg_multimodal.sh` with 0.85) have a TODO to honor `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` in the future. If the profiler detects constant VRAM across all probes (meaning the env var is ignored), it prints a warning and skips marker recommendations.
### Usage
```bash
# Default mode: binary search for minimum VRAM (recommended)
# -xvs is optional: stop on first failure, verbose, show output
python tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated] -xvs
# Single-pass profiling (no binary search, just measure one run using default RAM)
python tests/utils/profile_pytest.py --no-find-min-vram tests/serve/test_vllm.py::test_serve_deployment[aggregated]
```
### Example output
```bash
========================================================================
FIND MINIMUM VRAM (binary search)
========================================================================
GPU total : 48.0 GiB
GPU free : 48.0 GiB (in use: 0.0 GiB)
Test : tests/serve/test_vllm.py::test_serve_deployment[aggregated] -x
Range : 5% - 95% (tolerance 5%)
Max iter: 6 (1 validation + 5 bisections)
[probe 1/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.95 (45.6 GiB) [validation run]
[PASS] peak 18.5 GiB, wall 41s, iter took 49s
...
[probe 5/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.33 (15.9 GiB)
[FAIL] OOM or error at 33% (15.9 GiB), iter took 30s
[probe 6/6] _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE=0.36 (17.2 GiB) [~0 left, ETA ~0s]
[PASS] peak 18.5 GiB, wall 41s, iter took 49s
========================================================================
MINIMUM VRAM RESULT
========================================================================
Lowest passing utilization : 36%
Minimum VRAM needed : ~17.2 GiB (peak observed: 18.5 GiB, +10% safety: 20.4 GiB)
# test_serve_deployment[aggregated]: @pytest.mark.max_vram_gib(21)
# Fits on: L4 (24 GiB), V100-32GB (32 GiB), A6000/A40 (48 GiB), A100/H100 (80 GiB)
# Will OOM on: edge/embedded (4 GiB), RTX 3060/4060 (8 GiB), T4 (16 GiB)
========================================================================
========================================================================
Recommended markers to add to your pytest. You can copy-paste this:
========================================================================
# Measured using: tests/utils/profile_pytest.py tests/serve/test_vllm.py::test_serve_deployment[aggregated]
@pytest.mark.e2e # wall time 41.2s, loads a real model
@pytest.mark.gpu_1 # 1 GPU(s) used, peak 18.5 GiB
@pytest.mark.max_vram_gib(21) # peak 18.5 GiB GPU RAM used (+10% safety: 20.4 GiB)
@pytest.mark.timeout(124) # 3x observed 41.2s
WARNING: Wall time 41.2s is too slow for pre_merge (> 20s). Consider post_merge or nightly instead.
WARNING: Will OOM on edge/embedded (4 GiB).
WARNING: Will OOM on RTX 3060/4060 (8 GiB).
WARNING: Will OOM on T4 (16 GiB).
========================================================================
```
### How to use the recommendations
1. **Copy the `@pytest.mark.*` lines** into your test function or `pytestmark` list.
2. **VRAM marker** — `max_vram_gib(N)` records the peak GPU memory the test needs (with 10% safety margin). This marker does **not** skip tests on its own — if a test runs on a GPU that is too small, it will OOM and fail hard. Use `--max-vram-gib=N` to select only tests that fit on the available GPU (see [Filtering by VRAM](#filtering-by-vram) for examples). The WARNING lines in the profiler output tell you which GPU tiers would be too small (e.g., "Will OOM on T4 (16 GiB)").
3. **Lifecycle markers** — the profiler recommends `pre_merge` only for tests under 20 seconds. For slower tests, it warns you to consider `post_merge` or `nightly` but does not choose for you — use your judgment based on how critical the test is for catching regressions early.
4. **Timeout** — the recommended value is 3x the observed wall time. Adjust upward if your test has high variance (e.g., first-run model download, flaky network).
5. **Test type** (`unit`, `integration`, `e2e`) — inferred from wall time and whether a real model was loaded. Override if you know better (e.g., a fast test that uses a mock engine is `integration`, not `e2e`).
### Options
| Flag | Description |
|------|-------------|
| `--no-find-min-vram` | Skip binary search; run a single profiling pass instead |
| `--interval N` | GPU sampling interval in seconds (default: 1.0) |
| `--baseline-seconds N` | Seconds to sample before launching pytest (default: 3.0) |
| `--teardown-seconds N` | Seconds to sample after pytest exits (default: 5.0) |
| `--csv FILE` | Write raw nvidia-smi samples to a CSV file |
| `--no-recommend` | Suppress marker recommendations |
---
## References
- [pytest documentation](https://docs.pytest.org/en/stable/)
- [Bazel Test Encyclopedia — test sizes and timeouts](https://docs.bazel.build/versions/2.0.0/test-encyclopedia.html)
......
......@@ -42,6 +42,7 @@ def pytest_configure(config):
"gpu_2: marks tests to run on 2GPUs",
"gpu_4: marks tests to run on 4GPUs",
"gpu_8: marks tests to run on 8GPUs",
"max_vram_gib(N): peak VRAM in GiB (with 10% safety). Filter with --max-vram-gib=N",
"e2e: marks tests as end-to-end tests",
"integration: marks tests as integration tests",
"unit: marks tests as unit tests",
......@@ -101,6 +102,12 @@ def pytest_addoption(parser: pytest.Parser) -> None:
help="Skip restarting NATS and etcd services before deployment. "
"Default: deploy tests skip (for speed), fault-tolerance tests restart (for clean state).",
)
parser.addoption(
"--max-vram-gib",
type=float,
default=None,
help="Skip tests whose @pytest.mark.max_vram_gib(N) exceeds this value (GiB).",
)
LOG_FORMAT = "[TEST] %(asctime)s %(levelname)s %(name)s: %(message)s"
......@@ -293,6 +300,17 @@ def pytest_collection_modifyitems(config, items):
if _item_has_marker(item, marker_name):
item.add_marker(skip)
# Skip tests that exceed --max-vram-gib
vram_limit = config.getoption("--max-vram-gib", default=None)
if vram_limit is not None:
skip_vram = pytest.mark.skip(
reason=f"requires more than {vram_limit} GiB VRAM (--max-vram-gib={vram_limit})"
)
for item in items:
vram_mark = item.get_closest_marker("max_vram_gib")
if vram_mark and vram_mark.args and vram_mark.args[0] > vram_limit:
item.add_marker(skip_vram)
# Collect models via explicit pytest mark from final filtered items only
models_to_download = set()
for item in items:
......@@ -836,11 +854,17 @@ def dynamo_dynamic_ports(num_system_ports) -> Generator[ServicePorts, None, None
- frontend_port: OpenAI-compatible HTTP/gRPC ingress (dynamo.frontend)
- system_ports: List of worker metrics/system ports (configurable count via num_system_ports)
- kv_event_port: ZMQ port for vLLM KV event publishing (avoids collisions under xdist)
"""
frontend_port = allocate_port(DefaultPort.FRONTEND.value)
system_port_list = allocate_ports(num_system_ports, DefaultPort.SYSTEM1.value)
all_ports = [frontend_port, *system_port_list]
kv_event_port = allocate_port(DefaultPort.SYSTEM1.value)
all_ports = [frontend_port, *system_port_list, kv_event_port]
try:
yield ServicePorts(frontend_port=frontend_port, system_ports=system_port_list)
yield ServicePorts(
frontend_port=frontend_port,
system_ports=system_port_list,
kv_event_port=kv_event_port,
)
finally:
deallocate_ports(all_ports)
......@@ -89,6 +89,8 @@ class VllmWorkerProcess(ManagedProcess):
"dynamo.vllm",
"--model",
TEST_MODEL,
"--max-model-len",
"32768", # 32768 uses ~1.5 GiB (original default 131072 used ~6 GiB KV cache)
"--dyn-tool-call-parser",
"harmony",
"--dyn-reasoning-parser",
......@@ -97,6 +99,10 @@ class VllmWorkerProcess(ManagedProcess):
"32768",
]
gpu_util = os.environ.get("_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE")
if gpu_util:
command.extend(["--gpu-memory-utilization", gpu_util])
env = os.environ.copy()
env["DYN_LOG"] = "debug"
env["DYN_SYSTEM_USE_ENDPOINT_HEALTH_STATUS"] = '["generate"]'
......@@ -222,7 +228,9 @@ def _validate_chat_response(response: requests.Response) -> Dict[str, Any]:
return response_json
@pytest.mark.timeout(300) # ~3x measured total (~70s/test), rounded up
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning_effort
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.timeout(300) # 3x observed ~70s wall time, rounded up
@pytest.mark.post_merge
def test_reasoning_effort(
request, start_services: ServicePorts, predownload_models
......@@ -288,7 +296,9 @@ def test_reasoning_effort(
)
@pytest.mark.timeout(180) # ~3x measured total (~50s/test), rounded up
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.timeout(113) # 3x observed 37.4s wall time
@pytest.mark.post_merge
def test_tool_calling(
request, start_services: ServicePorts, predownload_models
......@@ -330,7 +340,9 @@ def test_tool_calling(
), "Expected get_current_weather tool to be called"
@pytest.mark.timeout(180) # ~3x measured total (~50s/test), rounded up
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_tool_calling_second_round
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.timeout(115) # 3x observed 38.1s wall time
@pytest.mark.nightly
def test_tool_calling_second_round(
request, start_services: ServicePorts, predownload_models
......@@ -394,7 +406,9 @@ def test_tool_calling_second_round(
), "Expected response to include temperature information from tool call result (20°C)"
@pytest.mark.timeout(180) # ~3x measured total (~57s/test), rounded up
# Measured using: tests/utils/profile_pytest.py tests/frontend/test_vllm.py::test_reasoning
@pytest.mark.max_vram_gib(20.4) # observed peak 18.5 GiB (+10% safety)
@pytest.mark.timeout(131) # 3x observed 43.4s wall time
@pytest.mark.nightly
def test_reasoning(request, start_services: ServicePorts, predownload_models) -> None:
"""Test reasoning functionality with a mathematical problem."""
......
......@@ -6,6 +6,7 @@
import dataclasses
import logging
import os
import time
from collections.abc import Mapping
from copy import deepcopy
from typing import Any, Dict, Optional
......@@ -51,6 +52,16 @@ def run_serve_deployment(
if extra_env:
merged_env.update(extra_env)
# Stagger engine startup under xdist to avoid vLLM profiling race
# (vLLM bug #10643: concurrent profilers miscount each other's memory).
worker_id = os.environ.get("PYTEST_XDIST_WORKER", "")
if worker_id.startswith("gw"):
worker_num = int(worker_id.removeprefix("gw"))
if worker_num > 0:
stagger_s = worker_num * 15
logger.info("Staggering startup by %ds (xdist %s)", stagger_s, worker_id)
time.sleep(stagger_s)
if ports is not None:
dynamic_frontend_port = int(ports.frontend_port)
dynamic_system_ports = [int(p) for p in ports.system_ports]
......@@ -76,6 +87,10 @@ def run_serve_deployment(
for idx, port in enumerate(dynamic_system_ports, start=1):
merged_env[f"DYN_SYSTEM_PORT{idx}"] = str(port)
# Unique ZMQ port for vLLM KV event publishing (avoids xdist collisions).
if ports.kv_event_port:
merged_env["DYN_VLLM_KV_EVENT_PORT"] = str(ports.kv_event_port)
# Ensure EngineProcess health checks hit the correct frontend port.
config = dataclasses.replace(config, frontend_port=dynamic_frontend_port)
else:
......
......@@ -9,9 +9,10 @@ from pytest_httpserver import HTTPServer
from dynamo.common.utils.paths import WORKSPACE_DIR
from tests.serve.lora_utils import MinioLoraConfig, MinioService
from tests.utils.port_utils import allocate_port, deallocate_port
# Shared constants for multimodal testing
IMAGE_SERVER_PORT = 8765
IMAGE_SERVER_PORT = allocate_port(8765)
MULTIMODAL_IMG_PATH = os.path.join(
WORKSPACE_DIR, "lib/llm/tests/data/media/llm-optimize-deploy-graphic.png"
)
......@@ -42,7 +43,8 @@ def get_multimodal_test_image_bytes() -> bytes:
@pytest.fixture(scope="session")
def httpserver_listen_address():
return ("127.0.0.1", IMAGE_SERVER_PORT)
yield ("127.0.0.1", IMAGE_SERVER_PORT)
deallocate_port(IMAGE_SERVER_PORT)
@pytest.fixture(scope="function")
......@@ -60,7 +62,7 @@ def image_server(httpserver: HTTPServer):
Usage:
def test_multimodal(image_server):
url = "http://localhost:8765/llm-graphic.png"
# Use MULTIMODAL_IMG_URL from this module
# ... use url in your test payload
"""
image_data = get_multimodal_test_image_bytes()
......
......@@ -12,6 +12,8 @@ trap 'echo "Cleaning up..."; kill 0' EXIT
MODEL="${MODEL:-Qwen/Qwen3-0.6B}"
GPU_MEM_FRACTION="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}"
echo "Starting Dynamo frontend..."
python3 -m dynamo.frontend &
......@@ -22,7 +24,8 @@ CUDA_VISIBLE_DEVICES=0 python3 -m dynamo.vllm \
--nnodes 2 \
--node-rank 0 \
--master-addr 127.0.0.1 \
--enforce-eager &
--enforce-eager \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
echo "Starting dynamo.vllm headless worker (TP=2, nnodes=2, node-rank=1, GPU 1)..."
CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
......@@ -32,6 +35,7 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.vllm \
--node-rank 1 \
--master-addr 127.0.0.1 \
--enforce-eager \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} \
--headless &
wait
......@@ -54,10 +54,10 @@ vllm_dir = os.environ.get("VLLM_DIR") or os.path.join(
# vLLM test configurations
# NOTE: pytest.mark.gpu_1 tests take ~5.5 minutes total to run sequentially (with models pre-cached)
# TODO: Now that these tests use dynamic ports, optimize the runtime by bin-packing and running
# multiple engine deployments in parallel (while keeping GPU contention under control). This may
# require annotating each config with approximate GPU RAM usage so a future collector/launcher can
# bin-pack safely.
# TODO: Now that these tests use dynamic ports and each config has a max_vram_gib marker,
# optimize the runtime by bin-packing multiple engine deployments in parallel on the same GPU.
# A future collector/launcher can sum max_vram_gib values to decide how many tests fit
# concurrently without exceeding available VRAM.
vllm_configs = {
"aggregated": VLLMConfig(
name="aggregated",
......@@ -65,8 +65,9 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.timeout(127), # 3x observed 42.2s wall time
pytest.mark.pre_merge,
pytest.mark.timeout(300), # 3x measured time (43s) + download time (150s)
],
model="Qwen/Qwen3-0.6B",
request_payloads=[
......@@ -90,7 +91,12 @@ vllm_configs = {
name="aggregated_logprobs",
directory=vllm_dir,
script_name="agg.sh",
marks=[pytest.mark.gpu_1, pytest.mark.post_merge],
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.timeout(73), # 3x observed 24.3s wall time
pytest.mark.post_merge,
],
model="Qwen/Qwen3-0.6B",
request_payloads=[
chat_payload_with_logprobs(
......@@ -116,8 +122,9 @@ vllm_configs = {
marks=[
pytest.mark.lmcache,
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.4 GiB (+10% safety)
pytest.mark.timeout(147), # 3x observed 49.0s wall time
pytest.mark.pre_merge,
pytest.mark.timeout(360), # 3x estimated time (70s) + download time (150s)
pytest.mark.skipif(
_is_cuda13(),
reason="lmcache does not support CUDA 13 as of v0.3.11",
......@@ -138,8 +145,9 @@ vllm_configs = {
marks=[
pytest.mark.lmcache,
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.4 GiB (+10% safety)
pytest.mark.timeout(148), # 3x observed 49.3s wall time
pytest.mark.pre_merge,
pytest.mark.timeout(360), # 3x estimated time (70s) + download time (150s)
pytest.mark.skipif(
_is_cuda13(),
reason="lmcache does not support CUDA 13 as of v0.3.11",
......@@ -162,8 +170,9 @@ vllm_configs = {
script_name="agg_request_planes.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.3 GiB (+10% safety)
pytest.mark.timeout(129), # 3x observed 43.0s wall time
pytest.mark.pre_merge,
pytest.mark.timeout(300), # 3x measured time (43s) + download time (150s)
],
model="Qwen/Qwen3-0.6B",
script_args=["--tcp"],
......@@ -178,8 +187,9 @@ vllm_configs = {
script_name="agg_request_planes.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.1), # observed peak 7.3 GiB (+10% safety)
pytest.mark.timeout(127), # 3x observed 42.3s wall time
pytest.mark.pre_merge,
pytest.mark.timeout(300), # 3x measured time (43s) + download time (150s)
],
model="Qwen/Qwen3-0.6B",
script_args=["--http"],
......@@ -196,7 +206,7 @@ vllm_configs = {
pytest.mark.gpu_2,
pytest.mark.pre_merge,
pytest.mark.skip(reason="DYN-2263"),
],
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen3-0.6B",
request_payloads=[
chat_payload_default(
......@@ -219,7 +229,7 @@ vllm_configs = {
pytest.mark.gpu_2,
pytest.mark.pre_merge,
pytest.mark.skip(reason="DYN-2264"),
],
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen3-0.6B",
request_payloads=[
# Test approximate KV routing (--no-kv-events mode)
......@@ -250,7 +260,10 @@ vllm_configs = {
name="disaggregated",
directory=vllm_dir,
script_name="disagg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.pre_merge],
marks=[
pytest.mark.gpu_2,
pytest.mark.pre_merge,
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen3-0.6B",
request_payloads=[
chat_payload_default(),
......@@ -266,6 +279,7 @@ vllm_configs = {
pytest.mark.vllm,
pytest.mark.h100,
pytest.mark.nightly,
# TODO: profile to get max_vram and timeout
],
model="deepseek-ai/DeepSeek-V2-Lite",
script_args=[
......@@ -289,7 +303,12 @@ vllm_configs = {
name="multimodal_disagg_qwen3vl_2b_e_pd",
directory=vllm_dir,
script_name="disagg_multimodal_e_pd.sh",
marks=[pytest.mark.gpu_1, pytest.mark.pre_merge],
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(24.6), # observed peak 22.3 GiB (+10% safety)
pytest.mark.timeout(206), # 3x observed 68.4s wall time
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-VL-2B-Instruct",
script_args=["--model", "Qwen/Qwen3-VL-2B-Instruct", "--single-gpu"],
request_payloads=[
......@@ -318,7 +337,12 @@ vllm_configs = {
directory=vllm_dir,
script_name="agg_multimodal.sh",
# post_merge because needs real NIXL not stub
marks=[pytest.mark.gpu_1, pytest.mark.post_merge],
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(10.2), # observed peak 9.3 GiB (+10% safety)
pytest.mark.timeout(131), # 3x observed 43.7s wall time
pytest.mark.post_merge,
],
model="Qwen/Qwen2-VL-2B-Instruct",
# Pass --frontend-decoding to enable Rust frontend image decoding + NIXL RDMA transfer
script_args=[
......@@ -345,13 +369,20 @@ vllm_configs = {
)
],
),
# NOTE: Pack all workers on 1 GPU for lower CI resource requirements
# NOTE: Pack all workers on 1 GPU for lower CI resource requirements.
# NOTE: disagg_multimodal_epd.sh uses --kv-cache-memory-bytes=512MB for P/D
# workers. Per vLLM CacheConfig, kv_cache_memory_bytes (when not-None) ignores
# gpu_memory_utilization (ref: https://docs.vllm.ai/en/stable/api/vllm/config/cache/),
# so _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE has no effect. Regardless of GPU_MEM
# fractions (0.1/0.4/0.4), the 3 workers combined consistently use ~17.6 GiB
# total on this GPU.
"multimodal_disagg_qwen3vl_2b_epd": VLLMConfig(
name="multimodal_disagg_qwen3vl_2b_epd",
directory=vllm_dir,
script_name="disagg_multimodal_epd.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(19.4), # observed peak 17.6 GiB (+10% safety)
pytest.mark.post_merge,
pytest.mark.skip(reason="DYN-2265"),
],
......@@ -389,7 +420,12 @@ vllm_configs = {
name="multimodal_agg_qwen",
directory=vllm_dir,
script_name="agg_multimodal.sh",
marks=[pytest.mark.gpu_1, pytest.mark.post_merge],
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.6), # observed peak 19.6 GiB (+10% safety)
pytest.mark.timeout(150), # 3x observed 50.0s wall time
pytest.mark.post_merge,
],
model="Qwen/Qwen2.5-VL-7B-Instruct",
script_args=["--model", "Qwen/Qwen2.5-VL-7B-Instruct"],
delayed_start=0,
......@@ -418,6 +454,8 @@ vllm_configs = {
script_name="agg_multimodal.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(18.9), # observed peak 17.1 GiB (+10% safety)
pytest.mark.timeout(128), # 3x observed 42.7s wall time
pytest.mark.nightly,
# https://github.com/ai-dynamo/dynamo/issues/4501
pytest.mark.xfail(strict=False),
......@@ -456,7 +494,10 @@ vllm_configs = {
name="multimodal_video_agg",
directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
script_name="video_agg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.nightly],
marks=[
pytest.mark.gpu_2,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="llava-hf/LLaVA-NeXT-Video-7B-hf",
delayed_start=60, # Video models require longer loading time
script_args=["--model", "llava-hf/LLaVA-NeXT-Video-7B-hf"],
......@@ -483,7 +524,10 @@ vllm_configs = {
name="multimodal_video_disagg",
directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
script_name="video_disagg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.nightly],
marks=[
pytest.mark.gpu_2,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="llava-hf/LLaVA-NeXT-Video-7B-hf",
delayed_start=60, # Video models require longer loading time
script_args=["--model", "llava-hf/LLaVA-NeXT-Video-7B-hf"],
......@@ -512,7 +556,10 @@ vllm_configs = {
name="multimodal_audio_agg",
directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
script_name="audio_agg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.nightly],
marks=[
pytest.mark.gpu_2,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen2-Audio-7B-Instruct",
delayed_start=60, # Audio models require longer loading time
script_args=["--model", "Qwen/Qwen2-Audio-7B-Instruct"],
......@@ -539,7 +586,10 @@ vllm_configs = {
name="multimodal_audio_disagg",
directory=os.path.join(WORKSPACE_DIR, "examples/multimodal"),
script_name="audio_disagg.sh",
marks=[pytest.mark.gpu_2, pytest.mark.nightly],
marks=[
pytest.mark.gpu_2,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen2-Audio-7B-Instruct",
delayed_start=60, # Audio models require longer loading time
script_args=["--model", "Qwen/Qwen2-Audio-7B-Instruct"],
......@@ -566,7 +616,11 @@ vllm_configs = {
name="aggregated_toolcalling",
directory=vllm_dir,
script_name="agg_multimodal.sh",
marks=[pytest.mark.gpu_2, pytest.mark.multimodal, pytest.mark.nightly],
marks=[
pytest.mark.gpu_2,
pytest.mark.multimodal,
pytest.mark.nightly,
], # TODO: profile to get max_vram and timeout
model="Qwen/Qwen3-VL-30B-A3B-Instruct-FP8",
script_args=[
"--model",
......@@ -646,10 +700,9 @@ vllm_configs = {
script_name="agg.sh",
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(21.9), # observed peak 19.9 GiB (+10% safety)
pytest.mark.timeout(233), # 3x observed 77.7s wall time
pytest.mark.post_merge,
pytest.mark.timeout(
420
), # 3x estimated time (60s) + download time (240s) for 7B model
],
model="deepseek-ai/deepseek-llm-7b-base",
script_args=[
......@@ -669,6 +722,7 @@ vllm_configs = {
marks=[
pytest.mark.gpu_2,
pytest.mark.pre_merge,
# TODO: profile to get max_vram
pytest.mark.timeout(300),
],
model="Qwen/Qwen3-0.6B",
......@@ -681,7 +735,12 @@ vllm_configs = {
name="guided_decoding",
directory=vllm_dir,
script_name="agg.sh",
marks=[pytest.mark.gpu_1, pytest.mark.pre_merge],
marks=[
pytest.mark.gpu_1,
pytest.mark.max_vram_gib(8.6), # observed peak 7.8 GiB (+10% safety)
pytest.mark.timeout(67), # 3x observed 22.3s wall time
pytest.mark.pre_merge,
],
model="Qwen/Qwen3-0.6B",
request_payloads=[
chat_payload(
......
......@@ -187,6 +187,9 @@ class EngineProcess(ManagedProcess):
),
],
delayed_start=config.delayed_start,
# Must stay False: command[0] is "bash", so True would kill every
# bash process system-wide. Stale cleanup relies on stragglers list
# and process-group termination in __exit__ instead.
terminate_all_matching_process_names=False,
stragglers=config.stragglers,
log_dir=request.node.name,
......
......@@ -38,6 +38,7 @@ class ServicePorts:
frontend_port: int
system_ports: list[int]
kv_event_port: int = 0
def _load_port_registry() -> dict:
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment