Unverified Commit 6dc85fbc authored by Keiven C's avatar Keiven C Committed by GitHub
Browse files

feat: GPU-parallel test runner with VRAM-aware scheduling (#7560)


Signed-off-by: default avatarKeiven Chang <keivenchang@users.noreply.github.com>
parent 4ea21079
......@@ -29,8 +29,8 @@ JINJA_TEMPLATE_PATH = str(
pytestmark = [
pytest.mark.unit,
pytest.mark.sglang,
pytest.mark.gpu_1, # needs sglang installed (GPU node) but uses no GPU
pytest.mark.max_vram_gib(0),
pytest.mark.gpu_1, # needs sglang & GPU packages installed but does not actually use GPU
pytest.mark.profiled_vram_gib(0), # These unit tests do not actually use GPU VRAM
pytest.mark.pre_merge,
]
# Create SGLang-specific CLI args fixture
......
......@@ -70,12 +70,12 @@ Each engine backend has its own CLI flag to control what fraction of GPU memory
| SGLang | `--mem-fraction-static` | — | 0.88
| TRT-LLM | `--free-gpu-memory-fraction` | `DYN_TRTLLM_FREE_GPU_MEMORY_FRACTION` | 0.9
Dynamo launch scripts recognize a generic env var, `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` (float 0.0-1.0), and translate it to the engine-specific flag. This is used by `tests/utils/profile_pytest.py` to binary-search the minimum VRAM a test needs. Currently implemented for vLLM launch scripts; SGLang and TRT-LLM support is planned.
Dynamo launch scripts use absolute KV cache overrides for deterministic, parallel-safe GPU memory control. For vLLM, `_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES` maps to `--kv-cache-memory-bytes`. For SGLang, `_PROFILE_OVERRIDE_SGLANG_MAX_TOTAL_TOKENS` maps to `--max-total-tokens`. These are set by `tests/utils/profile_pytest.py` during binary-search profiling and by `tests/utils/pytest_parallel_gpu.py` at runtime.
Setting a lower memory fraction leaves more headroom for other CUDA allocations (e.g. activation buffers, NCCL buffers) at the cost of a smaller KV cache. Setting it higher allows more concurrent requests but risks OOM from non-KV-cache allocations. Typical production values are 0.85-0.95.
> [!Important]
> In vLLM, when `--kv-cache-memory-bytes` is set to an explicit value (not None), it **overrides and ignores** `--gpu-memory-utilization` for KV cache sizing ([vLLM CacheConfig docs](https://docs.vllm.ai/en/stable/api/vllm/config/cache/)). This means `_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE` has no effect on actual VRAM usage for scripts that set `--kv-cache-memory-bytes`. For example, `disagg_multimodal_epd.sh` uses `--kv-cache-memory-bytes=512MB` for its prefill/decode workers, so their VRAM consumption is fixed regardless of the memory fraction.
> In vLLM, when `--kv-cache-memory-bytes` is set to an explicit value (not None), it **overrides and ignores** `--gpu-memory-utilization` for KV cache sizing ([vLLM CacheConfig docs](https://docs.vllm.ai/en/stable/api/vllm/config/cache/)). This is exactly why we use `--kv-cache-memory-bytes` for parallel-safe allocation: it provides a deterministic, absolute KV cache cap that is immune to profiling races.
## Disaggregated Router
......
......@@ -55,7 +55,7 @@ if [ "$ENABLE_OTEL" = true ]; then
TRACE_ARGS+=(--enable-trace --otlp-traces-endpoint localhost:4317)
fi
GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL")
GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner "Launching Aggregated Serving" "$MODEL" "$HTTP_PORT"
......@@ -75,7 +75,7 @@ python3 -m dynamo.sglang \
--trust-remote-code \
--skip-tokenizer-init \
--enable-metrics \
${GPU_MEM_FRACTION:+--mem-fraction-static "$GPU_MEM_FRACTION"} \
$GPU_MEM_ARGS \
"${TRACE_ARGS[@]}" \
"${EXTRA_ARGS[@]}" &
......
......@@ -40,7 +40,7 @@ while [[ $# -gt 0 ]]; do
esac
done
GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL" 2>/dev/null || true)
GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner --no-curl "Launching Embedding Worker" "$MODEL" "$HTTP_PORT"
......@@ -68,8 +68,8 @@ python3 -m dynamo.sglang \
--tp 1 \
--trust-remote-code \
--use-sglang-tokenizer \
${GPU_MEM_FRACTION:+--mem-fraction-static "$GPU_MEM_FRACTION"} \
--enable-metrics \
$GPU_MEM_ARGS \
"${EXTRA_ARGS[@]}" &
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
......
......@@ -54,7 +54,7 @@ fi
MODEL="Qwen/Qwen3-0.6B"
GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL")
GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner "Launching Aggregated + KV Routing (2 GPUs)" "$MODEL" "$HTTP_PORT"
......@@ -84,9 +84,9 @@ python3 -m dynamo.sglang \
--page-size 16 \
--tp 1 \
--trust-remote-code \
${GPU_MEM_FRACTION:+--mem-fraction-static "$GPU_MEM_FRACTION"} \
"${KV_EVENTS_ARGS_1[@]}" \
--enable-metrics \
$GPU_MEM_ARGS \
"${TRACE_ARGS[@]}" &
OTEL_SERVICE_NAME=dynamo-worker-2 DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT_WORKER2:-8082} \
......@@ -96,9 +96,9 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--page-size 16 \
--tp 1 \
--trust-remote-code \
${GPU_MEM_FRACTION:+--mem-fraction-static "$GPU_MEM_FRACTION"} \
"${KV_EVENTS_ARGS_2[@]}" \
--enable-metrics \
$GPU_MEM_ARGS \
"${TRACE_ARGS[@]}" &
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
......
......@@ -48,7 +48,9 @@ fi
MODEL="Qwen/Qwen3-0.6B"
GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL")
GPU_MEM_ARGS=$(build_gpu_mem_args sglang)
DISAGG_BOOTSTRAP_PORT="${DYN_DISAGG_BOOTSTRAP_PORT:-12345}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner "Launching Disaggregated Serving (2 GPUs)" "$MODEL" "$HTTP_PORT"
......@@ -71,12 +73,12 @@ python3 -m dynamo.sglang \
--tp 1 \
--trust-remote-code \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 12345 \
--disaggregation-bootstrap-port "$DISAGG_BOOTSTRAP_PORT" \
--host 0.0.0.0 \
--port 40000 \
--disaggregation-transfer-backend nixl \
${GPU_MEM_FRACTION:+--mem-fraction-static "$GPU_MEM_FRACTION"} \
--enable-metrics \
$GPU_MEM_ARGS \
"${TRACE_ARGS[@]}" &
# run decode worker
......@@ -88,11 +90,11 @@ CUDA_VISIBLE_DEVICES=1 python3 -m dynamo.sglang \
--tp 1 \
--trust-remote-code \
--disaggregation-mode decode \
--disaggregation-bootstrap-port 12345 \
--disaggregation-bootstrap-port "$DISAGG_BOOTSTRAP_PORT" \
--host 0.0.0.0 \
--disaggregation-transfer-backend nixl \
${GPU_MEM_FRACTION:+--mem-fraction-static "$GPU_MEM_FRACTION"} \
--enable-metrics \
$GPU_MEM_ARGS \
"${TRACE_ARGS[@]}" &
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
......
......@@ -3,9 +3,8 @@
# SPDX-License-Identifier: Apache-2.0
#
# Disaggregated prefill/decode on a SINGLE GPU.
# Per-worker VRAM is estimated from model parameters below. Override individual
# knobs (CONTEXT_LENGTH, MAX_RUNNING_REQUESTS) via env vars, or set
# _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE to bypass the calculation entirely.
# Per-worker VRAM is controlled via build_gpu_mem_args (see gpu_utils.sh).
# Override individual knobs (CONTEXT_LENGTH, MAX_RUNNING_REQUESTS) via env vars.
#
# Measured reference (Qwen/Qwen3-0.6B, --context-length 4096, RTX 6000 Ada 48 GiB):
# estimate (from gpu_utils.sh) : ~5.7 GiB per worker (w=1.1 + kv=0.9 + oh=3.7)
......@@ -26,10 +25,12 @@ MODEL="Qwen/Qwen3-0.6B"
CONTEXT_LENGTH="${CONTEXT_LENGTH:-4096}"
MAX_RUNNING_REQUESTS="${MAX_RUNNING_REQUESTS:-2}"
GPU_MEM_FRACTION=$(build_gpu_mem_args sglang --model "$MODEL" --max-model-len "$CONTEXT_LENGTH" --max-num-seqs "$MAX_RUNNING_REQUESTS" --workers-per-gpu 2)
GPU_MEM_ARGS=$(build_gpu_mem_args sglang --workers-per-gpu 2)
source "$SCRIPT_DIR/../../../common/launch_utils.sh"
DISAGG_BOOTSTRAP_PORT="${DYN_DISAGG_BOOTSTRAP_PORT:-12345}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner "Launching Disaggregated (same GPU)" "$MODEL" "$HTTP_PORT" \
"Workers: 2 (prefill + decode, fraction is per worker)"
......@@ -47,10 +48,10 @@ python3 -m dynamo.sglang \
--tp 1 \
--trust-remote-code \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 12345 \
--disaggregation-bootstrap-port "$DISAGG_BOOTSTRAP_PORT" \
--host 0.0.0.0 \
--disaggregation-transfer-backend nixl \
--mem-fraction-static "${GPU_MEM_FRACTION}" \
$GPU_MEM_ARGS \
--context-length "$CONTEXT_LENGTH" \
--chunked-prefill-size "$CONTEXT_LENGTH" \
--max-prefill-tokens "$CONTEXT_LENGTH" \
......@@ -77,10 +78,10 @@ python3 -m dynamo.sglang \
--tp 1 \
--trust-remote-code \
--disaggregation-mode decode \
--disaggregation-bootstrap-port 12345 \
--disaggregation-bootstrap-port "$DISAGG_BOOTSTRAP_PORT" \
--host 0.0.0.0 \
--disaggregation-transfer-backend nixl \
--mem-fraction-static "${GPU_MEM_FRACTION}" \
$GPU_MEM_ARGS \
--context-length "$CONTEXT_LENGTH" \
--chunked-prefill-size "$CONTEXT_LENGTH" \
--max-prefill-tokens "$CONTEXT_LENGTH" \
......
......@@ -82,18 +82,24 @@ else
DYN_DECODE_WORKER_GPU=${DYN_DECODE_WORKER_GPU:-2}
fi
# Per-worker CUDA_VISIBLE_DEVICES pinning. In single-gpu mode, inherit from parent
# (the parallel test runner sets CUDA_VISIBLE_DEVICES); overriding would defeat GPU assignment.
if [[ "$SINGLE_GPU" == "true" ]]; then
_ENCODE_CUDA_PIN=""
_PREFILL_CUDA_PIN=""
_DECODE_CUDA_PIN=""
else
_ENCODE_CUDA_PIN="CUDA_VISIBLE_DEVICES=$DYN_ENCODE_WORKER_GPU"
_PREFILL_CUDA_PIN="CUDA_VISIBLE_DEVICES=$DYN_PREFILL_WORKER_GPU"
_DECODE_CUDA_PIN="CUDA_VISIBLE_DEVICES=$DYN_DECODE_WORKER_GPU"
fi
# GPU memory fractions for workers (used with --mem-fraction-static)
DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.9}
DYN_PREFILL_GPU_MEM=${DYN_PREFILL_GPU_MEM:-0.9}
DYN_DECODE_GPU_MEM=${DYN_DECODE_GPU_MEM:-0.9}
# Profiler override: scale prefill/decode fractions proportionally.
# Encode worker has no --mem-fraction-static in single-gpu mode, so it's unaffected.
if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" && "$SINGLE_GPU" == "true" ]]; then
_TOTAL_FRAC=$(awk -v p="$DYN_PREFILL_GPU_MEM" -v d="$DYN_DECODE_GPU_MEM" 'BEGIN { printf "%.4f", p + d }')
DYN_PREFILL_GPU_MEM=$(awk -v o="$_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE" -v p="$DYN_PREFILL_GPU_MEM" -v t="$_TOTAL_FRAC" 'BEGIN { printf "%.2f", o * p / t }')
DYN_DECODE_GPU_MEM=$(awk -v o="$_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE" -v d="$DYN_DECODE_GPU_MEM" -v t="$_TOTAL_FRAC" 'BEGIN { printf "%.2f", o * d / t }')
fi
GPU_MEM_ARGS=$(build_gpu_mem_args sglang --workers-per-gpu 3)
ENCODE_EXTRA_ARGS=""
PREFILL_EXTRA_ARGS=""
......@@ -104,14 +110,16 @@ if [[ "$SINGLE_GPU" == "true" ]]; then
# functional-test size so the last worker can initialize without OOM.
# --context-length keeps the per-request token pool allocation small.
ENCODE_EXTRA_ARGS=""
PREFILL_EXTRA_ARGS="--mem-fraction-static ${DYN_PREFILL_GPU_MEM} --delete-ckpt-after-loading --max-running-requests 2 --context-length 2048 --max-total-tokens 1024"
DECODE_EXTRA_ARGS="--mem-fraction-static ${DYN_DECODE_GPU_MEM} --delete-ckpt-after-loading --max-running-requests 2 --context-length 2048 --max-total-tokens 1024"
PREFILL_EXTRA_ARGS="--mem-fraction-static ${DYN_PREFILL_GPU_MEM} --delete-ckpt-after-loading --max-running-requests 2 --context-length 2048 --max-total-tokens 1024 $GPU_MEM_ARGS"
DECODE_EXTRA_ARGS="--mem-fraction-static ${DYN_DECODE_GPU_MEM} --delete-ckpt-after-loading --max-running-requests 2 --context-length 2048 --max-total-tokens 1024 $GPU_MEM_ARGS"
fi
# Prevent port collisions: the test framework exports DYN_SYSTEM_PORT which all
# child processes would inherit. Unset it so only workers that need it set their own.
unset DYN_SYSTEM_PORT
DISAGG_BOOTSTRAP_PORT="${DYN_DISAGG_BOOTSTRAP_PORT:-12345}"
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner --multimodal "Launching Disaggregated Multimodal E/P/D" "$MODEL_NAME" "$HTTP_PORT"
......@@ -122,7 +130,7 @@ python3 -m dynamo.frontend &
# run SGLang multimodal encode worker (frontend-facing: encodes images, routes to worker)
echo "Starting encode worker on GPU $DYN_ENCODE_WORKER_GPU (GPU mem: $DYN_ENCODE_GPU_MEM)..."
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \
CUDA_VISIBLE_DEVICES=$DYN_ENCODE_WORKER_GPU python3 -m dynamo.sglang --multimodal-encode-worker --model-path "$MODEL_NAME" $SERVED_MODEL_ARG --chat-template "$CHAT_TEMPLATE" --skip-tokenizer-init $ENCODE_EXTRA_ARGS &
env ${_ENCODE_CUDA_PIN:+"$_ENCODE_CUDA_PIN"} python3 -m dynamo.sglang --multimodal-encode-worker --model-path "$MODEL_NAME" $SERVED_MODEL_ARG --chat-template "$CHAT_TEMPLATE" --skip-tokenizer-init $ENCODE_EXTRA_ARGS &
if [[ "$SINGLE_GPU" == "true" ]]; then
# Wait for encode worker to initialize before starting prefill worker.
......@@ -136,7 +144,7 @@ fi
# See https://github.com/sgl-project/sglang/pull/11203.
echo "Starting prefill worker on GPU $DYN_PREFILL_WORKER_GPU (GPU mem: $DYN_PREFILL_GPU_MEM)..."
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
CUDA_VISIBLE_DEVICES=$DYN_PREFILL_WORKER_GPU python3 -m dynamo.sglang \
env ${_PREFILL_CUDA_PIN:+"$_PREFILL_CUDA_PIN"} python3 -m dynamo.sglang \
--multimodal-worker \
--model-path "$MODEL_NAME" \
$SERVED_MODEL_ARG \
......@@ -145,7 +153,7 @@ CUDA_VISIBLE_DEVICES=$DYN_PREFILL_WORKER_GPU python3 -m dynamo.sglang \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode prefill \
--disaggregation-bootstrap-port 12345 \
--disaggregation-bootstrap-port "$DISAGG_BOOTSTRAP_PORT" \
--host 0.0.0.0 \
--disable-radix-cache \
--disaggregation-transfer-backend nixl \
......@@ -159,7 +167,7 @@ fi
# run SGLang multimodal decode worker
echo "Starting decode worker on GPU $DYN_DECODE_WORKER_GPU (GPU mem: $DYN_DECODE_GPU_MEM)..."
CUDA_VISIBLE_DEVICES=$DYN_DECODE_WORKER_GPU python3 -m dynamo.sglang \
env ${_DECODE_CUDA_PIN:+"$_DECODE_CUDA_PIN"} python3 -m dynamo.sglang \
--multimodal-worker \
--model-path "$MODEL_NAME" \
$SERVED_MODEL_ARG \
......@@ -168,7 +176,7 @@ CUDA_VISIBLE_DEVICES=$DYN_DECODE_WORKER_GPU python3 -m dynamo.sglang \
--trust-remote-code \
--skip-tokenizer-init \
--disaggregation-mode decode \
--disaggregation-bootstrap-port 12345 \
--disaggregation-bootstrap-port "$DISAGG_BOOTSTRAP_PORT" \
--host 0.0.0.0 \
--disaggregation-transfer-backend nixl \
$DECODE_EXTRA_ARGS &
......
......@@ -72,27 +72,36 @@ if [[ -n "$SERVED_MODEL_NAME" ]]; then
fi
# GPU assignments (override via environment variables)
DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
DYN_WORKER_GPU=${DYN_WORKER_GPU:-1}
if [[ "$SINGLE_GPU" == "true" ]]; then
DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
DYN_WORKER_GPU=${DYN_WORKER_GPU:-0}
else
DYN_ENCODE_WORKER_GPU=${DYN_ENCODE_WORKER_GPU:-0}
DYN_WORKER_GPU=${DYN_WORKER_GPU:-1}
fi
# Per-worker CUDA_VISIBLE_DEVICES pinning. In single-gpu mode, inherit from parent
# (the parallel test runner sets CUDA_VISIBLE_DEVICES); overriding would defeat GPU assignment.
if [[ "$SINGLE_GPU" == "true" ]]; then
_ENCODE_CUDA_PIN=""
_WORKER_CUDA_PIN=""
else
_ENCODE_CUDA_PIN="CUDA_VISIBLE_DEVICES=$DYN_ENCODE_WORKER_GPU"
_WORKER_CUDA_PIN="CUDA_VISIBLE_DEVICES=$DYN_WORKER_GPU"
fi
# GPU memory fractions for workers (used with --mem-fraction-static)
DYN_ENCODE_GPU_MEM=${DYN_ENCODE_GPU_MEM:-0.9}
DYN_WORKER_GPU_MEM=${DYN_WORKER_GPU_MEM:-0.9}
# Profiler override: split _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE between workers
# preserving the ratio set by the env vars.
if [[ -n "${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-}" && "$SINGLE_GPU" == "true" ]]; then
_TOTAL_FRAC=$(awk -v e="$DYN_ENCODE_GPU_MEM" -v w="$DYN_WORKER_GPU_MEM" 'BEGIN { printf "%.4f", e + w }')
DYN_ENCODE_GPU_MEM=$(awk -v o="$_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE" -v e="$DYN_ENCODE_GPU_MEM" -v t="$_TOTAL_FRAC" 'BEGIN { printf "%.2f", o * e / t }')
DYN_WORKER_GPU_MEM=$(awk -v o="$_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE" -v w="$DYN_WORKER_GPU_MEM" -v t="$_TOTAL_FRAC" 'BEGIN { printf "%.2f", o * w / t }')
fi
GPU_MEM_ARGS=$(build_gpu_mem_args sglang --workers-per-gpu 2)
ENCODE_EXTRA_ARGS=""
WORKER_EXTRA_ARGS=""
if [[ "$SINGLE_GPU" == "true" ]]; then
ENCODE_EXTRA_ARGS="--mem-fraction-static ${DYN_ENCODE_GPU_MEM}"
WORKER_EXTRA_ARGS="--mem-fraction-static ${DYN_WORKER_GPU_MEM} --delete-ckpt-after-loading --max-running-requests 2 --chunked-prefill-size 4096 --max-prefill-tokens 4096"
WORKER_EXTRA_ARGS="--mem-fraction-static ${DYN_WORKER_GPU_MEM} --delete-ckpt-after-loading --max-running-requests 2 --chunked-prefill-size 4096 --max-prefill-tokens 4096 $GPU_MEM_ARGS"
fi
# Prevent port collisions: the test framework exports DYN_SYSTEM_PORT which all
......@@ -114,7 +123,7 @@ python3 -m dynamo.frontend &
# run SGLang multimodal encode worker (frontend-facing: encodes images, routes to worker)
echo "Starting encode worker on GPU $DYN_ENCODE_WORKER_GPU (GPU mem: $DYN_ENCODE_GPU_MEM)..."
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT1:-8081} \
CUDA_VISIBLE_DEVICES=$DYN_ENCODE_WORKER_GPU python3 -m dynamo.sglang --multimodal-encode-worker --model-path "$MODEL_NAME" $SERVED_MODEL_ARG --chat-template "$CHAT_TEMPLATE" --skip-tokenizer-init $ENCODE_EXTRA_ARGS &
env ${_ENCODE_CUDA_PIN:+"$_ENCODE_CUDA_PIN"} python3 -m dynamo.sglang --multimodal-encode-worker --model-path "$MODEL_NAME" $SERVED_MODEL_ARG --chat-template "$CHAT_TEMPLATE" --skip-tokenizer-init $ENCODE_EXTRA_ARGS &
if [[ "$SINGLE_GPU" == "true" ]]; then
# Wait for encode worker to initialize before starting PD worker.
......@@ -128,7 +137,7 @@ fi
# See https://github.com/sgl-project/sglang/pull/11203.
echo "Starting PD worker on GPU $DYN_WORKER_GPU (GPU mem: $DYN_WORKER_GPU_MEM)..."
DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT2:-8082} \
CUDA_VISIBLE_DEVICES=$DYN_WORKER_GPU python3 -m dynamo.sglang \
env ${_WORKER_CUDA_PIN:+"$_WORKER_CUDA_PIN"} python3 -m dynamo.sglang \
--multimodal-worker \
--model-path "$MODEL_NAME" \
$SERVED_MODEL_ARG \
......
......@@ -3,9 +3,9 @@
# SPDX-License-Identifier: Apache-2.0
#
# Disaggregated prefill/decode on a SINGLE GPU.
# Per-worker VRAM is estimated from model parameters below. Override individual
# knobs (MAX_SEQ_LEN, MAX_CONCURRENT_SEQS) via env vars, or set
# _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE to bypass the calculation entirely.
# Per-worker VRAM is controlled via env vars (MAX_SEQ_LEN, MAX_CONCURRENT_SEQS).
# TODO: unify with build_gpu_mem_args once trtllm --override-engine-args JSON
# merging is supported.
#
# NOTE — trtllm fraction semantics differ from vllm/sglang:
# vllm/sglang: fraction of TOTAL VRAM (weights + KV + activations all inside)
......@@ -30,7 +30,9 @@ MODEL="Qwen/Qwen3-0.6B"
MAX_SEQ_LEN="${MAX_SEQ_LEN:-4096}"
MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
GPU_MEM_FRACTION=$(build_gpu_mem_args trtllm --model "$MODEL" --max-model-len "$MAX_SEQ_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS" --workers-per-gpu 2)
# TODO: unify with build_gpu_mem_args once trtllm --override-engine-args JSON
# merging is supported.
GPU_MEM_FRACTION="${GPU_MEM_FRACTION:-}"
# Environment variables with defaults
export DYNAMO_HOME=${DYNAMO_HOME:-"/workspace"}
......@@ -68,7 +70,10 @@ done
# Always override free_gpu_memory_fraction so the script controls KV cache size,
# matching how vllm (--gpu-memory-utilization) and sglang (--mem-fraction-static)
# pass memory parameters from the launch script.
OVERRIDE_PAIRS="\"kv_cache_config\": {\"free_gpu_memory_fraction\": ${GPU_MEM_FRACTION}}"
OVERRIDE_PAIRS=""
if [[ -n "$GPU_MEM_FRACTION" ]]; then
OVERRIDE_PAIRS="\"kv_cache_config\": {\"free_gpu_memory_fraction\": ${GPU_MEM_FRACTION}}"
fi
if [ "$ENABLE_OTEL" = true ]; then
export DYN_LOGGING_JSONL=true
export OTEL_EXPORT_ENABLED=1
......
......@@ -33,7 +33,7 @@ done
MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner "Launching Aggregated Serving (1 GPU)" "$MODEL" "$HTTP_PORT"
......@@ -48,7 +48,8 @@ DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT:-8081} \
python -m dynamo.vllm --model "$MODEL" --enforce-eager \
--max-model-len "$MAX_MODEL_LEN" \
--max-num-seqs "$MAX_CONCURRENT_SEQS" \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} "${EXTRA_ARGS[@]}" &
$GPU_MEM_ARGS \
"${EXTRA_ARGS[@]}" &
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
wait_any_exit
......@@ -17,7 +17,7 @@ MODEL="Qwen/Qwen3-0.6B"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner "Launching Aggregated Serving + LMCache (1 GPU)" "$MODEL" "$HTTP_PORT"
......@@ -28,7 +28,8 @@ DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT:-8081} \
python -m dynamo.vllm --model "$MODEL" --enforce-eager \
--max-model-len "$MAX_MODEL_LEN" \
--max-num-seqs "$MAX_CONCURRENT_SEQS" \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' &
$GPU_MEM_ARGS \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' &
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
wait_any_exit
......@@ -27,7 +27,7 @@ MODEL="Qwen/Qwen3-0.6B"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner "Launching Aggregated + LMCache + Multiproc (1 GPU)" "$MODEL" "$HTTP_PORT"
......@@ -39,7 +39,8 @@ DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT:-8081} \
python -m dynamo.vllm --model "$MODEL" --enforce-eager \
--max-model-len "$MAX_MODEL_LEN" \
--max-num-seqs "$MAX_CONCURRENT_SEQS" \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' &
$GPU_MEM_ARGS \
--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}' &
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
wait_any_exit
......@@ -69,7 +69,7 @@ case "$MODEL_NAME" in
MODEL_EXTRA_ARGS="--tensor-parallel-size=8" ;;
esac
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL_NAME" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
# Start vLLM worker with vision model
# --enforce-eager: Quick deployment (remove for production)
......@@ -79,7 +79,9 @@ DYN_SYSTEM_PORT=${DYN_SYSTEM_PORT:-8081} \
python -m dynamo.vllm --enable-multimodal --model $MODEL_NAME \
--max-model-len "$MAX_MODEL_LEN" \
--max-num-seqs "$MAX_CONCURRENT_SEQS" \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} $MODEL_EXTRA_ARGS "${EXTRA_ARGS[@]}"
$GPU_MEM_ARGS \
$MODEL_EXTRA_ARGS \
"${EXTRA_ARGS[@]}"
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
wait_any_exit
......@@ -50,7 +50,7 @@ MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
export DYN_REQUEST_PLANE=$REQUEST_PLANE
echo "Using request plane mode: $REQUEST_PLANE"
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
HTTP_PORT="${DYN_HTTP_PORT:-8000}"
print_launch_banner "Launching Aggregated Serving + Request Planes (1 GPU)" "$MODEL" "$HTTP_PORT"
......@@ -62,7 +62,7 @@ DYN_HEALTH_CHECK_ENABLED=true \
python -m dynamo.vllm --model "$MODEL" --enforce-eager \
--max-model-len "$MAX_MODEL_LEN" \
--max-num-seqs "$MAX_CONCURRENT_SEQS" \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} &
$GPU_MEM_ARGS &
# Exit on first worker failure; kill 0 in the EXIT trap tears down the rest
wait_any_exit
......@@ -61,10 +61,16 @@ fi
export DYN_REQUEST_PLANE=tcp
# Configure model-specific args
GPU_MEM=${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-0.80}
GPU_MEM="0.80"
KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
if [[ -n "$KV_BYTES" ]]; then
GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
else
GPU_MEM_ARGS="--gpu-memory-utilization $GPU_MEM"
fi
MODEL_SPECIFIC_ARGS=""
if [[ "$MODEL_NAME" == "meta-llama/Llama-4-Maverick-17B-128E-Instruct-FP8" ]]; then
MODEL_SPECIFIC_ARGS="--tensor-parallel-size=8 --max-model-len=208960 --gpu-memory-utilization $GPU_MEM"
MODEL_SPECIFIC_ARGS="--tensor-parallel-size=8 --max-model-len=208960 $GPU_MEM_ARGS"
fi
if [[ $HEAD_NODE -eq 1 ]]; then
......
......@@ -3,9 +3,8 @@
# SPDX-License-Identifier: Apache-2.0
#
# Disaggregated prefill/decode on a SINGLE GPU.
# Per-worker VRAM is estimated from model parameters below. Override individual
# knobs (MAX_MODEL_LEN, MAX_CONCURRENT_SEQS) via env vars, or set
# _PROFILE_PYTEST_VRAM_FRAC_OVERRIDE to bypass the calculation entirely.
# Per-worker VRAM is controlled via build_gpu_mem_args (see gpu_utils.sh).
# Override individual knobs (MAX_MODEL_LEN, MAX_CONCURRENT_SEQS) via env vars.
#
# Measured reference (Qwen/Qwen3-0.6B, --max-model-len 4096, RTX 6000 Ada 48 GiB):
# estimate (from gpu_utils.sh) : ~4.0 GiB per worker (~8.0 GiB total)
......@@ -26,7 +25,7 @@ MODEL="Qwen/Qwen3-0.6B"
MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS" --workers-per-gpu 2)
GPU_MEM_ARGS=$(build_gpu_mem_args vllm --workers-per-gpu 2)
source "$SCRIPT_DIR/../../../common/launch_utils.sh"
......@@ -49,7 +48,7 @@ python3 -m dynamo.vllm \
--enforce-eager \
--disaggregation-mode decode \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--gpu-memory-utilization "${GPU_MEM_FRACTION}" \
$GPU_MEM_ARGS \
--max-model-len "$MAX_MODEL_LEN" &
# Wait for decode worker to initialize before starting prefill worker
......@@ -70,7 +69,7 @@ python3 -m dynamo.vllm \
--enforce-eager \
--disaggregation-mode prefill \
--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' \
--gpu-memory-utilization "${GPU_MEM_FRACTION}" \
$GPU_MEM_ARGS \
--max-model-len "$MAX_MODEL_LEN" \
--kv-events-config '{"publisher":"zmq","topic":"kv-events","endpoint":"tcp://*:20081","enable_kv_cache_events":true}' &
......
......@@ -114,7 +114,13 @@ mkdir -p $LOG_DIR
# the GPU memory requires for vLLM reservation and runtime spike (not
# reserved by vLLM) can be different and cause model fails to start,
# adjust '--gpu-memory-utilization' as needed
GPU_MEM_UTIL="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-0.91}"
GPU_MEM_UTIL="0.91"
KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
if [[ -n "$KV_BYTES" ]]; then
GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
else
GPU_MEM_ARGS="--gpu-memory-utilization $GPU_MEM_UTIL"
fi
dp_start_rank=$((NODE_RANK * GPUS_PER_NODE))
VLLM_NIXL_SIDE_CHANNEL_PORT=20096 \
......@@ -131,7 +137,7 @@ python3 -m dynamo.vllm \
--max-model-len 4096 \
--data-parallel-address $MASTER_ADDR \
--data-parallel-rpc-port 13345 \
--gpu-memory-utilization "$GPU_MEM_UTIL" \
$GPU_MEM_ARGS \
--enforce-eager \
--kv-events-config "{\"publisher\":\"zmq\",\"topic\":\"kv-events\",\"endpoint\":\"tcp://*:20080\",\"enable_kv_cache_events\":true}" 2>&1 | tee $LOG_DIR/dsr1_dep_${dp_start_rank}.log &
......
......@@ -63,13 +63,12 @@ python -m dynamo.frontend &
MAX_MODEL_LEN="${MAX_MODEL_LEN:-4096}"
MAX_CONCURRENT_SEQS="${MAX_CONCURRENT_SEQS:-2}"
GPU_MEM_FRACTION=$(build_gpu_mem_args vllm --model "$MODEL" --max-model-len "$MAX_MODEL_LEN" --max-num-seqs "$MAX_CONCURRENT_SEQS")
GPU_MEM_ARGS=$(build_gpu_mem_args vllm)
DYN_SYSTEM_ENABLED=true DYN_SYSTEM_PORT=${SYSTEM_PORT} \
python -m dynamo.vllm --model "$MODEL" --enforce-eager \
--max-model-len "$MAX_MODEL_LEN" \
--max-num-seqs "$MAX_CONCURRENT_SEQS" \
${GPU_MEM_FRACTION:+--gpu-memory-utilization "$GPU_MEM_FRACTION"} \
$GPU_MEM_ARGS & \
--enable-lora \
--max-lora-rank 64 &
......
......@@ -28,12 +28,18 @@ if [[ "$CAPACITY_GB" != "0" ]]; then
}")
fi
GPU_MEM_UTIL="${_PROFILE_PYTEST_VRAM_FRAC_OVERRIDE:-.9}"
GPU_MEM_UTIL=".9"
KV_BYTES="${_PROFILE_OVERRIDE_VLLM_KV_CACHE_BYTES:-}"
if [[ -n "$KV_BYTES" ]]; then
GPU_MEM_ARGS="--kv-cache-memory-bytes $KV_BYTES --gpu-memory-utilization 0.01"
else
GPU_MEM_ARGS="--gpu-memory-utilization $GPU_MEM_UTIL"
fi
CUDA_VISIBLE_DEVICES=2 \
vllm serve "$MODEL" \
--enable-log-requests \
--max-model-len 16384 \
--gpu-memory-utilization "$GPU_MEM_UTIL" \
$GPU_MEM_ARGS \
"${EC_ARGS[@]}" \
"${EXTRA_ARGS[@]}"
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment