Unverified Commit 75bf1e09 authored by Alec's avatar Alec Committed by GitHub
Browse files

docs: restructure vLLM docs and add startup banners to launch scripts (#6698)


Signed-off-by: default avataralec-flowers <aflowers@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: default avatarishandhanani <ishandhanani@gmail.com>
parent 47ed1227
......@@ -105,7 +105,7 @@ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
"stream": true,
"max_tokens": 30
}'
......
......@@ -9,7 +9,7 @@ Dynamo SGLang supports three types of diffusion-based generation: **LLM diffusio
## Overview
| Type | Worker Flag | API Endpoint |
|------|------------|--------------|
| ---------------- | --------------------------- | ----------------------------------------- |
| LLM Diffusion | `--dllm-algorithm <algo>` | `/v1/chat/completions`, `/v1/completions` |
| Image Diffusion | `--image-diffusion-worker` | `/v1/images/generations` |
| Video Generation | `--video-generation-worker` | `/v1/videos` |
......@@ -40,7 +40,7 @@ curl -X POST http://localhost:8001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "inclusionAI/LLaDA2.0-mini-preview",
"messages": [{"role": "user", "content": "Hello! How are you?"}],
"messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}],
"temperature": 0.7,
"max_tokens": 512
}'
......@@ -66,7 +66,7 @@ curl http://localhost:8000/v1/images/generations \
-H "Content-Type: application/json" \
-d '{
"model": "black-forest-labs/FLUX.1-dev",
"prompt": "A sunset over the ocean",
"prompt": "Explain why Roger Federer is considered one of the greatest tennis players of all time",
"size": "1024x1024",
"response_format": "url",
"nvext": {
......@@ -94,7 +94,7 @@ Use `--wan-size 1b` (default, 1 GPU) or `--wan-size 14b` (2 GPUs). See the launc
curl http://localhost:8000/v1/videos \
-H "Content-Type: application/json" \
-d '{
"prompt": "A curious raccoon exploring a garden",
"prompt": "Roger Federer winning his 19th grand slam",
"model": "Wan-AI/Wan2.1-T2V-1.3B-Diffusers",
"seconds": 2,
"size": "832x480",
......
......@@ -99,8 +99,8 @@ curl http://localhost:8000/v1/chat/completions \
{
"role": "user",
"content": [
{"type": "text", "text": "Describe the image."},
{"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/test2017/000000155781.jpg"}}
{"type": "text", "text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"},
{"type": "image_url", "image_url": {"url": "https://media.newyorker.com/photos/63249cff39ac97c4c23ff5d0/master/w_2560%2Cc_limit/Marzorati%2520-%2520Federer%2520Retirement%25202.jpg"}}
]
}
],
......@@ -115,7 +115,7 @@ curl http://localhost:8000/v1/chat/completions \
For advanced multimodal deployments with separate encoder, prefill, and decode workers (E/PD and E/P/D patterns), see the dedicated [SGLang Multimodal](../../features/multimodal/multimodal-sglang.md) documentation.
| Pattern | Script | Description |
|---------|--------|-------------|
| ------- | ------------------------------- | --------------------------------------------- |
| E/PD | `./launch/multimodal_epd.sh` | Separate vision encoder + combined PD worker |
| E/P/D | `./launch/multimodal_disagg.sh` | Separate encoder, prefill, and decode workers |
......@@ -157,6 +157,7 @@ For full details on all diffusion worker types (LLM, image, video), see [Diffusi
### Kubernetes Deployment
For complete K8s deployment examples, see:
- [SGLang K8s deployment guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy)
- [SGLang aggregated router K8s example](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/deploy/agg_router.yaml)
- [Kubernetes Deployment Guide](../../kubernetes/README.md)
......
......@@ -50,7 +50,7 @@ curl -H 'Content-Type: application/json' \
-d '{
"model": "<model_name>",
"max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
"messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}]
}' \
http://localhost:8000/v1/chat/completions
......@@ -259,7 +259,7 @@ Send a request with `x-request-id` for easy lookup:
curl -H 'Content-Type: application/json' \
-H 'x-request-id: my-trace-001' \
-d '{"model": "Qwen/Qwen3-0.6B", "max_completion_tokens": 50,
"messages": [{"role": "user", "content": "Hello"}]}' \
"messages": [{"role": "user", "content": "Explain why Roger Federer is considered one of the greatest tennis players of all time"}]}' \
http://localhost:8000/v1/chat/completions
```
......
......@@ -4,63 +4,30 @@
title: vLLM
---
This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
# LLM Deployment using vLLM
## Use the Latest Release
Dynamo vLLM integrates [vLLM](https://github.com/vllm-project/vllm) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation while maintaining full compatibility with vLLM's native engine arguments. Dynamo leverages vLLM's native KV cache events, NIXL-based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
We recommend using the [latest stable release](https://github.com/ai-dynamo/dynamo/releases/latest) of Dynamo to avoid breaking changes.
## Installation
---
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Deploy on Kubernetes](#kubernetes-deployment)
- [Configuration](#configuration)
## Feature Support Matrix
### Core Dynamo Features
| Feature | vLLM | Notes |
|---------|------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | WIP |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ | |
| [**Load Based Planner**](../../components/planner/README.md) | 🚧 | WIP |
| [**KVBM**](../../components/kvbm/README.md) | ✅ | |
| [**LMCache**](../../integrations/lmcache-integration.md) | ✅ | |
| [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
### Large Scale P/D and WideEP Features
| Feature | vLLM | Notes |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP** | ✅ | Support for PPLX / DeepEP not verified |
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |
### Install Latest Release
## vLLM Quick Start
We recommend using [uv](https://github.com/astral-sh/uv) to install:
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
```bash
uv venv --python 3.12 --seed
uv pip install "ai-dynamo[vllm]"
```
### Start Infrastructure Services (Local Development Only)
This installs Dynamo with the compatible vLLM version.
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):
### Development Setup
```bash
docker compose -f deploy/docker-compose.yml up -d
```
For development, use the [devcontainer](https://github.com/ai-dynamo/dynamo/tree/main/.devcontainer) which has all dependencies pre-installed.
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--discovery-backend file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events. For vLLM, KV events are currently enabled by default when prefix caching is active (**deprecated** — use `--kv-events-config` explicitly). Use `--no-router-kv-events` on the frontend for prediction-based routing without events
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
---
### Pull or build container
<Accordion title="Build and run container">
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
......@@ -69,123 +36,48 @@ python container/render.py --framework vllm --output-short-filename
docker build -f container/rendered.Dockerfile -t dynamo:latest-vllm .
```
### Run container
```bash
./container/run.sh -it --framework VLLM [--mount-workspace]
```
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
## Run Single Node Examples
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
### Aggregated Serving
```bash
# requires one gpu
cd examples/backends/vllm
bash launch/agg.sh
```
### Aggregated Serving with KV Routing
```bash
# requires two gpus
cd examples/backends/vllm
bash launch/agg_router.sh
```
### Disaggregated Serving
```bash
# requires two gpus
cd examples/backends/vllm
bash launch/disagg.sh
```
</Accordion>
### Disaggregated Serving with KV Routing
## Feature Support Matrix
```bash
# requires three gpus
cd examples/backends/vllm
bash launch/disagg_router.sh
```
| Feature | Status | Notes |
|---------|--------|-------|
| [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ | Prefill/decode separation with NIXL KV transfer |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ | |
| [**KVBM**](../../components/kvbm/README.md) | ✅ | |
| [**LMCache**](../../integrations/lmcache-integration.md) | ✅ | |
| [**Multimodal Support**](vllm-omni.md) | ✅ | Via vLLM-Omni integration |
| [**Observability**](vllm-observability.md) | ✅ | Metrics and monitoring |
| **WideEP** | ✅ | Support for DeepEP |
| **DP Rank Routing** | ✅ | [Hybrid load balancing](https://docs.vllm.ai/en/stable/serving/data_parallel_deployment/?h=external+dp#hybrid-load-balancing) via external DP rank control |
| [**LoRA**](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/launch/lora/README.md) | ✅ | Dynamic loading/unloading from S3-compatible storage |
| **GB200 Support** | ✅ | Container functional on main |
### Single Node Data Parallel Attention / Expert Parallelism
## Quick Start
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
Start infrastructure services for local development:
```bash
# requires four gpus
cd examples/backends/vllm
bash launch/dep.sh
docker compose -f deploy/docker-compose.yml up -d
```
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
## Advanced Examples
Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
### Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)
Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
**Guide:** [Speculative Decoding Quickstart](../../features/speculative-decoding/speculative-decoding-vllm.md)
> **See also:** [Speculative Decoding Feature Overview](../../features/speculative-decoding/README.md) for cross-backend documentation.
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)
## Configuration
vLLM workers are configured through command-line arguments. Key parameters include:
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--disaggregation-mode <mode>`: Worker role for disaggregated serving. Accepted values: `prefill`, `decode`, `agg` (default)
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
- `--kv-transfer-config`: JSON string specifying the vLLM KVTransferConfig (e.g., `--kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}'`). See vLLM documentation for details.
- `--enable-prompt-embeds`: **Enable prompt embeddings feature** (opt-in, default: disabled)
- **Required for:** Accepting pre-computed prompt embeddings via API
- **Default behavior:** Prompt embeddings DISABLED - requests with `prompt_embeds` will fail
- **Error without flag:** `ValueError: You must set --enable-prompt-embeds to input prompt_embeds`
See `args.py` for the full list of configuration options and their defaults.
The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
### Hashing Consistency for KV Events
When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:
Launch an aggregated serving deployment:
```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg.sh
```
See the high-level notes in [Router Design](../../design-docs/router-design.md#deterministic-event-ids) on deterministic event IDs.
## Request Migration
Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
## Next Steps
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
- **[Reference Guide](vllm-reference-guide.md)**: Configuration, arguments, and operational details
- **[Examples](vllm-examples.md)**: All deployment patterns with launch scripts
- **[Observability](vllm-observability.md)**: Metrics and monitoring
- **[vLLM-Omni](vllm-omni.md)**: Multimodal model serving
- **[Kubernetes Deployment](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md)**: Kubernetes deployment guide
- **[vLLM Documentation](https://docs.vllm.ai/en/stable/)**: Upstream vLLM serve arguments
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: DeepSeek-R1
---
Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a separate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
## Instructions
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README.md) Getting Started section on each node, and then run these two commands.
node 0
```bash
./launch/dsr1_dep.sh --num-nodes 2 --node-rank 0 --gpus-per-node 8 --master-addr <node 0 addr>
```
node 1
```bash
./launch/dsr1_dep.sh --num-nodes 2 --node-rank 1 --gpus-per-node 8 --master-addr <node 0 addr>
```
### Testing the Deployment
On node 0 (where the frontend was started) send a test request to verify your deployment:
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: GPT-OSS
---
Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview
This deployment uses disaggregated serving in vLLM where:
- **Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
- **Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
- **Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
## Prerequisites
This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [the vLLM Backend guide](README.md)
## Instructions
### 1. Launch the Deployment
Note that GPT-OSS is a reasoning model with tool calling support. To
ensure the response is being processed correctly, the worker should be
launched with proper `--dyn-reasoning-parser` and `--dyn-tool-call-parser`.
**Start frontend**
```bash
python3 -m dynamo.frontend --http-port 8000 &
```
**Run decode worker**
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m dynamo.vllm \
--model openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
**Run prefill workers**
```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \
--model openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--disaggregation-mode prefill \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
### 2. Verify the Deployment is Ready
Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started:
```
curl http://localhost:8000/health
```
Make sure that both of the `generate` endpoints are available before sending an inference request:
```
{
"status": "healthy",
"endpoints": [
"dyn://dynamo.backend.generate"
],
"instances": [
{
"component": "backend",
"endpoint": "generate",
"namespace": "dynamo",
"instance_id": 7587889712474989333,
"transport": {
"nats_tcp": "dynamo_backend.generate-694d997dbae9a315"
}
},
{
"component": "prefill",
"endpoint": "generate",
"namespace": "dynamo",
"instance_id": 7587889712474989350,
"transport": {
"nats_tcp": "dynamo_prefill.generate-694d997dbae9a326"
}
},
...
]
}
```
If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
### 3. Test the Deployment
Send a test request to verify the deployment:
```bash
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
"max_output_tokens": 200,
"stream": false
}'
```
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
### 4. Reasoning and Tool Calling
Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
multi-turn as it involves tool selection and generation based on the tool result. Below is an example
of sending multi-round requests to complete a user query with reasoning and tool calling:
**Application setup (pseudocode)**
```Python
# The tool defined by the application
def get_system_health():
for component in system.components:
if not component.health():
return False
return True
# The JSON representation of the declaration in ChatCompletion tool style
tool_choice = '{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}'
# On user query, perform below workflow.
def user_query(app_request):
# first round
# create chat completion with prompt and tool choice
request = ...
response = send(request)
if response["finish_reason"] == "tool_calls":
# second round
function, params = parse_tool_call(response)
function_result = function(params)
# create request with prompt, assistant response, and function result
request = ...
response = send(request)
return app_response(response)
```
**First request with tools**
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Hey, quick check: is everything up and running?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}
],
"response_format": {
"type": "text"
},
"stream": false,
"max_tokens": 300
}'
```
**First response with tool choice**
```JSON
{
"id": "chatcmpl-d1c12219-6298-4c83-a6e3-4e7cef16e1a9",
"choices": [
{
"index": 0,
"message": {
"tool_calls": [
{
"id": "call-1",
"type": "function",
"function": {
"name": "get_system_health",
"arguments": "{}"
}
}
],
"role": "assistant",
"reasoning_content": "We need to check system health. Use function."
},
"finish_reason": "tool_calls"
}
],
"created": 1758758741,
"model": "openai/gpt-oss-120b",
"object": "chat.completion",
"usage": null
}
```
**Second request with tool calling result**
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Hey, quick check: is everything up and running?"
},
{
"role": "assistant",
"tool_calls": [
{
"id": "call-1",
"type": "function",
"function": {
"name": "get_system_health",
"arguments": "{}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call-1",
"content": "{\"status\":\"ok\",\"uptime_seconds\":372045}"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}
],
"response_format": {
"type": "text"
},
"stream": false,
"max_tokens": 300
}'
```
**Second response with final message**
```JSON
{
"id": "chatcmpl-9ebfe64a-68b9-4c1d-9742-644cf770ad0e",
"choices": [
{
"index": 0,
"message": {
"content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.",
"role": "assistant",
"reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up."
},
"finish_reason": "stop"
}
],
"created": 1758758853,
"model": "openai/gpt-oss-120b",
"object": "chat.completion",
"usage": null
}
```
\ No newline at end of file
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Multi-Node
---
This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities.
## Prerequisites
Multi-node deployments require:
- Multiple nodes with GPU resources
- Network connectivity between nodes (faster the better)
- Firewall rules allowing NATS/ETCD communication
## Infrastructure Setup
### Step 1: Start NATS/ETCD on Head Node
Start the required services on your head node. These endpoints must be accessible from all worker nodes:
```bash
# On head node (node-1)
docker compose -f deploy/docker-compose.yml up -d
```
Default ports:
- NATS: 4222
- ETCD: 2379
### Step 2: Configure Environment Variables
Set the head node IP address and service endpoints. **Set this on all nodes** for easy copy-paste:
```bash
# Set this on ALL nodes - replace with your actual head node IP
export HEAD_NODE_IP="<your-head-node-ip>"
# Service endpoints (set on all nodes)
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```
## Deployment Patterns
### Multi-node Aggregated Serving
Deploy vLLM workers across multiple nodes for horizontal scaling:
**Node 1 (Head Node)**: Run ingress and first worker
```bash
# Start ingress
python -m dynamo.frontend --router-mode kv
# Start vLLM worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--enforce-eager
```
**Node 2**: Run additional worker
```bash
# Start vLLM worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--enforce-eager
```
### Multi-node Disaggregated Serving
Deploy prefill and decode workers on separate nodes for optimized resource utilization:
**Node 1**: Run ingress and decode worker
```bash
# Start ingress
python -m dynamo.frontend --router-mode kv &
# Start decode worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--enforce-eager \
--disaggregation-mode decode
```
**Node 2**: Run prefill worker
```bash
# Start prefill worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--enforce-eager \
--disaggregation-mode prefill
```
### Multi-node Tensor/Pipeline Parallelism
When the total parallelism (TP × PP) exceeds the number of GPUs on a single node,
you need multiple nodes to host a **single** model instance. One node runs the full
`dynamo.vllm` process (head) while additional nodes run in `--headless` mode,
spawning only vLLM workers.
See [`examples/backends/vllm/launch/multi_node_tp.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/multi_node_tp.sh) for a ready-to-use launch script that supports both head and worker roles via `--head` / `--worker` flags. The model, TP size, and node count are configurable via `MODEL`, `TENSOR_PARALLEL_SIZE`, and `NNODES` environment variables.
For details on the flags used for multi-node distributed execution (`--master-addr`, `--master-port`, `--nnodes`, `--node-rank`), see the [vLLM multiprocessing docs](https://docs.vllm.ai/en/stable/serving/parallelism_scaling/#running-vllm-with-multiprocessing).
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Prompt Embeddings
---
Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.
## How It Works
| Path | What Happens |
|------|--------------|
| **Text prompt** | Tokenize → Embedding Layer → Transformer |
| **Prompt embeds** | Validate → Bypass Embedding → Transformer |
## Architecture
```mermaid
flowchart LR
subgraph FE["Frontend (Rust)"]
A[Request] --> B{prompt_embeds?}
B -->|No| C[🔴 Tokenize text]
B -->|Yes| D[🟢 Validate base64+size]
C --> E[token_ids, ISL=N]
D --> F[token_ids=empty, skip ISL]
end
subgraph RT["Router (NATS)"]
G[Route PreprocessedRequest]
end
subgraph WK["Worker (Python)"]
H[TokensPrompt#40;token_ids#41;]
I[Decode → EmbedsPrompt#40;tensor#41;]
end
subgraph VLLM["vLLM Engine"]
J[🔴 Embedding Layer]
K[🟢 Bypass Embedding]
L[Transformer Layers]
M[LM Head → Response]
end
E --> G
F --> G
G -->|Normal| H
G -->|Embeds| I
H --> J --> L
I --> K --> L
L --> M
```
| Layer | **Normal Flow** | **Prompt Embeds** |
|---|---|---|
| **Frontend (Rust)** | 🔴 Tokenize text → token_ids, compute ISL | 🟢 Validate base64+size, skip tokenization |
| **Router (NATS)** | Forward token_ids in PreprocessedRequest | Forward prompt_embeds string |
| **Worker (Python)** | `TokensPrompt(token_ids)` | Decode base64 → `EmbedsPrompt(tensor)` |
| **vLLM Engine** | 🔴 Embedding Layer → Transformer | 🟢 Bypass Embedding → Transformer |
## Quick Start
Send pre-computed prompt embeddings directly to vLLM, bypassing tokenization.
### 1. Enable Feature
```bash
python -m dynamo.vllm --model <model-name> --enable-prompt-embeds
```
> **Required:** The `--enable-prompt-embeds` flag must be set or requests will fail.
### 2. Send Request
```python
import torch
import base64
import io
from openai import OpenAI
# Prepare embeddings (sequence_length, hidden_dim)
embeddings = torch.randn(10, 4096, dtype=torch.float32)
# Encode
buffer = io.BytesIO()
torch.save(embeddings, buffer)
buffer.seek(0)
embeddings_base64 = base64.b64encode(buffer.read()).decode()
# Send
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="", # Can be empty or present; prompt_embeds takes precedence
max_tokens=100,
extra_body={"prompt_embeds": embeddings_base64}
)
```
## Configuration
### Docker Compose
```yaml
vllm-worker:
command:
- python
- -m
- dynamo.vllm
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --enable-prompt-embeds # Add this
```
### Kubernetes
```yaml
extraPodSpec:
mainContainer:
args:
- "--model"
- "meta-llama/Meta-Llama-3.1-8B-Instruct"
- "--enable-prompt-embeds" # Add this
```
### NATS Configuration
NATS needs 15MB payload limit (already configured in default deployments):
```yaml
# Docker Compose - deploy/docker-compose.yml
nats-server:
command: ["-js", "--trace", "-m", "8222", "--max_payload", "15728640"]
# Kubernetes - deploy/cloud/helm/platform/values.yaml
nats:
config:
merge:
max_payload: 15728640
```
## API Reference
### Request
```json
{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "",
"prompt_embeds": "<base64-encoded-pytorch-tensor>",
"max_tokens": 100
}
```
**Requirements:**
- **Format:** PyTorch tensor serialized with `torch.save()` and base64-encoded
- **Size:** 100 bytes - 10MB (decoded)
- **Shape:** `(seq_len, hidden_dim)` or `(batch, seq_len, hidden_dim)`
- **Dtype:** `torch.float32` (recommended)
**Field Precedence:**
- Both `prompt` and `prompt_embeds` can be provided in the same request
- When both are present, **`prompt_embeds` takes precedence** and `prompt` is ignored
- The `prompt` field can be empty (`""`) when using `prompt_embeds`
### Response
Standard OpenAI format with accurate usage:
```json
{
"usage": {
"prompt_tokens": 10, // Extracted from embedding shape
"completion_tokens": 15,
"total_tokens": 25
}
}
```
## Errors
| Error | Fix |
|-------|-----|
| `ValueError: You must set --enable-prompt-embeds` | Add `--enable-prompt-embeds` to worker |
| `prompt_embeds must be valid base64` | Use `.decode('utf-8')` after `base64.b64encode()` |
| `decoded data must be at least 100 bytes` | Increase sequence length |
| `exceeds maximum size of 10MB` | Reduce sequence length |
| `must be a torch.Tensor` | Use `torch.save()` not NumPy |
| `size of tensor must match` | Use correct hidden dimension for model |
## Examples
### Streaming
```python
stream = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="",
max_tokens=100,
stream=True,
extra_body={"prompt_embeds": embeddings_base64}
)
for chunk in stream:
if chunk.choices:
print(chunk.choices[0].text, end="", flush=True)
```
### Load from File
```python
embeddings = torch.load("embeddings.pt")
buffer = io.BytesIO()
torch.save(embeddings, buffer)
buffer.seek(0)
embeddings_base64 = base64.b64encode(buffer.read()).decode()
# Use in request...
```
## Limitations
- ❌ Requires `--enable-prompt-embeds` flag (disabled by default)
- ❌ PyTorch format only (NumPy not supported)
- ❌ 10MB decoded size limit
- ❌ Cannot mix with multimodal data (images/video)
## Testing
Comprehensive test coverage ensures reliability:
- **Unit Tests:** 31 tests (11 Rust + 20 Python)
- Validation, decoding, format handling, error cases, usage statistics
- **Integration Tests:** 21 end-to-end tests
- Core functionality, performance, formats, concurrency, usage statistics
Run integration tests:
```bash
# Start worker with flag
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enable-prompt-embeds
# Run tests
pytest tests/integration/test_prompt_embeds_integration.py -v
```
## See Also
- [vLLM Backend](README.md)
- [vLLM Configuration](README.md#configuration)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Examples
---
# vLLM Examples
For quick start instructions, see the [vLLM README](README.md). This document provides all deployment patterns for running vLLM with Dynamo, including aggregated, disaggregated, KV-routed, and expert-parallel configurations.
## Table of Contents
- [Infrastructure Setup](#infrastructure-setup)
- [LLM Serving](#llm-serving)
- [Advanced Examples](#advanced-examples)
- [Kubernetes Deployment](#kubernetes-deployment)
- [Troubleshooting](#troubleshooting)
## Infrastructure Setup
For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
<Note>
- **etcd** is optional but is the default local discovery backend. File-based discovery is also available (see `python -m dynamo.vllm --help` for `--discovery-backend` options).
- **NATS** is only needed when using KV routing with events. Prediction-based routing does not require NATS.
- **On Kubernetes**, neither is required when using the Dynamo operator.
</Note>
<Tip>
Each launch script runs the frontend and worker(s) in a single terminal. You can run each command separately in different terminals for better log visibility. For AI agents working with Dynamo, you can run the launch script in the background and use the `curl` commands to test the deployment.
</Tip>
## LLM Serving
### Aggregated Serving
The simplest deployment pattern: a single worker handles both prefill and decode. Requires 1 GPU.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg.sh
```
### Aggregated Serving with KV Routing
Two workers behind a [KV-aware router](../../components/router/README.md) that maximizes cache reuse. Requires 2 GPUs.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_router.sh
```
This launches the frontend in KV routing mode with two workers publishing KV events over ZMQ.
### Disaggregated Serving
Separates prefill and decode into independent workers connected via NIXL for KV cache transfer. Requires 2 GPUs.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg.sh
```
### Disaggregated Serving with KV Routing
Scales to 2 prefill + 2 decode workers with KV-aware routing on both pools. Requires 4 GPUs.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/disagg_router.sh
```
The frontend runs in KV routing mode and automatically detects prefill workers to activate an internal prefill router.
### Data Parallel / Expert Parallelism
Launches 4 data-parallel workers with expert parallelism behind a KV-aware router. Uses a Mixture-of-Experts model (`Qwen/Qwen3-30B-A3B`). Requires 4 GPUs.
```bash
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/dep.sh
```
<Tip>
Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
</Tip>
## Advanced Examples
### Speculative Decoding
Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model for faster inference while maintaining accuracy.
**Guide:** [Speculative Decoding Quickstart](../../features/speculative-decoding/speculative-decoding-vllm.md)
> **See also:** [Speculative Decoding Feature Overview](../../features/speculative-decoding/README.md) for cross-backend documentation.
### Multimodal
Serve multimodal models using the vLLM-Omni integration.
**Guide:** [vLLM-Omni](vllm-omni.md)
### Multi-Node
Deploy vLLM across multiple nodes using Dynamo's distributed capabilities. Multi-node deployments require network connectivity between nodes and firewall rules allowing NATS/ETCD communication.
Start NATS/ETCD on the head node so all worker nodes can reach them:
```bash
# On head node
docker compose -f deploy/docker-compose.yml up -d
# Set on ALL nodes
export HEAD_NODE_IP="<your-head-node-ip>"
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```
For multi-node tensor/pipeline parallelism (when TP x PP exceeds GPUs on a single node), see [`launch/multi_node_tp.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/multi_node_tp.sh). For details on distributed execution, see the [vLLM multiprocessing docs](https://docs.vllm.ai/en/stable/serving/parallelism_scaling/#running-vllm-with-multiprocessing).
### DeepSeek-R1
Dynamo supports DeepSeek R1 with data parallel attention and wide expert parallelism. Each DP attention rank is a separate Dynamo component emitting its own KV events and metrics.
Run on 2 nodes (16 GPUs, dp=16):
```bash
# Node 0
cd $DYNAMO_HOME/examples/backends/vllm
./launch/dsr1_dep.sh --num-nodes 2 --node-rank 0 --gpus-per-node 8 --master-addr <node-0-addr>
# Node 1
./launch/dsr1_dep.sh --num-nodes 2 --node-rank 1 --gpus-per-node 8 --master-addr <node-0-addr>
```
See [`launch/dsr1_dep.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/vllm/launch/dsr1_dep.sh) for configurable options.
## Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see the [vLLM Kubernetes Deployment Guide](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/README.md).
See also the [Kubernetes Deployment Guide](../../kubernetes/README.md) for general Dynamo K8s documentation.
## Troubleshooting
### Workers Fail to Start with NIXL Errors
Ensure NIXL is installed and the side-channel ports are not in conflict. Each worker in a multi-worker setup needs a unique `VLLM_NIXL_SIDE_CHANNEL_PORT`.
### KV Router Not Routing Correctly
Ensure `PYTHONHASHSEED=0` is set for all vLLM processes when using KV-aware routing. See [Hashing Consistency](vllm-reference-guide.md#hashing-consistency-for-kv-events) for details.
### GPU OOM on Startup
If a previous run left orphaned GPU processes, the next launch may OOM. Check for zombie processes:
```bash
nvidia-smi # look for lingering python processes
kill -9 <PID>
```
## See Also
- **[vLLM README](README.md)**: Quick start and feature overview
- **[Reference Guide](vllm-reference-guide.md)**: Configuration, arguments, and operational details
- **[Observability](vllm-observability.md)**: Metrics and monitoring
- **[Benchmarking](../../benchmarks/benchmarking.md)**: Performance benchmarking tools
- **[Tuning Disaggregated Performance](../../performance/tuning.md)**: P/D tuning guide
......@@ -8,7 +8,7 @@ title: Prometheus
When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/stable/design/metrics.html).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](../../integrations/lmcache-integration.md).
......@@ -18,10 +18,9 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t
## Environment Variables and Flags
| Variable/Flag | Description | Default | Example |
|---------------|-------------|---------|---------|
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port. Required to expose `/metrics` endpoint. | `-1` (disabled) | `8081` |
| `--kv-transfer-config` | KV transfer configuration JSON. Use LMCache connector to enable LMCache metrics. | - | `--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'` |
## Getting Started Quickly
......@@ -33,30 +32,16 @@ For visualizing metrics with Prometheus and Grafana, start the observability sta
### Launch Dynamo Components
Launch a frontend and vLLM backend to test metrics:
The launch scripts in `examples/backends/vllm/launch/` already enable metrics on port 8081 by default. For example:
```bash
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend
# Enable system metrics server on port 8081
$ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model_name> \
--enforce-eager --no-enable-prefix-caching --max-num-seqs 3
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg.sh
```
Wait for the vLLM worker to start, then send requests and check metrics:
Once the deployment is running, send a request and check metrics:
```bash
# Send a request
curl -H 'Content-Type: application/json' \
-d '{
"model": "<model_name>",
"max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions
# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^vllm:"
```
......@@ -80,7 +65,7 @@ vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
```
**Note:** The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for the current list.
**Note:** The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.vllm.ai/en/stable/design/metrics.html) for the current list.
### Metric Categories
......@@ -92,7 +77,7 @@ vLLM provides metrics in the following categories (all prefixed with `vllm:`):
- **Scheduler metrics** - Scheduling and queue management
- **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version.
**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/stable/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version.
## Available Metrics
......@@ -103,28 +88,22 @@ The official vLLM documentation includes complete metric definitions with:
- Information about v1 metrics migration
- Future work and deprecated metrics
For the complete and authoritative list of all vLLM metrics, see the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
For the complete and authoritative list of all vLLM metrics, see the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/stable/design/metrics.html).
## LMCache Metrics
When LMCache is enabled with `--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'` and `DYN_SYSTEM_PORT` is set, LMCache metrics (prefixed with `lmcache:`) are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.
### Minimum Requirements
When LMCache is enabled, LMCache metrics (prefixed with `lmcache:`) are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.
To access LMCache metrics, both of these are required:
1. `--kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'` - Enables LMCache in vLLM
2. `DYN_SYSTEM_PORT=8081` - Enables Dynamo's metrics HTTP endpoint
To try it out, use the LMCache launch script:
**Example:**
```bash
DYN_SYSTEM_PORT=8081 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --kv-transfer-config '{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'
cd $DYNAMO_HOME/examples/backends/vllm
bash launch/agg_lmcache.sh
```
### Viewing LMCache Metrics
Send a request and view LMCache metrics:
```bash
# View all LMCache metrics
curl -s localhost:8081/metrics | grep "^lmcache:"
```
......@@ -146,13 +125,13 @@ Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger in
- Metrics are filtered by the `vllm:` and `lmcache:` prefixes before being exposed (when LMCache is enabled)
- The integration uses Dynamo's `register_engine_metrics_callback()` function with the global `REGISTRY`
- Metrics appear after vLLM engine initialization completes
- vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for migration details
- vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/stable/design/metrics.html) for migration details
## Related Documentation
### vLLM Metrics
- [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/latest/design/metrics.html)
- [vLLM Production Metrics User Guide](https://docs.vllm.ai/en/latest/usage/metrics.html)
- [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/stable/design/metrics.html)
- [vLLM Production Metrics User Guide](https://docs.vllm.ai/en/stable/usage/metrics.html)
- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/v1/metrics)
### Dynamo Metrics
......
......@@ -162,14 +162,13 @@ The `/v1/videos` endpoint also accepts NVIDIA extensions via the `nvext` field f
## CLI Reference
| Flag | Description |
|---|---|
| `--omni` | Enable the vLLM-Omni orchestrator (required for all omni workloads) |
| `--output-modalities <modality>` | Output modality: `text`, `image`, or `video` |
| `--stage-configs-path <path>` | Path to stage config YAML (optional; vLLM-Omni uses model defaults if omitted) |
| _(no `--kv-transfer-config`)_ | KV connector is disabled by default; omit the flag for omni workers |
| `--media-output-fs-url <url>` | Filesystem URL for storing generated media (default: `file:///tmp/dynamo_media`) |
| `--media-output-http-url <url>` | Base URL for rewriting media paths in responses (optional) |
For the full list of Omni-related flags (including `--omni`, `--output-modalities`, `--stage-configs-path`, `--media-output-fs-url`, `--media-output-http-url`, and the `--omni-*` diffusion flags), run:
```bash
python -m dynamo.vllm --help
```
See also the [Argument Reference](vllm-reference-guide.md#argument-reference) in the Reference Guide.
## Storage Configuration
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Reference Guide
subtitle: Configuration, arguments, and operational details for the vLLM backend
---
# Reference Guide
## Overview
The vLLM backend in Dynamo integrates [vLLM](https://github.com/vllm-project/vllm) engines into Dynamo's distributed runtime, enabling disaggregated serving, KV-aware routing, and request cancellation. Dynamo leverages vLLM's native KV cache events, NIXL-based transfer mechanisms, and metric reporting.
Dynamo vLLM uses vLLM's native argument parser — all vLLM engine arguments are passed through directly. Dynamo adds its own arguments for disaggregation mode, KV transfer, and prompt embeddings.
## Argument Reference
The vLLM backend accepts all upstream vLLM engine arguments plus Dynamo-specific arguments. The authoritative source is always the CLI:
```bash
python -m dynamo.vllm --help
```
The `--help` output is organized into the following groups:
- **Dynamo Runtime Options** — Namespace, discovery backend, request/event plane, endpoint types, tool/reasoning parsers, and custom chat templates. These are common across all Dynamo backends and use `DYN_*` env vars.
- **Dynamo vLLM Options** — Disaggregation mode, tokenizer selection, sleep mode, multimodal flags, vLLM-Omni pipeline configuration, headless mode, and ModelExpress. These use `DYN_VLLM_*` env vars.
- **vLLM Engine Options** — All native vLLM arguments (`--model`, `--tensor-parallel-size`, `--kv-transfer-config`, `--kv-events-config`, `--enable-prefix-caching`, etc.). See the [vLLM serve args documentation](https://docs.vllm.ai/en/stable/configuration/serve_args.html).
### Prompt Embeddings
Dynamo supports [vLLM prompt embeddings](https://docs.vllm.ai/en/stable/features/prompt_embeds.html) — pre-computed embeddings bypass tokenization in the Rust frontend and are decoded to tensors in the worker.
- Enable with `--enable-prompt-embeds` (disabled by default)
- Embeddings are sent as base64-encoded PyTorch tensors via the `prompt_embeds` field in the Completions API
- NATS must be configured with a 15MB max payload for large embeddings (already set in default deployments)
## Hashing Consistency for KV Events
When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's built-in hashing for prefix caching.
- If your vLLM version supports it, configure a deterministic prefix caching algorithm:
```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
```
See the high-level notes in [Router Design](../../design-docs/router-design.md#deterministic-event-ids) on deterministic event IDs.
## Graceful Shutdown
vLLM workers use Dynamo's graceful shutdown mechanism. When a `SIGTERM` or `SIGINT` is received:
1. **Discovery unregister**: The worker is removed from service discovery so no new requests are routed to it
2. **Grace period**: In-flight requests are allowed to complete (configurable via `DYN_GRACEFUL_SHUTDOWN_GRACE_PERIOD_SECS`, default 5s)
3. **Resource cleanup**: Engine resources and temporary files (Prometheus dirs, LoRA adapters) are released
All vLLM endpoints use `graceful_shutdown=True`, meaning they wait for in-flight requests to finish before exiting. An internal `VllmEngineMonitor` also checks engine health every 2 seconds and initiates shutdown if the engine becomes unresponsive.
For more details, see [Graceful Shutdown](../../fault-tolerance/graceful-shutdown.md).
## Health Checks
Each worker type has a specialized health check payload that validates the full inference pipeline:
| Worker Type | Health Check Strategy |
|------------|----------------------|
| Decode / Aggregated | Short generation request (`max_tokens=1`) using the model's BOS token |
| Prefill | Same payload structure as decode, adapted for prefill request format |
| vLLM-Omni | Short generation request via AsyncOmni with the model's BOS token |
Health checks are registered with the Dynamo runtime and called by the frontend or Kubernetes liveness probes. The payload can be overridden via `DYN_HEALTH_CHECK_PAYLOAD` environment variable. See [Health Checks](../../observability/health-checks.md) for the broader health check architecture.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources.
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../fault-tolerance/request-cancellation.md) documentation.
## Request Migration
Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
## See Also
- **[Examples](vllm-examples.md)**: All deployment patterns with launch scripts
- **[vLLM README](README.md)**: Quick start and feature overview
- **[Observability](vllm-observability.md)**: Metrics and monitoring setup
- **[Router Guide](../../components/router/router-guide.md)**: KV-aware routing configuration
- **[Fault Tolerance](../../fault-tolerance/README.md)**: Request migration, cancellation, and graceful shutdown
......@@ -97,7 +97,7 @@ curl http://localhost:8000/v1/chat/completions \
"content": [
{
"type": "text",
"text": "Describe the image."
"text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
},
{
"type": "image_url",
......@@ -160,7 +160,7 @@ curl http://localhost:8000/v1/chat/completions \
"content": [
{
"type": "text",
"text": "Describe the image."
"text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
},
{
"type": "image_url",
......@@ -223,7 +223,7 @@ curl http://localhost:8000/v1/chat/completions \
"content": [
{
"type": "text",
"text": "Describe the image."
"text": "Explain why Roger Federer is considered one of the greatest tennis players of all time"
},
{
"type": "image_url",
......
......@@ -291,16 +291,12 @@ navigation:
# -- Backend detail pages --
- section: vLLM Details
contents:
- page: DeepSeek-R1
path: backends/vllm/deepseek-r1.md
- page: GPT-OSS
path: backends/vllm/gpt-oss.md
- page: Multi-Node
path: backends/vllm/multi-node.md
- page: Prometheus
path: backends/vllm/prometheus.md
- page: Prompt Embeddings
path: backends/vllm/prompt-embeddings.md
- page: Reference Guide
path: backends/vllm/vllm-reference-guide.md
- page: Examples
path: backends/vllm/vllm-examples.md
- page: Observability
path: backends/vllm/vllm-observability.md
- page: vLLM-Omni
path: backends/vllm/vllm-omni.md
- section: TensorRT-LLM Details
......
......@@ -163,7 +163,7 @@ When LMCache is enabled with `--kv-transfer-config '{"kv_connector":"LMCacheConn
- `DYN_SYSTEM_PORT=8081` - Enables metrics HTTP endpoint
- `PROMETHEUS_MULTIPROC_DIR` (optional) - If not set, Dynamo manages it internally
For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the **[LMCache Metrics section](../backends/vllm/prometheus.md#lmcache-metrics)** in the vLLM Prometheus Metrics Guide.
For detailed information on LMCache metrics, including the complete list of available metrics and how to access them, see the **[LMCache Metrics section](../backends/vllm/vllm-observability.md#lmcache-metrics)** in the vLLM Prometheus Metrics Guide.
## Troubleshooting
......
......@@ -84,7 +84,7 @@ Dynamo exposes several categories of metrics:
- **Frontend Metrics** (`dynamo_frontend_*`) - Request handling, token processing, and latency measurements
- **Component Metrics** (`dynamo_component_*`) - Request counts, processing times, byte transfers, and system uptime
- **Specialized Component Metrics** (e.g., `dynamo_preprocessor_*`) - Component-specific metrics
- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/prometheus.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-observability.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`)
- **Engine Metrics** (Pass-through) - Backend engines expose their own metrics: [vLLM](../backends/vllm/vllm-observability.md) (`vllm:*`), [SGLang](../backends/sglang/sglang-observability.md) (`sglang:*`), [TensorRT-LLM](../backends/trtllm/prometheus.md) (`trtllm_*`)
## Runtime Hierarchy
......
......@@ -71,7 +71,7 @@ echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
......
......@@ -49,7 +49,7 @@ echo " curl http://localhost:${HTTP_PORT}/v1/embeddings \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"input\": \"Hello world\""
echo " \"input\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\""
echo " }'"
echo ""
echo "=========================================="
......
......@@ -69,7 +69,7 @@ echo " curl http://localhost:${HTTP_PORT}/v1/chat/completions \\"
echo " -H 'Content-Type: application/json' \\"
echo " -d '{"
echo " \"model\": \"${MODEL}\","
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Hello!\"}],"
echo " \"messages\": [{\"role\": \"user\", \"content\": \"Explain why Roger Federer is considered one of the greatest tennis players of all time\"}],"
echo " \"max_tokens\": 32"
echo " }'"
echo ""
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment