"tests/vscode:/vscode.git/clone" did not exist on "ba3ac23560cb4a986b0e26c87162b68a778da286"
Unverified Commit 39d645e5 authored by Jonathan Tong's avatar Jonathan Tong Committed by GitHub
Browse files

docs: migrate Fern docs from fern/ into docs/ (#6206)


Signed-off-by: default avatarJont828 <jt572@cornell.edu>
parent d381e6ff
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# SGLang Prometheus Metrics
## Overview
When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
**For the complete and authoritative list of all SGLang metrics**, always refer to the [official SGLang Production Metrics documentation](https://docs.sglang.io/references/production_metrics.html).
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
## Environment Variables
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
## Getting Started Quickly
This is a single machine example.
### Start Observability Stack
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
### Launch Dynamo Components
Launch a frontend and SGLang backend to test metrics:
```bash
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend
# Enable system metrics server on port 8081
$ DYN_SYSTEM_PORT=8081 python -m dynamo.sglang --model <model_name> --enable-metrics
```
Wait for the SGLang worker to start, then send requests and check metrics:
```bash
# Send a request
curl -H 'Content-Type: application/json' \
-d '{
"model": "<model_name>",
"max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions
# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^sglang:"
```
## Exposed Metrics
SGLang exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All SGLang engine metrics use the `sglang:` prefix and include labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`) to identify the source.
**Example Prometheus Exposition Format text:**
```
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
# TYPE sglang:prompt_tokens_total counter
sglang:prompt_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 8128902.0
# HELP sglang:generation_tokens_total Number of generation tokens processed.
# TYPE sglang:generation_tokens_total counter
sglang:generation_tokens_total{model_name="meta-llama/Llama-3.1-8B-Instruct"} 7557572.0
# HELP sglang:cache_hit_rate The cache hit rate
# TYPE sglang:cache_hit_rate gauge
sglang:cache_hit_rate{model_name="meta-llama/Llama-3.1-8B-Instruct"} 0.0075
```
**Note:** The specific metrics shown above are examples and may vary depending on your SGLang version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.sglang.io/references/production_metrics.html) for the current list.
### Metric Categories
SGLang provides metrics in the following categories (all prefixed with `sglang:`):
- **Throughput metrics** - Token processing rates
- **Resource usage** - System resource consumption
- **Latency metrics** - Request and token latency measurements
- **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
**Note:** Specific metrics are subject to change between SGLang versions. Always refer to the [official documentation](https://docs.sglang.io/references/production_metrics.html) or inspect the `/metrics` endpoint for your SGLang version.
## Available Metrics
The official SGLang documentation includes complete metric definitions with:
- HELP and TYPE descriptions
- Counter, Gauge, and Histogram metric types
- Metric labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`)
- Setup guide for Prometheus + Grafana monitoring
- Troubleshooting tips and configuration examples
For the complete and authoritative list of all SGLang metrics, see the [official SGLang Production Metrics documentation](https://docs.sglang.io/references/production_metrics.html).
## Implementation Details
- SGLang uses multiprocess metrics collection via `prometheus_client.multiprocess.MultiProcessCollector`
- Metrics are filtered by the `sglang:` prefix before being exposed
- The integration uses Dynamo's `register_engine_metrics_callback()` function
- Metrics appear after SGLang engine initialization completes
## Related Documentation
### SGLang Metrics
- [Official SGLang Production Metrics](https://docs.sglang.io/references/production_metrics.html)
- [SGLang GitHub - Metrics Collector](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/metrics/collector.py)
### Dynamo Metrics
- [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside SGLang metrics
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
- Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# SGLang Disaggregated Serving
This document explains how SGLang's disaggregated prefill-decode architecture works, both standalone and within Dynamo.
## Overview
Disaggregated serving separates the prefill and decode phases of LLM inference into different workers. This architecture allows for:
- Independent scaling of prefill and decode resources
- Better resource utilization (prefill is compute-bound, decode is memory-bound)
- Efficient KV cache transfer between workers using RDMA
## How Dynamo Integrates with SGLang Disaggregation
**SGLang's standalone approach:**
1. The load balancer receives a request from the client
2. A random `(prefill, decode)` pair is selected from the pool of available workers
3. Request is sent to both `prefill` and `decode` workers via asyncio tasks
4. Internally disaggregation is done from prefill → decode
**Dynamo's approach:**
Because Dynamo has a discovery mechanism, we do not use a load balancer. Instead:
1. Route to a decode worker first
2. Choose a prefill worker via round-robin or KV-aware selection
3. Send the request to both workers
4. SGLang's bootstrap server (part of the `tokenizer_manager`) is used in conjunction with NIXL/Mooncake to handle the KV transfer
## Disaggregation Flow
The following diagram shows the complete request flow for disaggregated serving:
```mermaid
sequenceDiagram
participant Client
participant Decode
participant Prefill
Note over Decode,Prefill: 0. Setup Phase (One-Time)
Decode->>Prefill: Register RDMA connection info (base GPU memory pointers)
Note over Client,Prefill: Per-Request Phase
Client->>Decode: 1. Send request
Decode->>Prefill: 2. Forward request + get bootstrap_room
Prefill-->>Decode: Return bootstrap_room ID
Note over Decode: 3. Allocate GPU memory for KV cache
Decode->>Prefill: Send allocation info (page indices, metadata buffer)
Note over Prefill: 4. Prefill forward pass
par Decode polls
loop Poll transfer
Note over Decode: 5. Poll for KV arrival
end
and Prefill transfers
Note over Prefill: 6. RDMA write KV to decode
Prefill->>Decode: Transfer KV cache + metadata
end
Note over Prefill: 7. Poll RDMA handles
Note over Prefill: Transfer complete, deallocate metadata
Note over Decode: 8. KV received, start decode
loop Generate tokens
Note over Decode: Decode forward pass
Decode-->>Client: Stream output token
end
```
### Key Steps Explained
**Setup Phase (One-Time)**
- Decode workers register their RDMA connection information with prefill workers
- This includes base GPU memory pointers for direct memory access
**Per-Request Flow**
1. **Request initiation**: Client sends request to decode worker
2. **Bootstrap room allocation**: Decode forwards to prefill and receives a bootstrap_room ID for coordination
3. **Memory allocation**: Decode allocates GPU memory pages for incoming KV cache
4. **Prefill execution**: Prefill worker processes the prompt and generates KV cache
5. **KV transfer**: Prefill uses RDMA to write KV cache directly to decode's GPU memory (while decode polls for completion)
6. **Cleanup**: Prefill deallocates transfer metadata after confirming completion
7. **Decode phase**: Decode worker generates tokens using the transferred KV cache
8. **Streaming**: Tokens are streamed back to the client as they're generated
### Performance Characteristics
- **RDMA transfer**: Zero-copy GPU-to-GPU transfer with minimal CPU involvement
- **Parallel operations**: Decode can poll while prefill transfers data
- **One-time setup**: RDMA connections established once, reused for all requests
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LLM Deployment using TensorRT-LLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
---
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#single-node-examples)
- [Advanced Examples](#advanced-examples)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)
- [Multimodal Support](#multimodal-support)
- [Video Diffusion Support](#video-diffusion-support-experimental)
- [Logits Processing](#logits-processing)
- [DP Rank Routing](#dp-rank-routing-attention-data-parallelism)
- [Performance Sweep](#performance-sweep)
- [Known Issues and Mitigations](#known-issues-and-mitigations)
## Feature Support Matrix
### Core Dynamo Features
| Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ | |
| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | Planned |
| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | |
### Large Scale P/D and WideEP Features
| Feature | TensorRT-LLM | Notes |
|--------------------|--------------|-----------------------------------------------------------------|
| **WideEP** | ✅ | |
| **DP Rank Routing**| ✅ | |
| **GB200 Support** | ✅ | |
## TensorRT-LLM Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start Infrastructure Services (Local Development Only)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Build container
```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
# On an x86 machine:
python container/render.py --framework=trtllm --target=runtime --output-short-filename
docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
# On an ARM machine:
python container/render.py --framework=trtllm --target=runtime --platform=arm64 --output-short-filename
docker build -t dynamo:trtllm-latest -f container/rendered.Dockerfile .
```
### Run container
```bash
./container/run.sh --framework trtllm -it
```
## Single Node Examples
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router_guide.md).
### Aggregated
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh
```
### Aggregated with KV Routing
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh
```
### Disaggregated
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh
```
### Disaggregated with KV Routing
> [!IMPORTANT]
> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh
```
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export AGG_ENGINE_ARGS=./engine_configs/deepseek-r1/agg/mtp/mtp_agg.yaml
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
# nvidia/DeepSeek-R1-FP4 is a large model
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
./launch/agg.sh
```
Notes:
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
## Advanced Examples
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Multinode Deployment
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../examples/backends/trtllm/deploy/README.md).
### Client
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
### Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
## KV Cache Transfer in Disaggregated Serving
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
## Request Migration
Dynamo supports [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for configuration details.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
## Client
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
## Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
## Multimodal support
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal_trtllm.md).
## Video Diffusion Support (Experimental)
Dynamo supports video generation using diffusion models through the `--modality video_diffusion` flag.
### Requirements
- **visual_gen**: Part of TensorRT-LLM, located at `tensorrt_llm/visual_gen/`. Currently available **only** on the [`feat/visual_gen`](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/visual_gen/tensorrt_llm/visual_gen) branch (not yet merged to main or any release). Install from source:
```bash
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM && git checkout feat/visual_gen
cd tensorrt_llm/visual_gen && pip install -e .
```
- **dynamo-runtime with video API**: The Dynamo runtime must include `ModelType.Videos` support. Ensure you're using a compatible version.
### Supported Models
| Diffusers Pipeline | Description | Example Model |
|--------------------|-------------|---------------|
| `WanPipeline` | Wan 2.1/2.2 Text-to-Video | `Wan-AI/Wan2.1-T2V-1.3B-Diffusers` |
The pipeline type is **auto-detected** from the model's `model_index.json` — no `--model-type` flag is needed.
### Quick Start
```bash
python -m dynamo.trtllm \
--modality video_diffusion \
--model-path Wan-AI/Wan2.1-T2V-1.3B-Diffusers \
--output-dir /tmp/videos
```
### API Endpoint
Video generation uses the `/v1/videos/generations` endpoint:
```bash
curl -X POST http://localhost:8000/v1/videos/generations \
-H "Content-Type: application/json" \
-d '{
"prompt": "A cat playing piano",
"model": "wan_t2v",
"size": "832x480",
"seconds": 4,
"fps": 24
}'
```
### Configuration Options
| Flag | Description | Default |
|------|-------------|---------|
| `--output-dir` | Directory for generated videos | `/tmp/dynamo_videos` |
| `--default-height` | Default video height | `480` |
| `--default-width` | Default video width | `832` |
| `--default-num-frames` | Default frame count | `81` |
| `--enable-teacache` | Enable TeaCache optimization | `False` |
| `--disable-torch-compile` | Disable torch.compile | `False` |
### Limitations
- Video diffusion is experimental and not recommended for production use
- Only text-to-video is supported in this release (image-to-video planned)
- Requires GPU with sufficient VRAM for the diffusion model
## Logits Processing
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
### How it works
- **Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
- **TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
- **Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](../../../lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](../../../lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
### Quick test: HelloWorld processor
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh
```
Notes:
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".
### Bring your own processor
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
```python
from typing import Sequence
import torch
from dynamo.logits_processing import BaseLogitsProcessor
class TemperatureProcessor(BaseLogitsProcessor):
def __init__(self, temperature: float = 1.0):
if temperature <= 0:
raise ValueError("Temperature must be positive")
self.temperature = temperature
def __call__(self, input_ids: Sequence[int], logits: torch.Tensor):
if self.temperature == 1.0:
return
logits.div_(self.temperature)
```
Wire it into TRT-LLM by adapting and attaching to `SamplingParams`:
```python
from dynamo.trtllm.logits_processing.adapter import create_trtllm_adapters
from dynamo.logits_processing.examples import TemperatureProcessor
processors = [TemperatureProcessor(temperature=0.7)]
sampling_params.logits_processor = create_trtllm_adapters(processors)
```
### Current limitations
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
## DP Rank Routing (Attention Data Parallelism)
TensorRT-LLM supports [attention data parallelism](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.
### Dynamo vs TRT-LLM Internal Routing
- **Dynamo DP Rank Routing**: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (`attention_dp_relax=False`). Use this with `--router-mode kv` for cache-aware routing.
- **TRT-LLM Internal Routing**: TRT-LLM's scheduler assigns DP ranks internally. Use this with `--router-mode round-robin` or `random` when KV-aware routing isn't needed.
### Enabling DP Rank Routing
```bash
# Worker with attention DP
# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
--model-path <MODEL_PATH> \
--tensor-parallel-size 2 \
--enable-attention-dp \
--publish-events-and-metrics
# Frontend with KV routing
python3 -m dynamo.frontend --router-mode kv
```
The `--enable-attention-dp` flag sets `attention_dp_size = tensor_parallel_size` and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each `(worker_id, dp_rank)` combination.
> [!NOTE]
> Attention DP requires TRT-LLM's PyTorch backend. AutoDeploy does not support attention DP.
## Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
## Dynamo KV Block Manager Integration
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
## Known Issues and Mitigations
### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
**Symptoms:**
- Workers function normally initially but hang after heavy load testing
- Inference requests get stuck and eventually timeout
- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
```yaml
cache_transceiver_config:
backend: DEFAULT
max_tokens_in_buffer: 65536 # Must exceed max ISL
```
For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`.
**Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Gemma 3 with Variable Sliding Window Attention
This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
> [!Note]
> - Ensure that required services such as `nats` and `etcd` are running before starting.
> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
## Aggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it
export SERVED_MODEL_NAME=$MODEL_PATH
export AGG_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
./launch/agg.sh
```
## Aggregated Serving with KV Routing
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it
export SERVED_MODEL_NAME=$MODEL_PATH
export AGG_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_agg.yaml
./launch/agg_router.sh
```
## Disaggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it
export SERVED_MODEL_NAME=$MODEL_PATH
export PREFILL_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
export DECODE_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
./launch/disagg.sh
```
## Disaggregated Serving with KV Routing
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export MODEL_PATH=google/gemma-3-1b-it
export SERVED_MODEL_NAME=$MODEL_PATH
export PREFILL_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_prefill.yaml
export DECODE_ENGINE_ARGS=$DYNAMO_HOME/examples/backends/trtllm/engine_configs/gemma3/vswa_decode.yaml
./launch/disagg_router.sh
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running gpt-oss-120b Disaggregated with TensorRT-LLM
Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview
This deployment uses disaggregated serving in TensorRT-LLM where:
- **Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
- **Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
- **Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
The disaggregated approach optimizes for both low-latency (maximizing tokens per second per user) and high-throughput (maximizing total tokens per GPU per second) use cases by separating the compute-intensive prefill phase from the memory-bound decode phase.
## Prerequisites
- 1x NVIDIA B200 node with 8 GPUs (this guide focuses on single-node B200 deployment)
- CUDA Toolkit 12.8 or later
- Docker with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed
- Fast SSD storage for model weights (~240GB required)
- HuggingFace account and [access token](https://huggingface.co/settings/tokens)
- [HuggingFace CLI](https://huggingface.co/docs/huggingface_hub/en/guides/cli)
Ensure that the `etcd` and `nats` services are running with the following command:
```bash
docker compose -f deploy/docker-compose.yml up
```
## Instructions
### 1. Download the Model
```bash
export MODEL_PATH=<LOCAL_MODEL_DIRECTORY>
export HF_TOKEN=<INSERT_TOKEN_HERE>
pip install -U "huggingface_hub[cli]"
huggingface-cli download openai/gpt-oss-120b --exclude "original/*" --exclude "metal/*" --local-dir $MODEL_PATH
```
### 2. Run the Container
Set the container image:
```bash
export DYNAMO_CONTAINER_IMAGE=nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:my-tag
```
Launch the Dynamo TensorRT-LLM container with the necessary configurations:
```bash
docker run \
--gpus all \
-it \
--rm \
--network host \
--volume $MODEL_PATH:/model \
--volume $PWD:/workspace \
--shm-size=10G \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
--ulimit nofile=65536:65536 \
--cap-add CAP_SYS_PTRACE \
--ipc host \
-e HF_TOKEN=$HF_TOKEN \
-e TRTLLM_ENABLE_PDL=1 \
-e TRT_LLM_DISABLE_LOAD_WEIGHTS_IN_PARALLEL=True \
$DYNAMO_CONTAINER_IMAGE
```
This command:
- Automatically removes the container when stopped (`--rm`)
- Allows container to interact with host's IPC resources for optimal performance (`--ipc=host`)
- Runs the container in interactive mode (`-it`)
- Sets up shared memory and stack limits for optimal performance
- Mounts your model directory into the container at `/model`
- Mounts the current Dynamo workspace into the container at `/workspace/dynamo`
- Enables [PDL](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization) and disables parallel weight loading
- Sets HuggingFace token as environment variable in the container
### 3. Understanding the Configuration
The deployment uses configuration files and command-line arguments to control behavior:
#### Configuration Files
**Prefill Configuration (`examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`)**:
- `enable_attention_dp: false` - Attention data parallelism disabled for prefill
- `enable_chunked_prefill: true` - Enables efficient chunked prefill processing
- `moe_config.backend: CUTLASS` - Uses optimized CUTLASS kernels for MoE layers
- `cache_transceiver_config.backend: ucx` - Uses UCX for efficient KV cache transfer
- `cuda_graph_config.max_batch_size: 32` - Maximum batch size for CUDA graphs
**Decode Configuration (`examples/backends/trtllm/engine_configs/gpt-oss-120b/decode.yaml`)**:
- `enable_attention_dp: true` - Attention data parallelism enabled for decode
- `disable_overlap_scheduler: false` - Enables overlapping for decode efficiency
- `moe_config.backend: CUTLASS` - Uses optimized CUTLASS kernels for MoE layers
- `cache_transceiver_config.backend: ucx` - Uses UCX for efficient KV cache transfer
- `cuda_graph_config.max_batch_size: 128` - Maximum batch size for CUDA graphs
#### Command-Line Arguments
Both workers receive these key arguments:
- `--tensor-parallel-size 4` - Uses 4 GPUs for tensor parallelism
- `--expert-parallel-size 4` - Expert parallelism across 4 GPUs
- `--free-gpu-memory-fraction 0.9` - Allocates 90% of GPU memory
Prefill-specific arguments:
- `--max-num-tokens 20000` - Maximum tokens for prefill processing
- `--max-batch-size 32` - Maximum batch size for prefill
Decode-specific arguments:
- `--max-num-tokens 16384` - Maximum tokens for decode processing
- `--max-batch-size 128` - Maximum batch size for decode
### 4. Launch the Deployment
Note that GPT-OSS is a reasoning model with tool calling support. To ensure the response is being processed correctly, the worker should be launched with proper ```--dyn-reasoning-parser``` and ```--dyn-tool-call-parser```.
You can use the provided launch script or run the components manually:
#### Option A: Using the Launch Script
```bash
cd /workspace/examples/backends/trtllm
./launch/gpt_oss_disagg.sh
```
#### Option B: Manual Launch
1. **Start frontend**:
```bash
# Start frontend with round-robin routing
python3 -m dynamo.frontend --router-mode round-robin --http-port 8000 &
```
2. **Launch prefill worker**:
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python3 -m dynamo.trtllm \
--model-path /model \
--served-model-name openai/gpt-oss-120b \
--extra-engine-args examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony \
--disaggregation-mode prefill \
--max-num-tokens 20000 \
--max-batch-size 32 \
--free-gpu-memory-fraction 0.9 \
--tensor-parallel-size 4 \
--expert-parallel-size 4 &
```
3. **Launch decode worker**:
```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 python3 -m dynamo.trtllm \
--model-path /model \
--served-model-name openai/gpt-oss-120b \
--extra-engine-args examples/backends/trtllm/engine_configs/gpt-oss-120b/decode.yaml \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony \
--disaggregation-mode decode \
--max-num-tokens 16384 \
--free-gpu-memory-fraction 0.9 \
--tensor-parallel-size 4 \
--expert-parallel-size 4
```
### 6. Verify the Deployment is Ready
Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started:
```
curl http://localhost:8000/health
```
Make sure that both of the endpoints are available before sending an inference request:
```
{
"endpoints": [
"dyn://dynamo.tensorrt_llm.generate",
"dyn://dynamo.prefill.generate"
],
"status": "healthy"
}
```
If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
### 7. Test the Deployment
Send a test request to verify the deployment:
```bash
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
"max_output_tokens": 200,
"stream": false
}'
```
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
### 8. Reasoning and Tool Calling
Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
multi-turn as it involves tool selection and generation based on the tool result.
In addition, the reasoning effort can be configured through ```chat_template_args```. Increasing the reasoning effort makes the model more accurate but also slower. It supports three levels: ```low```, ```medium```, and ```high```.
Below is an example of sending multi-round requests to complete a user query with reasoning and tool calling:
**Application setup (pseudocode)**
```Python
# The tool defined by the application
def get_system_health():
for component in system.components:
if not component.health():
return False
return True
# The JSON representation of the declaration in ChatCompletion tool style
tool_choice = '{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}'
# On user query, perform below workflow.
def user_query(app_request):
# first round
# create chat completion with prompt and tool choice
request = ...
response = send(request)
if response["finish_reason"] == "tool_calls":
# second round
function, params = parse_tool_call(response)
function_result = function(params)
# create request with prompt, assistant response, and function result
request = ...
response = send(request)
return app_response(response)
```
**First request with tools**
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Hey, quick check: is everything up and running?"
}
],
"chat_template_args": {
"reasoning_effort": "low"
},
"tools": [
{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}
],
"response_format": {
"type": "text"
},
"stream": false,
"max_tokens": 300
}'
```
**First response with tool choice**
```JSON
{
"id": "chatcmpl-d1c12219-6298-4c83-a6e3-4e7cef16e1a9",
"choices": [
{
"index": 0,
"message": {
"tool_calls": [
{
"id": "call-1",
"type": "function",
"function": {
"name": "get_system_health",
"arguments": "{}"
}
}
],
"role": "assistant",
"reasoning_content": "We need to check system health. Use function."
},
"finish_reason": "tool_calls"
}
],
"created": 1758758741,
"model": "openai/gpt-oss-120b",
"object": "chat.completion",
"usage": null
}
```
**Second request with tool calling result**
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Hey, quick check: is everything up and running?"
},
{
"role": "assistant",
"tool_calls": [
{
"id": "call-1",
"type": "function",
"function": {
"name": "get_system_health",
"arguments": "{}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call-1",
"content": "{\"status\":\"ok\",\"uptime_seconds\":372045}"
}
],
"chat_template_args": {
"reasoning_effort": "low"
},
"tools": [
{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}
],
"response_format": {
"type": "text"
},
"stream": false,
"max_tokens": 300
}'
```
**Second response with final message**
```JSON
{
"id": "chatcmpl-9ebfe64a-68b9-4c1d-9742-644cf770ad0e",
"choices": [
{
"index": 0,
"message": {
"content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.",
"role": "assistant",
"reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up."
},
"finish_reason": "stop"
}
],
"created": 1758758853,
"model": "openai/gpt-oss-120b",
"object": "chat.completion",
"usage": null
}
```
## Benchmarking
### Performance Testing with AIPerf
The Dynamo container includes [AIPerf](https://github.com/ai-dynamo/aiperf/tree/main?tab=readme-ov-file#aiperf), NVIDIA's tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment.
**Run the following benchmark from inside the container** (after completing the deployment steps above):
```bash
# Create a directory for benchmark results
mkdir -p /tmp/benchmark-results
# Run the benchmark - this command tests the deployment with high-concurrency synthetic workload
aiperf profile \
--model openai/gpt-oss-120b \
--tokenizer /model \
--endpoint-type chat \
--endpoint /v1/chat/completions \
--streaming \
--url localhost:8000 \
--synthetic-input-tokens-mean 32000 \
--synthetic-input-tokens-stddev 0 \
--output-tokens-mean 256 \
--output-tokens-stddev 0 \
--extra-inputs max_tokens:256 \
--extra-inputs min_tokens:256 \
--extra-inputs ignore_eos:true \
--extra-inputs "{\"nvext\":{\"ignore_eos\":true}}" \
--concurrency 256 \
--request-count 6144 \
--warmup-request-count 1000 \
--num-dataset-entries 8000 \
--random-seed 100 \
--artifact-dir /tmp/benchmark-results \
-H 'Authorization: Bearer NOT USED' \
-H 'Accept: text/event-stream'
```
### What This Benchmark Does
This command:
- **Tests chat completions** with streaming responses against the disaggregated deployment
- **Simulates high load** with 256 concurrent requests and 6144 total requests
- **Uses long context inputs** (32K tokens) to test prefill performance
- **Generates consistent outputs** (256 tokens) to measure decode throughput
- **Includes warmup period** (1000 requests) to stabilize performance metrics
- **Saves detailed results** to `/tmp/benchmark-results` for analysis
Key parameters you can adjust:
- `--concurrency`: Number of simultaneous requests (impacts GPU utilization)
- `--synthetic-input-tokens-mean`: Average input length (tests prefill capacity)
- `--output-tokens-mean`: Average output length (tests decode throughput)
- `--request-count`: Total number of requests for the benchmark
### Installing AIPerf Outside the Container
If you prefer to run benchmarks from outside the container:
```bash
# Install AIPerf
pip install aiperf
# Then run the same benchmark command, adjusting the tokenizer path if needed
```
## Architecture Overview
The disaggregated architecture separates prefill and decode phases:
```mermaid
flowchart TD
Client["Users/Clients<br/>(HTTP)"] --> Frontend["Frontend<br/>Round-Robin Router"]
Frontend --> Prefill["Prefill Worker<br/>(GPUs 0-3)"]
Frontend --> Decode["Decode Worker<br/>(GPUs 4-7)"]
Prefill -.->|KV Cache Transfer<br/>via UCX| Decode
```
## Key Features
1. **Disaggregated Serving**: Separates compute-intensive prefill from memory-bound decode operations
2. **Optimized Resource Usage**: Different parallelism strategies for prefill vs decode
3. **Scalable Architecture**: Easy to adjust worker counts based on workload
4. **TensorRT-LLM Optimizations**: Leverages TensorRT-LLM's efficient kernels and memory management
## Troubleshooting
### Common Issues
1. **CUDA Out-of-Memory Errors**
- Reduce `--max-num-tokens` in the launch commands (currently 20000 for prefill, 16384 for decode)
- Lower `--free-gpu-memory-fraction` from 0.9 to 0.8 or 0.7
- Ensure model checkpoints are compatible with the expected format
2. **Workers Not Connecting**
- Ensure etcd and NATS services are running: `docker ps | grep -E "(etcd|nats)"`
- Check network connectivity between containers
- Verify CUDA_VISIBLE_DEVICES settings match your GPU configuration
- Check that no other processes are using the assigned GPUs
3. **Performance Issues**
- Monitor GPU utilization with `nvidia-smi` while the deployment is running
- Check worker logs for bottlenecks or errors
- Ensure that batch sizes in manual commands match those in configuration files
- Adjust chunked prefill settings based on your workload
- For connection issues, ensure port 8000 is not being used by another application
4. **Container Startup Issues**
- Verify that the NVIDIA Container Toolkit is properly installed
- Check Docker daemon is running with GPU support
- Ensure sufficient disk space for model weights and container images
## Next Steps
- **Production Deployment**: For multi-node deployments, see the [Multi-node Guide](../../../examples/basics/multinode/README.md)
- **Advanced Configuration**: Explore TensorRT-LLM engine building options for further optimization
- **Monitoring**: Set up Prometheus and Grafana for production monitoring
- **Performance Benchmarking**: Use AIPerf to measure and optimize your deployment performance
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KV Cache Transfer in Disaggregated Serving
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
## Using NIXL for KV Cache Transfer
Start the disaggregated service: See [Disaggregated Serving](./README.md#disaggregated) to learn how to start the deployment.
## Default Method: NIXL
By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
### Specify Backends for NIXL
TensorRT-LLM supports two NIXL communication backends: UCX and LIBFABRIC. By default, UCX is used if no backend is explicitly specified. Dynamo currently only supports the UCX backend, as LIBFABRIC support is still a work in progress. Please do not change the NIXL backend in the Dynamo runtime image.
## Alternative Method: UCX
TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
> [!Note]
> The environment variable `TRTLLM_USE_UCX_KVCACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios:
- **Aggregated Serving:**
Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
- **Disaggregated Serving:**
Distribute the workload across two GB200x4 nodes:
- One node runs the decode worker.
- The other node runs the prefill worker.
## Notes
* Make sure the (`eagle3_one_model: true`) is set in the LLM API config inside the `examples/backends/trtllm/engine_configs/llama4/eagle` folder.
## Setup
Assuming you have already allocated your nodes via `salloc`, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
```bash
cd $DYNAMO_HOME/examples/backends/trtllm
export IMAGE="<dynamo_trtllm_image>"
# export MOUNTS="${PWD}/:/mnt,/lustre:/lustre"
export MOUNTS="${PWD}/:/mnt"
export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
```
See [this](./multinode/multinode-examples.md#setup) section from multinode guide to learn more about the above options.
## Aggregated Serving
```bash
export NUM_NODES=1
export ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_agg.yml"
./multinode/srun_aggregated.sh
```
## Disaggregated Serving
```bash
export NUM_PREFILL_NODES=1
export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_prefill.yml"
export NUM_DECODE_NODES=1
export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4/eagle/eagle_decode.yml"
./multinode/srun_disaggregated.sh
```
## Example Request
See [here](./multinode/multinode-examples.md#example-request) to learn how to send a request to the deployment.
```
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
"model": "nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8",
"messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
"max_tokens": 1024
}' -w "\n"
# output:
{"id":"cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8","choices":[{"text":"NVIDIA is considered a great company for several reasons:\n\n1. **Technological Innovation**: NVIDIA is a leader in the field of graphics processing units (GPUs) and has been at the forefront of technological innovation.
...
and the broader tech industry.\n\nThese factors combined have contributed to NVIDIA's status as a great company in the technology sector.","index":0,"logprobs":null,"finish_reason":"stop"}],"created":1753329671,"model":"nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8","system_fingerprint":null,"object":"text_completion","usage":{"prompt_tokens":16,"completion_tokens":562,"total_tokens":578,"prompt_tokens_details":null,"completion_tokens_details":null}}
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
the set of nodes need to be launched together in the same MPI world, such as
via `mpirun` or `srun`. This is true regardless of whether the worker is
aggregated, prefill-only, or decode-only.
In this document we will demonstrate two examples launching multinode workers
on a slurm cluster with `srun`:
1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
worker across 4 GB200 nodes
2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
worker (4 nodes) across a total of 8 GB200 nodes.
NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
using `mpirun` directly, with relative ease.
## Setup
For simplicity of the example, we will make some assumptions about your slurm cluster:
1. First, we assume you have access to a slurm cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup.
2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
example will use `srun` arguments like `--container-image`,
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead.
3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
described [here](https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container).
This is the image that can be set to the `IMAGE` environment variable in later steps.
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
will allocate 8 nodes below as a reference command to have enough capacity
to run both examples. If you plan to only run the aggregated example, you
will only need 4 nodes. If you customize the configurations to require a
different number of nodes, you can adjust the number of allocated nodes
accordingly. Pre-allocating nodes is technically not a requirement,
but it makes iterations of testing/experimenting easier.
Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
```bash
# Set partition manually based on your slurm cluster's partition names
PARTITION=""
# Set account manually if this command doesn't work on your cluster
ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami) format=account)"
salloc \
--partition="${PARTITION}" \
--account="${ACCOUNT}" \
--job-name="${ACCOUNT}-dynamo.trtllm" \
-t 05:00:00 \
--nodes 8
```
5. Lastly, we will assume you are inside an interactive shell on one of your allocated
nodes, which may be the default behavior after executing the `salloc` command above
depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
### Environment Variable Setup
This example aims to automate as much of the environment setup as possible,
but all slurm clusters and environments are different, and you may need to
dive into the scripts to make modifications based on your specific environment.
Assuming you have already allocated your nodes via `salloc`, and are
inside an interactive shell on one of the allocated nodes, set the
following environment variables based:
```bash
# NOTE: IMAGE must be set manually for now
# To build an iamge, see the steps here:
# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container
export IMAGE="<dynamo_trtllm_image>"
# MOUNTS are the host:container path pairs that are mounted into the containers
# launched by each `srun` command.
#
# If you want to reference files, such as $MODEL_PATH below, in a
# different location, you can customize MOUNTS or specify additional
# comma-separated mount pairs here.
#
# NOTE: Currently, this example assumes that the local bash scripts and configs
# referenced are mounted into into /mnt inside the container. If you want to
# customize the location of the scripts, make sure to modify `srun_aggregated.sh`
# accordingly for the new locations of `start_frontend_services.sh` and
# `start_trtllm_worker.sh`.
#
# For example, assuming your cluster had a `/lustre` directory on the host, you
# could add that as a mount like so:
#
# export MOUNTS="${PWD}/../../../../:/mnt,/lustre:/lustre"
export MOUNTS="${PWD}/../../../../:/mnt"
# NOTE: In general, Deepseek R1 is very large, so it is recommended to
# pre-download the model weights and save them in some shared location,
# NFS storage, HF_HOME, etc. and modify the `--model-path` below
# to reuse the pre-downloaded weights instead.
#
# On Blackwell systems (ex: GB200), it is recommended to use the FP4 weights:
# https://huggingface.co/nvidia/DeepSeek-R1-FP4
#
# On Hopper systems, FP4 isn't supported so you'll need to use the default weights:
# https://huggingface.co/deepseek-ai/DeepSeek-R1
export MODEL_PATH="nvidia/DeepSeek-R1-FP4"
# The name the model will be served/queried under, matching what's
# returned by the /v1/models endpoint.
#
# By default this is inferred from MODEL_PATH, but when using locally downloaded
# model weights, it can be nice to have explicit control over the name.
export SERVED_MODEL_NAME="nvidia/DeepSeek-R1-FP4"
```
## Aggregated WideEP
Assuming you have at least 4 nodes allocated following the setup steps above,
follow these steps below to launch an **aggregated** deployment across 4 nodes:
```bash
# Default set in srun_aggregated.sh, but can customize here.
# export ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/agg/wide_ep/wide_ep_agg.yaml"
# Customize NUM_NODES to match the desired parallelism in ENGINE_CONFIG
# The product of NUM_NODES*NUM_GPUS_PER_NODE should match the number of
# total GPUs necessary to satisfy the requested parallelism. For example,
# 4 nodes x 4 gpus/node = 16 gpus total for TP16/EP16.
# export NUM_NODES=4
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
# export NUM_GPUS_PER_NODE=4
# Launches:
# - frontend + etcd/nats on current (head) node
# - one large aggregated trtllm worker across multiple nodes via MPI tasks
./srun_aggregated.sh
```
## Disaggregated WideEP
Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
following the setup above, follow these steps below to launch a **disaggregated**
deployment across 8 nodes:
> [!Tip]
> Make sure you have a fresh environment and don't still have the aggregated
> example above still deployed on the same set of nodes.
```bash
# Defaults set in srun_disaggregated.sh, but can customize here.
# export PREFILL_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_prefill.yaml"
# export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/deepseek-r1/disagg/wide_ep/wide_ep_decode.yaml"
# Customize NUM_PREFILL_NODES to match the desired parallelism in PREFILL_ENGINE_CONFIG
# Customize NUM_DECODE_NODES to match the desired parallelism in DECODE_ENGINE_CONFIG
# The products of NUM_PREFILL_NODES*NUM_GPUS_PER_NODE and
# NUM_DECODE_NODES*NUM_GPUS_PER_NODE should match the respective number of
# GPUs necessary to satisfy the requested parallelism in each config.
# export NUM_PREFILL_NODES=4
# export NUM_DECODE_NODES=4
# GB200 nodes have 4 gpus per node, but for other types of nodes you can configure this.
# export NUM_GPUS_PER_NODE=4
# Launches:
# - frontend + etcd/nats on current (head) node.
# - one large prefill trtllm worker across multiple nodes via MPI tasks
# - one large decode trtllm worker across multiple nodes via MPI tasks
./srun_disaggregated.sh
```
> [!Tip]
> To launch multiple replicas of the configured prefill/decode workers, you can set
> NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
## Understanding the Output
1. The `srun_aggregated.sh` launches two `srun` jobs. The first launches
etcd, NATS, and the OpenAI frontend on the head node only
called "node1" in the example output below. The second launches
a single TP16 Dynamo+TRTLLM worker spread across 4 nodes, each node
using 4 GPUs each.
```
# Frontend/etcd/nats services
srun: launching StepId=453374.17 on host node1, 1 tasks: 0
...
# TP16 TRTLLM worker split across 4 nodes with 4 gpus each
srun: launching StepId=453374.18 on host node1, 4 tasks: [0-3]
srun: launching StepId=453374.18 on host node2, 4 tasks: [4-7]
srun: launching StepId=453374.18 on host node3, 4 tasks: [8-11]
srun: launching StepId=453374.18 on host node4, 4 tasks: [12-15]
```
2. The OpenAI frontend will listen for and dynamically discover workers as
they register themselves with Dynamo's distributed runtime:
```
0: 2025-06-13T02:36:48.160Z INFO dynamo_run::input::http: Watching for remote model at models
0: 2025-06-13T02:36:48.161Z INFO dynamo_llm::http::service::service_v2: Starting HTTP service on: 0.0.0.0:8000 address="0.0.0.0:8000"
```
3. The TRTLLM worker will consist of N (N=16 for TP16) MPI ranks, 1 rank on each
GPU on each node, which will each output their progress while loading the model.
You can see each rank's output prefixed with the rank at the start of each log line
until the model succesfully finishes loading:
```
8: rank8 run mgmn worker node with mpi_world_size: 16 ...
10: rank10 run mgmn worker node with mpi_world_size: 16 ...
9: rank9 run mgmn worker node with mpi_world_size: 16 ...
11: rank11 run mgmn worker node with mpi_world_size: 16 ...
...
15: Model init total -- 55.42s
11: Model init total -- 55.91s
12: Model init total -- 55.24s
```
4. After the model fully finishes loading on all ranks, the worker will register itself,
and the OpenAI frontend will detect it, signaled by this output:
```
0: 2025-06-13T02:46:35.040Z INFO dynamo_llm::discovery::watcher: added model model_name="nvidia/DeepSeek-R1-FP4"
```
5. At this point, with the worker fully initialized and detected by the frontend,
it is now ready for inference.
6. For `srun_disaggregated.sh`, it follows a very similar flow, but instead launches
three srun jobs instead of two. One for frontend, one for prefill worker,
and one for decode worker.
## Example Request
To verify the deployed model is working, send a `curl` request:
```bash
# NOTE: $HOST assumes running on head node, but can be changed to $HEAD_NODE_IP instead.
HOST=localhost
PORT=8000
# "model" here should match the model name returned by the /v1/models endpoint
curl -w "%{http_code}" ${HOST}:${PORT}/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "'${SERVED_MODEL_NAME}'",
"messages": [
{
"role": "user",
"content": "Tell me a story as if we were playing dungeons and dragons."
}
],
"stream": true,
"max_tokens": 30
}'
```
## Cleanup
To cleanup background `srun` processes launched by `srun_aggregated.sh` or
`srun_disaggregated.sh`, you can run:
```bash
pkill srun
```
## Known Issues
- This example has only been tested on a 4xGB200 node setup with 16 GPUs using
FP4 weights. In theory, the example should work on alternative setups such as
H100 nodes with FP8 weights, but this hasn't been tested yet.
- WideEP configs in this directory are still being tested. A WideEP specific
example with documentation will be added once ready.
- There are known issues where WideEP workers may not cleanly shut down:
- This may lead to leftover shared memory files in `/dev/shm/moe_*`. For
now, you must manually clean these up before deploying again on the
same set of nodes.
- Similarly, there may be GPU memory left in-use after killing the `srun`
jobs. After cleaning up any leftover shared memory files as described
above, the GPU memory may slowly come back. You can run `watch nvidia-smi`
to check on this behavior. If you don't free the GPU memory before the
next deployment, you may get a CUDA OOM error while loading the model.
- There is mention of this issue in the relevant TRT-LLM blog
[here](https://github.com/NVIDIA/TensorRT-LLM/blob/6021a439ab9c29f4c46f721eeb59f6b992c425ea/docs/source/blogs/tech_blog/blog4_Scaling_Expert_Parallelism_in_TensorRT-LLM.md#miscellaneous).
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# LLM Deployment using vLLM
This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
## Use the Latest Release
We recommend using the latest stable release of Dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
---
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Deploy on Kubernetes](#kubernetes-deployment)
- [Configuration](#configuration)
## Feature Support Matrix
### Core Dynamo Features
| Feature | vLLM | Notes |
|---------|------|-------|
| [**Disaggregated Serving**](../../../docs/design_docs/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../../docs/design_docs/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../components/router/README.md) | ✅ | |
| [**SLA-Based Planner**](../../../docs/components/planner/planner_guide.md) | ✅ | |
| [**Load Based Planner**](../../../docs/components/planner/README.md) | 🚧 | WIP |
| [**KVBM**](../../../docs/components/kvbm/README.md) | ✅ | |
| [**LMCache**](../../integrations/lmcache_integration.md) | ✅ | |
| [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
### Large Scale P/D and WideEP Features
| Feature | vLLM | Notes |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP** | ✅ | Support for PPLX / DeepEP not verified |
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |
## vLLM Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start Infrastructure Services (Local Development Only)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Pull or build container
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
```bash
python container/render.py --framework=vllm --target=runtime --output-short-filename
docker build -t dynamo:vllm-latest -f container/rendered.Dockerfile .
```
### Run container
```bash
./container/run.sh -it --framework VLLM [--mount-workspace]
```
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
## Run Single Node Examples
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
### Aggregated Serving
```bash
# requires one gpu
cd examples/backends/vllm
bash launch/agg.sh
```
### Aggregated Serving with KV Routing
```bash
# requires two gpus
cd examples/backends/vllm
bash launch/agg_router.sh
```
### Disaggregated Serving
```bash
# requires two gpus
cd examples/backends/vllm
bash launch/disagg.sh
```
### Disaggregated Serving with KV Routing
```bash
# requires three gpus
cd examples/backends/vllm
bash launch/disagg_router.sh
```
### Single Node Data Parallel Attention / Expert Parallelism
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
```bash
# requires four gpus
cd examples/backends/vllm
bash launch/dep.sh
```
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
## Advanced Examples
Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
### Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)
Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
**Guide:** [Speculative Decoding Quickstart](../../features/speculative_decoding/speculative_decoding_vllm.md)
> **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../examples/backends/vllm/deploy/README.md)
## Configuration
vLLM workers are configured through command-line arguments. Key parameters include:
- `--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
- `--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
- `--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
- `--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
- `--enable-prompt-embeds`: **Enable prompt embeddings feature** (opt-in, default: disabled)
- **Required for:** Accepting pre-computed prompt embeddings via API
- **Default behavior:** Prompt embeddings DISABLED - requests with `prompt_embeds` will fail
- **Error without flag:** `ValueError: You must set --enable-prompt-embeds to input prompt_embeds`
See `args.py` for the full list of configuration options and their defaults.
The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
### Hashing Consistency for KV Events
When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:
```bash
vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
```
See the high-level notes in [Router Design](../../design_docs/router_design.md#deterministic-event-ids) on deterministic event IDs.
## Request Migration
Dynamo supports [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for configuration details.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../../docs/fault_tolerance/request_cancellation.md) documentation.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running Deepseek R1 with Wide EP
Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a separate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
## Instructions
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README.md) Getting Started section on each node, and then run these two commands.
node 0
```bash
./launch/dsr1_dep.sh --num-nodes 2 --node-rank 0 --gpus-per-node 8 --master-addr <node 0 addr>
```
node 1
```bash
./launch/dsr1_dep.sh --num-nodes 2 --node-rank 1 --gpus-per-node 8 --master-addr <node 0 addr>
```
### Testing the Deployment
On node 0 (where the frontend was started) send a test request to verify your deployment:
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running gpt-oss-120b Disaggregated with vLLM
Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview
This deployment uses disaggregated serving in vLLM where:
- **Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
- **Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
- **Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
## Prerequisites
This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](/docs/backends/vllm/README.md)
## Instructions
### 1. Launch the Deployment
Note that GPT-OSS is a reasoning model with tool calling support. To
ensure the response is being processed correctly, the worker should be
launched with proper `--dyn-reasoning-parser` and `--dyn-tool-call-parser`.
**Start frontend**
```bash
python3 -m dynamo.frontend --http-port 8000 &
```
**Run decode worker**
```bash
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m dynamo.vllm \
--model openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
**Run prefill workers**
```bash
CUDA_VISIBLE_DEVICES=4,5,6,7 python -m dynamo.vllm \
--model openai/gpt-oss-120b \
--tensor-parallel-size 4 \
--is-prefill-worker \
--dyn-reasoning-parser gpt_oss \
--dyn-tool-call-parser harmony
```
### 2. Verify the Deployment is Ready
Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started:
```
curl http://localhost:8000/health
```
Make sure that both of the `generate` endpoints are available before sending an inference request:
```
{
"status": "healthy",
"endpoints": [
"dyn://dynamo.backend.generate"
],
"instances": [
{
"component": "backend",
"endpoint": "generate",
"namespace": "dynamo",
"instance_id": 7587889712474989333,
"transport": {
"nats_tcp": "dynamo_backend.generate-694d997dbae9a315"
}
},
{
"component": "prefill",
"endpoint": "generate",
"namespace": "dynamo",
"instance_id": 7587889712474989350,
"transport": {
"nats_tcp": "dynamo_prefill.generate-694d997dbae9a326"
}
},
...
]
}
```
If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
### 3. Test the Deployment
Send a test request to verify the deployment:
```bash
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "openai/gpt-oss-120b",
"input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
"max_output_tokens": 200,
"stream": false
}'
```
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
### 4. Reasoning and Tool Calling
Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
multi-turn as it involves tool selection and generation based on the tool result. Below is an example
of sending multi-round requests to complete a user query with reasoning and tool calling:
**Application setup (pseudocode)**
```Python
# The tool defined by the application
def get_system_health():
for component in system.components:
if not component.health():
return False
return True
# The JSON representation of the declaration in ChatCompletion tool style
tool_choice = '{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}'
# On user query, perform below workflow.
def user_query(app_request):
# first round
# create chat completion with prompt and tool choice
request = ...
response = send(request)
if response["finish_reason"] == "tool_calls":
# second round
function, params = parse_tool_call(response)
function_result = function(params)
# create request with prompt, assistant response, and function result
request = ...
response = send(request)
return app_response(response)
```
**First request with tools**
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Hey, quick check: is everything up and running?"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}
],
"response_format": {
"type": "text"
},
"stream": false,
"max_tokens": 300
}'
```
**First response with tool choice**
```JSON
{
"id": "chatcmpl-d1c12219-6298-4c83-a6e3-4e7cef16e1a9",
"choices": [
{
"index": 0,
"message": {
"tool_calls": [
{
"id": "call-1",
"type": "function",
"function": {
"name": "get_system_health",
"arguments": "{}"
}
}
],
"role": "assistant",
"reasoning_content": "We need to check system health. Use function."
},
"finish_reason": "tool_calls"
}
],
"created": 1758758741,
"model": "openai/gpt-oss-120b",
"object": "chat.completion",
"usage": null
}
```
**Second request with tool calling result**
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '
{
"model": "openai/gpt-oss-120b",
"messages": [
{
"role": "user",
"content": "Hey, quick check: is everything up and running?"
},
{
"role": "assistant",
"tool_calls": [
{
"id": "call-1",
"type": "function",
"function": {
"name": "get_system_health",
"arguments": "{}"
}
}
]
},
{
"role": "tool",
"tool_call_id": "call-1",
"content": "{\"status\":\"ok\",\"uptime_seconds\":372045}"
}
],
"tools": [
{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}
],
"response_format": {
"type": "text"
},
"stream": false,
"max_tokens": 300
}'
```
**Second response with final message**
```JSON
{
"id": "chatcmpl-9ebfe64a-68b9-4c1d-9742-644cf770ad0e",
"choices": [
{
"index": 0,
"message": {
"content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.",
"role": "assistant",
"reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up."
},
"finish_reason": "stop"
}
],
"created": 1758758853,
"model": "openai/gpt-oss-120b",
"object": "chat.completion",
"usage": null
}
```
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Multi-node Examples
This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities.
## Prerequisites
Multi-node deployments require:
- Multiple nodes with GPU resources
- Network connectivity between nodes (faster the better)
- Firewall rules allowing NATS/ETCD communication
## Infrastructure Setup
### Step 1: Start NATS/ETCD on Head Node
Start the required services on your head node. These endpoints must be accessible from all worker nodes:
```bash
# On head node (node-1)
docker compose -f deploy/docker-compose.yml up -d
```
Default ports:
- NATS: 4222
- ETCD: 2379
### Step 2: Configure Environment Variables
Set the head node IP address and service endpoints. **Set this on all nodes** for easy copy-paste:
```bash
# Set this on ALL nodes - replace with your actual head node IP
export HEAD_NODE_IP="<your-head-node-ip>"
# Service endpoints (set on all nodes)
export NATS_SERVER="nats://${HEAD_NODE_IP}:4222"
export ETCD_ENDPOINTS="${HEAD_NODE_IP}:2379"
```
## Deployment Patterns
### Multi-node Aggregated Serving
Deploy vLLM workers across multiple nodes for horizontal scaling:
**Node 1 (Head Node)**: Run ingress and first worker
```bash
# Start ingress
python -m dynamo.frontend --router-mode kv
# Start vLLM worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--enforce-eager
```
**Node 2**: Run additional worker
```bash
# Start vLLM worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--tensor-parallel-size 8 \
--enforce-eager
```
### Multi-node Disaggregated Serving
Deploy prefill and decode workers on separate nodes for optimized resource utilization:
**Node 1**: Run ingress and decode worker
```bash
# Start ingress
python -m dynamo.frontend --router-mode kv &
# Start prefill worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 8 \
--enforce-eager
```
**Node 2**: Run prefill worker
```bash
# Start decode worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 8 \
--enforce-eager \
--is-prefill-worker
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# vLLM Prometheus Metrics
## Overview
When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](../../integrations/lmcache_integration.md).
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
## Environment Variables and Flags
| Variable/Flag | Description | Default | Example |
|---------------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port. Required to expose `/metrics` endpoint. | `-1` (disabled) | `8081` |
| `--connector` | KV connector to use. Use `lmcache` to enable LMCache metrics. | `nixl` | `--connector lmcache` |
## Getting Started Quickly
This is a single machine example.
### Start Observability Stack
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
### Launch Dynamo Components
Launch a frontend and vLLM backend to test metrics:
```bash
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
$ python -m dynamo.frontend
# Enable system metrics server on port 8081
$ DYN_SYSTEM_PORT=8081 python -m dynamo.vllm --model <model_name> \
--enforce-eager --no-enable-prefix-caching --max-num-seqs 3
```
Wait for the vLLM worker to start, then send requests and check metrics:
```bash
# Send a request
curl -H 'Content-Type: application/json' \
-d '{
"model": "<model_name>",
"max_completion_tokens": 100,
"messages": [{"role": "user", "content": "Hello"}]
}' \
http://localhost:8000/v1/chat/completions
# Check metrics from the worker
curl -s localhost:8081/metrics | grep "^vllm:"
```
## Exposed Metrics
vLLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All vLLM engine metrics use the `vllm:` prefix and include labels (e.g., `model_name`, `finished_reason`, `scheduling_event`) to identify the source.
**Example Prometheus Exposition Format text:**
```
# HELP vllm:request_success_total Number of successfully finished requests.
# TYPE vllm:request_success_total counter
vllm:request_success_total{finished_reason="length",model_name="meta-llama/Llama-3.1-8B"} 15.0
vllm:request_success_total{finished_reason="stop",model_name="meta-llama/Llama-3.1-8B"} 150.0
# HELP vllm:time_to_first_token_seconds Histogram of time to first token in seconds.
# TYPE vllm:time_to_first_token_seconds histogram
vllm:time_to_first_token_seconds_bucket{le="0.001",model_name="meta-llama/Llama-3.1-8B"} 0.0
vllm:time_to_first_token_seconds_bucket{le="0.005",model_name="meta-llama/Llama-3.1-8B"} 5.0
vllm:time_to_first_token_seconds_count{model_name="meta-llama/Llama-3.1-8B"} 165.0
vllm:time_to_first_token_seconds_sum{model_name="meta-llama/Llama-3.1-8B"} 89.38
```
**Note:** The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for the current list.
### Metric Categories
vLLM provides metrics in the following categories (all prefixed with `vllm:`):
- **Request metrics** - Request success, failure, and completion tracking
- **Performance metrics** - Latency, throughput, and timing measurements
- **Resource usage** - System resource consumption
- **Scheduler metrics** - Scheduling and queue management
- **Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version.
## Available Metrics
The official vLLM documentation includes complete metric definitions with:
- Detailed explanations and design rationale
- Counter, Gauge, and Histogram metric types
- Metric labels (e.g., `model_name`, `finished_reason`, `scheduling_event`)
- Information about v1 metrics migration
- Future work and deprecated metrics
For the complete and authoritative list of all vLLM metrics, see the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
## LMCache Metrics
When LMCache is enabled with `--connector lmcache` and `DYN_SYSTEM_PORT` is set, LMCache metrics (prefixed with `lmcache:`) are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.
### Minimum Requirements
To access LMCache metrics, both of these are required:
1. `--connector lmcache` - Enables LMCache in vLLM
2. `DYN_SYSTEM_PORT=8081` - Enables Dynamo's metrics HTTP endpoint
**Example:**
```bash
DYN_SYSTEM_PORT=8081 \
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --connector lmcache
```
### Viewing LMCache Metrics
```bash
# View all LMCache metrics
curl -s localhost:8081/metrics | grep "^lmcache:"
```
### Troubleshooting
Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in:
- [LMCache Integration Guide](../../integrations/lmcache_integration.md#troubleshooting)
**For complete LMCache configuration and metric details**, see:
- [LMCache Integration Guide](../../integrations/lmcache_integration.md) - Setup and configuration
- [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference
## Implementation Details
- vLLM v1 uses multiprocess metrics collection via `prometheus_client.multiprocess`
- `PROMETHEUS_MULTIPROC_DIR`: (optional). By default, Dynamo automatically manages this environment variable, setting it to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when `/metrics` is scraped. Users only need to set this explicitly where complete control over the metrics directory is required.
- Dynamo uses `MultiProcessCollector` to aggregate metrics from all worker processes
- Metrics are filtered by the `vllm:` and `lmcache:` prefixes before being exposed (when LMCache is enabled)
- The integration uses Dynamo's `register_engine_metrics_callback()` function with the global `REGISTRY`
- Metrics appear after vLLM engine initialization completes
- vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for migration details
## Related Documentation
### vLLM Metrics
- [Official vLLM Metrics Design Documentation](https://docs.vllm.ai/en/latest/design/metrics.html)
- [vLLM Production Metrics User Guide](https://docs.vllm.ai/en/latest/usage/metrics.html)
- [vLLM GitHub - Metrics Implementation](https://github.com/vllm-project/vllm/tree/main/vllm/v1/metrics)
### Dynamo Metrics
- [Dynamo Metrics Guide](../../observability/metrics.md) - Complete documentation on Dynamo runtime metrics
- [Prometheus and Grafana Setup](../../observability/prometheus-grafana.md) - Visualization setup instructions
- Dynamo runtime metrics (prefixed with `dynamo_*`) are available at the same `/metrics` endpoint alongside vLLM metrics
- Implementation: `lib/runtime/src/metrics.rs` (Rust runtime metrics)
- Metric names: `lib/runtime/src/metrics/prometheus_names.rs` (metric name constants)
- Integration code: `components/src/dynamo/common/utils/prometheus.py` - Prometheus utilities and callback registration
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Prompt Embeddings
Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.
## How It Works
| Path | What Happens |
|------|--------------|
| **Text prompt** | Tokenize → Embedding Layer → Transformer |
| **Prompt embeds** | Validate → Bypass Embedding → Transformer |
## Architecture
```mermaid
flowchart LR
subgraph FE["Frontend (Rust)"]
A[Request] --> B{prompt_embeds?}
B -->|No| C[🔴 Tokenize text]
B -->|Yes| D[🟢 Validate base64+size]
C --> E[token_ids, ISL=N]
D --> F[token_ids=empty, skip ISL]
end
subgraph RT["Router (NATS)"]
G[Route PreprocessedRequest]
end
subgraph WK["Worker (Python)"]
H[TokensPrompt#40;token_ids#41;]
I[Decode → EmbedsPrompt#40;tensor#41;]
end
subgraph VLLM["vLLM Engine"]
J[🔴 Embedding Layer]
K[🟢 Bypass Embedding]
L[Transformer Layers]
M[LM Head → Response]
end
E --> G
F --> G
G -->|Normal| H
G -->|Embeds| I
H --> J --> L
I --> K --> L
L --> M
```
| Layer | **Normal Flow** | **Prompt Embeds** |
|---|---|---|
| **Frontend (Rust)** | 🔴 Tokenize text → token_ids, compute ISL | 🟢 Validate base64+size, skip tokenization |
| **Router (NATS)** | Forward token_ids in PreprocessedRequest | Forward prompt_embeds string |
| **Worker (Python)** | `TokensPrompt(token_ids)` | Decode base64 → `EmbedsPrompt(tensor)` |
| **vLLM Engine** | 🔴 Embedding Layer → Transformer | 🟢 Bypass Embedding → Transformer |
## Quick Start
Send pre-computed prompt embeddings directly to vLLM, bypassing tokenization.
### 1. Enable Feature
```bash
python -m dynamo.vllm --model <model-name> --enable-prompt-embeds
```
> **Required:** The `--enable-prompt-embeds` flag must be set or requests will fail.
### 2. Send Request
```python
import torch
import base64
import io
from openai import OpenAI
# Prepare embeddings (sequence_length, hidden_dim)
embeddings = torch.randn(10, 4096, dtype=torch.float32)
# Encode
buffer = io.BytesIO()
torch.save(embeddings, buffer)
buffer.seek(0)
embeddings_base64 = base64.b64encode(buffer.read()).decode()
# Send
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="", # Can be empty or present; prompt_embeds takes precedence
max_tokens=100,
extra_body={"prompt_embeds": embeddings_base64}
)
```
## Configuration
### Docker Compose
```yaml
vllm-worker:
command:
- python
- -m
- dynamo.vllm
- --model
- meta-llama/Meta-Llama-3.1-8B-Instruct
- --enable-prompt-embeds # Add this
```
### Kubernetes
```yaml
extraPodSpec:
mainContainer:
args:
- "--model"
- "meta-llama/Meta-Llama-3.1-8B-Instruct"
- "--enable-prompt-embeds" # Add this
```
### NATS Configuration
NATS needs 15MB payload limit (already configured in default deployments):
```yaml
# Docker Compose - deploy/docker-compose.yml
nats-server:
command: ["-js", "--trace", "-m", "8222", "--max_payload", "15728640"]
# Kubernetes - deploy/cloud/helm/platform/values.yaml
nats:
config:
merge:
max_payload: 15728640
```
## API Reference
### Request
```json
{
"model": "meta-llama/Meta-Llama-3.1-8B-Instruct",
"prompt": "",
"prompt_embeds": "<base64-encoded-pytorch-tensor>",
"max_tokens": 100
}
```
**Requirements:**
- **Format:** PyTorch tensor serialized with `torch.save()` and base64-encoded
- **Size:** 100 bytes - 10MB (decoded)
- **Shape:** `(seq_len, hidden_dim)` or `(batch, seq_len, hidden_dim)`
- **Dtype:** `torch.float32` (recommended)
**Field Precedence:**
- Both `prompt` and `prompt_embeds` can be provided in the same request
- When both are present, **`prompt_embeds` takes precedence** and `prompt` is ignored
- The `prompt` field can be empty (`""`) when using `prompt_embeds`
### Response
Standard OpenAI format with accurate usage:
```json
{
"usage": {
"prompt_tokens": 10, // Extracted from embedding shape
"completion_tokens": 15,
"total_tokens": 25
}
}
```
## Errors
| Error | Fix |
|-------|-----|
| `ValueError: You must set --enable-prompt-embeds` | Add `--enable-prompt-embeds` to worker |
| `prompt_embeds must be valid base64` | Use `.decode('utf-8')` after `base64.b64encode()` |
| `decoded data must be at least 100 bytes` | Increase sequence length |
| `exceeds maximum size of 10MB` | Reduce sequence length |
| `must be a torch.Tensor` | Use `torch.save()` not NumPy |
| `size of tensor must match` | Use correct hidden dimension for model |
## Examples
### Streaming
```python
stream = client.completions.create(
model="meta-llama/Meta-Llama-3.1-8B-Instruct",
prompt="",
max_tokens=100,
stream=True,
extra_body={"prompt_embeds": embeddings_base64}
)
for chunk in stream:
if chunk.choices:
print(chunk.choices[0].text, end="", flush=True)
```
### Load from File
```python
embeddings = torch.load("embeddings.pt")
buffer = io.BytesIO()
torch.save(embeddings, buffer)
buffer.seek(0)
embeddings_base64 = base64.b64encode(buffer.read()).decode()
# Use in request...
```
## Limitations
- ❌ Requires `--enable-prompt-embeds` flag (disabled by default)
- ❌ PyTorch format only (NumPy not supported)
- ❌ 10MB decoded size limit
- ❌ Cannot mix with multimodal data (images/video)
## Testing
Comprehensive test coverage ensures reliability:
- **Unit Tests:** 31 tests (11 Rust + 20 Python)
- Validation, decoding, format handling, error cases, usage statistics
- **Integration Tests:** 21 end-to-end tests
- Core functionality, performance, formats, concurrency, usage statistics
Run integration tests:
```bash
# Start worker with flag
python -m dynamo.vllm --model Qwen/Qwen3-0.6B --enable-prompt-embeds
# Run tests
pytest tests/integration/test_prompt_embeds_integration.py -v
```
## See Also
- [vLLM Backend](README.md)
- [vLLM Configuration](README.md#configuration)
<!-- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->
# Dynamo Benchmarking Guide
This benchmarking framework lets you compare performance across any combination of:
- **DynamoGraphDeployments**
- **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.)
## Choosing Your Benchmarking Approach
Dynamo provides two benchmarking approaches to suit different use cases: **client-side** and **server-side**. Client-side refers to running benchmarks on your local machine and connecting to Kubernetes deployments via port-forwarding, while server-side refers to running benchmarks directly within the Kubernetes cluster using internal service URLs. Which method to use depends on your use case.
**TLDR:**
Need high performance/load testing? Server-side.
Just quick testing/comparison? Client-side.
### Use Client-Side Benchmarking When:
- You want to quickly test deployments
- You want immediate access to results on your local machine
- You're comparing external services or deployments (not necessarily just Dynamo deployments)
- You need to run benchmarks from your laptop/workstation
**[Go to Client-Side Benchmarking (Local)](#client-side-benchmarking-local)**
### Use Server-Side Benchmarking When:
- You have a development environment with kubectl access
- You're doing performance validation with high load/speed requirements
- You're experiencing timeouts or performance issues with client-side benchmarking
- You want optimal network performance (no port-forwarding overhead)
- You're running automated CI/CD pipelines
- You need isolated execution environments
- You're doing resource-intensive benchmarking
- You want persistent result storage in the cluster
**[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)**
### Quick Comparison
| Feature | Client-Side | Server-Side |
|---------|-------------|-------------|
| **Location** | Your local machine | Kubernetes cluster |
| **Network** | Port-forwarding required | Direct service DNS |
| **Setup** | Quick and simple | Requires cluster resources |
| **Performance** | Limited by local resources, may timeout under high load | Optimal cluster performance, handles high load |
| **Isolation** | Shared environment | Isolated job execution |
| **Results** | Local filesystem | Persistent volumes |
| **Best for** | Light load | High load |
## What This Tool Does
The framework is a Python-based wrapper around `aiperf` that:
- Benchmarks any HTTP endpoints
- Runs concurrency sweeps across configurable load levels
- Generates comparison plots with your custom labels
- Works with any HuggingFace-compatible model on NVIDIA GPUs (H200, H100, A100, etc.)
- Provides direct Python script execution for maximum flexibility
**Default sequence lengths**: Input: 2000 tokens, Output: 256 tokens (configurable with `--isl` and `--osl`)
**Important**: The `--model` parameter configures AIPerf for benchmarking and provides logging context. The default `--model` value in the benchmarking script is `Qwen/Qwen3-0.6B`, but it must match the model deployed at the endpoint(s).
---
# Client-Side Benchmarking (Local)
Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
## Prerequisites
1. **Dynamo container environment** - You must be running inside a Dynamo container with the benchmarking tools pre-installed.
2. **HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be:
- DynamoGraphDeployments exposed via HTTP endpoints
- External services (vLLM, llm-d, AIBrix, etc.)
- Any HTTP endpoint serving HuggingFace-compatible models
3. **Benchmark dependencies** - Since benchmarks run locally, you need to install the required Python dependencies. Install them using:
```bash
pip install -r deploy/utils/requirements.txt
```
## User Workflow
Follow these steps to benchmark Dynamo deployments using client-side benchmarking:
### Step 1: Establish Kubernetes Cluster and Install Dynamo
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](/docs/kubernetes/installation_guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md) to set up benchmarking resources.
### Step 2: Deploy DynamoGraphDeployments
Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Each deployment should have a frontend service exposed.
### Step 3: Port-Forward and Benchmark Deployment A
```bash
# Port-forward the frontend service for deployment A
kubectl port-forward -n <namespace> svc/<frontend-service-name> 8000:8000 > /dev/null 2>&1 &
# Note: remember to stop the port-forward process after benchmarking.
# Benchmark deployment A using Python scripts
python3 -m benchmarks.utils.benchmark \
--benchmark-name deployment-a \
--endpoint-url http://localhost:8000 \
--model "your-model-name" \
--output-dir ./benchmarks/results
```
### Step 4: [If Comparative] Teardown Deployment A and Establish Deployment B
If comparing multiple deployments, teardown deployment A and deploy deployment B with a different configuration.
### Step 5: [If Comparative] Port-Forward and Benchmark Deployment B
```bash
# Port-forward the frontend service for deployment B
kubectl port-forward -n <namespace> svc/<frontend-service-name> 8001:8000 > /dev/null 2>&1 &
# Benchmark deployment B using Python scripts
python3 -m benchmarks.utils.benchmark \
--benchmark-name deployment-b \
--endpoint-url http://localhost:8001 \
--model "your-model-name" \
--output-dir ./benchmarks/results
```
### Step 6: Generate Summary and Visualization
```bash
# Generate plots and summary using Python plotting script
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results
# Or plot only specific benchmark experiments
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b
```
## Use Cases
The benchmarking framework supports various comparative analysis scenarios:
- **Compare multiple DynamoGraphDeployments of a single backend** (e.g., aggregated vs disaggregated configurations)
- **Compare different backends** (e.g., vLLM vs TensorRT-LLM vs SGLang)
- **Compare Dynamo vs other platforms** (e.g., Dynamo vs llm-d vs AIBrix)
- **Compare different models** (e.g., Llama-3-8B vs Llama-3-70B vs Qwen-3-0.6B)
- **Compare different hardware configurations** (e.g., H100 vs A100 vs H200)
- **Compare different parallelization strategies** (e.g., different GPU counts or memory configurations)
## Configuration and Usage
### Command Line Options
```bash
python3 -m benchmarks.utils.benchmark --benchmark-name <name> --endpoint-url <endpoint_url> [OPTIONS]
REQUIRED:
--benchmark-name NAME Name/label for this benchmark (used in plots and results)
--endpoint-url URL HTTP endpoint URL to benchmark (e.g., http://localhost:8000)
OPTIONS:
-h, --help Show help message and examples
-m, --model MODEL Model name for AIPerf configuration and logging (default: Qwen/Qwen3-0.6B)
NOTE: This must match the model deployed at the endpoint
-i, --isl LENGTH Input sequence length (default: 2000)
-s, --std STDDEV Input sequence standard deviation (default: 10)
-o, --osl LENGTH Output sequence length (default: 256)
-d, --output-dir DIR Output directory (default: ./benchmarks/results)
--verbose Enable verbose output
```
### Important Notes
- **Benchmark Name**: The benchmark name becomes the label in plots and results
- **Name Restrictions**: Names can only contain letters, numbers, hyphens, and underscores. The name `plots` is reserved.
- **Port-Forwarding**: You must have an exposed endpoint before benchmarking
- **Model Parameter**: The `--model` parameter configures AIPerf for testing and logging, and must match the model deployed at the endpoint
- **Sequential Benchmarking**: For comparative benchmarks, deploy and benchmark each configuration separately
### What Happens During Benchmarking
The Python benchmarking module:
1. **Connects** to your port-forwarded endpoint
2. **Benchmarks** using AIPerf at various concurrency levels (default: 1, 2, 5, 10, 50, 100, 250)
3. **Measures** key metrics: latency, throughput, time-to-first-token
4. **Saves** results to an output directory organized by benchmark name
The Python plotting module:
1. **Generates** comparison plots using your benchmark name in `<OUTPUT_DIR>/plots/`
2. **Creates** summary statistics and visualizations
### Plotting Options
The plotting script supports several options for customizing which experiments to visualize:
```bash
# Plot all benchmark experiments in the data directory
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results
# Plot only specific benchmark experiments
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --benchmark-name experiment-a --benchmark-name experiment-b
# Specify custom output directory for plots
python3 -m benchmarks.utils.plot --data-dir ./benchmarks/results --output-dir ./custom-plots
```
**Available Options:**
- `--data-dir`: Directory containing benchmark results (required)
- `--benchmark-name`: Specific benchmark experiment name to plot (can be specified multiple times). Names must match subdirectory names under the data dir.
- `--output-dir`: Custom output directory for plots (defaults to data-dir/plots)
**Note**: If `--benchmark-name` is not specified, the script will plot all subdirectories found in the data directory.
### Using Your Own Models and Configuration
The benchmarking framework supports any HuggingFace-compatible LLM model. Specify your model in the benchmark script's `--model` parameter. It must match the model name of the deployment. You can override the default sequence lengths (2000/256 tokens) with `--isl` and `--osl` flags if needed for your specific workload.
The benchmarking framework is built around Python modules that provide direct control over the benchmark workflow. The Python benchmarking module connects to your existing endpoints, runs the benchmarks, and can generate plots. Deployment is user-managed and out of scope for this tool.
### Comparison Limitations
The plotting system supports up to 12 different benchmarks in a single comparison.
### Concurrency Configuration
You can customize the concurrency levels using the CONCURRENCIES environment variable:
```bash
# Custom concurrency levels
CONCURRENCIES="1,5,20,50" python3 -m benchmarks.utils.benchmark \
--benchmark-name my-test \
--endpoint-url http://localhost:8000
# Or set permanently
export CONCURRENCIES="1,2,5,10,25,50,100"
python3 -m benchmarks.utils.benchmark \
--benchmark-name test \
--endpoint-url http://localhost:8000
```
## Understanding Your Results
After benchmarking completes, check `./benchmarks/results/` (or your custom output directory):
### Plot Labels and Organization
The plotting script uses the `--benchmark-name` as the experiment name in all generated plots. For example:
- `--benchmark-name aggregated` → plots will show "aggregated" as the label
- `--benchmark-name vllm-disagg` → plots will show "vllm-disagg" as the label
This allows you to easily identify and compare different configurations in the visualization plots.
### Summary and Plots
```text
benchmarks/results/plots
├── SUMMARY.txt # Quick overview of all results
├── p50_inter_token_latency_vs_concurrency.png # Token generation speed
├── avg_time_to_first_token_vs_concurrency.png # Response time
├── request_throughput_vs_concurrency.png # Requests per second
├── efficiency_tok_s_gpu_vs_user.png # GPU efficiency
└── avg_inter_token_latency_vs_concurrency.png # Average latency
```
### Data Files
Raw data is organized by deployment/benchmark type and concurrency level:
**For Any Benchmarking (uses your custom benchmark name):**
```text
results/ # Client-side: ./benchmarks/results/ or custom dir
├── plots/ # Server-side: /data/results/
│ ├── SUMMARY.txt # Performance visualization plots
│ ├── p50_inter_token_latency_vs_concurrency.png
│ ├── avg_inter_token_latency_vs_concurrency.png
│ ├── request_throughput_vs_concurrency.png
│ ├── efficiency_tok_s_gpu_vs_user.png
│ └── avg_time_to_first_token_vs_concurrency.png
├── <your-benchmark-name>/ # Results for your benchmark (uses your custom name)
│ ├── c1/ # Concurrency level 1
│ │ └── profile_export_aiperf.json
│ ├── c2/ # Concurrency level 2
│ ├── c5/ # Concurrency level 5
│ └── ... # Other concurrency levels (10, 50, 100, 250)
└── <your-benchmark-name-N>/ # Results for additional benchmarking runs
└── c*/ # Same structure as above
```
**Example with actual benchmark names:**
```text
results/
├── plots/
├── experiment-a/ # --benchmark-name experiment-a
├── experiment-b/ # --benchmark-name experiment-b
└── experiment-c/ # --benchmark-name experiment-c
```
Each concurrency directory contains:
- **`profile_export_aiperf.json`** - Structured metrics from AIPerf
- **`profile_export_aiperf.csv`** - CSV format metrics from AIPerf
- **`profile_export.json`** - Raw AIPerf results
- **`inputs.json`** - Generated test inputs
---
# Server-Side Benchmarking (In-Cluster)
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.
## What Server-Side Benchmarking Does
The server-side benchmarking solution:
- Runs benchmarks directly within the Kubernetes cluster using internal service URLs
- Uses Kubernetes service DNS for direct communication (no port forwarding required)
- Leverages the existing benchmarking infrastructure (`benchmarks.utils.benchmark`)
- Stores results persistently using `dynamo-pvc`
- Provides isolated execution environment with configurable resources
- Handles high load/speed requirements without timeout issues
- **Note**: Each benchmark job runs within a single Kubernetes namespace, but can benchmark services across multiple namespaces using the full DNS format `svc_name.namespace.svc.cluster.local`
## Prerequisites
1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](/docs/kubernetes/README.md))
2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
3. **Docker image** containing the Dynamo benchmarking tools
## Quick Start
### Step 1: Deploy Your DynamoGraphDeployment
Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed.
### Step 2: Deploy and Run Benchmark Job
**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
```bash
export NAMESPACE=benchmarking
# Deploy the benchmark job with default settings
kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
# Monitor the job, wait for it to complete
kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
```
#### Customize the job configuration
To customize the benchmark parameters, edit the `benchmarks/incluster/benchmark_job.yaml` file and modify:
- **Model name**: Change `"Qwen/Qwen3-0.6B"` in the args section
- **Benchmark name**: Change `"qwen3-0p6b-vllm-agg"` to your desired benchmark name
- **Service URL**: Change `"vllm-agg-frontend:8000"` so the service URL matches your deployed service
- **Docker image**: Change the image field if needed
Then deploy:
```bash
kubectl apply -f benchmarks/incluster/benchmark_job.yaml -n $NAMESPACE
```
### Step 3: Retrieve Results
```bash
# Create access pod (skip this step if access pod is already running)
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
# Download the results
kubectl cp $NAMESPACE/pvc-access-pod:/data/results/<benchmark-name> ./benchmarks/results/<benchmark-name>
# Cleanup
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
### Step 4: Generate Plots
```bash
# Generate performance plots from the downloaded results
python3 -m benchmarks.utils.plot \
--data-dir ./benchmarks/results
```
This will create visualization plots. For more details on interpreting these plots, see the [Summary and Plots](#summary-and-plots) section above.
## Cross-Namespace Service Access
Server-side benchmarking can benchmark services across multiple namespaces from a single job using Kubernetes DNS. When referencing services in other namespaces, use the full DNS format:
```bash
# Access service in same namespace
SERVICE_URL=vllm-agg-frontend:8000
# Access service in different namespace
SERVICE_URL=vllm-agg-frontend.production.svc.cluster.local:8000
```
**DNS Format**: `<service-name>.<namespace>.svc.cluster.local:port`
This allows you to:
- Benchmark multiple services across different namespaces in a single job
- Compare services running in different environments (dev, staging, production)
- Test cross-namespace integrations without port-forwarding
- Run comprehensive cross-namespace performance comparisons
## Configuration
The benchmark job is configured directly in the YAML file.
### Default Configuration
- **Model**: `Qwen/Qwen3-0.6B`
- **Benchmark Name**: `qwen3-0p6b-vllm-agg`
- **Service**: `vllm-agg-frontend:8000`
- **Docker Image**: `nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag`
### Customizing the Job
To customize the benchmark, edit `benchmarks/incluster/benchmark_job.yaml`:
1. **Change the model**: Update the `--model` argument
2. **Change the benchmark name**: Update the `--benchmark-name` argument
3. **Change the service URL**: Update the `--endpoint-url` argument (use `<svc_name>.<namespace>.svc.cluster.local:port` for cross-namespace access)
4. **Change Docker image**: Update the image field if needed
### Example: Multi-Namespace Benchmarking
To benchmark services across multiple namespaces, you would need to run separate benchmark jobs for each service since the format supports one benchmark per job. However, the results are stored in the same PVC and may be accessed together.
```yaml
# Job 1: Production service
args:
- --model
- "Qwen/Qwen3-0.6B"
- --benchmark-name
- "prod-vllm"
- --endpoint-url
- "vllm-agg-frontend.production.svc.cluster.local:8000"
- --output-dir
- /data/results
# Job 2: Staging service
args:
- --model
- "Qwen/Qwen3-0.6B"
- --benchmark-name
- "staging-vllm"
- --endpoint-url
- "vllm-agg-frontend.staging.svc.cluster.local:8000"
- --output-dir
- /data/results
```
## Understanding Your Results
Results are stored in `/data/results` and follow the same structure as client-side benchmarking:
```text
/data/results/
└── <benchmark-name>/ # Results for your benchmark name
├── c1/ # Concurrency level 1
│ └── profile_export_aiperf.json
├── c2/ # Concurrency level 2
└── ... # Other concurrency levels
```
## Monitoring and Debugging
### Check Job Status
```bash
kubectl describe job dynamo-benchmark -n $NAMESPACE
```
### View Logs
```bash
# Follow logs in real-time
kubectl logs -f job/dynamo-benchmark -n $NAMESPACE
```
### Debug Failed Jobs
```bash
# Check pod status
kubectl get pods -n $NAMESPACE -l job-name=dynamo-benchmark
# Describe failed pod
kubectl describe pod <pod-name> -n $NAMESPACE
```
## Troubleshooting
### Common Issues
1. **Service not found**: Ensure your DynamoGraphDeployment frontend service is running
3. **PVC access**: Check that `dynamo-pvc` is properly configured and accessible
4. **Image pull issues**: Ensure the Docker image is accessible from the cluster
5. **Resource constraints**: Adjust resource limits if the job is being evicted
### Debug Commands
```bash
# Check PVC status
kubectl get pvc dynamo-pvc -n $NAMESPACE
# Check service endpoints
kubectl get svc -n $NAMESPACE
# Verify your service exists and has endpoints
SVC_NAME="${SERVICE_URL%%:*}"
kubectl get svc "$SVC_NAME" -n "$NAMESPACE"
kubectl get endpoints "$SVC_NAME" -n "$NAMESPACE"
```
---
## Customize Benchmarking Behavior
The built-in Python workflow connects to endpoints, benchmarks with aiperf, and generates plots. If you want to modify the behavior:
1. **Extend the workflow**: Modify `benchmarks/utils/workflow.py` to add custom deployment types or metrics collection
2. **Generate different plots**: Modify `benchmarks/utils/plot.py` to generate a different set of plots for whatever you wish to visualize.
3. **Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process.
The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow.
---
## Testing with Mocker Backend
For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:
- **Testing deployments** without expensive GPU infrastructure
- **Developing and debugging** router, planner, or frontend logic
- **CI/CD pipelines** that need to validate infrastructure without model execution
- **Benchmarking framework validation** to ensure your setup works before using real backends
The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
# Dynamo KV Smart Router A/B Benchmarking Guide
This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
## Overview
Dynamo's KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:
1. Deploy two identical Dynamo configurations:
a. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITHOUT** KV Smart Router enabled
b. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITH** KV Smart Router enabled
2. Run controlled benchmarks using AIPerf
3. Compare performance metrics to evaluate KV router effectiveness
**Prerequisites:** Kubernetes cluster with GPUs, kubectl, helm
---
## Prerequisites
### Required Tools
- `kubectl` (configured with cluster access)
- `helm` (v3+)
- HuggingFace account and token (if model downloads are gated)
- Kubernetes cluster with:
- GPU nodes (H100, H200, or similar)
- Sufficient GPU capacity (16+ GPUs recommended for this example)
- Dynamo platform installed globally OR ability to install per-namespace
### Knowledge Requirements
- Basic Kubernetes concepts (namespaces, pods, services)
- Familiarity with LLM inference concepts
- Command-line proficiency
---
## Architecture
This guide sets up two parallel deployments, as well as a benchmarking pod that can test each deployment:
```text
┌─────────────────────────────────────┐
│ Deployment A: Router OFF │
│ Namespace: router-off-test │
│ ├─ Frontend (Standard Routing) │
│ └─ 8x Decode Workers (1 GPU each) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Deployment B: Router ON │
│ Namespace: router-on-test │
│ ├─ Frontend (KV Smart Router) │
│ └─ 8x Decode Workers (1 GPU each) │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Benchmark Pod │
│ Namespace: benchmark │
│ └─ AIPerf + Dataset │
└─────────────────────────────────────┘
```
**Key Difference:** Deployment B sets `DYN_ROUTER_MODE=kv` on the frontend to enable KV cache-aware routing.
---
## Phase 1: Namespace and Infrastructure Setup
### Step 1.1: Create Namespaces
```bash
# Create namespaces for both deployments
kubectl create namespace router-off-test
kubectl create namespace router-on-test
kubectl create namespace benchmark
```
### Step 1.2: Create HuggingFace Token Secret (optional)
If the model you're seeking to deploy requires HF token to download (Llama family models require this), replace `YOUR_HF_TOKEN` with your actual HuggingFace token:
```bash
# Router-OFF namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="YOUR_HF_TOKEN" \
-n router-off-test
# Router-ON namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="YOUR_HF_TOKEN" \
-n router-on-test
```
### Step 1.3: Install Dynamo Platform (Per-Namespace)
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation_guide.md) to install the platform in both namespaces:
- `router-off-test`
- `router-on-test`
**Key Configuration Notes:**
- If your cluster uses namespace restrictions, ensure `dynamo-operator.namespaceRestriction.enabled=true` is set during installation
- Adjust version tags to match your cluster's available Dynamo versions
- If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation
### Step 1.4: Verify Infrastructure
Wait for operators and infrastructure to be ready:
```bash
# Check router-off-test
kubectl get pods -n router-off-test
# Check router-on-test
kubectl get pods -n router-on-test
```
You should see:
- `dynamo-platform-dynamo-operator-controller-manager` (2/2 Running)
- `dynamo-platform-etcd-0` (1/1 Running)
- `dynamo-platform-nats-0` (2/2 Running)
---
## Phase 2: Deploy Model Serving
### Step 2.1: Create Deployment YAMLs
Create `router-off-deployment.yaml`:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-no-router
spec:
services:
Frontend:
dynamoNamespace: vllm-agg-no-router
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
VllmDecodeWorker:
envFromSecret: hf-token-secret
dynamoNamespace: vllm-agg-no-router
componentType: worker
replicas: 8
resources:
limits:
gpu: "1"
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- gpu-h200-sxm # Adjust to your GPU node type
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/examples/backends/vllm
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-32B --quantization fp8
startupProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 60 # 32 minutes total (120s + 60*30s)
livenessProbe:
httpGet:
path: /live
port: 9090
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 10
readinessProbe:
httpGet:
path: /live
port: 9090
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 10
```
Create `router-on-deployment.yaml`:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: vllm-agg-router
spec:
services:
Frontend:
dynamoNamespace: vllm-agg-router
componentType: frontend
replicas: 1
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
envs:
- name: DYN_ROUTER_MODE
value: kv # KEY DIFFERENCE: Enable KV Smart Router
VllmDecodeWorker:
envFromSecret: hf-token-secret
dynamoNamespace: vllm-agg-router
componentType: worker
replicas: 8
resources:
limits:
gpu: "1"
extraPodSpec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node.kubernetes.io/instance-type
operator: In
values:
- gpu-h200-sxm # Adjust to your GPU node type
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
workingDir: /workspace/examples/backends/vllm
command:
- /bin/sh
- -c
args:
- python3 -m dynamo.vllm --model Qwen/Qwen3-32B --quantization fp8
startupProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 60 # 32 minutes total (120s + 60*30s)
livenessProbe:
httpGet:
path: /live
port: 9090
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 10
readinessProbe:
httpGet:
path: /live
port: 9090
initialDelaySeconds: 300
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 10
```
### Step 2.2: Deploy Both Configurations
```bash
# Deploy router-OFF
kubectl apply -f router-off-deployment.yaml -n router-off-test
# Deploy router-ON
kubectl apply -f router-on-deployment.yaml -n router-on-test
```
**💡 Optimization Tip:** Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with `ReadWriteMany` access mode to cache the model.
First, create the PVC separately:
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
storageClassName: "your-shared-storage-class" # e.g., nfs, efs, nebius-shared-fs
resources:
requests:
storage: 100Gi
```
Then reference it in your DynamoGraphDeployment:
```yaml
spec:
pvcs:
- create: false
name: model-cache
size: "0"
services:
VllmDecodeWorker:
volumeMounts:
- mountPoint: /root/.cache/huggingface
name: model-cache
useAsCompilationCache: false
```
With this configuration, only the first worker downloads the model; others use the cached version, reducing startup time from 20+ minutes to ~2 minutes per pod.
### Step 2.3: Monitor Deployment Progress
```bash
# Watch router-OFF pods
kubectl get pods -n router-off-test -w
# Watch router-ON pods
kubectl get pods -n router-on-test -w
```
Wait for all pods to reach `Running` status and pass readiness probes.
**Expected Timeline:**
- **With shared PVC** (ReadWriteMany): ~5-10 minutes total (first worker downloads, others reuse cache)
- **Without shared PVC**: 20-30 minutes per worker (workers download independently)
- For 8 workers: Budget **1-2 hours** for full deployment (workers start in parallel but are limited by node scheduling)
The startup probe allows 32 minutes per pod (failureThreshold: 60), which accommodates model download and initialization.
### Step 2.4: Verify All Workers Are Healthy
> ⚠️ **CRITICAL CHECKPOINT**: Before running benchmarks, you **MUST** verify equal worker health in both deployments. Unequal worker counts will invalidate your comparison results.
```bash
# Quick health check - both should show "8/8"
echo "Router OFF: $(kubectl get pods -n router-off-test -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
echo "Router ON: $(kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker --field-selector=status.phase=Running -o json | jq '[.items[] | select(.status.conditions[] | select(.type=="Ready" and .status=="True"))] | length')/8 ready"
# Detailed view
kubectl get pods -n router-off-test -l nvidia.com/dynamo-component-type=worker
kubectl get pods -n router-on-test -l nvidia.com/dynamo-component-type=worker
```
**Both must show 8/8 workers in Ready state (1/1 Running).** If workers are not ready:
- Check logs: `kubectl logs -n <namespace> <pod-name>`
- Common issues: model download in progress, startup probe timeout, insufficient GPU resources
**Do not proceed with benchmarks until all 16 workers (8 per deployment) are healthy.**
---
## Phase 3: Prepare Benchmark Dataset
### Understanding the Mooncake Trace Dataset
For this A/B comparison, we use the **Mooncake Trace Dataset**, published by [Mooncake AI](https://github.com/kvcache-ai/Mooncake). This is a privacy-preserving dataset of real-world LLM inference traffic from production arxiv workloads.
**What's in the dataset?** Each trace entry contains:
- **Timestamp:** When the request arrived (for realistic request timing)
- **Input/output lengths:** Number of tokens in prompts and responses
- **Block hash IDs:** Cryptographic hashes representing KV cache blocks (explained below)
**Sample trace entry:**
```json
{
"timestamp": 27482,
"input_length": 6955,
"output_length": 52,
"hash_ids": [46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 2353, 2354]
}
```
### Why Mooncake Traces Matter for KV Cache Benchmarking
**The Challenge:** Traditional LLM benchmarks use synthetic or random data, which are often insufficient to capture real-world optimizations like KV Smart Router. To properly evaluate this feature, we need realistic traffic patterns with **prefix repetition** - but this creates a privacy problem: how do we measure realistic KV cache hit patterns without exposing actual user conversations?
**Mooncake's Solution: Privacy-Preserving Block Hashes**
Instead of storing actual prompt text, the Mooncake dataset uses cryptographic hashes to represent KV cache blocks. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks. This preserves the **pattern of prefix reuse** while completely protecting user privacy.
### How it works - Multi-turn conversation example
```text
Turn 1 (initial request - long document analysis):
Input: ~8,000 tokens (e.g., research paper + question)
Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61]
└─ 16 blocks × 512 tokens/block = ~8,192 tokens
Turn 2 (follow-up question on same document):
Input: Same document + new question (~8,500 tokens)
Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62]
└──────────── Reuses first 16 blocks (~8,192 tokens) ───────────────┘
✅ Cache hit: First 8,192 tokens don't need recomputation!
Turn 3 (another follow-up):
Input: Same document + different question (~9,000 tokens)
Hash IDs: [46][47][48][49][50][51][52][53][54][55][56][57][58][59][60][61][62][63]
└──────────── Reuses first 16 blocks (~8,192 tokens) ───────────────┘
```
When requests share the same hash IDs (e.g., blocks 46-61), it means they share those 512-token blocks - indicating **significant prefix overlap** (in this case, 8,192 tokens). The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits and avoiding redundant computation for those shared prefix tokens.
**Key Dataset Properties:**
-**Realistic timing:** Request arrival patterns from production workloads
-**Real prefix patterns:** Up to 50% cache hit ratio ([Mooncake technical report](https://github.com/kvcache-ai/Mooncake))
-**Privacy-preserving:** No actual text - only hash-based cache block identifiers
-**Reproducible:** Public dataset enables fair comparisons across different systems
**Why this matters:** With random synthetic data, the KV Smart Router would show no benefit because there's no prefix reuse to exploit. Mooncake traces provide realistic workload patterns that demonstrate the router's real-world performance gains while respecting user privacy.
---
### Download and Prepare the Dataset
```bash
# Download the Mooncake arxiv trace dataset
curl -sL https://raw.githubusercontent.com/kvcache-ai/Mooncake/refs/heads/main/FAST25-release/arxiv-trace/mooncake_trace.jsonl -o mooncake_trace.jsonl
# Trim to 1000 requests for faster benchmarking
head -n 1000 mooncake_trace.jsonl > mooncake_trace_small.jsonl
# Speed up timestamps 4x (reduces benchmark time from ~12 min to ~3 min)
python3 - <<'PY'
import json
with open("mooncake_trace_small.jsonl") as src, open("mooncake_trace_4x.jsonl", "w") as dst:
for line in src:
rec = json.loads(line)
rec["timestamp"] = int(rec["timestamp"] / 4)
dst.write(json.dumps(rec) + "\n")
PY
echo "Dataset ready: mooncake_trace_4x.jsonl (1000 requests, 4x speed)"
```
---
## Phase 4: Set Up Benchmark Environment
### Step 4.1: Deploy Benchmark Pod
Create `benchmark-job.yaml`:
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: aiperf-benchmark
namespace: benchmark
spec:
backoffLimit: 1
template:
spec:
restartPolicy: Never
containers:
- name: benchmark
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.0
command: ["/bin/sh", "-c", "sleep infinity"]
imagePullPolicy: IfNotPresent
resources:
limits:
nvidia.com/gpu: 0
```
Deploy:
```bash
kubectl apply -f benchmark-job.yaml
```
Wait for pod to be ready:
```bash
kubectl get pods -n benchmark
```
### Step 4.2: Copy Dataset to Benchmark Pod
```bash
POD_NAME=$(kubectl get pods -n benchmark -l job-name=aiperf-benchmark -o jsonpath='{.items[0].metadata.name}')
kubectl -n benchmark cp mooncake_trace_4x.jsonl ${POD_NAME}:/tmp/mooncake_trace_4x.jsonl
```
### Step 4.3: Install AIPerf
```bash
kubectl -n benchmark exec ${POD_NAME} -- bash -lc '. /opt/dynamo/venv/bin/activate && pip install -q aiperf'
```
---
## Phase 5: Run Benchmarks
### Step 5.1: Benchmark Router-OFF (Baseline)
```bash
kubectl -n benchmark exec ${POD_NAME} -- bash -lc '
. /opt/dynamo/venv/bin/activate
aiperf profile \
--model "Qwen/Qwen3-32B" \
--url "http://vllm-agg-no-router-frontend.router-off-test.svc.cluster.local:8000" \
--endpoint-type chat \
--input-file /tmp/mooncake_trace_4x.jsonl \
--custom-dataset-type mooncake_trace \
--tokenizer "Qwen/Qwen3-32B" \
--streaming \
--request-count 1000 \
--fixed-schedule \
--output-artifact-dir /tmp/router_off_results
'
```
**Note:** This will take 3-5 minutes. The terminal output includes a summary table.
### Step 5.2: Benchmark Router-ON (KV Smart Router)
```bash
kubectl -n benchmark exec ${POD_NAME} -- bash -lc '
. /opt/dynamo/venv/bin/activate
aiperf profile \
--model "Qwen/Qwen3-32B" \
--url "http://vllm-agg-router-frontend.router-on-test.svc.cluster.local:8000" \
--endpoint-type chat \
--input-file /tmp/mooncake_trace_4x.jsonl \
--custom-dataset-type mooncake_trace \
--tokenizer "Qwen/Qwen3-32B" \
--streaming \
--request-count 1000 \
--fixed-schedule \
--output-artifact-dir /tmp/router_on_results
'
```
### Step 5.3: Collect Results
```bash
# Copy results to local machine
kubectl -n benchmark cp ${POD_NAME}:/tmp/router_off_results/profile_export_aiperf.csv ./router_off_results.csv
kubectl -n benchmark cp ${POD_NAME}:/tmp/router_on_results/profile_export_aiperf.csv ./router_on_results.csv
```
---
## Phase 6: Analyze Results
### Key Metrics to Compare
| Metric | Description | What to Look For |
|--------|-------------|------------------|
| **Time to First Token (TTFT)** | Latency until first token arrives | Lower is better; KV router may reduce with prefix reuse |
| **Inter Token Latency (ITL)** | Average time between tokens | Lower is better; indicates generation speed |
| **Request Latency** | Total end-to-end latency | Lower is better; overall user experience |
| **Output Token Throughput** | Tokens generated per second (system-wide) | Higher is better; system efficiency |
| **Request Throughput** | Requests completed per second | Higher is better; capacity |
### Interpreting Results
**Your Results May Vary**: The improvement from KV Smart Router depends heavily on your workload characteristics:
**Factors that increase KV router benefit:**
- **High prefix overlap** (shared system prompts, templates, document contexts)
- **Long prompts** (>2000 tokens) where caching saves significant compute
- **Multi-turn conversations** with context carryover
- **Batch workloads** with similar queries
**Factors that reduce KV router benefit:**
- **Unique prompts** with no prefix reuse
- **Short prompts** (<1000 tokens) where routing overhead exceeds benefit
- **Evenly distributed load** where round-robin is already optimal
- **Low request rate** where cache eviction negates benefits
**Expected Performance:**
- **High prefix overlap workloads**: 20-50% TTFT improvement
- **Moderate prefix overlap**: 10-20% improvement
- **Low prefix overlap**: <5% improvement (may not be worth enabling)
**KV Smart Router is beneficial when:**
- TTFT improvements > 20%
- No significant degradation in other metrics
- Workload demonstrates measurable prefix reuse patterns
**Standard routing is better when:**
- KV router shows <10% improvement
- Increased latency variance is observed
- Load distribution across workers is more important than cache affinity
### Example Comparison
From the terminal output, compare the summary tables:
```
Router-OFF (Baseline):
TTFT avg: 12,764 ms p99: 45,898 ms
Request Latency avg: 32,978 ms
Output Token Throughput: 1,614 tokens/sec
Request Throughput: 8.61 req/sec
Router-ON (KV Router):
TTFT avg: 8,012 ms p99: 28,644 ms (37% faster ✅)
Request Latency avg: 28,972 ms (12% faster ✅)
Output Token Throughput: 1,746 tokens/sec (8% higher ✅)
Request Throughput: 9.33 req/sec (8% higher ✅)
```
In this example with all 8 workers healthy, the **KV router significantly outperformed** the baseline:
- **37% faster TTFT** - Users see first token much sooner
- **8% higher throughput** - System processes more requests per second
- **12% lower latency** - Faster end-to-end completion
The Mooncake arxiv dataset has sufficient prefix overlap (long input sequences with similar patterns) to benefit from KV cache-aware routing. Workloads with explicit shared prefixes (system prompts, templates) may see even greater improvements.
---
## Phase 7: Cleanup
```bash
# Delete deployments
kubectl delete dynamographdeployment vllm-agg-no-router -n router-off-test
kubectl delete dynamographdeployment vllm-agg-router -n router-on-test
# Delete namespaces (removes all resources)
kubectl delete namespace router-off-test
kubectl delete namespace router-on-test
kubectl delete namespace benchmark
```
---
## Troubleshooting
### Issue: Pods Stuck in Pending
**Cause:** Insufficient GPU resources
**Solution:**
```bash
# Check GPU availability
kubectl describe nodes | grep -A 10 "Allocated resources"
# Reduce worker replicas if needed
kubectl edit dynamographdeployment -n <namespace>
```
### Issue: ImagePullBackOff Errors
**Cause:** Version mismatch or missing credentials
**Solution:**
```bash
# Check available versions
kubectl get pods -n dynamo-system -o yaml | grep image:
# Update deployment YAML to match cluster version
```
### Issue: Operator Not Processing Deployment
**Cause:** Namespace restrictions
**Solution:**
- Ensure Dynamo platform is Helm-installed in the namespace
- Verify operator has `--restrictedNamespace=<your-namespace>` argument
- Check operator logs: `kubectl logs -n <namespace> deployment/dynamo-platform-dynamo-operator-controller-manager`
### Issue: Workers Not Becoming Ready
**Cause:** Model download failures or probe configuration
**Solution:**
```bash
# Check worker logs
kubectl logs -n <namespace> <worker-pod-name>
# Common issues:
# - Invalid HuggingFace token
# - Network connectivity
# - Insufficient disk space for model
```
### Issue: Workers Restarting in CrashLoopBackOff
**Cause:** Startup probe timeout - workers killed before finishing initialization
**Symptoms:**
- Pods show "Container main failed startup probe, will be restarted"
- Logs show model still downloading or loading when pod is killed
- Large models (>30GB) take longer than default 22-minute timeout
**Solution:**
Increase the startup probe `failureThreshold`:
```bash
# Patch the deployment to allow 32 minutes instead of 22
kubectl patch dynamographdeployment <deployment-name> -n <namespace> --type='json' \
-p='[{"op": "replace", "path": "/spec/services/VllmDecodeWorker/extraPodSpec/mainContainer/startupProbe/failureThreshold", "value": 60}]'
```
Or update your YAML before deploying:
```yaml
startupProbe:
httpGet:
path: /health
port: 9090
initialDelaySeconds: 120
periodSeconds: 30
timeoutSeconds: 10
failureThreshold: 60 # 32 minutes total (120s + 60*30s)
```
**Model Loading Times (approximate):**
- Qwen3-32B: ~20-25 minutes (first download)
- Llama-70B: ~25-30 minutes (first download)
- With cached model on node: ~2-5 minutes
### Issue: Unequal Worker Health
**Cause:** Resource constraints, image pull issues, or configuration errors
**Solution:**
```bash
# Check all worker status
kubectl get pods -n <namespace> -l nvidia.com/dynamo-component-type=worker
# Describe problematic pods
kubectl describe pod <pod-name> -n <namespace>
# Fix issues before benchmarking or results will be skewed
```
---
## Advanced Configuration
### Testing Different Models
Replace `Qwen/Qwen3-32B` with your model in:
- Deployment YAML `args` section
- AIPerf `--model` and `--tokenizer` parameters
### Adjusting Worker Count
Change `replicas: 8` in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.
### Using Custom Datasets
Replace mooncake dataset with your own JSONL file:
- Format: One request per line with `timestamp` field
- AIPerf supports various formats via `--custom-dataset-type`
### Disaggregated Prefill/Decode
For advanced testing, add separate prefill workers:
```yaml
VllmPrefillWorker:
componentType: worker
replicas: 2
# ... configuration
```
---
## Best Practices
1. **Equal Conditions:** Ensure both deployments have identical worker counts and health before benchmarking
2. **Warm-Up:** Run a small test (100 requests) before the full benchmark to warm up caches
3. **Multiple Runs:** Run benchmarks 3+ times and average results for statistical significance
4. **Monitor Workers:** Watch for any pod restarts or issues during benchmark runs
5. **Document Conditions:** Record cluster state, worker health, and any anomalies
6. **Test Relevant Workloads:** Use datasets that match your actual use case for meaningful results
---
## Conclusion
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit.
For questions or issues, consult the [Dynamo documentation](https://github.com/ai-dynamo/dynamo) or open an issue on GitHub.
---
## Appendix: Files Reference
- `router-off-deployment.yaml`: Standard routing deployment
- `router-on-deployment.yaml`: KV router enabled deployment
- `benchmark-job.yaml`: AIPerf benchmark pod
- `prepare-dataset.sh`: Dataset preparation script
- Results CSVs: Detailed metrics from AIPerf
**Repository:** [https://github.com/ai-dynamo/dynamo](https://github.com/ai-dynamo/dynamo)
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0 -->
# Frontend
The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.
## Feature Matrix
| Feature | Status |
|---------|--------|
| OpenAI Chat Completions API | ✅ Supported |
| OpenAI Completions API | ✅ Supported |
| KServe gRPC v2 API | ✅ Supported |
| Streaming responses | ✅ Supported |
| Multi-model serving | ✅ Supported |
| Integrated routing | ✅ Supported |
| Tool calling | ✅ Supported |
## Quick Start
### Prerequisites
- Dynamo platform installed
- `etcd` and `nats-server -js` running
- At least one backend worker registered
### HTTP Frontend
```bash
python -m dynamo.frontend --http-port 8000
```
This starts an OpenAI-compatible HTTP server with integrated preprocessing and routing. Backends are auto-discovered when they call `register_llm`.
### KServe gRPC Frontend
```bash
python -m dynamo.frontend --kserve-grpc-server
```
See the [Frontend Guide](frontend_guide.md) for KServe-specific configuration and message formats.
### Kubernetes
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: frontend-example
spec:
graphs:
- name: frontend
replicas: 1
services:
- name: Frontend
image: nvcr.io/nvidia/dynamo/dynamo-vllm:latest
command:
- python
- -m
- dynamo.frontend
- --http-port
- "8000"
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--http-port` | 8000 | HTTP server port |
| `--kserve-grpc-server` | false | Enable KServe gRPC server |
| `--router-mode` | `round_robin` | Routing strategy: `round_robin`, `random`, `kv` |
See the [Frontend Guide](frontend_guide.md) for full configuration options.
## Next Steps
| Document | Description |
|----------|-------------|
| [Frontend Guide](frontend_guide.md) | KServe gRPC configuration and integration |
| [Router Documentation](../router/README.md) | KV-aware routing configuration |
```{toctree}
:hidden:
frontend_guide
```
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0 -->
# Frontend Guide
This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend.
## KServe gRPC Frontend
### Motivation
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry-standard protocols for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
## Supported Endpoints
* `ModelInfer` endpoint: KServe Standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1)
* `ModelStreamInfer` endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto#L84-L92)
* `ModelMetadata` endpoint: KServe standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#model-metadata-1)
* `ModelConfig` endpoint: Triton extension endpoint as described [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md)
## Starting the Frontend
To start the KServe frontend, run the below command:
```bash
python -m dynamo.frontend --kserve-grpc-server
```
## gRPC Performance Tuning
The gRPC server supports optional HTTP/2 flow control tuning via environment variables. These can be set before starting the server to optimize for high-throughput streaming workloads.
| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE` | HTTP/2 connection-level flow control window size in bytes | tonic default (64KB) |
| `DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE` | HTTP/2 per-stream flow control window size in bytes | tonic default (64KB) |
### Example: High-ISL/OSL configuration for streaming workloads
```bash
# For 128 concurrent 15k-token requests
export DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE=16777216 # 16MB
export DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE=1048576 # 1MB
python -m dynamo.frontend --kserve-grpc-server
```
If these variables are not set, the server uses tonic's default values.
> **Note**: Tune these values based on your workload. Connection window should accommodate `concurrent_requests x request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details.
## Registering a Backend
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference
The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
### OpenAI Completions
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor-based messages and OpenAI message and imposes specific structure of the KServe request message.
#### Model Metadata / Config
The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
```json
{
"name": "$MODEL_NAME",
"version": 1,
"platform": "dynamo",
"backend": "dynamo",
"inputs": [
{
"name": "text_input",
"datatype": "BYTES",
"shape": [1]
},
{
"name": "streaming",
"datatype": "BOOL",
"shape": [1],
"optional": true
}
],
"outputs": [
{
"name": "text_output",
"datatype": "BYTES",
"shape": [-1]
},
{
"name": "finish_reason",
"datatype": "BYTES",
"shape": [-1],
"optional": true
}
]
}
```
#### Inference
On receiving inference request, the following conversion will be performed:
* `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
* `streaming`: the element will be converted to `stream` field in OpenAI Completion request
On receiving model response, the following conversion will be performed:
* `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice.
* `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.
### Tensor
This combination is used when the user is migrating an existing KServe-based backend into Dynamo ecosystem.
#### Model Metadata / Config
When registering the backend, the backend must provide the model's metadata as tensor-based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
* [TensorModelConfig](../../../lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](../../../lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
* [triton_model_config](../../../lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](../../../tests/frontend/grpc/echo_tensor_worker.py) for example.
#### Inference
When receiving inference request, the backend will receive [NvCreateTensorRequest](../../../lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](../../../lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
## Python Bindings
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](../../../lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
## Integration
### With Router
The frontend includes an integrated router for request distribution. Configure routing mode:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
See [Router Documentation](../router/README.md) for routing configuration details.
### With Backends
Backends auto-register with the frontend when they call `register_llm()`. Supported backends:
- [vLLM Backend](../../backends/vllm/README.md)
- [SGLang Backend](../../backends/sglang/README.md)
- [TensorRT-LLM Backend](../../backends/trtllm/README.md)
## See Also
| Document | Description |
|----------|-------------|
| [Frontend Overview](README.md) | Quick start and feature matrix |
| [Router Documentation](../router/README.md) | KV-aware routing configuration |
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KV Block Manager (KVBM)
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
KVBM offers:
- A **unified memory API** spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage systems
- Support for **block lifecycles** (allocate → register → match) with event-based state transitions
- Integration with **[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)**, a dynamic memory exchange layer for remote registration, sharing, and access of memory blocks
> **Get started:** See the [KVBM Guide](kvbm_guide.md) for installation and deployment instructions.
## When to Use KV Cache Offloading
KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and better user experience. Providers benefit from higher throughput and lower cost per token, making inference services more scalable and efficient.
Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in:
| Scenario | Benefit |
|----------|---------|
| **Long sessions and multi-turn conversations** | Preserves large prompt prefixes, avoids recomputation, improves first-token latency and throughput |
| **High concurrency** | Idle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits |
| **Shared or repeated content** | Reuse across users or sessions (system prompts, templates) increases cache hits, especially with remote or cross-instance sharing |
| **Memory- or cost-constrained deployments** | Offloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware |
## Feature Support Matrix
| | Feature | Support |
|--|---------|---------|
| **Backend** | Local | ✅ |
| | Kubernetes | ✅ |
| **LLM Framework** | vLLM | ✅ |
| | TensorRT-LLM | ✅ |
| | SGLang | ❌ |
| **Serving Type** | Aggregated | ✅ |
| | Disaggregated | ✅ |
## Architecture
![KVBM Architecture](../../images/kvbm-architecture.png)
*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
KVBM has three primary logical layers:
**LLM Inference Runtime Layer** — The top layer includes inference runtimes (TensorRT-LLM, vLLM) that integrate through dedicated connector modules to the Dynamo KVBM. These connectors act as translation layers, mapping runtime-specific operations and events into KVBM's block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and memory tiering.
**KVBM Logic Layer** — The middle layer encapsulates core KV block manager logic and serves as the runtime substrate for managing block memory. The KVBM adapter normalizes representations and data layout for incoming requests across runtimes and forwards them to the core memory manager. This layer implements table lookups, memory allocation, block layout management, lifecycle state transitions, and block reuse/eviction policies.
**NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage.
> **Learn more:** See the [KVBM Design Document](../../design_docs/kvbm_design.md) for detailed architecture, components, and data flows.
## Next Steps
- **[KVBM Guide](kvbm_guide.md)** — Installation, configuration, and deployment instructions
- **[KVBM Design](../../design_docs/kvbm_design.md)** — Architecture deep dive, components, and data flows
- **[LMCache Integration](../../integrations/lmcache_integration.md)** — Use LMCache with Dynamo vLLM backend
- **[FlexKV Integration](../../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management
- **[SGLang HiCache](../../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL
- **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details
```{toctree}
:hidden:
kvbm_guide
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment