SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# SGLang Prometheus Metrics
## Overview
When running SGLang through Dynamo, SGLang engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both SGLang engine metrics (prefixed with `sglang:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
**For the complete and authoritative list of all SGLang metrics**, always refer to the [official SGLang Production Metrics documentation](https://docs.sglang.io/references/production_metrics.html).
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
## Environment Variables
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_SYSTEM_PORT` | System metrics/health port | `-1` (disabled) | `8081` |
## Getting Started Quickly
This is a single machine example.
### Start Observability Stack
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
### Launch Dynamo Components
Launch a frontend and SGLang backend to test metrics:
```bash
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
SGLang exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All SGLang engine metrics use the `sglang:` prefix and include labels (e.g., `model_name`, `engine_type`, `tp_rank`, `pp_rank`) to identify the source.
**Example Prometheus Exposition Format text:**
```
# HELP sglang:prompt_tokens_total Number of prefill tokens processed.
**Note:** The specific metrics shown above are examples and may vary depending on your SGLang version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.sglang.io/references/production_metrics.html) for the current list.
### Metric Categories
SGLang provides metrics in the following categories (all prefixed with `sglang:`):
-**Throughput metrics** - Token processing rates
-**Resource usage** - System resource consumption
-**Latency metrics** - Request and token latency measurements
-**Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
**Note:** Specific metrics are subject to change between SGLang versions. Always refer to the [official documentation](https://docs.sglang.io/references/production_metrics.html) or inspect the `/metrics` endpoint for your SGLang version.
## Available Metrics
The official SGLang documentation includes complete metric definitions with:
For the complete and authoritative list of all SGLang metrics, see the [official SGLang Production Metrics documentation](https://docs.sglang.io/references/production_metrics.html).
## Implementation Details
- SGLang uses multiprocess metrics collection via `prometheus_client.multiprocess.MultiProcessCollector`
- Metrics are filtered by the `sglang:` prefix before being exposed
- The integration uses Dynamo's `register_engine_metrics_callback()` function
- Metrics appear after SGLang engine initialization completes
## Related Documentation
### SGLang Metrics
-[Official SGLang Production Metrics](https://docs.sglang.io/references/production_metrics.html)
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LLM Deployment using TensorRT-LLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start Infrastructure Services (Local Development Only)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Build container
```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router_guide.md).
### Aggregated
```bash
cd$DYNAMO_HOME/examples/backends/trtllm
./launch/agg.sh
```
### Aggregated with KV Routing
```bash
cd$DYNAMO_HOME/examples/backends/trtllm
./launch/agg_router.sh
```
### Disaggregated
```bash
cd$DYNAMO_HOME/examples/backends/trtllm
./launch/disagg.sh
```
### Disaggregated with KV Routing
> [!IMPORTANT]
> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
```bash
cd$DYNAMO_HOME/examples/backends/trtllm
./launch/disagg_router.sh
```
### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
## Advanced Examples
Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Multinode Deployment
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [TensorRT-LLM Kubernetes Deployment Guide](../../../examples/backends/trtllm/deploy/README.md).
### Client
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
### Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
## KV Cache Transfer in Disaggregated Serving
Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
## Request Migration
Dynamo supports [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for configuration details.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../fault_tolerance/request_cancellation.md) documentation.
## Client
See [client](../../../docs/backends/sglang/README.md#testing-the-deployment) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
## Benchmarking
To benchmark your deployment with AIPerf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
## Multimodal support
Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal_trtllm.md).
## Video Diffusion Support (Experimental)
Dynamo supports video generation using diffusion models through the `--modality video_diffusion` flag.
### Requirements
-**visual_gen**: Part of TensorRT-LLM, located at `tensorrt_llm/visual_gen/`. Currently available **only** on the [`feat/visual_gen`](https://github.com/NVIDIA/TensorRT-LLM/tree/feat/visual_gen/tensorrt_llm/visual_gen) branch (not yet merged to main or any release). Install from source:
- Video diffusion is experimental and not recommended for production use
- Only text-to-video is supported in this release (image-to-video planned)
- Requires GPU with sufficient VRAM for the diffusion model
## Logits Processing
Logits processors let you modify the next-token logits at every decoding step (e.g., to apply custom constraints or sampling transforms). Dynamo provides a backend-agnostic interface and an adapter for TensorRT-LLM so you can plug in custom processors.
### How it works
-**Interface**: Implement `dynamo.logits_processing.BaseLogitsProcessor` which defines `__call__(input_ids, logits)` and modifies `logits` in-place.
-**TRT-LLM adapter**: Use `dynamo.trtllm.logits_processing.adapter.create_trtllm_adapters(...)` to convert Dynamo processors into TRT-LLM-compatible processors and assign them to `SamplingParams.logits_processor`.
-**Examples**: See example processors in `lib/bindings/python/src/dynamo/logits_processing/examples/` ([temperature](../../../lib/bindings/python/src/dynamo/logits_processing/examples/temperature.py), [hello_world](../../../lib/bindings/python/src/dynamo/logits_processing/examples/hello_world.py)).
### Quick test: HelloWorld processor
You can enable a test-only processor that forces the model to respond with "Hello world!". This is useful to verify the wiring without modifying your model or engine code.
```bash
cd$DYNAMO_HOME/examples/backends/trtllm
export DYNAMO_ENABLE_TEST_LOGITS_PROCESSOR=1
./launch/agg.sh
```
Notes:
- When enabled, Dynamo initializes the tokenizer so the HelloWorld processor can map text to token IDs.
- Expected chat response contains "Hello world".
### Bring your own processor
Implement a processor by conforming to `BaseLogitsProcessor` and modify logits in-place. For example, temperature scaling:
- Per-request processing only (batch size must be 1); beam width > 1 is not supported.
- Processors must modify logits in-place and not return a new tensor.
- If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
## DP Rank Routing (Attention Data Parallelism)
TensorRT-LLM supports [attention data parallelism](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models)(attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.
### Dynamo vs TRT-LLM Internal Routing
-**Dynamo DP Rank Routing**: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (`attention_dp_relax=False`). Use this with `--router-mode kv` for cache-aware routing.
-**TRT-LLM Internal Routing**: TRT-LLM's scheduler assigns DP ranks internally. Use this with `--router-mode round-robin` or `random` when KV-aware routing isn't needed.
### Enabling DP Rank Routing
```bash
# Worker with attention DP
# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
The `--enable-attention-dp` flag sets `attention_dp_size = tensor_parallel_size` and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each `(worker_id, dp_rank)` combination.
> [!NOTE]
> Attention DP requires TRT-LLM's PyTorch backend. AutoDeploy does not support attention DP.
## Performance Sweep
For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](../../../examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
## Dynamo KV Block Manager Integration
Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
Here is the instruction: [Running KVBM in TensorRT-LLM](./../../../docs/components/kvbm/kvbm_guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
**Symptoms:**
- Workers function normally initially but hang after heavy load testing
- Inference requests get stuck and eventually timeout
- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
```yaml
cache_transceiver_config:
backend:DEFAULT
max_tokens_in_buffer:65536# Must exceed max ISL
```
For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`.
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Gemma 3 with Variable Sliding Window Attention
This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
> [!Note]
> - Ensure that required services such as `nats` and `etcd` are running before starting.
> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running gpt-oss-120b Disaggregated with TensorRT-LLM
Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview
This deployment uses disaggregated serving in TensorRT-LLM where:
-**Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
-**Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
-**Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
The disaggregated approach optimizes for both low-latency (maximizing tokens per second per user) and high-throughput (maximizing total tokens per GPU per second) use cases by separating the compute-intensive prefill phase from the memory-bound decode phase.
## Prerequisites
- 1x NVIDIA B200 node with 8 GPUs (this guide focuses on single-node B200 deployment)
- CUDA Toolkit 12.8 or later
- Docker with [NVIDIA Container Toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/install-guide.html) installed
- Fast SSD storage for model weights (~240GB required)
- HuggingFace account and [access token](https://huggingface.co/settings/tokens)
-`enable_attention_dp: true` - Attention data parallelism enabled for decode
-`disable_overlap_scheduler: false` - Enables overlapping for decode efficiency
-`moe_config.backend: CUTLASS` - Uses optimized CUTLASS kernels for MoE layers
-`cache_transceiver_config.backend: ucx` - Uses UCX for efficient KV cache transfer
-`cuda_graph_config.max_batch_size: 128` - Maximum batch size for CUDA graphs
#### Command-Line Arguments
Both workers receive these key arguments:
-`--tensor-parallel-size 4` - Uses 4 GPUs for tensor parallelism
-`--expert-parallel-size 4` - Expert parallelism across 4 GPUs
-`--free-gpu-memory-fraction 0.9` - Allocates 90% of GPU memory
Prefill-specific arguments:
-`--max-num-tokens 20000` - Maximum tokens for prefill processing
-`--max-batch-size 32` - Maximum batch size for prefill
Decode-specific arguments:
-`--max-num-tokens 16384` - Maximum tokens for decode processing
-`--max-batch-size 128` - Maximum batch size for decode
### 4. Launch the Deployment
Note that GPT-OSS is a reasoning model with tool calling support. To ensure the response is being processed correctly, the worker should be launched with proper ```--dyn-reasoning-parser``` and ```--dyn-tool-call-parser```.
You can use the provided launch script or run the components manually:
Poll the `/health` endpoint to verify that both the prefill and decode worker endpoints have started:
```
curl http://localhost:8000/health
```
Make sure that both of the endpoints are available before sending an inference request:
```
{
"endpoints": [
"dyn://dynamo.tensorrt_llm.generate",
"dyn://dynamo.prefill.generate"
],
"status": "healthy"
}
```
If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
### 7. Test the Deployment
Send a test request to verify the deployment:
```bash
curl -X POST http://localhost:8000/v1/responses \
-H"Content-Type: application/json"\
-d'{
"model": "openai/gpt-oss-120b",
"input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
"max_output_tokens": 200,
"stream": false
}'
```
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
### 8. Reasoning and Tool Calling
Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
multi-turn as it involves tool selection and generation based on the tool result.
In addition, the reasoning effort can be configured through ```chat_template_args```. Increasing the reasoning effort makes the model more accurate but also slower. It supports three levels: ```low```, ```medium```, and ```high```.
Below is an example of sending multi-round requests to complete a user query with reasoning and tool calling:
**Application setup (pseudocode)**
```Python
# The tool defined by the application
def get_system_health():
for component in system.components:
if not component.health():
return False
return True
# The JSON representation of the declaration in ChatCompletion tool style
tool_choice = '{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}'
# On user query, perform below workflow.
def user_query(app_request):
# first round
# create chat completion with prompt and tool choice
request = ...
response = send(request)
if response["finish_reason"] == "tool_calls":
# second round
function, params = parse_tool_call(response)
function_result = function(params)
# create request with prompt, assistant response, and function result
"content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.",
"role": "assistant",
"reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up."
},
"finish_reason": "stop"
}
],
"created": 1758758853,
"model": "openai/gpt-oss-120b",
"object": "chat.completion",
"usage": null
}
```
## Benchmarking
### Performance Testing with AIPerf
The Dynamo container includes [AIPerf](https://github.com/ai-dynamo/aiperf/tree/main?tab=readme-ov-file#aiperf), NVIDIA's tool for benchmarking generative AI models. This tool helps measure throughput, latency, and other performance metrics for your deployment.
**Run the following benchmark from inside the container** (after completing the deployment steps above):
```bash
# Create a directory for benchmark results
mkdir-p /tmp/benchmark-results
# Run the benchmark - this command tests the deployment with high-concurrency synthetic workload
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KV Cache Transfer in Disaggregated Serving
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
## Using NIXL for KV Cache Transfer
Start the disaggregated service: See [Disaggregated Serving](./README.md#disaggregated) to learn how to start the deployment.
## Default Method: NIXL
By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.
### Specify Backends for NIXL
TensorRT-LLM supports two NIXL communication backends: UCX and LIBFABRIC. By default, UCX is used if no backend is explicitly specified. Dynamo currently only supports the UCX backend, as LIBFABRIC support is still a work in progress. Please do not change the NIXL backend in the Dynamo runtime image.
## Alternative Method: UCX
TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
> [!Note]
> The environment variable `TRTLLM_USE_UCX_KVCACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios:
-**Aggregated Serving:**
Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
-**Disaggregated Serving:**
Distribute the workload across two GB200x4 nodes:
- One node runs the decode worker.
- The other node runs the prefill worker.
## Notes
* Make sure the (`eagle3_one_model: true`) is set in the LLM API config inside the `examples/backends/trtllm/engine_configs/llama4/eagle` folder.
## Setup
Assuming you have already allocated your nodes via `salloc`, and are
inside an interactive shell on one of the allocated nodes, set the
"messages": [{"role": "user", "content": "Why is NVIDIA a great company?"}],
"max_tokens": 1024
}' -w "\n"
# output:
{"id":"cmpl-3e87ea5c-010e-4dd2-bcc4-3298ebd845a8","choices":[{"text":"NVIDIA is considered a great company for several reasons:\n\n1. **Technological Innovation**: NVIDIA is a leader in the field of graphics processing units (GPUs) and has been at the forefront of technological innovation.
...
and the broader tech industry.\n\nThese factors combined have contributed to NVIDIA's status as a great company in the technology sector.","index":0,"logprobs":null,"finish_reason":"stop"}],"created":1753329671,"model":"nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8","system_fingerprint":null,"object":"text_completion","usage":{"prompt_tokens":16,"completion_tokens":562,"total_tokens":578,"prompt_tokens_details":null,"completion_tokens_details":null}}
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
the set of nodes need to be launched together in the same MPI world, such as
via `mpirun` or `srun`. This is true regardless of whether the worker is
aggregated, prefill-only, or decode-only.
In this document we will demonstrate two examples launching multinode workers
on a slurm cluster with `srun`:
1. Deploying an aggregated nvidia/DeepSeek-R1 model as a multi-node TP16/EP16
worker across 4 GB200 nodes
2. Deploying a disaggregated nvidia/DeepSeek-R1 model with a multi-node
TP16/EP16 prefill worker (4 nodes) and a multi-node TP16/EP16 decode
worker (4 nodes) across a total of 8 GB200 nodes.
NOTE: Some of the scripts used in this example like `start_frontend_services.sh` and
`start_trtllm_worker.sh` should be translatable to other environments like Kubernetes, or
using `mpirun` directly, with relative ease.
## Setup
For simplicity of the example, we will make some assumptions about your slurm cluster:
1. First, we assume you have access to a slurm cluster with multiple GPU nodes
available. For functional testing, most setups should be fine. For performance
testing, you should aim to allocate groups of nodes that are performantly
inter-connected, such as those in an NVL72 setup.
2. Second, we assume this slurm cluster has the [Pyxis](https://github.com/NVIDIA/pyxis)
SPANK plugin setup. In particular, the `srun_aggregated.sh` script in this
example will use `srun` arguments like `--container-image`,
`--container-mounts`, and `--container-env` that are added to `srun` by Pyxis.
If your cluster supports similar container based plugins, you may be able to
modify the script to use that instead.
3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
described [here](https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container).
This is the image that can be set to the `IMAGE` environment variable in later steps.
4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
will allocate 8 nodes below as a reference command to have enough capacity
to run both examples. If you plan to only run the aggregated example, you
will only need 4 nodes. If you customize the configurations to require a
different number of nodes, you can adjust the number of allocated nodes
accordingly. Pre-allocating nodes is technically not a requirement,
but it makes iterations of testing/experimenting easier.
Make sure to set your `PARTITION` and `ACCOUNT` according to your slurm cluster setup:
```bash
# Set partition manually based on your slurm cluster's partition names
PARTITION=""
# Set account manually if this command doesn't work on your cluster
ACCOUNT="$(sacctmgr -nP show assoc where user=$(whoami)format=account)"
salloc \
--partition="${PARTITION}"\
--account="${ACCOUNT}"\
--job-name="${ACCOUNT}-dynamo.trtllm"\
-t 05:00:00 \
--nodes 8
```
5. Lastly, we will assume you are inside an interactive shell on one of your allocated
nodes, which may be the default behavior after executing the `salloc` command above
depending on the cluster setup. If not, then you should SSH into one of the allocated nodes.
### Environment Variable Setup
This example aims to automate as much of the environment setup as possible,
but all slurm clusters and environments are different, and you may need to
dive into the scripts to make modifications based on your specific environment.
Assuming you have already allocated your nodes via `salloc`, and are
inside an interactive shell on one of the allocated nodes, set the
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# LLM Deployment using vLLM
This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
## Use the Latest Release
We recommend using the latest stable release of Dynamo to avoid breaking changes:
| **WideEP** | ✅ | Support for PPLX / DeepEP not verified |
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |
## vLLM Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start Infrastructure Services (Local Development Only)
For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](../../../deploy/docker-compose.yml):
```bash
docker compose -f deploy/docker-compose.yml up -d
```
> [!NOTE]
> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
### Pull or build container
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
## Run Single Node Examples
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
### Aggregated Serving
```bash
# requires one gpu
cd examples/backends/vllm
bash launch/agg.sh
```
### Aggregated Serving with KV Routing
```bash
# requires two gpus
cd examples/backends/vllm
bash launch/agg_router.sh
```
### Disaggregated Serving
```bash
# requires two gpus
cd examples/backends/vllm
bash launch/disagg.sh
```
### Disaggregated Serving with KV Routing
```bash
# requires three gpus
cd examples/backends/vllm
bash launch/disagg_router.sh
```
### Single Node Data Parallel Attention / Expert Parallelism
This example is not meant to be performant but showcases Dynamo routing to data parallel workers
```bash
# requires four gpus
cd examples/backends/vllm
bash launch/dep.sh
```
> [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
## Advanced Examples
Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
### Speculative Decoding with Aggregated Serving (Meta-Llama-3.1-8B-Instruct + Eagle3)
Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
> **See also:** [Speculative Decoding Feature Overview](../../features/speculative_decoding/README.md) for cross-backend documentation.
### Kubernetes Deployment
For complete Kubernetes deployment instructions, configurations, and troubleshooting, see [vLLM Kubernetes Deployment Guide](../../../examples/backends/vllm/deploy/README.md)
## Configuration
vLLM workers are configured through command-line arguments. Key parameters include:
-`--model`: Model to serve (e.g., `Qwen/Qwen3-0.6B`)
-`--is-prefill-worker`: Enable prefill-only mode for disaggregated serving
-`--metrics-endpoint-port`: Port for publishing KV metrics to Dynamo
-`--connector`: Specify which kv_transfer_config you want vllm to use `[nixl, lmcache, kvbm, none]`. This is a helper flag which overwrites the engines KVTransferConfig.
-**Required for:** Accepting pre-computed prompt embeddings via API
-**Default behavior:** Prompt embeddings DISABLED - requests with `prompt_embeds` will fail
-**Error without flag:**`ValueError: You must set --enable-prompt-embeds to input prompt_embeds`
See `args.py` for the full list of configuration options and their defaults.
The [documentation](https://docs.vllm.ai/en/v0.9.2/configuration/serve_args.html?h=serve+arg) for the vLLM CLI args points to running 'vllm serve --help' to see what CLI args can be added. We use the same argument parser as vLLM.
### Hashing Consistency for KV Events
When using KV-aware routing, ensure deterministic hashing across processes to avoid radix tree mismatches. Choose one of the following:
- Set `PYTHONHASHSEED=0` for all vLLM processes when relying on Python's builtin hashing for prefix caching.
- If your vLLM version supports it, configure a deterministic prefix caching algorithm, for example:
See the high-level notes in [Router Design](../../design_docs/router_design.md#deterministic-event-ids) on deterministic event IDs.
## Request Migration
Dynamo supports [request migration](../../../docs/fault_tolerance/request_migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../../docs/fault_tolerance/request_migration.md) documentation for configuration details.
## Request Cancellation
When a user cancels a request (e.g., by disconnecting from the frontend), the request is automatically cancelled across all workers, freeing compute resources for other requests.
### Cancellation Support Matrix
| | Prefill | Decode |
|-|---------|--------|
| **Aggregated** | ✅ | ✅ |
| **Disaggregated** | ✅ | ✅ |
For more details, see the [Request Cancellation Architecture](../../../docs/fault_tolerance/request_cancellation.md) documentation.
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running Deepseek R1 with Wide EP
Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a separate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
## Instructions
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README.md) Getting Started section on each node, and then run these two commands.
On node 0 (where the frontend was started) send a test request to verify your deployment:
```bash
curl localhost:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-d'{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Running gpt-oss-120b Disaggregated with vLLM
Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview
This deployment uses disaggregated serving in vLLM where:
-**Prefill Worker**: Processes input prompts efficiently using 4 GPUs with tensor parallelism
-**Decode Worker**: Generates output tokens using 4 GPUs, optimized for token generation throughput
-**Frontend**: Provides OpenAI-compatible API endpoint with round-robin routing
## Prerequisites
This guide assumes readers already knows how to deploy Dynamo disaggregated serving with vLLM as illustrated in [README.md](/docs/backends/vllm/README.md)
## Instructions
### 1. Launch the Deployment
Note that GPT-OSS is a reasoning model with tool calling support. To
ensure the response is being processed correctly, the worker should be
launched with proper `--dyn-reasoning-parser` and `--dyn-tool-call-parser`.
If only one worker endpoint is listed, the other may still be starting up. Monitor the worker logs to track startup progress.
### 3. Test the Deployment
Send a test request to verify the deployment:
```bash
curl -X POST http://localhost:8000/v1/responses \
-H"Content-Type: application/json"\
-d'{
"model": "openai/gpt-oss-120b",
"input": "Explain the concept of disaggregated serving in LLM inference in 3 sentences.",
"max_output_tokens": 200,
"stream": false
}'
```
The server exposes a standard OpenAI-compatible API endpoint that accepts JSON requests. You can adjust parameters like `max_tokens`, `temperature`, and others according to your needs.
### 4. Reasoning and Tool Calling
Dynamo has supported reasoning and tool calling in OpenAI Chat Completion endpoint. A typical workflow for application built on top of Dynamo
is that the application has a set of tools to aid the assistant provide accurate answer, and it is ususally
multi-turn as it involves tool selection and generation based on the tool result. Below is an example
of sending multi-round requests to complete a user query with reasoning and tool calling:
**Application setup (pseudocode)**
```Python
# The tool defined by the application
def get_system_health():
for component in system.components:
if not component.health():
return False
return True
# The JSON representation of the declaration in ChatCompletion tool style
tool_choice = '{
"type": "function",
"function": {
"name": "get_system_health",
"description": "Returns the current health status of the LLM runtime—use before critical operations to verify the service is live.",
"parameters": {
"type": "object",
"properties": {}
}
}
}'
# On user query, perform below workflow.
def user_query(app_request):
# first round
# create chat completion with prompt and tool choice
request = ...
response = send(request)
if response["finish_reason"] == "tool_calls":
# second round
function, params = parse_tool_call(response)
function_result = function(params)
# create request with prompt, assistant response, and function result
"content": "All systems are green—everything’s up and running smoothly! 🚀 Let me know if you need anything else.",
"role": "assistant",
"reasoning_content": "The user asks: \"Hey, quick check: is everything up and running?\" We have just checked system health, it's ok. Provide friendly response confirming everything's up."
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# vLLM Prometheus Metrics
## Overview
When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](../../integrations/lmcache_integration.md).
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
**For visualization setup instructions**, see the [Prometheus and Grafana Setup Guide](../../observability/prometheus-grafana.md).
## Environment Variables and Flags
| Variable/Flag | Description | Default | Example |
| `DYN_SYSTEM_PORT` | System metrics/health port. Required to expose `/metrics` endpoint. | `-1` (disabled) | `8081` |
| `--connector` | KV connector to use. Use `lmcache` to enable LMCache metrics. | `nixl` | `--connector lmcache` |
## Getting Started Quickly
This is a single machine example.
### Start Observability Stack
For visualizing metrics with Prometheus and Grafana, start the observability stack. See [Observability Getting Started](../../observability/README.md#getting-started-quickly) for instructions.
### Launch Dynamo Components
Launch a frontend and vLLM backend to test metrics:
```bash
# Start frontend (default port 8000, override with --http-port or DYN_HTTP_PORT env var)
vLLM exposes metrics in Prometheus Exposition Format text at the `/metrics` HTTP endpoint. All vLLM engine metrics use the `vllm:` prefix and include labels (e.g., `model_name`, `finished_reason`, `scheduling_event`) to identify the source.
**Example Prometheus Exposition Format text:**
```
# HELP vllm:request_success_total Number of successfully finished requests.
**Note:** The specific metrics shown above are examples and may vary depending on your vLLM version. Always inspect your actual `/metrics` endpoint or refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for the current list.
### Metric Categories
vLLM provides metrics in the following categories (all prefixed with `vllm:`):
-**Request metrics** - Request success, failure, and completion tracking
-**Performance metrics** - Latency, throughput, and timing measurements
-**Resource usage** - System resource consumption
-**Scheduler metrics** - Scheduling and queue management
-**Disaggregation metrics** - Metrics specific to disaggregated deployments (when enabled)
**Note:** Specific metrics are subject to change between vLLM versions. Always refer to the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) or inspect the `/metrics` endpoint for your vLLM version.
## Available Metrics
The official vLLM documentation includes complete metric definitions with:
For the complete and authoritative list of all vLLM metrics, see the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
## LMCache Metrics
When LMCache is enabled with `--connector lmcache` and `DYN_SYSTEM_PORT` is set, LMCache metrics (prefixed with `lmcache:`) are automatically exposed via Dynamo's `/metrics` endpoint alongside vLLM and Dynamo metrics.
### Minimum Requirements
To access LMCache metrics, both of these are required:
Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in:
- vLLM v1 uses multiprocess metrics collection via `prometheus_client.multiprocess`
-`PROMETHEUS_MULTIPROC_DIR`: (optional). By default, Dynamo automatically manages this environment variable, setting it to a temporary directory where multiprocess metrics are stored as memory-mapped files. Each worker process writes its metrics to separate files in this directory, which are aggregated when `/metrics` is scraped. Users only need to set this explicitly where complete control over the metrics directory is required.
- Dynamo uses `MultiProcessCollector` to aggregate metrics from all worker processes
- Metrics are filtered by the `vllm:` and `lmcache:` prefixes before being exposed (when LMCache is enabled)
- The integration uses Dynamo's `register_engine_metrics_callback()` function with the global `REGISTRY`
- Metrics appear after vLLM engine initialization completes
- vLLM v1 metrics are different from v0 - see the [official documentation](https://docs.vllm.ai/en/latest/design/metrics.html) for migration details
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Prompt Embeddings
Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.
<!-- # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License. -->
# Dynamo Benchmarking Guide
This benchmarking framework lets you compare performance across any combination of:
-**DynamoGraphDeployments**
-**External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.)
## Choosing Your Benchmarking Approach
Dynamo provides two benchmarking approaches to suit different use cases: **client-side** and **server-side**. Client-side refers to running benchmarks on your local machine and connecting to Kubernetes deployments via port-forwarding, while server-side refers to running benchmarks directly within the Kubernetes cluster using internal service URLs. Which method to use depends on your use case.
**TLDR:**
Need high performance/load testing? Server-side.
Just quick testing/comparison? Client-side.
### Use Client-Side Benchmarking When:
- You want to quickly test deployments
- You want immediate access to results on your local machine
- You're comparing external services or deployments (not necessarily just Dynamo deployments)
- You need to run benchmarks from your laptop/workstation
→ **[Go to Client-Side Benchmarking (Local)](#client-side-benchmarking-local)**
### Use Server-Side Benchmarking When:
- You have a development environment with kubectl access
- You're doing performance validation with high load/speed requirements
- You're experiencing timeouts or performance issues with client-side benchmarking
- You want optimal network performance (no port-forwarding overhead)
- You're running automated CI/CD pipelines
- You need isolated execution environments
- You're doing resource-intensive benchmarking
- You want persistent result storage in the cluster
→ **[Go to Server-Side Benchmarking (In-Cluster)](#server-side-benchmarking-in-cluster)**
### Quick Comparison
| Feature | Client-Side | Server-Side |
|---------|-------------|-------------|
| **Location** | Your local machine | Kubernetes cluster |
| **Network** | Port-forwarding required | Direct service DNS |
| **Results** | Local filesystem | Persistent volumes |
| **Best for** | Light load | High load |
## What This Tool Does
The framework is a Python-based wrapper around `aiperf` that:
- Benchmarks any HTTP endpoints
- Runs concurrency sweeps across configurable load levels
- Generates comparison plots with your custom labels
- Works with any HuggingFace-compatible model on NVIDIA GPUs (H200, H100, A100, etc.)
- Provides direct Python script execution for maximum flexibility
**Default sequence lengths**: Input: 2000 tokens, Output: 256 tokens (configurable with `--isl` and `--osl`)
**Important**: The `--model` parameter configures AIPerf for benchmarking and provides logging context. The default `--model` value in the benchmarking script is `Qwen/Qwen3-0.6B`, but it must match the model deployed at the endpoint(s).
---
# Client-Side Benchmarking (Local)
Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
## Prerequisites
1.**Dynamo container environment** - You must be running inside a Dynamo container with the benchmarking tools pre-installed.
2.**HTTP endpoints** - Ensure you have HTTP endpoints available for benchmarking. These can be:
- DynamoGraphDeployments exposed via HTTP endpoints
- External services (vLLM, llm-d, AIBrix, etc.)
- Any HTTP endpoint serving HuggingFace-compatible models
3.**Benchmark dependencies** - Since benchmarks run locally, you need to install the required Python dependencies. Install them using:
```bash
pip install-r deploy/utils/requirements.txt
```
## User Workflow
Follow these steps to benchmark Dynamo deployments using client-side benchmarking:
### Step 1: Establish Kubernetes Cluster and Install Dynamo
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](/docs/kubernetes/installation_guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md) to set up benchmarking resources.
### Step 2: Deploy DynamoGraphDeployments
Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Each deployment should have a frontend service exposed.
### Step 3: Port-Forward and Benchmark Deployment A
```bash
# Port-forward the frontend service for deployment A
-`--benchmark-name`: Specific benchmark experiment name to plot (can be specified multiple times). Names must match subdirectory names under the data dir.
-`--output-dir`: Custom output directory for plots (defaults to data-dir/plots)
**Note**: If `--benchmark-name` is not specified, the script will plot all subdirectories found in the data directory.
### Using Your Own Models and Configuration
The benchmarking framework supports any HuggingFace-compatible LLM model. Specify your model in the benchmark script's `--model` parameter. It must match the model name of the deployment. You can override the default sequence lengths (2000/256 tokens) with `--isl` and `--osl` flags if needed for your specific workload.
The benchmarking framework is built around Python modules that provide direct control over the benchmark workflow. The Python benchmarking module connects to your existing endpoints, runs the benchmarks, and can generate plots. Deployment is user-managed and out of scope for this tool.
### Comparison Limitations
The plotting system supports up to 12 different benchmarks in a single comparison.
### Concurrency Configuration
You can customize the concurrency levels using the CONCURRENCIES environment variable:
└── <your-benchmark-name-N>/ # Results for additional benchmarking runs
└── c*/ # Same structure as above
```
**Example with actual benchmark names:**
```text
results/
├── plots/
├── experiment-a/ # --benchmark-name experiment-a
├── experiment-b/ # --benchmark-name experiment-b
└── experiment-c/ # --benchmark-name experiment-c
```
Each concurrency directory contains:
-**`profile_export_aiperf.json`** - Structured metrics from AIPerf
-**`profile_export_aiperf.csv`** - CSV format metrics from AIPerf
-**`profile_export.json`** - Raw AIPerf results
-**`inputs.json`** - Generated test inputs
---
# Server-Side Benchmarking (In-Cluster)
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.
## What Server-Side Benchmarking Does
The server-side benchmarking solution:
- Runs benchmarks directly within the Kubernetes cluster using internal service URLs
- Uses Kubernetes service DNS for direct communication (no port forwarding required)
- Leverages the existing benchmarking infrastructure (`benchmarks.utils.benchmark`)
- Stores results persistently using `dynamo-pvc`
- Provides isolated execution environment with configurable resources
- Handles high load/speed requirements without timeout issues
-**Note**: Each benchmark job runs within a single Kubernetes namespace, but can benchmark services across multiple namespaces using the full DNS format `svc_name.namespace.svc.cluster.local`
## Prerequisites
1.**Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](/docs/kubernetes/README.md))
2.**Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
3.**Docker image** containing the Dynamo benchmarking tools
## Quick Start
### Step 1: Deploy Your DynamoGraphDeployment
Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed.
### Step 2: Deploy and Run Benchmark Job
**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
# Generate performance plots from the downloaded results
python3 -m benchmarks.utils.plot \
--data-dir ./benchmarks/results
```
This will create visualization plots. For more details on interpreting these plots, see the [Summary and Plots](#summary-and-plots) section above.
## Cross-Namespace Service Access
Server-side benchmarking can benchmark services across multiple namespaces from a single job using Kubernetes DNS. When referencing services in other namespaces, use the full DNS format:
To customize the benchmark, edit `benchmarks/incluster/benchmark_job.yaml`:
1.**Change the model**: Update the `--model` argument
2.**Change the benchmark name**: Update the `--benchmark-name` argument
3.**Change the service URL**: Update the `--endpoint-url` argument (use `<svc_name>.<namespace>.svc.cluster.local:port` for cross-namespace access)
4.**Change Docker image**: Update the image field if needed
### Example: Multi-Namespace Benchmarking
To benchmark services across multiple namespaces, you would need to run separate benchmark jobs for each service since the format supports one benchmark per job. However, the results are stored in the same PVC and may be accessed together.
kubectl get pods -n$NAMESPACE-l job-name=dynamo-benchmark
# Describe failed pod
kubectl describe pod <pod-name> -n$NAMESPACE
```
## Troubleshooting
### Common Issues
1.**Service not found**: Ensure your DynamoGraphDeployment frontend service is running
3.**PVC access**: Check that `dynamo-pvc` is properly configured and accessible
4.**Image pull issues**: Ensure the Docker image is accessible from the cluster
5.**Resource constraints**: Adjust resource limits if the job is being evicted
### Debug Commands
```bash
# Check PVC status
kubectl get pvc dynamo-pvc -n$NAMESPACE
# Check service endpoints
kubectl get svc -n$NAMESPACE
# Verify your service exists and has endpoints
SVC_NAME="${SERVICE_URL%%:*}"
kubectl get svc "$SVC_NAME"-n"$NAMESPACE"
kubectl get endpoints "$SVC_NAME"-n"$NAMESPACE"
```
---
## Customize Benchmarking Behavior
The built-in Python workflow connects to endpoints, benchmarks with aiperf, and generates plots. If you want to modify the behavior:
1.**Extend the workflow**: Modify `benchmarks/utils/workflow.py` to add custom deployment types or metrics collection
2.**Generate different plots**: Modify `benchmarks/utils/plot.py` to generate a different set of plots for whatever you wish to visualize.
3.**Direct module usage**: Use individual Python modules (`benchmarks.utils.benchmark`, `benchmarks.utils.plot`) for granular control over each step of the benchmarking process.
The Python benchmarking module provides a complete end-to-end benchmarking experience with full control over the workflow.
---
## Testing with Mocker Backend
For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:
-**Testing deployments** without expensive GPU infrastructure
-**Developing and debugging** router, planner, or frontend logic
-**CI/CD pipelines** that need to validate infrastructure without model execution
-**Benchmarking framework validation** to ensure your setup works before using real backends
The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
## Overview
Dynamo's KV Smart Router intelligently routes requests based on KV cache affinity, improving performance for workloads with shared prompt prefixes. This guide helps you:
1. Deploy two identical Dynamo configurations:
a. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITHOUT** KV Smart Router enabled
b. A vllm server for Qwen3-32B with 8 workers (aggregated) **WITH** KV Smart Router enabled
2. Run controlled benchmarks using AIPerf
3. Compare performance metrics to evaluate KV router effectiveness
**Prerequisites:** Kubernetes cluster with GPUs, kubectl, helm
---
## Prerequisites
### Required Tools
-`kubectl` (configured with cluster access)
-`helm` (v3+)
- HuggingFace account and token (if model downloads are gated)
- Kubernetes cluster with:
- GPU nodes (H100, H200, or similar)
- Sufficient GPU capacity (16+ GPUs recommended for this example)
- Dynamo platform installed globally OR ability to install per-namespace
If the model you're seeking to deploy requires HF token to download (Llama family models require this), replace `YOUR_HF_TOKEN` with your actual HuggingFace token:
```bash
# Router-OFF namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="YOUR_HF_TOKEN"\
-n router-off-test
# Router-ON namespace
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN="YOUR_HF_TOKEN"\
-n router-on-test
```
### Step 1.3: Install Dynamo Platform (Per-Namespace)
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation_guide.md) to install the platform in both namespaces:
-`router-off-test`
-`router-on-test`
**Key Configuration Notes:**
- If your cluster uses namespace restrictions, ensure `dynamo-operator.namespaceRestriction.enabled=true` is set during installation
- Adjust version tags to match your cluster's available Dynamo versions
- If you encounter operator compatibility issues (e.g., unsupported MPI arguments), consult your cluster administrator or the Dynamo troubleshooting documentation
### Step 1.4: Verify Infrastructure
Wait for operators and infrastructure to be ready:
**💡 Optimization Tip:** Each worker will download the model independently (~20 minutes per pod). For faster initialization, add a shared PVC with `ReadWriteMany` access mode to cache the model.
With this configuration, only the first worker downloads the model; others use the cached version, reducing startup time from 20+ minutes to ~2 minutes per pod.
### Step 2.3: Monitor Deployment Progress
```bash
# Watch router-OFF pods
kubectl get pods -n router-off-test -w
# Watch router-ON pods
kubectl get pods -n router-on-test -w
```
Wait for all pods to reach `Running` status and pass readiness probes.
-**Without shared PVC**: 20-30 minutes per worker (workers download independently)
- For 8 workers: Budget **1-2 hours** for full deployment (workers start in parallel but are limited by node scheduling)
The startup probe allows 32 minutes per pod (failureThreshold: 60), which accommodates model download and initialization.
### Step 2.4: Verify All Workers Are Healthy
> ⚠️ **CRITICAL CHECKPOINT**: Before running benchmarks, you **MUST** verify equal worker health in both deployments. Unequal worker counts will invalidate your comparison results.
- Common issues: model download in progress, startup probe timeout, insufficient GPU resources
**Do not proceed with benchmarks until all 16 workers (8 per deployment) are healthy.**
---
## Phase 3: Prepare Benchmark Dataset
### Understanding the Mooncake Trace Dataset
For this A/B comparison, we use the **Mooncake Trace Dataset**, published by [Mooncake AI](https://github.com/kvcache-ai/Mooncake). This is a privacy-preserving dataset of real-world LLM inference traffic from production arxiv workloads.
**What's in the dataset?** Each trace entry contains:
-**Timestamp:** When the request arrived (for realistic request timing)
-**Input/output lengths:** Number of tokens in prompts and responses
### Why Mooncake Traces Matter for KV Cache Benchmarking
**The Challenge:** Traditional LLM benchmarks use synthetic or random data, which are often insufficient to capture real-world optimizations like KV Smart Router. To properly evaluate this feature, we need realistic traffic patterns with **prefix repetition** - but this creates a privacy problem: how do we measure realistic KV cache hit patterns without exposing actual user conversations?
Instead of storing actual prompt text, the Mooncake dataset uses cryptographic hashes to represent KV cache blocks. Each hash ID represents a **512-token block**, and the hash includes both the current block and all preceding blocks. This preserves the **pattern of prefix reuse** while completely protecting user privacy.
### How it works - Multi-turn conversation example
```text
Turn 1 (initial request - long document analysis):
Input: ~8,000 tokens (e.g., research paper + question)
└──────────── Reuses first 16 blocks (~8,192 tokens) ───────────────┘
```
When requests share the same hash IDs (e.g., blocks 46-61), it means they share those 512-token blocks - indicating **significant prefix overlap** (in this case, 8,192 tokens). The **KV Smart Router** routes requests with matching hash IDs to the same worker, maximizing cache hits and avoiding redundant computation for those shared prefix tokens.
**Key Dataset Properties:**
- ✅ **Realistic timing:** Request arrival patterns from production workloads
- ✅ **Real prefix patterns:** Up to 50% cache hit ratio ([Mooncake technical report](https://github.com/kvcache-ai/Mooncake))
- ✅ **Privacy-preserving:** No actual text - only hash-based cache block identifiers
- ✅ **Reproducible:** Public dataset enables fair comparisons across different systems
**Why this matters:** With random synthetic data, the KV Smart Router would show no benefit because there's no prefix reuse to exploit. Mooncake traces provide realistic workload patterns that demonstrate the router's real-world performance gains while respecting user privacy.
The Mooncake arxiv dataset has sufficient prefix overlap (long input sequences with similar patterns) to benefit from KV cache-aware routing. Workloads with explicit shared prefixes (system prompts, templates) may see even greater improvements.
failureThreshold:60# 32 minutes total (120s + 60*30s)
```
**Model Loading Times (approximate):**
- Qwen3-32B: ~20-25 minutes (first download)
- Llama-70B: ~25-30 minutes (first download)
- With cached model on node: ~2-5 minutes
### Issue: Unequal Worker Health
**Cause:** Resource constraints, image pull issues, or configuration errors
**Solution:**
```bash
# Check all worker status
kubectl get pods -n <namespace> -l nvidia.com/dynamo-component-type=worker
# Describe problematic pods
kubectl describe pod <pod-name> -n <namespace>
# Fix issues before benchmarking or results will be skewed
```
---
## Advanced Configuration
### Testing Different Models
Replace `Qwen/Qwen3-32B` with your model in:
- Deployment YAML `args` section
- AIPerf `--model` and `--tokenizer` parameters
### Adjusting Worker Count
Change `replicas: 8` in the deployment YAMLs. Ensure both deployments use the same count for fair comparison.
### Using Custom Datasets
Replace mooncake dataset with your own JSONL file:
- Format: One request per line with `timestamp` field
- AIPerf supports various formats via `--custom-dataset-type`
### Disaggregated Prefill/Decode
For advanced testing, add separate prefill workers:
```yaml
VllmPrefillWorker:
componentType:worker
replicas:2
# ... configuration
```
---
## Best Practices
1.**Equal Conditions:** Ensure both deployments have identical worker counts and health before benchmarking
2.**Warm-Up:** Run a small test (100 requests) before the full benchmark to warm up caches
3.**Multiple Runs:** Run benchmarks 3+ times and average results for statistical significance
4.**Monitor Workers:** Watch for any pod restarts or issues during benchmark runs
5.**Document Conditions:** Record cluster state, worker health, and any anomalies
6.**Test Relevant Workloads:** Use datasets that match your actual use case for meaningful results
---
## Conclusion
This guide provides a complete methodology for A/B testing Dynamo's KV Smart Router. The KV router's effectiveness depends heavily on workload characteristics—datasets with high prefix overlap will show the most benefit.
For questions or issues, consult the [Dynamo documentation](https://github.com/ai-dynamo/dynamo) or open an issue on GitHub.
---
## Appendix: Files Reference
-`router-off-deployment.yaml`: Standard routing deployment
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0 -->
# Frontend
The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.
## Feature Matrix
| Feature | Status |
|---------|--------|
| OpenAI Chat Completions API | ✅ Supported |
| OpenAI Completions API | ✅ Supported |
| KServe gRPC v2 API | ✅ Supported |
| Streaming responses | ✅ Supported |
| Multi-model serving | ✅ Supported |
| Integrated routing | ✅ Supported |
| Tool calling | ✅ Supported |
## Quick Start
### Prerequisites
- Dynamo platform installed
-`etcd` and `nats-server -js` running
- At least one backend worker registered
### HTTP Frontend
```bash
python -m dynamo.frontend --http-port 8000
```
This starts an OpenAI-compatible HTTP server with integrated preprocessing and routing. Backends are auto-discovered when they call `register_llm`.
### KServe gRPC Frontend
```bash
python -m dynamo.frontend --kserve-grpc-server
```
See the [Frontend Guide](frontend_guide.md) for KServe-specific configuration and message formats.
### Kubernetes
```yaml
apiVersion:nvidia.com/v1alpha1
kind:DynamoGraphDeployment
metadata:
name:frontend-example
spec:
graphs:
-name:frontend
replicas:1
services:
-name:Frontend
image:nvcr.io/nvidia/dynamo/dynamo-vllm:latest
command:
-python
--m
-dynamo.frontend
---http-port
-"8000"
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--http-port` | 8000 | HTTP server port |
| `--kserve-grpc-server` | false | Enable KServe gRPC server |
<!-- SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0 -->
# Frontend Guide
This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend.
## KServe gRPC Frontend
### Motivation
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry-standard protocols for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
## Supported Endpoints
*`ModelInfer` endpoint: KServe Standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#inference-1)
*`ModelStreamInfer` endpoint: Triton extension endpoint that provide bi-directional streaming version of the inference RPC to allow a sequence of inference requests/responses to be sent over a GRPC stream, as described [here](https://github.com/triton-inference-server/common/blob/main/protobuf/grpc_service.proto#L84-L92)
*`ModelMetadata` endpoint: KServe standard endpoint as described [here](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/required_api.md#model-metadata-1)
*`ModelConfig` endpoint: Triton extension endpoint as described [here](https://github.com/triton-inference-server/server/blob/main/docs/protocol/extension_model_configuration.md)
## Starting the Frontend
To start the KServe frontend, run the below command:
```bash
python -m dynamo.frontend --kserve-grpc-server
```
## gRPC Performance Tuning
The gRPC server supports optional HTTP/2 flow control tuning via environment variables. These can be set before starting the server to optimize for high-throughput streaming workloads.
| Environment Variable | Description | Default |
|---------------------|-------------|---------|
| `DYN_GRPC_INITIAL_CONNECTION_WINDOW_SIZE` | HTTP/2 connection-level flow control window size in bytes | tonic default (64KB) |
| `DYN_GRPC_INITIAL_STREAM_WINDOW_SIZE` | HTTP/2 per-stream flow control window size in bytes | tonic default (64KB) |
### Example: High-ISL/OSL configuration for streaming workloads
If these variables are not set, the server uses tonic's default values.
> **Note**: Tune these values based on your workload. Connection window should accommodate `concurrent_requests x request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details.
## Registering a Backend
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
*`ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
*`ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
*`ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference
The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
### OpenAI Completions
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor-based messages and OpenAI message and imposes specific structure of the KServe request message.
#### Model Metadata / Config
The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
```json
{
"name":"$MODEL_NAME",
"version":1,
"platform":"dynamo",
"backend":"dynamo",
"inputs":[
{
"name":"text_input",
"datatype":"BYTES",
"shape":[1]
},
{
"name":"streaming",
"datatype":"BOOL",
"shape":[1],
"optional":true
}
],
"outputs":[
{
"name":"text_output",
"datatype":"BYTES",
"shape":[-1]
},
{
"name":"finish_reason",
"datatype":"BYTES",
"shape":[-1],
"optional":true
}
]
}
```
#### Inference
On receiving inference request, the following conversion will be performed:
*`text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
*`streaming`: the element will be converted to `stream` field in OpenAI Completion request
On receiving model response, the following conversion will be performed:
*`text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice.
*`finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.
### Tensor
This combination is used when the user is migrating an existing KServe-based backend into Dynamo ecosystem.
#### Model Metadata / Config
When registering the backend, the backend must provide the model's metadata as tensor-based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
*[TensorModelConfig](../../../lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](../../../lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
*[triton_model_config](../../../lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](../../../tests/frontend/grpc/echo_tensor_worker.py) for example.
#### Inference
When receiving inference request, the backend will receive [NvCreateTensorRequest](../../../lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](../../../lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
## Python Bindings
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](../../../lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
## Integration
### With Router
The frontend includes an integrated router for request distribution. Configure routing mode:
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KV Block Manager (KVBM)
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
KVBM offers:
- A **unified memory API** spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage systems
- Support for **block lifecycles** (allocate → register → match) with event-based state transitions
- Integration with **[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)**, a dynamic memory exchange layer for remote registration, sharing, and access of memory blocks
> **Get started:** See the [KVBM Guide](kvbm_guide.md) for installation and deployment instructions.
## When to Use KV Cache Offloading
KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and better user experience. Providers benefit from higher throughput and lower cost per token, making inference services more scalable and efficient.
Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in:
| Scenario | Benefit |
|----------|---------|
| **Long sessions and multi-turn conversations** | Preserves large prompt prefixes, avoids recomputation, improves first-token latency and throughput |
| **High concurrency** | Idle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits |
| **Shared or repeated content** | Reuse across users or sessions (system prompts, templates) increases cache hits, especially with remote or cross-instance sharing |
| **Memory- or cost-constrained deployments** | Offloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware |
*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
KVBM has three primary logical layers:
**LLM Inference Runtime Layer** — The top layer includes inference runtimes (TensorRT-LLM, vLLM) that integrate through dedicated connector modules to the Dynamo KVBM. These connectors act as translation layers, mapping runtime-specific operations and events into KVBM's block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and memory tiering.
**KVBM Logic Layer** — The middle layer encapsulates core KV block manager logic and serves as the runtime substrate for managing block memory. The KVBM adapter normalizes representations and data layout for incoming requests across runtimes and forwards them to the core memory manager. This layer implements table lookups, memory allocation, block layout management, lifecycle state transitions, and block reuse/eviction policies.
**NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage.
> **Learn more:** See the [KVBM Design Document](../../design_docs/kvbm_design.md) for detailed architecture, components, and data flows.
## Next Steps
-**[KVBM Guide](kvbm_guide.md)** — Installation, configuration, and deployment instructions
-**[KVBM Design](../../design_docs/kvbm_design.md)** — Architecture deep dive, components, and data flows
-**[LMCache Integration](../../integrations/lmcache_integration.md)** — Use LMCache with Dynamo vLLM backend
-**[FlexKV Integration](../../integrations/flexkv_integration.md)** — Use FlexKV for KV cache management
-**[SGLang HiCache](../../integrations/sglang_hicache.md)** — Enable SGLang's hierarchical cache with NIXL
-**[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details