Unverified Commit 2c3066bd authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: full migration of docs/ to fern format in fern/ (#6050)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent d59b9d72
...@@ -9,7 +9,7 @@ Dynamo supports running Deepseek R1 with data parallel attention and wide expert ...@@ -9,7 +9,7 @@ Dynamo supports running Deepseek R1 with data parallel attention and wide expert
## Instructions ## Instructions
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [vLLM Backend](README.md) Getting Started section on each node, and then run these two commands. The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README.md) Getting Started section on each node, and then run these two commands.
node 0 node 0
```bash ```bash
......
...@@ -80,7 +80,7 @@ python -m dynamo.frontend --router-mode kv & ...@@ -80,7 +80,7 @@ python -m dynamo.frontend --router-mode kv &
# Start prefill worker # Start prefill worker
python -m dynamo.vllm \ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \ --model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--enforce-eager --enforce-eager
``` ```
...@@ -89,7 +89,7 @@ python -m dynamo.vllm \ ...@@ -89,7 +89,7 @@ python -m dynamo.vllm \
```bash ```bash
# Start decode worker # Start decode worker
python -m dynamo.vllm \ python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \ --model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 8 \ --tensor-parallel-size 8 \
--enforce-eager \ --enforce-eager \
--is-prefill-worker --is-prefill-worker
......
...@@ -11,7 +11,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t ...@@ -11,7 +11,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t
**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html). **For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](LMCache-Integration.md). **For LMCache metrics and integration**, see the [LMCache Integration Guide](../../integrations/lmcache-integration.md).
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md). **For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
...@@ -133,10 +133,10 @@ curl -s localhost:8081/metrics | grep "^lmcache:" ...@@ -133,10 +133,10 @@ curl -s localhost:8081/metrics | grep "^lmcache:"
Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in: Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in:
- [LMCache Integration Guide](LMCache-Integration.md#troubleshooting) - [LMCache Integration Guide](../../integrations/lmcache-integration.md#troubleshooting)
**For complete LMCache configuration and metric details**, see: **For complete LMCache configuration and metric details**, see:
- [LMCache Integration Guide](LMCache-Integration.md) - Setup and configuration - [LMCache Integration Guide](../../integrations/lmcache-integration.md) - Setup and configuration
- [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference - [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference
## Implementation Details ## Implementation Details
......
...@@ -3,6 +3,7 @@ ...@@ -3,6 +3,7 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Dynamo Benchmarking Guide # Dynamo Benchmarking Guide
This benchmarking framework lets you compare performance across any combination of: This benchmarking framework lets you compare performance across any combination of:
...@@ -64,7 +65,7 @@ The framework is a Python-based wrapper around `aiperf` that: ...@@ -64,7 +65,7 @@ The framework is a Python-based wrapper around `aiperf` that:
--- ---
## Client-Side Benchmarking (Local) # Client-Side Benchmarking (Local)
Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding. Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
...@@ -87,10 +88,10 @@ Client-side benchmarking runs on your local machine and connects to Kubernetes d ...@@ -87,10 +88,10 @@ Client-side benchmarking runs on your local machine and connects to Kubernetes d
Follow these steps to benchmark Dynamo deployments using client-side benchmarking: Follow these steps to benchmark Dynamo deployments using client-side benchmarking:
### Step 1: Establish Kubernetes Cluster and Install Dynamo ### Step 1: Establish Kubernetes Cluster and Install Dynamo
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](../kubernetes/installation-guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md) to set up benchmarking resources. Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](../kubernetes/installation-guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md) to set up benchmarking resources.
### Step 2: Deploy DynamoGraphDeployments ### Step 2: Deploy DynamoGraphDeployments
Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Each deployment should have a frontend service exposed. Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Each deployment should have a frontend service exposed.
### Step 3: Port-Forward and Benchmark Deployment A ### Step 3: Port-Forward and Benchmark Deployment A
```bash ```bash
...@@ -298,7 +299,7 @@ Each concurrency directory contains: ...@@ -298,7 +299,7 @@ Each concurrency directory contains:
--- ---
## Server-Side Benchmarking (In-Cluster) # Server-Side Benchmarking (In-Cluster)
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization. Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.
...@@ -316,17 +317,17 @@ The server-side benchmarking solution: ...@@ -316,17 +317,17 @@ The server-side benchmarking solution:
## Prerequisites ## Prerequisites
1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](../kubernetes/README.md)) 1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](../kubernetes/README.md))
2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md)) 2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
3. **Docker image** containing the Dynamo benchmarking tools 3. **Docker image** containing the Dynamo benchmarking tools
## Quick Start ## Quick Start
### Step 1: Deploy Your DynamoGraphDeployment ### Step 1: Deploy Your DynamoGraphDeployment
Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Ensure it has a frontend service exposed. Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed.
### Step 2: Deploy and Run Benchmark Job ### Step 2: Deploy and Run Benchmark Job
**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag. **Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
```bash ```bash
export NAMESPACE=benchmarking export NAMESPACE=benchmarking
...@@ -519,7 +520,7 @@ The Python benchmarking module provides a complete end-to-end benchmarking exper ...@@ -519,7 +520,7 @@ The Python benchmarking module provides a complete end-to-end benchmarking exper
## Testing with Mocker Backend ## Testing with Mocker Backend
For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) that simulates LLM inference without requiring actual GPU resources. This is useful for: For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:
- **Testing deployments** without expensive GPU infrastructure - **Testing deployments** without expensive GPU infrastructure
- **Developing and debugging** router, planner, or frontend logic - **Developing and debugging** router, planner, or frontend logic
...@@ -528,4 +529,4 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/ ...@@ -528,4 +529,4 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/
The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference. The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
See the [mocker directory](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) for usage examples and configuration options. See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
...@@ -3,8 +3,6 @@ ...@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# Dynamo KV Smart Router A/B Benchmarking Guide
This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster. This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
## Overview ## Overview
...@@ -99,7 +97,7 @@ kubectl create secret generic hf-token-secret \ ...@@ -99,7 +97,7 @@ kubectl create secret generic hf-token-secret \
### Step 1.3: Install Dynamo Platform (Per-Namespace) ### Step 1.3: Install Dynamo Platform (Per-Namespace)
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation_guide.md) to install the platform in both namespaces: If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation-guide.md) to install the platform in both namespaces:
- `router-off-test` - `router-off-test`
- `router-on-test` - `router-on-test`
......
This diff is collapsed.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Frontend
The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.
## Feature Matrix
| Feature | Status |
|---------|--------|
| OpenAI Chat Completions API | ✅ Supported |
| OpenAI Completions API | ✅ Supported |
| KServe gRPC v2 API | ✅ Supported |
| Streaming responses | ✅ Supported |
| Multi-model serving | ✅ Supported |
| Integrated routing | ✅ Supported |
| Tool calling | ✅ Supported |
## Quick Start
### Prerequisites
- Dynamo platform installed
- `etcd` and `nats-server -js` running
- At least one backend worker registered
### HTTP Frontend
```bash
python -m dynamo.frontend --http-port 8000
```
This starts an OpenAI-compatible HTTP server with integrated preprocessing and routing. Backends are auto-discovered when they call `register_llm`.
### KServe gRPC Frontend
```bash
python -m dynamo.frontend --kserve-grpc-server
```
See the [Frontend Guide](frontend-guide.md) for KServe-specific configuration and message formats.
### Kubernetes
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: frontend-example
spec:
graphs:
- name: frontend
replicas: 1
services:
- name: Frontend
image: nvcr.io/nvidia/dynamo/dynamo-vllm:latest
command:
- python
- -m
- dynamo.frontend
- --http-port
- "8000"
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--http-port` | 8000 | HTTP server port |
| `--kserve-grpc-server` | false | Enable KServe gRPC server |
| `--router-mode` | `round_robin` | Routing strategy: `round_robin`, `random`, `kv` |
See the [Frontend Guide](frontend-guide.md) for full configuration options.
## Next Steps
| Document | Description |
|----------|-------------|
| [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration |
| [Router Documentation](../router/README.md) | KV-aware routing configuration |
...@@ -3,11 +3,15 @@ ...@@ -3,11 +3,15 @@
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
--- ---
# KServe gRPC frontend # Frontend Guide
## Motivation This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend.
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend. ## KServe gRPC Frontend
### Motivation
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry-standard protocols for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo. This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
...@@ -20,8 +24,9 @@ This documentation assumes readers are familiar with the usage of KServe v2 API ...@@ -20,8 +24,9 @@ This documentation assumes readers are familiar with the usage of KServe v2 API
## Starting the Frontend ## Starting the Frontend
To start the KServe frontend, run the below command To start the KServe frontend, run the below command:
```
```bash
python -m dynamo.frontend --kserve-grpc-server python -m dynamo.frontend --kserve-grpc-server
``` ```
...@@ -45,54 +50,58 @@ python -m dynamo.frontend --kserve-grpc-server ...@@ -45,54 +50,58 @@ python -m dynamo.frontend --kserve-grpc-server
If these variables are not set, the server uses tonic's default values. If these variables are not set, the server uses tonic's default values.
> **Note**: Tune these values based on your workload. Connection window should accommodate `concurrent_requests × request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details. <Note>
Tune these values based on your workload. Connection window should accommodate `concurrent_requests x request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details.
</Note>
## Registering a Backend ## Registering a Backend
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination: Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor * `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend) * `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference * `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference
The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail: The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
### OpenAI Completions ### OpenAI Completions
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor based messages and OpenAI message and imposes specific structure of the KServe request message. Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor-based messages and OpenAI message and imposes specific structure of the KServe request message.
#### Model Metadata / Config #### Model Metadata / Config
The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response. The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
```
```json
{ {
name: $MODEL_NAME, "name": "$MODEL_NAME",
version: 1, "version": 1,
platform: "dynamo", "platform": "dynamo",
backend: "dynamo", # model config specific "backend": "dynamo",
inputs: [ "inputs": [
{ {
name: "text_input", "name": "text_input",
datatype: "BYTES", "datatype": "BYTES",
shape: [1] "shape": [1]
}, },
{ {
name: "streaming", "name": "streaming",
datatype: "BOOL", "datatype": "BOOL",
shape: [1], "shape": [1],
optional: true "optional": true
} }
] ],
outputs: [ "outputs": [
{ {
name: "text_output", "name": "text_output",
datatype: "BYTES", "datatype": "BYTES",
shape: [-1] "shape": [-1]
}, },
{ {
name: "finish_reason", "name": "finish_reason",
datatype: "BYTES", "datatype": "BYTES",
shape: [-1], "shape": [-1],
optional: true "optional": true
} }
] ]
} }
...@@ -101,26 +110,57 @@ The metadata and config endpoint will report the registered backend to have the ...@@ -101,26 +110,57 @@ The metadata and config endpoint will report the registered backend to have the
#### Inference #### Inference
On receiving inference request, the following conversion will be performed: On receiving inference request, the following conversion will be performed:
* `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request * `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
* `streaming`: the element will be converted to `stream` field in OpenAI Completion request * `streaming`: the element will be converted to `stream` field in OpenAI Completion request
On receiving model response, the following conversion will be performed: On receiving model response, the following conversion will be performed:
* `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice. * `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice.
* `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice. * `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.
### Tensor ### Tensor
This combination is used when the user is migrating an existing KServe based backend into Dynamo ecosystem. This combination is used when the user is migrating an existing KServe-based backend into Dynamo ecosystem.
#### Model Metadata / Config #### Model Metadata / Config
When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata: When registering the backend, the backend must provide the model's metadata as tensor-based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
* [TensorModelConfig](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
* [triton_model_config](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/main/tests/frontend/grpc/echo_tensor_worker.py) for example. * [TensorModelConfig](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
* [triton_model_config](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/main/tests/frontend/grpc/echo_tensor_worker.py) for example.
#### Inference #### Inference
When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo. When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
## Python Bindings ## Python Bindings
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example. The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
## Integration
### With Router
The frontend includes an integrated router for request distribution. Configure routing mode:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
See [Router Documentation](../router/README.md) for routing configuration details.
### With Backends
Backends auto-register with the frontend when they call `register_llm()`. Supported backends:
- [vLLM Backend](../../backends/vllm/README.md)
- [SGLang Backend](../../backends/sglang/README.md)
- [TensorRT-LLM Backend](../../backends/trtllm/README.md)
## See Also
| Document | Description |
|----------|-------------|
| [Frontend Overview](README.md) | Quick start and feature matrix |
| [Router Documentation](../router/README.md) | KV-aware routing configuration |
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# KV Block Manager (KVBM)
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
KVBM offers:
- A **unified memory API** spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage systems
- Support for **block lifecycles** (allocate → register → match) with event-based state transitions
- Integration with **[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)**, a dynamic memory exchange layer for remote registration, sharing, and access of memory blocks
> **Get started:** See the [KVBM Guide](kvbm-guide.md) for installation and deployment instructions.
## When to Use KV Cache Offloading
KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and better user experience. Providers benefit from higher throughput and lower cost per token, making inference services more scalable and efficient.
Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in:
| Scenario | Benefit |
|----------|---------|
| **Long sessions and multi-turn conversations** | Preserves large prompt prefixes, avoids recomputation, improves first-token latency and throughput |
| **High concurrency** | Idle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits |
| **Shared or repeated content** | Reuse across users or sessions (system prompts, templates) increases cache hits, especially with remote or cross-instance sharing |
| **Memory- or cost-constrained deployments** | Offloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware |
## Feature Support Matrix
| | Feature | Support |
|--|---------|---------|
| **Backend** | Local | ✅ |
| | Kubernetes | ✅ |
| **LLM Framework** | vLLM | ✅ |
| | TensorRT-LLM | ✅ |
| | SGLang | ❌ |
| **Serving Type** | Aggregated | ✅ |
| | Disaggregated | ✅ |
## Architecture
![KVBM Architecture](/assets/img/kvbm-architecture.png)
*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
KVBM has three primary logical layers:
**LLM Inference Runtime Layer** — The top layer includes inference runtimes (TensorRT-LLM, vLLM) that integrate through dedicated connector modules to the Dynamo KVBM. These connectors act as translation layers, mapping runtime-specific operations and events into KVBM's block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and memory tiering.
**KVBM Logic Layer** — The middle layer encapsulates core KV block manager logic and serves as the runtime substrate for managing block memory. The KVBM adapter normalizes representations and data layout for incoming requests across runtimes and forwards them to the core memory manager. This layer implements table lookups, memory allocation, block layout management, lifecycle state transitions, and block reuse/eviction policies.
**NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage.
> **Learn more:** See the [KVBM Design Document](../../design-docs/kvbm-design.md) for detailed architecture, components, and data flows.
## Next Steps
- **[KVBM Guide](kvbm-guide.md)** — Installation, configuration, and deployment instructions
- **[KVBM Design](../../design-docs/kvbm-design.md)** — Architecture deep dive, components, and data flows
- **[LMCache Integration](../../integrations/lmcache-integration.md)** — Use LMCache with Dynamo vLLM backend
- **[FlexKV Integration](../../integrations/flexkv-integration.md)** — Use FlexKV for KV cache management
- **[SGLang HiCache](../../integrations/sglang-hicache.md)** — Enable SGLang's hierarchical cache with NIXL
- **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# KVBM Guide
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
## Table of Contents
- [Quick Start](#quick-start)
- [Run KVBM Standalone](#run-kvbm-standalone)
- [Run KVBM in Dynamo with vLLM](#run-kvbm-in-dynamo-with-vllm)
- [Run KVBM in Dynamo with TensorRT-LLM](#run-kvbm-in-dynamo-with-tensorrt-llm)
- [Run Dynamo with SGLang HiCache](#run-dynamo-with-sglang-hicache)
- [Disaggregated Serving with KVBM](#disaggregated-serving-with-kvbm)
- [Configuration](#configuration)
- [Enable and View KVBM Metrics](#enable-and-view-kvbm-metrics)
- [Benchmarking KVBM](#benchmarking-kvbm)
- [Troubleshooting](#troubleshooting)
- [Developing Locally](#developing-locally)
## Quick Start
## Run KVBM Standalone
KVBM can be used independently without using the rest of the Dynamo stack:
```bash
pip install kvbm
```
See the [support matrix](../../reference/support-matrix.md) for version compatibility.
### Build from Source
To build KVBM from source, see the detailed instructions in the [KVBM bindings README](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/kvbm/README.md#build-from-source).
## Run KVBM in Dynamo with vLLM
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
# Build a dynamo vLLM container (KVBM is built in by default)
./container/build.sh --framework vllm
# Launch the container
./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/vllm
./launch/agg_kvbm.sh
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 10
}'
```
#### Alternative: Using Direct vllm serve
You can also use `vllm serve` directly with KVBM:
```bash
vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
```
## Run KVBM in Dynamo with TensorRT-LLM
> [!NOTE]
> **Prerequisites:**
> - Ensure `etcd` and `nats` are running before starting
> - KVBM only supports TensorRT-LLM's PyTorch backend
> - Disable partial reuse (`enable_partial_reuse: false`) to increase offloading cache hits
> - KVBM requires TensorRT-LLM v1.2.0rc2 or newer
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
# Build a dynamo TRTLLM container (KVBM is built in by default)
./container/build.sh --framework trtllm
# Launch the container
./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
# Write the LLM API config
cat > "/tmp/kvbm_llm_api_config.yaml" <<EOF
backend: pytorch
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
kv_connector_config:
connector_module: kvbm.trtllm_integration.connector
connector_scheduler_class: DynamoKVBMConnectorLeader
connector_worker_class: DynamoKVBMConnectorWorker
EOF
# Start dynamo frontend
python3 -m dynamo.frontend --http-port 8000 &
# Serve the model with KVBM
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml &
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 30
}'
```
#### Alternative: Using trtllm-serve
```bash
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
```
## Run Dynamo with SGLang HiCache
SGLang's Hierarchical Cache (HiCache) extends KV cache storage beyond GPU memory to include host CPU memory. When using NIXL as the storage backend, HiCache integrates with Dynamo's memory infrastructure.
### Quick Start
```bash
# Start SGLang worker with HiCache enabled
python -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl
# In a separate terminal, start the frontend
python -m dynamo.frontend --http-port 8000
# Send a test request
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'
```
> **Learn more:** See the [SGLang HiCache Integration Guide](../../integrations/sglang-hicache.md) for detailed configuration, deployment examples, and troubleshooting.
## Disaggregated Serving with KVBM
KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache.
### Disaggregated Serving with vLLM
```bash
# 1P1D - one prefill worker and one decode worker
# NOTE: requires at least 2 GPUs
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm.sh
# 2P2D - two prefill workers and two decode workers
# NOTE: requires at least 4 GPUs
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm_2p2d.sh
```
### Disaggregated Serving with TRT-LLM
```bash
# Launch prefill worker with KVBM
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml \
--disaggregation-mode prefill &
```
## Configuration
### Cache Tier Configuration
Configure KVBM cache tiers using environment variables:
```bash
# Option 1: CPU cache only (GPU -> CPU offloading)
export DYN_KVBM_CPU_CACHE_GB=4 # 4GB of pinned CPU memory
# Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
export DYN_KVBM_CPU_CACHE_GB=4
export DYN_KVBM_DISK_CACHE_GB=8 # 8GB of disk
# [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading)
# NOTE: Experimental, may not provide optimal performance
# NOTE: Disk offload filtering not supported with this option
export DYN_KVBM_DISK_CACHE_GB=8
```
You can also specify exact block counts instead of GB:
- `DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS`
- `DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS`
### SSD Lifespan Protection
When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step.
To disable disk offload filtering:
```bash
export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true
```
## Enable and View KVBM Metrics
### Setup Monitoring Stack
```bash
# Start basic services (etcd & natsd), along with Prometheus and Grafana
docker compose -f deploy/docker-observability.yml up -d
```
### Enable Metrics for vLLM
```bash
DYN_KVBM_METRICS=true \
DYN_KVBM_CPU_CACHE_GB=20 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--connector kvbm
```
### Enable Metrics for TensorRT-LLM
```bash
DYN_KVBM_METRICS=true \
DYN_KVBM_CPU_CACHE_GB=20 \
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml &
```
### Firewall Configuration (Optional)
```bash
# If firewall blocks KVBM metrics ports
sudo ufw allow 6880/tcp
```
### View Metrics
Access Grafana at http://localhost:3000 (default login: `dynamo`/`dynamo`) and look for the **KVBM Dashboard**.
### Available Metrics
| Metric | Description |
|--------|-------------|
| `kvbm_matched_tokens` | Number of matched tokens |
| `kvbm_offload_blocks_d2h` | Offload blocks from device to host |
| `kvbm_offload_blocks_h2d` | Offload blocks from host to disk |
| `kvbm_offload_blocks_d2d` | Offload blocks from device to disk (bypassing host) |
| `kvbm_onboard_blocks_d2d` | Onboard blocks from disk to device |
| `kvbm_onboard_blocks_h2d` | Onboard blocks from host to device |
| `kvbm_host_cache_hit_rate` | Host cache hit rate (0.0-1.0) |
| `kvbm_disk_cache_hit_rate` | Disk cache hit rate (0.0-1.0) |
## Benchmarking KVBM
Use [LMBenchmark](https://github.com/LMCache/LMBenchmark) to evaluate KVBM performance.
### Setup
```bash
git clone https://github.com/LMCache/LMBenchmark.git
cd LMBenchmark/synthetic-multi-round-qa
```
### Run Benchmark
```bash
# Synthetic multi-turn chat dataset
# Arguments: model, endpoint, output prefix, qps
./long_input_short_output_run.sh \
"Qwen/Qwen3-0.6B" \
"http://localhost:8000" \
"benchmark_kvbm" \
1
```
Average TTFT and other performance numbers will be in the output.
> **TIP:** If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard.
### Baseline Comparison
#### vLLM Baseline (without KVBM)
```bash
vllm serve Qwen/Qwen3-0.6B
```
#### TensorRT-LLM Baseline (without KVBM)
```bash
# Create config without kv_connector_config
cat > "/tmp/llm_api_config.yaml" <<EOF
backend: pytorch
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
EOF
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/llm_api_config.yaml
```
## Troubleshooting
### No TTFT Performance Gain
**Symptom:** Enabling KVBM does not show TTFT improvement or causes performance degradation.
**Cause:** Not enough prefix cache hits on KVBM to reuse offloaded KV blocks.
**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
![Grafana Example](/assets/img/kvbm-metrics-grafana.png)
### KVBM Worker Initialization Timeout
**Symptom:** KVBM fails to start when allocating large memory or disk storage.
**Solution:** Increase the leader-worker initialization timeout (default: 1800 seconds):
```bash
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600 # 1 hour
```
### Disk Offload Fails to Start
**Symptom:** KVBM fails to start when disk offloading is enabled.
**Cause:** `fallocate()` is not supported on the filesystem (e.g., Lustre, certain network filesystems).
**Solution:** Enable disk zerofill fallback:
```bash
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
```
If you encounter "write all error" or EINVAL (errno 22), also try:
```bash
export DYN_KVBM_DISK_DISABLE_O_DIRECT=true
```
## Developing Locally
Inside the Dynamo container, after changing KVBM-related code (Rust and/or Python):
```bash
cd /workspace/lib/bindings/kvbm
uv pip install maturin[patchelf]
maturin build --release --out /workspace/dist
uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl
```
## See Also
- [KVBM Overview](README.md) for a quick overview of KV Caching, KVBM and its architecture
- [KVBM Design](../../design-docs/kvbm-design.md) for a deep dive into KVBM architecture
- [LMCache Integration](../../integrations/lmcache-integration.md)
- [FlexKV Integration](../../integrations/flexkv-integration.md)
- [SGLang HiCache](../../integrations/sglang-hicache.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Planner
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner-guide.md) for a complete workflow including profiling and deployment.
## Feature Matrix
| Category | Feature | Status |
|----------|---------|--------|
| **Backend** | Local (bare metal) | Deprecated |
| | Kubernetes | Supported |
| **LLM Framework** | vLLM | Supported |
| | TensorRT-LLM | Supported |
| | SGLang | Supported |
| **Serving Type** | Aggregated | Unsupported |
| | Disaggregated | Supported |
| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
| | Load-based (KV cache/queue thresholds) | Deprecated |
| **Load Predictors** | ARIMA | Supported |
| | Prophet | Supported |
| | Kalman filter | Supported |
| | Constant (current = next) | Supported |
| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
| | VirtualConnector (external environments) | Supported |
## Quick Start
### Prerequisites
- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
- kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md))
- Pre-deployment profiling completed ([Profiling Guide](../profiler/profiler-guide.md))
### Deploy with DGDR (Recommended)
The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Guide](planner-guide.md) for the full workflow.
### Deploy with DGD (Manual)
For manual control, use the disaggregated planner templates:
```bash
# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
```
## Documentation
| Document | Description |
|----------|-------------|
| [Planner Guide](planner-guide.md) | Deployment, configuration, integration, troubleshooting |
| [Planner Examples](planner-examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
| [SLA Planner Guide](planner-guide.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
| [SLA-based Planner](planner-guide.md) | Scaling algorithm, correction factors, load prediction details |
| [Load-based Planner](README.md) | Legacy load-based scaling (deprecated) |
| [SLA-Driven Profiling](../profiler/profiler-guide.md) | Pre-deployment profiling process and configuration |
| [Planner Design](../../design-docs/planner-design.md) | Architecture deep-dive for contributors |
## Configuration Reference
### Key Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
| `--environment` | `kubernetes` | Deployment environment |
| `--adjustment-interval` | `180` | Seconds between scaling decisions |
| `--ttft` | `500.0` | Target Time To First Token (ms) |
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
| `--isl` | `3000` | Expected average input sequence length |
| `--osl` | `150` | Expected average output sequence length |
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
| `--min-endpoint` | `1` | Minimum replicas per worker type |
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
| `--no-operation` | `false` | Observation mode (no actual scaling) |
| `--no-correction` | `false` | Disable correction factors |
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
## Monitoring
### Grafana Dashboard
Deploy the planner dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```
The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
### Prometheus Metrics
The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
This diff is collapsed.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Profiler
The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.
## Feature Matrix
| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|------|--------|--------------|
| Dense Model Profiling | ✅ | ✅ | ✅ |
| MoE Model Profiling | 🚧 | ✅ | 🚧 |
| AI Configurator (Offline) | ❌ | ❌ | ✅ |
| Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
| Interactive WebUI | ✅ | ✅ | ✅ |
| Runtime Profiling Endpoints | ❌ | ✅ | ❌ |
## Quick Start
### Prerequisites
- Dynamo platform installed (see [Installation Guide](../../kubernetes/installation-guide.md))
- Kubernetes cluster with GPU nodes (for DGDR-based profiling)
- kube-prometheus-stack installed (required for SLA planner)
### Using DynamoGraphDeploymentRequest (Recommended)
The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model-profiling
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000 # Average input sequence length
osl: 150 # Average output sequence length
ttft: 200.0 # Target Time To First Token (ms)
itl: 20.0 # Target Inter-Token Latency (ms)
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
```bash
kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
```
### Using AI Configurator (Fast Offline Profiling)
For TensorRT-LLM, use AI Configurator for rapid profiling (~30 seconds):
```yaml
profilingConfig:
config:
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
```
### Direct Script Usage (Advanced)
For advanced scenarios, run the profiler directly:
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `sla.isl` | - | Average input sequence length (tokens) |
| `sla.osl` | - | Average output sequence length (tokens) |
| `sla.ttft` | - | Target Time To First Token (milliseconds) |
| `sla.itl` | - | Target Inter-Token Latency (milliseconds) |
| `sweep.useAiConfigurator` | `false` | Use offline simulation instead of real profiling |
| `hardware.minNumGpusPerEngine` | auto | Minimum GPUs per engine (auto-detected from model size) |
| `hardware.maxNumGpusPerEngine` | 8 | Maximum GPUs per engine |
## Profiling Methods
| Method | Duration | Accuracy | GPU Required | Backends |
|--------|----------|----------|--------------|----------|
| Online (AIPerf) | 2-4 hours | Highest | Yes | All |
| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM |
## Output
The profiler generates:
1. **Optimal Configuration**: Recommended TP sizes for prefill and decode engines
2. **Performance Data**: Interpolation models for the SLA Planner
3. **Generated DGD**: Complete deployment manifest with optimized settings
Example recommendations:
```text
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
```
## Next Steps
| Document | Description |
|----------|-------------|
| [Profiler Guide](profiler-guide.md) | Configuration, methods, and troubleshooting |
| [Profiler Examples](profiler-examples.md) | Complete DGDR YAMLs, WebUI, script examples |
| [SLA Planner Guide](../planner/planner-guide.md) | End-to-end deployment workflow |
| [SLA Planner Architecture](../planner/planner-guide.md) | How the Planner uses profiling data |
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
...@@ -40,14 +40,14 @@ To address the growing demands of distributed inference serving, NVIDIA introduc ...@@ -40,14 +40,14 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features: The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
- [Dynamo Disaggregated Serving](disagg-serving.md) - [Dynamo Disaggregated Serving](disagg-serving.md)
- [Dynamo Smart Router](../router/kv-cache-routing.md) - [Dynamo Smart Router](../components/router/README.md)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm-intro.md) - [Dynamo KV Cache Block Manager](../components/kvbm/README.md)
- [Planner](../planner/planner-intro.md) - [Planner](../components/planner/README.md)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths. Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../../assets/img/architecture.png "Dynamo Architecture") ![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](/assets/img/architecture.png "Dynamo Architecture")
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand. Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
...@@ -61,7 +61,7 @@ Dynamo prioritizes seamless integration. Its modular design enables it to work h ...@@ -61,7 +61,7 @@ Dynamo prioritizes seamless integration. Its modular design enables it to work h
Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization. Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../../assets/img/disagg-perf-benefit.png) ![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](/assets/img/disagg-perf-benefit.png)
* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL * Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
...@@ -70,7 +70,7 @@ The disaggregation of prefill and decode phases offers valuable flexibility. Sin ...@@ -70,7 +70,7 @@ The disaggregation of prefill and decode phases offers valuable flexibility. Sin
### KV aware routing ### KV aware routing
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../../assets/img/kv-routing.png) ![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](/assets/img/kv-routing.png)
* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL * Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
...@@ -80,7 +80,7 @@ Existing routing methods, including load-based routing, overlook the specific pr ...@@ -80,7 +80,7 @@ Existing routing methods, including load-based routing, overlook the specific pr
### KV cache manager ### KV cache manager
The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching. The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../../assets/img/kvbm-agg-performance.png) ![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](/assets/img/kvbm-agg-performance.png)
* Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL. * Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment