Unverified Commit 2c3066bd authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: full migration of docs/ to fern format in fern/ (#6050)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent d59b9d72
......@@ -9,7 +9,7 @@ Dynamo supports running Deepseek R1 with data parallel attention and wide expert
## Instructions
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [vLLM Backend](README.md) Getting Started section on each node, and then run these two commands.
The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README.md) Getting Started section on each node, and then run these two commands.
node 0
```bash
......
......@@ -80,7 +80,7 @@ python -m dynamo.frontend --router-mode kv &
# Start prefill worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 8 \
--enforce-eager
```
......@@ -89,7 +89,7 @@ python -m dynamo.vllm \
```bash
# Start decode worker
python -m dynamo.vllm \
--model meta-llama/Llama-3.3-70B-Instruct \
--model meta-llama/Llama-3.3-70B-Instruct
--tensor-parallel-size 8 \
--enforce-eager \
--is-prefill-worker
......
......@@ -11,7 +11,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t
**For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](LMCache-Integration.md).
**For LMCache metrics and integration**, see the [LMCache Integration Guide](../../integrations/lmcache-integration.md).
**For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).
......@@ -133,10 +133,10 @@ curl -s localhost:8081/metrics | grep "^lmcache:"
Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in:
- [LMCache Integration Guide](LMCache-Integration.md#troubleshooting)
- [LMCache Integration Guide](../../integrations/lmcache-integration.md#troubleshooting)
**For complete LMCache configuration and metric details**, see:
- [LMCache Integration Guide](LMCache-Integration.md) - Setup and configuration
- [LMCache Integration Guide](../../integrations/lmcache-integration.md) - Setup and configuration
- [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference
## Implementation Details
......
......@@ -3,6 +3,7 @@
# SPDX-License-Identifier: Apache-2.0
---
# Dynamo Benchmarking Guide
This benchmarking framework lets you compare performance across any combination of:
......@@ -64,7 +65,7 @@ The framework is a Python-based wrapper around `aiperf` that:
---
## Client-Side Benchmarking (Local)
# Client-Side Benchmarking (Local)
Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.
......@@ -87,10 +88,10 @@ Client-side benchmarking runs on your local machine and connects to Kubernetes d
Follow these steps to benchmark Dynamo deployments using client-side benchmarking:
### Step 1: Establish Kubernetes Cluster and Install Dynamo
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](../kubernetes/installation-guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md) to set up benchmarking resources.
Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](../kubernetes/installation-guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md) to set up benchmarking resources.
### Step 2: Deploy DynamoGraphDeployments
Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Each deployment should have a frontend service exposed.
Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Each deployment should have a frontend service exposed.
### Step 3: Port-Forward and Benchmark Deployment A
```bash
......@@ -298,7 +299,7 @@ Each concurrency directory contains:
---
## Server-Side Benchmarking (In-Cluster)
# Server-Side Benchmarking (In-Cluster)
Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.
......@@ -316,17 +317,17 @@ The server-side benchmarking solution:
## Prerequisites
1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](../kubernetes/README.md))
2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md))
2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
3. **Docker image** containing the Dynamo benchmarking tools
## Quick Start
### Step 1: Deploy Your DynamoGraphDeployment
Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Ensure it has a frontend service exposed.
Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed.
### Step 2: Deploy and Run Benchmark Job
**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
```bash
export NAMESPACE=benchmarking
......@@ -519,7 +520,7 @@ The Python benchmarking module provides a complete end-to-end benchmarking exper
## Testing with Mocker Backend
For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) that simulates LLM inference without requiring actual GPU resources. This is useful for:
For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:
- **Testing deployments** without expensive GPU infrastructure
- **Developing and debugging** router, planner, or frontend logic
......@@ -528,4 +529,4 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/
The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.
See the [mocker directory](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) for usage examples and configuration options.
See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
......@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0
---
# Dynamo KV Smart Router A/B Benchmarking Guide
This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.
## Overview
......@@ -99,7 +97,7 @@ kubectl create secret generic hf-token-secret \
### Step 1.3: Install Dynamo Platform (Per-Namespace)
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation_guide.md) to install the platform in both namespaces:
If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation-guide.md) to install the platform in both namespaces:
- `router-off-test`
- `router-on-test`
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# SLA-Driven Profiling with DynamoGraphDeploymentRequest
> [!TIP]
> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](../planner/sla-planner-quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.
## Overview
Dynamo provides automated SLA-driven profiling through **DynamoGraphDeploymentRequests (DGDR)**. Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically.
**Key Benefits:**
- **Declarative**: Specify SLAs, not implementation details
- **Automated**: No manual job setup or result processing
- **Integrated**: Seamlessly works with Dynamo Operator
- **Production-Ready**: Generates optimized configurations with SLA planner
This document covers:
- Technical details of online vs offline profiling
- Profiling process internals (GPU usage, measurements, interpolation)
- Direct script usage for advanced scenarios
- Comprehensive troubleshooting
## Support Matrix
| Backend | Dense Models | MoE Models |
|---------|-------------|------------|
| vLLM | ✅ | 🚧 |
| SGLang | ✅ | ✅ |
| TensorRT-LLM | ✅ | 🚧 |
Specifically, the profiler sweeps over the following parallelization mapping for prefill and decode:
| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping |
|---------|-------------|------------|
| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP |
| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
| Other Models | TP | TP |
> [!NOTE]
> - Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
## Using DGDR for Profiling (Recommended)
The recommended way to profile models is through DGDRs. Sample configurations are provided in `deploy/`:
**Available Samples:**
- **`profile_sla_dgdr.yaml`**: Standard profiling with AIPerf on real engines
- **`profile_sla_aic_dgdr.yaml`**: Fast profiling with AI Configurator simulation
- **`profile_sla_moe_dgdr.yaml`**: MoE model profiling
The Dynamo Operator automatically:
1. Discovers GPU resources (cluster-scoped operators only)
2. Runs profiling (AIPerf on real engines or AI Configurator simulation)
3. Generates optimal DGD configuration with SLA planner
4. Deploys the DGD to your cluster
See the [Quick Start Guide](../planner/sla-planner-quickstart.md) for prerequisites and detailed instructions.
## Hardware Configuration
Hardware parameters have sensible defaults and are **optional** - you can override them if needed:
```yaml
profilingConfig:
config:
# Override hardware defaults if needed
hardware:
min_num_gpus_per_engine: 1
max_num_gpus_per_engine: 8
num_gpus_per_node: 8
# Only needed when using AI Configurator (sweep.use_ai_configurator: true)
sweep:
aic_system: h200_sxm # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
```
### Automatic GPU Discovery (Optional Feature)
Cluster-scoped operators can optionally enable automatic GPU discovery to detect hardware from cluster nodes. When enabled, hardware config is auto-detected and overrides any manually specified values.
```yaml
spec:
enableGpuDiscovery: true
```
This feature is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions. It is not available for namespace-restricted operators.
## Profiling Method
1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense model and 4 nodes for MoE models.
3. **Parallelization Mapping Sweep**: Use the input ISL and OSL, test the performance of the engines with different parallelization mappings.
- For dense models, we test different TP sizes for both prefill and decode.
- For MoE models (SGLang), we evaluate both TEP and DEP as candidates for prefill and decode.
- **Prefill**:
- TP/TEP: We measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
- DEP: Attention uses data parallelism. We send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst. This stabilizes measurements when the first batch may launch before all requests arrive.
![Prefill Performance](../../assets/img/h100-prefill-performance.png)
- **Decode**: Since the ITL (or iteration time) is relevant with how many requests are in-flight, we measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL. However, for MoE models, this is not guaranteed because the kv cache in different attention DP ranks is different. We are working on framework-side change to fix this issue. For example, the below plot shows the decode parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B.
![Decode Performance](../../assets/img/h100-decode-performance.png)
4. **Recommendation**: Selects optimal parallelization mapping for prefill and decode that achieves the highest per GPU throughput while adhering the SLA on TTFT and ITL. Specifically, the profiler will choose the point (or a point on the curve for decode) that is left to the vertical red dashed line that represents the SLAs while has the highest y coordinate (throughput per GPU).
5. **In-Depth Profiling on the Recommended P/D Engine**: After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner.
![ITL Interpolation](../../assets/img/pd-interpolation.png)
- **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
- **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases.
To run the parallelization mapping sweep and the in-depth profiling on the recommended P/D engine, the profiler need to know the engine's forward pass time with different loads. There are two ways to achieve this: run AIPerf on real engines or use AI Configurator to run simulations.
### AIPerf on Real Engines
Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
**Characteristics:**
- **Duration**: 2-4 hours
- **Accuracy**: Highest (real measurements)
- **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM
**DGDR Configuration:**
```yaml
profilingConfig:
config:
sweep:
use_ai_configurator: false # Default
```
### AI Configurator Simulation
Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
**Characteristics:**
- **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
**DGDR Configuration:**
```yaml
profilingConfig:
config:
sweep:
use_ai_configurator: true
aic:
system: h200_sxm # GPU system type
model_name: QWEN3_32B # AIC model identifier
backend_version: "0.20.0"
```
**Supported Configurations:**
For the current list of supported models, systems, and backend versions, see the [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features).
To check from the command line: `aiconfigurator cli --help`
**Currently supports:**
- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
### Output Format
After profiling, the DGDR status contains:
1. **Recommended Configuration**: Optimal TP for prefill and decode
2. **Performance Data**: Interpolation models for SLA planner
3. **Generated DGD**: Complete deployment manifest
**Example Recommendations:**
```
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
```
#### Interactive Configuration Selection WebUI
When running the profiler with `--pick-with-webui`, an interactive web interface is launched that allows you to visually explore profiling results and manually select configurations.
**Features:**
- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
**Selection Methods:**
1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
**Example DGD Config Output:**
When you click "Show Config", you'll see a DynamoGraphDeployment configuration like:
```yaml
# DynamoGraphDeployment Configuration
# Prefill: 1 GPU(s), TP=1
# Decode: 4 GPU(s), TP=4
# Model: Qwen/Qwen3-32B-FP8
# Backend: trtllm
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
PrefillWorker:
subComponentType: prefill
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=1
DecodeWorker:
subComponentType: decode
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=4
```
**Usage:**
```bash
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--pick-with-webui \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15
```
Once you have selected a configuration, the full DynamoGraphDeployment CRD will be saved in your output folder as `config_with_planner.yaml`.
The WebUI launches on port 8000 by default (configurable with `--webui-port`).
#### Output Performance Plots
The profiler will generate the following plots to better visualize the performance data:
**Parallelization Mapping Sweep Plots:**
- `prefill_performance.png`: TTFT vs Parallelization Mapping size
- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests
Note these two plots are based on the input ISL and OSL.
**In-Depth Profiling for the Recommended P/D Engine Plots:**
- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL for the recommended prefill engine
- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL for the recommended prefill engine
- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length for the recommended decode engine
- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length for the recommended decode engine
### Output Interpolation Data
The profiler generates `.npz` files to store the performance data for the recommended P/D engine:
**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`):
- `prefill_isl`: 1D array of input sequence lengths tested
- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL
- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL
**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`):
- `max_kv_tokens`: Total KV tokens capacity in decode engine
- `x_kv_usage`: 1D array of active KV usage percentages [0, 1]
- `y_context_length`: 1D array of average context lengths tested
- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point
- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point
## DGDR Configuration Reference
This section provides detailed explanations of all DGDR `profilingConfig` options. The DGDR controller passes this configuration to the profiler script, which is defined in `benchmarks/profiler/utils/profiler_argparse.py`.
### Configuration Structure
All profiler configuration goes under `spec.profilingConfig.config`:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-deployment
spec:
model: "Qwen/Qwen3-0.6B" # High-level: model to deploy
backend: vllm # High-level: inference backend
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Required
configMapRef: # Optional: base DGD config
name: my-config
key: disagg.yaml
config: # Profiler configuration
sla: { ... }
hardware: { ... }
sweep: { ... }
aic: { ... }
planner: { ... }
deploymentOverrides: # Optional
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
```
### SLA Configuration (Required)
Define your performance requirements and workload characteristics:
```yaml
profilingConfig:
config:
sla:
isl: 3000 # Average input sequence length (tokens)
osl: 150 # Average output sequence length (tokens)
ttft: 200.0 # Target Time To First Token (milliseconds)
itl: 20.0 # Target Inter-Token Latency (milliseconds)
```
**What these control:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine)
- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine)
- **Trade-offs**: Tighter SLAs require more GPU resources
### Hardware Configuration (Optional)
Control GPU search space and constraints:
```yaml
profilingConfig:
config:
hardware:
min_num_gpus_per_engine: 2 # if not provided, will automatically determine based on model and VRAM size
max_num_gpus_per_engine: 8 # Maximum GPUs to test
num_gpus_per_node: 8 # GPUs per node (for multi-node MoE)
gpu_type: h200_sxm # GPU type hint
```
**When to use:**
- **min_num_gpus_per_engine**: Skip small TP sizes if your model is large
- **max_num_gpus_per_engine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
- **num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
- **gpu_type**: Informational, auto-detected by controller
> [!TIP]
> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
### Sweep Configuration (Optional)
Control profiling behavior:
```yaml
profilingConfig:
config:
sweep:
use_ai_configurator: false # Use offline profiling (default: false)
prefill_interpolation_granularity: 16 # Samples for prefill TTFT curve
decode_interpolation_granularity: 6 # Samples for decode ITL curve
```
**Use cases:**
- **use_ai_configurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
- **prefill_interpolation_granularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
- **decode_interpolation_granularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
### AI Configurator Configuration (Required if `use_ai_configurator: true`)
Configure AI Configurator profiling mode:
```yaml
profilingConfig:
config:
sweep:
use_ai_configurator: true
aic_system: h200_sxm # GPU system: h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
aic_hf_id: Qwen/Qwen3-32B # Huggingface model id
aic_backend_version: "0.20.0" # TensorRT-LLM version: 0.20.0, 1.0.0rc3
```
**Supported configurations:** See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features)
### Planner Configuration (Optional)
Pass arguments to the SLA planner:
```yaml
profilingConfig:
config:
planner:
planner_min_endpoint: 2 # Minimum endpoints to maintain
planner_adjustment_interval: 60 # Adjustment interval (seconds)
planner_load_predictor: linear # Load prediction method
```
> [!NOTE]
> Planner arguments use `planner_` prefix. See planner documentation for full list.
### Engine Configuration (Auto-configured)
The controller automatically sets these from high-level fields:
```yaml
# You specify:
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
# Controller auto-injects into config:
profilingConfig:
config:
deployment:
model: "Qwen/Qwen3-0.6B" # From spec.model
engine:
backend: vllm # From spec.backend
config: /path/to/configmap # From spec.profilingConfig.configMapRef (if provided)
```
**You should not manually set** `deployment.model` or `engine.backend` in `profilingConfig.config` - they are automatically injected from the high-level fields.
### Complete Example: AIPerf on Real Engines
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: vllm-dense-online
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
config:
sla:
isl: 3000
osl: 150
ttft: 200.0
itl: 20.0
hardware:
min_num_gpus_per_engine: 1
max_num_gpus_per_engine: 8
sweep:
use_ai_configurator: false
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
autoApply: true
```
### Complete Example: AI Configurator Simulation
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: trtllm-aic-offline
spec:
model: "Qwen/Qwen3-32B"
backend: trtllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
config:
sla:
isl: 4000
osl: 500
ttft: 300.0
itl: 10.0
sweep:
use_ai_configurator: true
aic:
system: h200_sxm
model_name: QWEN3_32B
backend_version: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
autoApply: true
```
### Complete Example: MoE Model
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sglang-moe
spec:
model: "deepseek-ai/DeepSeek-R1"
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
config:
sla:
isl: 2048
osl: 512
ttft: 300.0
itl: 25.0
hardware:
num_gpus_per_node: 8
max_num_gpus_per_engine: 32
engine:
is_moe_model: true # Enable MoE profiling mode
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
```
## Troubleshooting
### Profiling Takes Too Long
**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
```yaml
sweep:
use_ai_configurator: true
```
**Solution 2**: Reduce search space:
```yaml
config:
sweep:
min_num_gpus: 4 # Skip TP1, TP2
max_num_gpus: 8 # Don't test beyond TP8
```
### SLA Cannot Be Met
**Symptoms**: Profiler reports no configuration meets targets
**Solutions:**
1. Relax SLA targets (increase TTFT/ITL)
2. Add more GPU resources
3. Try a different backend
4. Use a smaller model
### AI Configurator: Attention Head Constraint Error
**Symptoms**: Profiling fails with error:
```
AssertionError: num_heads <N> should be divisible by tp_size <M> and the division result should be >= 4
```
**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes.
**Affected Models:**
- **Qwen3-0.6B** (16 heads): Max TP = 4 ❌ Fails at TP=8
- **GPT-2** (12 heads): Max TP = 3
- Most models **\<1B parameters**: May hit this constraint
**Solution**: Limit `max_num_gpus_per_engine` in your DGDR:
```yaml
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
config:
hardware:
max_num_gpus_per_engine: 4 # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
sweep:
use_ai_configurator: true
aic:
system: h200_sxm
model_name: QWEN3_0_6B
```
**Calculate Max TP**: `max_tp = num_attention_heads / 4`
> **Note**: This is an AI Configurator limitation. Online profiling doesn't have this constraint.
### Image Pull Errors
**Symptoms**: `ErrImagePull` or `ImagePullBackOff`
**Solution**: Ensure image pull secrets are configured:
```bash
kubectl create secret docker-registry nvcr-imagepullsecret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=<NGC_API_KEY> \
--namespace <your-namespace>
```
### Out of Memory During Profiling
**Symptoms**: OOM errors in profiling jobs
**Solutions:**
1. Reduce `gpu_memory_utilization` in engine config
2. Reduce `--max-context-length`
3. Skip larger TP configurations
4. Use fewer GPUs per test
### Unsupported Parallelization Mapping in Backend
**Symptoms**: Starttime/runtime error in the backend. For example, prime number of attention heads restrain TP size to be 1 (i.e., falcon-7b with 71 attention heads). Or some backend does not support different TP sizes for prefill and decode.
**Solutions:**
1. Contact the backend to add support for the use cases and bump backend version in dynamo.
2. Restrain the max and min number of GPUs per engine to the supported range.
## Next Steps
- **Deploy with DGDR**: See [Quick Start Guide](../planner/sla-planner-quickstart.md)
- **Understand SLA Planner**: Read [SLA Planner Deep Dive](../planner/sla-planner.md)
- **Monitor Deployments**: Set up [Observability](../kubernetes/observability/metrics.md)
- **Optimize Performance**: See [Performance Tuning](../performance/tuning.md)
## Related Documentation
- [DGDR API Reference](../kubernetes/api-reference.md)
- [SLA Planner Quick Start](../planner/sla-planner-quickstart.md)
- [SLA Planner Architecture](../planner/sla-planner.md)
- [Profiler Arguments Reference](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler/utils/profiler_argparse.py)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Frontend
The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.
## Feature Matrix
| Feature | Status |
|---------|--------|
| OpenAI Chat Completions API | ✅ Supported |
| OpenAI Completions API | ✅ Supported |
| KServe gRPC v2 API | ✅ Supported |
| Streaming responses | ✅ Supported |
| Multi-model serving | ✅ Supported |
| Integrated routing | ✅ Supported |
| Tool calling | ✅ Supported |
## Quick Start
### Prerequisites
- Dynamo platform installed
- `etcd` and `nats-server -js` running
- At least one backend worker registered
### HTTP Frontend
```bash
python -m dynamo.frontend --http-port 8000
```
This starts an OpenAI-compatible HTTP server with integrated preprocessing and routing. Backends are auto-discovered when they call `register_llm`.
### KServe gRPC Frontend
```bash
python -m dynamo.frontend --kserve-grpc-server
```
See the [Frontend Guide](frontend-guide.md) for KServe-specific configuration and message formats.
### Kubernetes
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: frontend-example
spec:
graphs:
- name: frontend
replicas: 1
services:
- name: Frontend
image: nvcr.io/nvidia/dynamo/dynamo-vllm:latest
command:
- python
- -m
- dynamo.frontend
- --http-port
- "8000"
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--http-port` | 8000 | HTTP server port |
| `--kserve-grpc-server` | false | Enable KServe gRPC server |
| `--router-mode` | `round_robin` | Routing strategy: `round_robin`, `random`, `kv` |
See the [Frontend Guide](frontend-guide.md) for full configuration options.
## Next Steps
| Document | Description |
|----------|-------------|
| [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration |
| [Router Documentation](../router/README.md) | KV-aware routing configuration |
......@@ -3,11 +3,15 @@
# SPDX-License-Identifier: Apache-2.0
---
# KServe gRPC frontend
# Frontend Guide
## Motivation
This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend.
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
## KServe gRPC Frontend
### Motivation
[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry-standard protocols for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.
......@@ -20,8 +24,9 @@ This documentation assumes readers are familiar with the usage of KServe v2 API
## Starting the Frontend
To start the KServe frontend, run the below command
```
To start the KServe frontend, run the below command:
```bash
python -m dynamo.frontend --kserve-grpc-server
```
......@@ -45,54 +50,58 @@ python -m dynamo.frontend --kserve-grpc-server
If these variables are not set, the server uses tonic's default values.
> **Note**: Tune these values based on your workload. Connection window should accommodate `concurrent_requests × request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details.
<Note>
Tune these values based on your workload. Connection window should accommodate `concurrent_requests x request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details.
</Note>
## Registering a Backend
Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
* `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
* `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference
* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference
The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:
### OpenAI Completions
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor based messages and OpenAI message and imposes specific structure of the KServe request message.
Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor-based messages and OpenAI message and imposes specific structure of the KServe request message.
#### Model Metadata / Config
The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
```
```json
{
name: $MODEL_NAME,
version: 1,
platform: "dynamo",
backend: "dynamo", # model config specific
inputs: [
"name": "$MODEL_NAME",
"version": 1,
"platform": "dynamo",
"backend": "dynamo",
"inputs": [
{
name: "text_input",
datatype: "BYTES",
shape: [1]
"name": "text_input",
"datatype": "BYTES",
"shape": [1]
},
{
name: "streaming",
datatype: "BOOL",
shape: [1],
optional: true
"name": "streaming",
"datatype": "BOOL",
"shape": [1],
"optional": true
}
]
outputs: [
],
"outputs": [
{
name: "text_output",
datatype: "BYTES",
shape: [-1]
"name": "text_output",
"datatype": "BYTES",
"shape": [-1]
},
{
name: "finish_reason",
datatype: "BYTES",
shape: [-1],
optional: true
"name": "finish_reason",
"datatype": "BYTES",
"shape": [-1],
"optional": true
}
]
}
......@@ -101,26 +110,57 @@ The metadata and config endpoint will report the registered backend to have the
#### Inference
On receiving inference request, the following conversion will be performed:
* `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
* `streaming`: the element will be converted to `stream` field in OpenAI Completion request
On receiving model response, the following conversion will be performed:
* `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice.
* `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.
### Tensor
This combination is used when the user is migrating an existing KServe based backend into Dynamo ecosystem.
This combination is used when the user is migrating an existing KServe-based backend into Dynamo ecosystem.
#### Model Metadata / Config
When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
* [TensorModelConfig](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
* [triton_model_config](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/main/tests/frontend/grpc/echo_tensor_worker.py) for example.
When registering the backend, the backend must provide the model's metadata as tensor-based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
* [TensorModelConfig](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
* [triton_model_config](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/main/tests/frontend/grpc/echo_tensor_worker.py) for example.
#### Inference
When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
## Python Bindings
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
## Integration
### With Router
The frontend includes an integrated router for request distribution. Configure routing mode:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
See [Router Documentation](../router/README.md) for routing configuration details.
### With Backends
Backends auto-register with the frontend when they call `register_llm()`. Supported backends:
- [vLLM Backend](../../backends/vllm/README.md)
- [SGLang Backend](../../backends/sglang/README.md)
- [TensorRT-LLM Backend](../../backends/trtllm/README.md)
## See Also
| Document | Description |
|----------|-------------|
| [Frontend Overview](README.md) | Quick start and feature matrix |
| [Router Documentation](../router/README.md) | KV-aware routing configuration |
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# KV Block Manager (KVBM)
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
KVBM offers:
- A **unified memory API** spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage systems
- Support for **block lifecycles** (allocate → register → match) with event-based state transitions
- Integration with **[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)**, a dynamic memory exchange layer for remote registration, sharing, and access of memory blocks
> **Get started:** See the [KVBM Guide](kvbm-guide.md) for installation and deployment instructions.
## When to Use KV Cache Offloading
KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and better user experience. Providers benefit from higher throughput and lower cost per token, making inference services more scalable and efficient.
Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in:
| Scenario | Benefit |
|----------|---------|
| **Long sessions and multi-turn conversations** | Preserves large prompt prefixes, avoids recomputation, improves first-token latency and throughput |
| **High concurrency** | Idle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits |
| **Shared or repeated content** | Reuse across users or sessions (system prompts, templates) increases cache hits, especially with remote or cross-instance sharing |
| **Memory- or cost-constrained deployments** | Offloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware |
## Feature Support Matrix
| | Feature | Support |
|--|---------|---------|
| **Backend** | Local | ✅ |
| | Kubernetes | ✅ |
| **LLM Framework** | vLLM | ✅ |
| | TensorRT-LLM | ✅ |
| | SGLang | ❌ |
| **Serving Type** | Aggregated | ✅ |
| | Disaggregated | ✅ |
## Architecture
![KVBM Architecture](/assets/img/kvbm-architecture.png)
*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
KVBM has three primary logical layers:
**LLM Inference Runtime Layer** — The top layer includes inference runtimes (TensorRT-LLM, vLLM) that integrate through dedicated connector modules to the Dynamo KVBM. These connectors act as translation layers, mapping runtime-specific operations and events into KVBM's block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and memory tiering.
**KVBM Logic Layer** — The middle layer encapsulates core KV block manager logic and serves as the runtime substrate for managing block memory. The KVBM adapter normalizes representations and data layout for incoming requests across runtimes and forwards them to the core memory manager. This layer implements table lookups, memory allocation, block layout management, lifecycle state transitions, and block reuse/eviction policies.
**NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage.
> **Learn more:** See the [KVBM Design Document](../../design-docs/kvbm-design.md) for detailed architecture, components, and data flows.
## Next Steps
- **[KVBM Guide](kvbm-guide.md)** — Installation, configuration, and deployment instructions
- **[KVBM Design](../../design-docs/kvbm-design.md)** — Architecture deep dive, components, and data flows
- **[LMCache Integration](../../integrations/lmcache-integration.md)** — Use LMCache with Dynamo vLLM backend
- **[FlexKV Integration](../../integrations/flexkv-integration.md)** — Use FlexKV for KV cache management
- **[SGLang HiCache](../../integrations/sglang-hicache.md)** — Enable SGLang's hierarchical cache with NIXL
- **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# KVBM Guide
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
## Table of Contents
- [Quick Start](#quick-start)
- [Run KVBM Standalone](#run-kvbm-standalone)
- [Run KVBM in Dynamo with vLLM](#run-kvbm-in-dynamo-with-vllm)
- [Run KVBM in Dynamo with TensorRT-LLM](#run-kvbm-in-dynamo-with-tensorrt-llm)
- [Run Dynamo with SGLang HiCache](#run-dynamo-with-sglang-hicache)
- [Disaggregated Serving with KVBM](#disaggregated-serving-with-kvbm)
- [Configuration](#configuration)
- [Enable and View KVBM Metrics](#enable-and-view-kvbm-metrics)
- [Benchmarking KVBM](#benchmarking-kvbm)
- [Troubleshooting](#troubleshooting)
- [Developing Locally](#developing-locally)
## Quick Start
## Run KVBM Standalone
KVBM can be used independently without using the rest of the Dynamo stack:
```bash
pip install kvbm
```
See the [support matrix](../../reference/support-matrix.md) for version compatibility.
### Build from Source
To build KVBM from source, see the detailed instructions in the [KVBM bindings README](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/kvbm/README.md#build-from-source).
## Run KVBM in Dynamo with vLLM
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
# Build a dynamo vLLM container (KVBM is built in by default)
./container/build.sh --framework vllm
# Launch the container
./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
cd $DYNAMO_HOME/examples/backends/vllm
./launch/agg_kvbm.sh
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 10
}'
```
#### Alternative: Using Direct vllm serve
You can also use `vllm serve` directly with KVBM:
```bash
vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
```
## Run KVBM in Dynamo with TensorRT-LLM
> [!NOTE]
> **Prerequisites:**
> - Ensure `etcd` and `nats` are running before starting
> - KVBM only supports TensorRT-LLM's PyTorch backend
> - Disable partial reuse (`enable_partial_reuse: false`) to increase offloading cache hits
> - KVBM requires TensorRT-LLM v1.2.0rc2 or newer
### Docker Setup
```bash
# Start up etcd for KVBM leader/worker registration and discovery
docker compose -f deploy/docker-compose.yml up -d
# Build a dynamo TRTLLM container (KVBM is built in by default)
./container/build.sh --framework trtllm
# Launch the container
./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds
```
### Aggregated Serving
```bash
# Write the LLM API config
cat > "/tmp/kvbm_llm_api_config.yaml" <<EOF
backend: pytorch
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
kv_connector_config:
connector_module: kvbm.trtllm_integration.connector
connector_scheduler_class: DynamoKVBMConnectorLeader
connector_worker_class: DynamoKVBMConnectorWorker
EOF
# Start dynamo frontend
python3 -m dynamo.frontend --http-port 8000 &
# Serve the model with KVBM
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml &
```
#### Verify Deployment
```bash
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello, how are you?"}],
"stream": false,
"max_tokens": 30
}'
```
#### Alternative: Using trtllm-serve
```bash
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
```
## Run Dynamo with SGLang HiCache
SGLang's Hierarchical Cache (HiCache) extends KV cache storage beyond GPU memory to include host CPU memory. When using NIXL as the storage backend, HiCache integrates with Dynamo's memory infrastructure.
### Quick Start
```bash
# Start SGLang worker with HiCache enabled
python -m dynamo.sglang \
--model-path Qwen/Qwen3-0.6B \
--host 0.0.0.0 --port 8000 \
--enable-hierarchical-cache \
--hicache-ratio 2 \
--hicache-write-policy write_through \
--hicache-storage-backend nixl
# In a separate terminal, start the frontend
python -m dynamo.frontend --http-port 8000
# Send a test request
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false,
"max_tokens": 30
}'
```
> **Learn more:** See the [SGLang HiCache Integration Guide](../../integrations/sglang-hicache.md) for detailed configuration, deployment examples, and troubleshooting.
## Disaggregated Serving with KVBM
KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache.
### Disaggregated Serving with vLLM
```bash
# 1P1D - one prefill worker and one decode worker
# NOTE: requires at least 2 GPUs
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm.sh
# 2P2D - two prefill workers and two decode workers
# NOTE: requires at least 4 GPUs
cd $DYNAMO_HOME/examples/backends/vllm
./launch/disagg_kvbm_2p2d.sh
```
### Disaggregated Serving with TRT-LLM
```bash
# Launch prefill worker with KVBM
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml \
--disaggregation-mode prefill &
```
## Configuration
### Cache Tier Configuration
Configure KVBM cache tiers using environment variables:
```bash
# Option 1: CPU cache only (GPU -> CPU offloading)
export DYN_KVBM_CPU_CACHE_GB=4 # 4GB of pinned CPU memory
# Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
export DYN_KVBM_CPU_CACHE_GB=4
export DYN_KVBM_DISK_CACHE_GB=8 # 8GB of disk
# [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading)
# NOTE: Experimental, may not provide optimal performance
# NOTE: Disk offload filtering not supported with this option
export DYN_KVBM_DISK_CACHE_GB=8
```
You can also specify exact block counts instead of GB:
- `DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS`
- `DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS`
### SSD Lifespan Protection
When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step.
To disable disk offload filtering:
```bash
export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true
```
## Enable and View KVBM Metrics
### Setup Monitoring Stack
```bash
# Start basic services (etcd & natsd), along with Prometheus and Grafana
docker compose -f deploy/docker-observability.yml up -d
```
### Enable Metrics for vLLM
```bash
DYN_KVBM_METRICS=true \
DYN_KVBM_CPU_CACHE_GB=20 \
python -m dynamo.vllm \
--model Qwen/Qwen3-0.6B \
--enforce-eager \
--connector kvbm
```
### Enable Metrics for TensorRT-LLM
```bash
DYN_KVBM_METRICS=true \
DYN_KVBM_CPU_CACHE_GB=20 \
python3 -m dynamo.trtllm \
--model-path Qwen/Qwen3-0.6B \
--served-model-name Qwen/Qwen3-0.6B \
--extra-engine-args /tmp/kvbm_llm_api_config.yaml &
```
### Firewall Configuration (Optional)
```bash
# If firewall blocks KVBM metrics ports
sudo ufw allow 6880/tcp
```
### View Metrics
Access Grafana at http://localhost:3000 (default login: `dynamo`/`dynamo`) and look for the **KVBM Dashboard**.
### Available Metrics
| Metric | Description |
|--------|-------------|
| `kvbm_matched_tokens` | Number of matched tokens |
| `kvbm_offload_blocks_d2h` | Offload blocks from device to host |
| `kvbm_offload_blocks_h2d` | Offload blocks from host to disk |
| `kvbm_offload_blocks_d2d` | Offload blocks from device to disk (bypassing host) |
| `kvbm_onboard_blocks_d2d` | Onboard blocks from disk to device |
| `kvbm_onboard_blocks_h2d` | Onboard blocks from host to device |
| `kvbm_host_cache_hit_rate` | Host cache hit rate (0.0-1.0) |
| `kvbm_disk_cache_hit_rate` | Disk cache hit rate (0.0-1.0) |
## Benchmarking KVBM
Use [LMBenchmark](https://github.com/LMCache/LMBenchmark) to evaluate KVBM performance.
### Setup
```bash
git clone https://github.com/LMCache/LMBenchmark.git
cd LMBenchmark/synthetic-multi-round-qa
```
### Run Benchmark
```bash
# Synthetic multi-turn chat dataset
# Arguments: model, endpoint, output prefix, qps
./long_input_short_output_run.sh \
"Qwen/Qwen3-0.6B" \
"http://localhost:8000" \
"benchmark_kvbm" \
1
```
Average TTFT and other performance numbers will be in the output.
> **TIP:** If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard.
### Baseline Comparison
#### vLLM Baseline (without KVBM)
```bash
vllm serve Qwen/Qwen3-0.6B
```
#### TensorRT-LLM Baseline (without KVBM)
```bash
# Create config without kv_connector_config
cat > "/tmp/llm_api_config.yaml" <<EOF
backend: pytorch
cuda_graph_config: null
kv_cache_config:
enable_partial_reuse: false
free_gpu_memory_fraction: 0.80
EOF
trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/llm_api_config.yaml
```
## Troubleshooting
### No TTFT Performance Gain
**Symptom:** Enabling KVBM does not show TTFT improvement or causes performance degradation.
**Cause:** Not enough prefix cache hits on KVBM to reuse offloaded KV blocks.
**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
![Grafana Example](/assets/img/kvbm-metrics-grafana.png)
### KVBM Worker Initialization Timeout
**Symptom:** KVBM fails to start when allocating large memory or disk storage.
**Solution:** Increase the leader-worker initialization timeout (default: 1800 seconds):
```bash
export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600 # 1 hour
```
### Disk Offload Fails to Start
**Symptom:** KVBM fails to start when disk offloading is enabled.
**Cause:** `fallocate()` is not supported on the filesystem (e.g., Lustre, certain network filesystems).
**Solution:** Enable disk zerofill fallback:
```bash
export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
```
If you encounter "write all error" or EINVAL (errno 22), also try:
```bash
export DYN_KVBM_DISK_DISABLE_O_DIRECT=true
```
## Developing Locally
Inside the Dynamo container, after changing KVBM-related code (Rust and/or Python):
```bash
cd /workspace/lib/bindings/kvbm
uv pip install maturin[patchelf]
maturin build --release --out /workspace/dist
uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl
```
## See Also
- [KVBM Overview](README.md) for a quick overview of KV Caching, KVBM and its architecture
- [KVBM Design](../../design-docs/kvbm-design.md) for a deep dive into KVBM architecture
- [LMCache Integration](../../integrations/lmcache-integration.md)
- [FlexKV Integration](../../integrations/flexkv-integration.md)
- [SGLang HiCache](../../integrations/sglang-hicache.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Planner
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner-guide.md) for a complete workflow including profiling and deployment.
## Feature Matrix
| Category | Feature | Status |
|----------|---------|--------|
| **Backend** | Local (bare metal) | Deprecated |
| | Kubernetes | Supported |
| **LLM Framework** | vLLM | Supported |
| | TensorRT-LLM | Supported |
| | SGLang | Supported |
| **Serving Type** | Aggregated | Unsupported |
| | Disaggregated | Supported |
| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
| | Load-based (KV cache/queue thresholds) | Deprecated |
| **Load Predictors** | ARIMA | Supported |
| | Prophet | Supported |
| | Kalman filter | Supported |
| | Constant (current = next) | Supported |
| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
| | VirtualConnector (external environments) | Supported |
## Quick Start
### Prerequisites
- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
- kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md))
- Pre-deployment profiling completed ([Profiling Guide](../profiler/profiler-guide.md))
### Deploy with DGDR (Recommended)
The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Guide](planner-guide.md) for the full workflow.
### Deploy with DGD (Manual)
For manual control, use the disaggregated planner templates:
```bash
# After profiling is complete
kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
```
## Documentation
| Document | Description |
|----------|-------------|
| [Planner Guide](planner-guide.md) | Deployment, configuration, integration, troubleshooting |
| [Planner Examples](planner-examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
| [SLA Planner Guide](planner-guide.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
| [SLA-based Planner](planner-guide.md) | Scaling algorithm, correction factors, load prediction details |
| [Load-based Planner](README.md) | Legacy load-based scaling (deprecated) |
| [SLA-Driven Profiling](../profiler/profiler-guide.md) | Pre-deployment profiling process and configuration |
| [Planner Design](../../design-docs/planner-design.md) | Architecture deep-dive for contributors |
## Configuration Reference
### Key Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
| `--environment` | `kubernetes` | Deployment environment |
| `--adjustment-interval` | `180` | Seconds between scaling decisions |
| `--ttft` | `500.0` | Target Time To First Token (ms) |
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
| `--isl` | `3000` | Expected average input sequence length |
| `--osl` | `150` | Expected average output sequence length |
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
| `--min-endpoint` | `1` | Minimum replicas per worker type |
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
| `--no-operation` | `false` | Observation mode (no actual scaling) |
| `--no-correction` | `false` | Disable correction factors |
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
## Monitoring
### Grafana Dashboard
Deploy the planner dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```
The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
### Prometheus Metrics
The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Planner Examples
Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).
## Basic Examples
### Minimal DGDR with AIC (Fastest)
The simplest way to deploy with the SLA planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-aic
spec:
model: Qwen/Qwen3-32B
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
export NAMESPACE=your-namespace
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
```
### Online Profiling (Real Measurements)
Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-online
spec:
model: meta-llama/Llama-3.3-70B-Instruct
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: false
prefillInterpolationGranularity: 16
decodeInterpolationGranularity: 6
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_dgdr.yaml -n $NAMESPACE
```
Available sample DGDRs in `benchmarks/profiler/deploy/`:
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
> **Profiling Config Cases**: Prior to 0.8.1, fields under `profilingConfig.config` use snake_case. Starting 0.8.1, fields use camelCase. There is backwards compatibility to snake_case, but example DGDRs use camelCase.
## Kubernetes Examples
### MoE Models (SGLang)
For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sla-moe
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: false
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
```
Deploy:
```bash
kubectl apply -f benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml -n $NAMESPACE
```
### Using Existing DGD Configs (Custom Setups)
Reference an existing DynamoGraphDeployment config via ConfigMap:
**Step 1: Create ConfigMap from your DGD config:**
```bash
kubectl create configmap deepseek-r1-config \
--from-file=disagg.yaml=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
**Step 2: Reference it in your DGDR:**
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml # Must match the key used in --from-file
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: deepseek-ai/DeepSeek-V3
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
autoApply: true
```
The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` and `spec.backend` into the final configuration.
### Inline Configuration (Simple Use Cases)
For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler auto-generates a basic DGD configuration:
```yaml
profilingConfig:
config:
sla:
isl: 8000
osl: 200
ttft: 200.0
itl: 10.0
hardware:
minNumGpusPerEngine: 2
maxNumGpusPerEngine: 8
gpuType: h200_sxm
sweep:
prefillInterpolationGranularity: 16
decodeInterpolationGranularity: 6
```
### Mocker Deployment (Testing)
Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:
- Large-scale experiments without GPU resources
- Testing planner behavior and infrastructure
- Validating deployment configurations
```yaml
spec:
model: <model-name>
backend: trtllm # Real backend for profiling
useMocker: true # Deploy mocker instead of real backend
profilingConfig:
profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
sweep:
useAiConfigurator: true
aicSystem: h100_sxm
autoApply: true
```
Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.
### Model Cache PVC (0.8.1+)
For large models, use a pre-populated PVC instead of downloading from HuggingFace:
See [SLA-Driven Profiling](../profiler/profiler-guide.md) for configuration details.
## Advanced Examples
### Custom Load Predictors
#### Warm-starting with Trace Data
Pre-load predictors with historical request patterns before live traffic:
```yaml
# In planner arguments
args:
- --load-predictor arima
- --load-predictor-warmup-trace /data/trace.jsonl
- --load-predictor-log1p
```
The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.
#### Kalman Filter Tuning
For workloads with rapid changes, tune the Kalman filter:
```yaml
args:
- --load-predictor kalman
- --kalman-q-level 2.0 # Higher = more responsive to level changes
- --kalman-q-trend 0.5 # Higher = trend changes faster
- --kalman-r 5.0 # Lower = trusts new measurements more
- --kalman-min-points 3 # Fewer points before forecasting starts
- --load-predictor-log1p # Often helps with request-rate series
```
#### Prophet for Seasonal Workloads
For workloads with daily/weekly patterns:
```yaml
args:
- --load-predictor prophet
- --prophet-window-size 100 # Larger window for seasonal detection
- --load-predictor-log1p
```
### Virtual Connector
For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:
```python
from dynamo._core import DistributedRuntime, VirtualConnectorClient
# Initialize client
client = VirtualConnectorClient(distributed_runtime, namespace)
# Main loop: watch for planner decisions and execute them
while True:
# Block until the planner makes a new scaling decision
await client.wait()
# Read the decision
decision = await client.get()
print(f"Scale to: prefill={decision.num_prefill_workers}, "
f"decode={decision.num_decode_workers}, "
f"id={decision.decision_id}")
# Execute scaling in your environment
scale_prefill_workers(decision.num_prefill_workers)
scale_decode_workers(decision.num_decode_workers)
# Report completion
await client.complete(decision)
```
See `components/planner/test/test_virtual_connector.py` for a full working example.
### Planner Configuration Passthrough
Pass planner-specific settings through the DGDR:
```yaml
profilingConfig:
config:
planner:
plannerMinEndpoint: 2
```
### Review Before Deploy (autoApply: false)
Disable auto-deployment to inspect the generated DGD:
```yaml
spec:
autoApply: false
```
After profiling completes:
```bash
# Extract and review generated DGD
kubectl get dgdr sla-aic -n $NAMESPACE \
-o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
# Review and modify as needed
vi my-dgd.yaml
# Deploy manually
kubectl apply -f my-dgd.yaml -n $NAMESPACE
```
### Profiling Artifacts with PVC
Save detailed profiling artifacts (plots, logs, raw data) to a PVC:
```yaml
spec:
profilingConfig:
outputPVC: "dynamo-pvc"
config:
sla:
isl: 3000
osl: 150
ttft: 200
itl: 20
```
Setup:
```bash
export NAMESPACE=your-namespace
deploy/utils/setup_benchmarking_resources.sh
```
Access results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
## Related Documentation
- [Planner README](README.md) -- Overview and quick start
- [Planner Guide](planner-guide.md) -- Deployment, configuration, integration
- [Planner Design](../../design-docs/planner-design.md) -- Architecture deep-dive
- [DGDR Configuration Reference](../profiler/profiler-guide.md#dgdr-configuration-structure)
- [SLA-Driven Profiling](../profiler/profiler-guide.md)
......@@ -3,83 +3,35 @@
# SPDX-License-Identifier: Apache-2.0
---
# SLA-Driven Profiling and Planner Deployment Quick Start Guide
# Planner Guide
Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md).
> [!WARNING]
> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](../kubernetes/installation-guide.md).
## Deployment
## Overview
### Prerequisites
The DGDR workflow automates the entire process from SLA specification to deployment:
1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information in a DGDR Custom Resource
2. **Automatic Profiling**: The Dynamo Operator automatically profiles your model to find optimal configurations
3. **Auto-Deploy**: The system automatically deploys the optimal configuration that meets your SLAs
Before deploying the planner, ensure:
```mermaid
flowchart TD
A[Create DGDR] --> B[DGDR Controller]
B --> C{Profiling Method}
C -->|Online| D[Run Profiling Job<br/>2-4 hours]
C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
D --> F[Generate DGD Config]
E --> F
F --> G[Auto-Deploy DGD]
G --> H[Monitor & Scale]
style A fill:#e1f5fe
style D fill:#fff3e0
style E fill:#e8f5e8
style G fill:#f3e5f5
style H fill:#fff8e1
```
## What is a DynamoGraphDeploymentRequest (DGDR)?
A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. Think of it as a "deployment order" where you specify:
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
## Prerequisites
Before creating a DGDR, ensure:
- **Dynamo platform installed** with the operator running (see [Installation Guide](../kubernetes/installation-guide.md))
- **[kube-prometheus-stack](../kubernetes/observability/metrics.md) installed and running** (required for SLA planner)
- **Dynamo platform installed** with the operator running (see [Installation Guide](../../kubernetes/installation-guide.md))
- **[kube-prometheus-stack](../../kubernetes/observability/metrics.md) installed and running** (required for SLA planner metric collection)
- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
- **Sufficient GPU resources** available in your cluster for profiling
- **Runtime images available** that contain both profiler and runtime components
### Container Images
Each DGDR requires you to specify container images for the profiling and deployment process:
Each DGDR requires container images for the profiling and deployment process:
**profilingConfig.profilerImage** (Required):
Specifies the container image used for the profiling job itself. This image must contain the profiler code and dependencies needed for SLA-based profiling.
The container image used for the profiling job. Must contain the profiler code and dependencies for SLA-based profiling.
**deploymentOverrides.workersImage** (Optional):
Specifies the container image used for DynamoGraphDeployment worker components (frontend, workers, planner). This image is used for:
The container image used for DGD worker components (frontend, workers, planner). Used for:
- Temporary DGDs created during online profiling (for performance measurements)
- The final DGD deployed after profiling completes
If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. You may use our public images (0.6.1 and later) or build and push your own.
If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. Public images are available from 0.6.1 onward.
```yaml
spec:
......@@ -89,64 +41,57 @@ spec:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Optional
```
## Quick Start: Deploy with DGDR
### What is a DynamoGraphDeploymentRequest (DGDR)?
### Step 1: Create Your DGDR
Dynamo provides sample DGDR configurations in `benchmarks/profiler/deploy/`. You can use these as starting points:
**Available Sample DGDRs:**
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator (TensorRT-LLM)
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
Or, you can create your own DGDR for your own needs:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model-deployment # Change the name
namespace: default # Change the namespace
spec:
model: "Qwen/Qwen3-0.6B" # Update to your model
backend: vllm # Backend: vllm, sglang, or trtllm
A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface for deploying models with specific performance and resource constraints. It specifies:
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Required
config:
sla:
isl: 3000 # Adjust to your workload
osl: 150 # Adjust to your workload
ttft: 200 # Your target (ms)
itl: 20 # Your target (ms)
- **What** model to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
sweep:
use_ai_configurator: false # Set to true for fast profiling (TensorRT-LLM only)
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1" # Optional
**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
autoApply: true # Auto-deploy after profiling
```
### DGDR Workflow
> [!TIP]
> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](../benchmarks/sla-driven-profiling.md#dgdr-configuration-reference).
The DGDR workflow automates the entire process from SLA specification to deployment:
### Step 2: Apply the DGDR
1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information
2. **Automatic Profiling**: The operator profiles your model to find optimal configurations
3. **Auto-Deploy**: The system deploys the optimal configuration that meets your SLAs
The rest of this quickstart will use the DGDR sample that uses AIC profiling. If you use a different DGDR file and/or name, be sure to adjust the commands accordingly.
```mermaid
flowchart TD
A[Create DGDR] --> B[DGDR Controller]
B --> C{Profiling Method}
C -->|Online| D[Run Profiling Job<br/>2-4 hours]
C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
D --> F[Generate DGD Config]
E --> F
F --> G[Auto-Deploy DGD]
G --> H[Monitor & Scale]
```bash
export NAMESPACE=your-namespace
kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
style A fill:#e1f5fe
style D fill:#fff3e0
style E fill:#e8f5e8
style G fill:#f3e5f5
style H fill:#fff8e1
```
The Dynamo Operator will immediately begin processing your request.
### Monitoring Progress
### Step 3: Monitor Progress
Watch the DGDR status:
Watch DGDR status:
```bash
# View status
......@@ -166,65 +111,47 @@ kubectl logs -f job/profile-sla-aic -n $NAMESPACE
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
> [!NOTE]
> With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours.
### Step 4: Access Your Deployment
Once the DGDR reaches `Ready` state, your model is deployed and ready to serve:
```bash
# Find the frontend service
kubectl get svc -n $NAMESPACE | grep trtllm-disagg
# Port-forward to access locally
kubectl port-forward svc/trtllm-disagg-frontend 8000:8000 -n $NAMESPACE
# Test the endpoint
curl http://localhost:8000/v1/models
```
### Relationship to DGD
### Step 5 (Optional): Access the Planner Grafana Dashboard
- **DGDR**: High-level "intent" -- what you want deployed
- **DGD**: Low-level "implementation" -- how it's deployed
If you want to monitor the SLA Planner's decision-making in real-time, you can deploy the Planner Grafana dashboard.
The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes the SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
The generated DGD is tracked via labels:
```yaml
metadata:
labels:
dgdr.nvidia.com/name: sla-aic
dgdr.nvidia.com/namespace: your-namespace
```
Follow the instructions in [Dynamo Metrics Collection on Kubernetes](../kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
## Configuration
The dashboard displays:
- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
- **Predicted Metrics**: Planner's load predictions and recommended replica counts
- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
> [!TIP]
> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your specific deployment namespace.
## DGDR Configuration Details
### DGDR Configuration
### Required Fields
#### Required Fields
| Field | Type | Description |
|-------|------|-------------|
| `spec.model` | string | Model identifier (e.g., "meta-llama/Llama-3-70b") |
| `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) |
| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
### Optional Fields
#### Optional Fields
| Field | Type | Description |
|-------|------|-------------|
| `spec.deploymentOverrides.workersImage` | string | Container image for DGD worker components. If omitted, uses image from base config file. |
| `spec.deploymentOverrides.workersImage` | string | Container image for DGD workers. If omitted, uses image from base config. |
| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
| `spec.deploymentOverrides` | object | Customize metadata (name, namespace, labels, annotations) and image for auto-created DGD |
| `spec.useMocker` | boolean | Deploy mocker instead of real backend (default: false) |
| `spec.deploymentOverrides` | object | Customize metadata and image for auto-created DGD |
### SLA Configuration
The `sla` section defines performance requirements and workload characteristics:
#### SLA Configuration
```yaml
sla:
......@@ -240,6 +167,8 @@ sla:
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources
For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](../profiler/profiler-guide.md#dgdr-configuration-structure).
### Profiling Methods
Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):
......@@ -247,154 +176,190 @@ Choose between **online profiling** (real measurements, 2-4 hours) or **offline
```yaml
# Online Profiling (Default)
sweep:
use_ai_configurator: false
useAiConfigurator: false
# Offline Profiling (AI Configurator - TensorRT-LLM only)
# Offline Profiling (AI Configurator)
sweep:
use_ai_configurator: true
aic_system: h200_sxm
aic_hf_id: Qwen/Qwen3-32B
aic_backend_version: "0.20.0"
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
```
> [!NOTE]
> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](../benchmarks/sla-driven-profiling.md#profiling-method).
For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](../profiler/profiler-guide.md#profiling-method).
### Load Predictors
The SLA planner forecasts the number of requests, ISL, and OSL in the next adjustment interval. Four prediction models are supported:
#### Constant Predictor
- **Use case**: Stable workloads with long prediction intervals
- **Behavior**: Assumes next load equals current load
- **Configuration**: `load-predictor: "constant"`
#### ARIMA Predictor
- **Use case**: Time-series data with trends and seasonality
- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
- **Configuration**: `load-predictor: "arima"`
- **Tunable parameters**:
- `--load-predictor-log1p`: model `log1p(y)` instead of `y`. If not set, ARIMA starts in raw space, and if it collapses to `(0,d,0)`, it falls back to `log1p` automatically.
#### Kalman Predictor
- **Use case**: Low-latency online forecasting (observe 1 -> predict 1) with smooth adaptation
- **Behavior**: Local linear trend Kalman filter (fast online updates; good default when ARIMA collapses to mean-only)
- **Configuration**: `load-predictor: "kalman"`
- **Tunable parameters**:
- `--kalman-q-level`: process noise for level (higher = more responsive)
- `--kalman-q-trend`: process noise for trend (higher = trend changes faster)
- `--kalman-r`: measurement noise (lower = trusts new measurements more)
- `--kalman-min-points`: minimum points before forecasting
- `--load-predictor-log1p`: model `log1p(y)` instead of `y` (often helps request-rate/count series)
#### Prophet Predictor
- **Use case**: Complex seasonal patterns and trend changes
- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
- **Configuration**: `load-predictor: "prophet"`
- **Tunable parameters**:
- `--prophet-window-size`: bounds internal history to control refit cost
- `--load-predictor-log1p`: model `log1p(y)` instead of `y`
#### Warm-starting Load Predictors (Optional)
You can warm-start load predictors with a mooncake-style JSONL trace file:
- **CLI argument**: `--load-predictor-warmup-trace <path/to/trace.jsonl>`
- **Effect**: preloads predictors with historical request-count / ISL / OSL samples extracted from the trace
### Planner Scaling Parameters
| Argument | Default | Description |
|----------|---------|-------------|
| `--adjustment-interval` | `180` | Seconds between scaling decisions |
| `--ttft` | `500.0` | Target Time To First Token (ms) |
| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
| `--isl` | `3000` | Expected average input sequence length |
| `--osl` | `150` | Expected average output sequence length |
| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
| `--min-endpoint` | `1` | Minimum replicas per worker type |
| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
| `--no-operation` | `false` | Observation mode (no actual scaling) |
| `--no-correction` | `false` | Disable correction factors |
### Hardware Configuration
#### Planner Configuration Passthrough
For details on hardware configuration and GPU discovery options, see [Hardware Configuration in SLA-Driven Profiling](../benchmarks/sla-driven-profiling.md#hardware-configuration).
Add planner-specific settings in the DGDR:
### Advanced Configuration
```yaml
profilingConfig:
config:
planner:
plannerMinEndpoint: 2
```
#### Using Existing DGD Configs (Recommended for Custom Setups)
## Integration
If you have an existing DynamoGraphDeployment config (e.g., from `examples/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap:
### Prometheus Setup
**Step 1: Create ConfigMap from your DGD config file:**
The planner queries Prometheus to collect frontend request metrics. The architecture:
```bash
kubectl create configmap deepseek-r1-config \
--from-file=disagg.yaml=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```mermaid
flowchart LR
Frontend --"/metrics"--> Prometheus
Planner --"query API"--> Prometheus
Planner --"scaling decisions"--> Workers
Frontend -.->|"requests"| Workers
```
**Step 2: Reference the ConfigMap in your DGDR:**
**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (configurable in podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: Prefill and backend workers handle inference
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
The planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with request count, ISL, OSL, TTFT, and ITL in the correct format. The Dynamo frontend provides these metrics automatically.
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml # Must match the key used in --from-file
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
use_ai_configurator: true
aic:
system: h200_sxm
model_name: DEEPSEEK_V3
backend_version: "0.20.0"
**Prometheus endpoint configuration:**
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
| Variable | Default |
|----------|---------|
| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` |
autoApply: true
```
If you see errors like "Failed to resolve prometheus service", ensure `PROMETHEUS_ENDPOINT` points to your Prometheus service.
> **What's happening**: The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` into `deployment.model` and `spec.backend` into `engine.backend` in the final configuration.
### Virtual Deployment
#### Inline Configuration (Simple Use Cases)
The SLA planner supports virtual deployment mode for customized environments (e.g., custom orchestrators) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing Kubernetes resources.
For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler will auto-generate a basic DGD configuration from your `model` and `backend`:
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of PATCHing DGD resources, it writes scaling decisions and waits for the external environment to acknowledge completion.
```yaml
profilingConfig:
config:
# SLA targets (required for profiling)
sla:
isl: 8000 # Input sequence length
osl: 200 # Output sequence length
ttft: 200.0 # Time To First Token (ms)
itl: 10.0 # Inter-Token Latency (ms)
# Hardware constraints (optional)
hardware:
min_num_gpus_per_engine: 2
max_num_gpus_per_engine: 8
gpu_type: h200_sxm
# Profiling sweep settings (optional)
sweep:
prefill_interpolation_granularity: 16 # Number of samples for prefill ISL sweep
decode_interpolation_granularity: 6 # Number of samples for decode sweep
```
#### Scaling Decision Flow
> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
1. **Decision Generation**: The planner calculates optimal worker counts
2. **Change Detection**: Skips scaling if target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
3. **Readiness Check**: Verifies previous scaling operations completed by checking `scaled_decision_id >= decision_id`
4. **Timeout Handling**: If not acknowledged within 30 minutes (1800 seconds), proceeds with new decisions
5. **Completion Tracking**: Optionally waits for scaling completion confirmation (blocking mode)
#### Planner Configuration Passthrough
Add planner-specific settings. Planner arguments use a `planner_` prefix:
#### Configuration
To use virtual deployment mode:
```yaml
profilingConfig:
config:
planner:
planner_min_endpoint: 2
environment: "virtual"
backend: "vllm" # or "sglang"
```
## Understanding Profiling Results
#### Deployment Environment Requirements
For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](../benchmarks/sla-driven-profiling.md).
The external deployment environment must use `VirtualConnectorClient`:
## Advanced Topics
```python
from dynamo._core import DistributedRuntime, VirtualConnectorClient
### Mocker Deployment
client = VirtualConnectorClient(distributed_runtime, namespace)
```
Instead of a real DGD that uses GPU resources, you can deploy a mocker deployment that uses simulated engines rather than GPUs. Mocker is available in all backend images and uses profiling data to simulate realistic GPU timing behavior. It is useful for:
- Large-scale experiments without GPU resources
- Testing Planner behavior and infrastructure
- Validating deployment configurations
1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()` (blocks until change)
2. **Parse Decisions**: Read values: `decision = await client.get()`
3. **Execute Scaling**: Apply the scaling decisions to your infrastructure
4. **Acknowledge Completion**: Mark done: `await client.complete(decision)`
To deploy mocker instead of the real backend, set `useMocker: true`:
A scaling decision (returned by `client.get()`) contains:
- `num_prefill_workers`: Target number of prefill workers (-1 if not set)
- `num_decode_workers`: Target number of decode workers (-1 if not set)
- `decision_id`: Incremental ID for each scaling decision
```yaml
spec:
model: <model-name>
backend: trtllm # Real backend for profiling (vllm, sglang, or trtllm)
useMocker: true # Deploy mocker instead of real backend
See `components/planner/test/test_virtual_connector.py` for a full example.
profilingConfig:
profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
...
autoApply: true
### Grafana Dashboard
Deploy the planner Grafana dashboard:
```bash
kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
```
Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.
Follow [Dynamo Metrics Collection on Kubernetes](../../kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
The dashboard displays:
- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
- **Predicted Metrics**: Planner's load predictions and recommended replica counts
- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
### DGDR Immutability
> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your deployment namespace.
DGDRs are **immutable** - if you need to update SLAs or configuration:
## DGDR Immutability
DGDRs are **immutable**. To update SLAs or configuration:
1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
2. Create a new DGDR with updated specifications
### Manual Deployment Control
There are two ways to manually control deployment after profiling:
## Manual Deployment Control
#### Option 1: Use DGDR-Generated Configuration (Recommended)
### Option 1: Use DGDR-Generated Configuration (Recommended)
Disable auto-deployment to review the generated DGD before applying:
......@@ -403,7 +368,7 @@ spec:
autoApply: false
```
Then manually extract and apply the generated DGD:
Then manually extract and apply:
```bash
# Extract generated DGD from DGDR status
......@@ -411,90 +376,43 @@ kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment
# Or save to file first for review/modification
kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
vi my-dgd.yaml
kubectl apply -f my-dgd.yaml -n $NAMESPACE
```
The generated DGD includes optimized configurations and the SLA planner component. The required `planner-profile-data` ConfigMap is automatically created when profiling completes, so the DGD will deploy successfully.
### Option 2: Use Standalone Planner Templates (Advanced)
#### Option 2: Use Standalone Planner Templates (Advanced)
For advanced use cases, you can manually deploy using the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:
For advanced use cases, use the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:
```bash
# After profiling completes, profiling data is automatically stored in ConfigMaps
# OPTIONAL: Inspect profiling results stored in ConfigMaps
# View the generated DGD configuration
# After profiling completes, profiling data is stored in ConfigMaps
kubectl get configmap dgdr-output-<dgdr-name> -n $NAMESPACE -o yaml
# View the planner profiling data (JSON format)
kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml
# Update the PROMETHEUS_ENDPOINT environment variable in the planner template
# to match your cluster's Prometheus service location (see comments in the template)
# Update backend planner manifest as needed, then deploy
# Update PROMETHEUS_ENDPOINT in the template, then deploy
kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
```
> **Note**: The standalone templates are provided as examples and may need customization for your model and requirements. The DGDR-generated configuration (Option 1) is recommended as it's automatically tuned to your profiling results and SLA targets.
>
> **Important - Prometheus Configuration**: The planner queries Prometheus to get frontend request metrics for scaling decisions. If you see errors like "Failed to resolve prometheus service", ensure the `PROMETHEUS_ENDPOINT` environment variable in your planner configuration correctly points to your Prometheus service. See the comments in the example templates for details.
### Relationship to DynamoGraphDeployment (DGD)
## Accessing Profiling Artifacts
- **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed
By default, profiling jobs save essential data to ConfigMaps. For detailed artifacts, configure the DGDR to use `dynamo-pvc`:
The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs
The generated DGD is tracked via labels:
```yaml
metadata:
labels:
dgdr.nvidia.com/name: sla-aic
dgdr.nvidia.com/namespace: your-namespace
```
### Accessing Detailed Profiling Artifacts
By default, profiling jobs save essential data to ConfigMaps for planner integration. For advanced users who need access to detailed artifacts (logs, performance plots, AIPerf results, etc), configure the DGDR to use `dynamo-pvc`. This is optional and will not affect the functionality of profiler or Planner.
**What's available in ConfigMaps (always created):**
**ConfigMaps (always created):**
- Generated DGD configuration
- Profiling data for Planner (`.json` files)
**What's available in PVC if attached to DGDR (optional):**
**PVC (optional):**
- Performance plots (PNGs)
- DGD configuration and logs of all services for each profiled deployment
- AIPerf profiling artifacts for each AIPerf run
- DGD configuration and logs for each profiled deployment
- AIPerf profiling artifacts
- Raw profiling data (`.npz` files)
- Profiler log
**Setup:**
1. Set up the benchmarking PVC:
```bash
export NAMESPACE=your-namespace
# Setup PVC
deploy/utils/setup_benchmarking_resources.sh
```
2. Add `outputPVC` to your DGDR's `profilingConfig`:
```yaml
spec:
profilingConfig:
outputPVC: "dynamo-pvc"
config:
# ... rest of config
```
3. After profiling completes, access results:
```bash
# Access results after profiling
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
......@@ -525,25 +443,15 @@ kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
| **Prometheus errors** | Ensure `PROMETHEUS_ENDPOINT` env var points to your Prometheus service |
> [!NOTE]
> For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](../benchmarks/sla-driven-profiling.md#troubleshooting).
## Configuration Reference
For comprehensive documentation of all DGDR configuration options, see the [DGDR Configuration Reference](../benchmarks/sla-driven-profiling.md#dgdr-configuration-reference).
This includes detailed explanations of:
- **SLA Configuration**: ISL, OSL, TTFT, ITL with use cases and trade-offs
- **Hardware Configuration**: GPU constraints and search space control
- **Sweep Configuration**: Profiling behavior and interpolation settings
- **AI Configurator Configuration**: System types, model mappings, backend versions
- **Planner Configuration**: Autoscaling and adjustment parameters
- **Complete Examples**: Full DGDRs for online, offline (AIC), and MoE profiling
For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](../profiler/profiler-guide.md#troubleshooting).
## Related Documentation
- [DGDR API Reference](../kubernetes/api-reference.md)
- [Pre-Deployment Profiling Details](../benchmarks/sla-driven-profiling.md)
- [SLA Planner Architecture](sla-planner.md)
- [Dynamo Operator Guide](../kubernetes/dynamo-operator.md)
- [Planner README](README.md) -- Overview and quick start
- [Planner Examples](planner-examples.md) -- DGDR YAML examples and sample configurations
- [Planner Design](../../design-docs/planner-design.md) -- Architecture deep-dive for contributors
- [DGDR API Reference](../../kubernetes/api-reference.md)
- [Pre-Deployment Profiling](../profiler/profiler-guide.md)
- [Dynamo Operator Guide](../../kubernetes/dynamo-operator.md)
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Profiler
The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.
## Feature Matrix
| Feature | vLLM | SGLang | TensorRT-LLM |
|---------|------|--------|--------------|
| Dense Model Profiling | ✅ | ✅ | ✅ |
| MoE Model Profiling | 🚧 | ✅ | 🚧 |
| AI Configurator (Offline) | ❌ | ❌ | ✅ |
| Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
| Interactive WebUI | ✅ | ✅ | ✅ |
| Runtime Profiling Endpoints | ❌ | ✅ | ❌ |
## Quick Start
### Prerequisites
- Dynamo platform installed (see [Installation Guide](../../kubernetes/installation-guide.md))
- Kubernetes cluster with GPU nodes (for DGDR-based profiling)
- kube-prometheus-stack installed (required for SLA planner)
### Using DynamoGraphDeploymentRequest (Recommended)
The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model-profiling
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000 # Average input sequence length
osl: 150 # Average output sequence length
ttft: 200.0 # Target Time To First Token (ms)
itl: 20.0 # Target Inter-Token Latency (ms)
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
```bash
kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
```
### Using AI Configurator (Fast Offline Profiling)
For TensorRT-LLM, use AI Configurator for rapid profiling (~30 seconds):
```yaml
profilingConfig:
config:
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
```
### Direct Script Usage (Advanced)
For advanced scenarios, run the profiler directly:
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| `sla.isl` | - | Average input sequence length (tokens) |
| `sla.osl` | - | Average output sequence length (tokens) |
| `sla.ttft` | - | Target Time To First Token (milliseconds) |
| `sla.itl` | - | Target Inter-Token Latency (milliseconds) |
| `sweep.useAiConfigurator` | `false` | Use offline simulation instead of real profiling |
| `hardware.minNumGpusPerEngine` | auto | Minimum GPUs per engine (auto-detected from model size) |
| `hardware.maxNumGpusPerEngine` | 8 | Maximum GPUs per engine |
## Profiling Methods
| Method | Duration | Accuracy | GPU Required | Backends |
|--------|----------|----------|--------------|----------|
| Online (AIPerf) | 2-4 hours | Highest | Yes | All |
| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM |
## Output
The profiler generates:
1. **Optimal Configuration**: Recommended TP sizes for prefill and decode engines
2. **Performance Data**: Interpolation models for the SLA Planner
3. **Generated DGD**: Complete deployment manifest with optimized settings
Example recommendations:
```text
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
```
## Next Steps
| Document | Description |
|----------|-------------|
| [Profiler Guide](profiler-guide.md) | Configuration, methods, and troubleshooting |
| [Profiler Examples](profiler-examples.md) | Complete DGDR YAMLs, WebUI, script examples |
| [SLA Planner Guide](../planner/planner-guide.md) | End-to-end deployment workflow |
| [SLA Planner Architecture](../planner/planner-guide.md) | How the Planner uses profiling data |
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Profiler Examples
Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.
## DGDR Examples
### Dense Model: AIPerf on Real Engines
Standard online profiling with real GPU measurements:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: vllm-dense-online
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000
osl: 150
ttft: 200.0
itl: 20.0
hardware:
minNumGpusPerEngine: 1
maxNumGpusPerEngine: 8
sweep:
useAiConfigurator: false
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
### Dense Model: AI Configurator Simulation
Fast offline profiling (~30 seconds, TensorRT-LLM only):
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: trtllm-aic-offline
spec:
model: "Qwen/Qwen3-32B"
backend: trtllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
config:
sla:
isl: 4000
osl: 500
ttft: 300.0
itl: 10.0
sweep:
useAiConfigurator: true
aicSystem: h200_sxm # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
autoApply: true
```
### MoE Model
Multi-node MoE profiling with SGLang:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: sglang-moe
spec:
model: "deepseek-ai/DeepSeek-R1"
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
config:
sla:
isl: 2048
osl: 512
ttft: 300.0
itl: 25.0
hardware:
numGpusPerNode: 8
maxNumGpusPerEngine: 32
engine:
isMoeModel: true
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
autoApply: true
```
### Using Existing DGD Config (ConfigMap)
Reference a custom DGD configuration via ConfigMap:
```bash
# Create ConfigMap from your DGD config file
kubectl create configmap deepseek-r1-config \
--from-file=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: deepseek-r1
spec:
model: deepseek-ai/DeepSeek-R1
backend: sglang
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
configMapRef:
name: deepseek-r1-config
key: disagg.yaml
config:
sla:
isl: 4000
osl: 500
ttft: 300
itl: 10
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: deepseek-ai/DeepSeek-V3
aicBackendVersion: "0.20.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
autoApply: true
```
## Interactive WebUI
Launch an interactive configuration selection interface:
```bash
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--pick-with-webui \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15
```
The WebUI launches on port 8000 by default (configurable with `--webui-port`).
### Features
- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
### Selection Methods
1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
### Example DGD Config Output
When you click "Show Config", you see a DynamoGraphDeployment configuration:
```yaml
# DynamoGraphDeployment Configuration
# Prefill: 1 GPU(s), TP=1
# Decode: 4 GPU(s), TP=4
# Model: Qwen/Qwen3-32B-FP8
# Backend: trtllm
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
spec:
services:
PrefillWorker:
subComponentType: prefill
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=1
DecodeWorker:
subComponentType: decode
replicas: 1
extraPodSpec:
mainContainer:
args:
- --tensor-parallel-size=4
```
Once you select a configuration, the full DGD CRD is saved as `config_with_planner.yaml`.
## Direct Script Examples
### Basic Profiling
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150
```
### With GPU Constraints
```bash
python -m benchmarks.profiler.profile_sla \
--backend sglang \
--config examples/backends/sglang/deploy/disagg.yaml \
--model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150 \
--min-num-gpus 2 \
--max-num-gpus 8
```
### AI Configurator (Offline)
```bash
python -m benchmarks.profiler.profile_sla \
--backend trtllm \
--config path/to/disagg.yaml \
--use-ai-configurator \
--model Qwen/Qwen3-32B-FP8 \
--aic-system h200_sxm \
--ttft 200 --itl 15 \
--isl 4000 --osl 500
```
## SGLang Runtime Profiling
Profile SGLang workers at runtime via HTTP endpoints:
```bash
# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
# Run inference requests to generate profiling data...
# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile
```
A test script is provided at `examples/backends/sglang/test_sglang_profile.py`:
```bash
python examples/backends/sglang/test_sglang_profile.py
```
View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Profiler Guide
This guide covers deployment, configuration, integration, and troubleshooting for the Dynamo Profiler.
## What is a DynamoGraphDeploymentRequest (DGDR)?
A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. You specify:
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
The Dynamo Operator watches for DGDRs and automatically:
1. Discovers available GPU resources in your cluster
2. Runs profiling (online or offline) to find optimal configurations
3. Generates an optimized DynamoGraphDeployment (DGD) configuration
4. Deploys the DGD to your cluster
**Relationship to DGD:**
- **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed
## Support Matrix
| Backend | Dense Models | MoE Models |
|---------|-------------|------------|
| vLLM | ✅ | 🚧 |
| SGLang | ✅ | ✅ |
| TensorRT-LLM | ✅ | 🚧 |
The profiler sweeps over the following parallelization mappings for prefill and decode:
| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping |
|---------|-------------|------------|
| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP |
| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
| Other Models | TP | TP |
> [!NOTE]
> Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
## Deployment
### Kubernetes Deployment (DGDR)
The recommended deployment method is through DGDRs. Sample configurations are provided in `benchmarks/profiler/deploy/`:
| Sample | Description |
|--------|-------------|
| `profile_sla_dgdr.yaml` | Standard online profiling with AIPerf |
| `profile_sla_aic_dgdr.yaml` | Fast offline profiling with AI Configurator |
| `profile_sla_moe_dgdr.yaml` | MoE model profiling (SGLang) |
#### Container Images
Each DGDR requires container images for profiling and deployment:
- **`profilingConfig.profilerImage`** (Required): Container image for the profiling job. Must contain the profiler code and dependencies.
- **`deploymentOverrides.workersImage`** (Optional): Container image for DGD worker components (frontend, workers, planner). If omitted, uses image from the base config file.
```yaml
spec:
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
```
#### Quick Start: Deploy with DGDR
**Step 1: Create Your DGDR**
Use a sample configuration or create your own:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-model-profiling
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
config:
sla:
isl: 3000
osl: 150
ttft: 200.0
itl: 20.0
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
autoApply: true
```
**Step 2: Apply the DGDR**
```bash
export NAMESPACE=your-namespace
kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
```
**Step 3: Monitor Progress**
```bash
# View status
kubectl get dgdr -n $NAMESPACE
# Detailed status
kubectl describe dgdr my-model-profiling -n $NAMESPACE
# Watch profiling job logs
kubectl logs -f job/profile-my-model-profiling -n $NAMESPACE
```
**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
**Step 4: Access Your Deployment**
```bash
# Find the frontend service
kubectl get svc -n $NAMESPACE | grep frontend
# Port-forward to access locally
kubectl port-forward svc/<deployment>-frontend 8000:8000 -n $NAMESPACE
# Test the endpoint
curl http://localhost:8000/v1/models
```
> [!NOTE]
> DGDRs are **immutable**. To update SLAs or configuration, delete the existing DGDR and create a new one.
### Direct Script Execution
For advanced use cases or local development:
```bash
python -m benchmarks.profiler.profile_sla \
--backend vllm \
--config path/to/disagg.yaml \
--model meta-llama/Llama-3-8B \
--ttft 200 --itl 15 \
--isl 3000 --osl 150 \
--min-num-gpus 1 \
--max-num-gpus 8
```
## Profiling Method
The profiler follows a 5-step process:
1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense models and 4 nodes for MoE models.
3. **Parallelization Mapping Sweep**: Test performance of engines with different parallelization mappings using the input ISL and OSL.
- For dense models, test different TP sizes for both prefill and decode.
- For MoE models (SGLang), evaluate both TEP and DEP as candidates for prefill and decode.
- **Prefill**:
- TP/TEP: Measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
- DEP: Attention uses data parallelism. Send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst.
![Prefill Performance](/assets/img/h100-prefill-performance.png)
- **Decode**: Measure the ITL under different numbers of in-flight requests, from 1 to the maximum the KV cache can hold. To measure ITL without being affected by piggy-backed prefill requests, the script enables KV-reuse and warms up the engine by issuing the same prompts before measuring.
![Decode Performance](/assets/img/h100-decode-performance.png)
4. **Recommendation**: Select optimal parallelization mapping for prefill and decode that achieves the highest per-GPU throughput while adhering to the SLA on TTFT and ITL.
5. **In-Depth Profiling on the Recommended P/D Engine**: Interpolate TTFT with ISL and ITL with active KV cache and decode context length for more accurate performance estimation.
![ITL Interpolation](/assets/img/pd-interpolation.png)
- **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
- **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths.
### AIPerf on Real Engines
Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
- **Duration**: 2-4 hours
- **Accuracy**: Highest (real measurements)
- **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM
```yaml
profilingConfig:
config:
sweep:
useAiConfigurator: false # Default
```
### AI Configurator Simulation
Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
- **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
```yaml
profilingConfig:
config:
sweep:
useAiConfigurator: true
aicSystem: h200_sxm
aicHfId: Qwen/Qwen3-32B
aicBackendVersion: "0.20.0" # TRT-LLM version simulated by AIC
```
> [!NOTE]
> `aicBackendVersion` specifies the TensorRT-LLM version that AI Configurator simulates. See the [AI Configurator supported features](https://github.com/ai-dynamo/aiconfigurator#supported-features) for available versions.
**Currently supports:**
- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features) for the full list.
### Automatic GPU Discovery
Cluster-scoped operators can optionally enable automatic GPU discovery:
```yaml
spec:
enableGpuDiscovery: true
```
This is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions.
## Configuration
### DGDR Configuration Structure
All profiler configuration goes under `spec.profilingConfig.config`:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeploymentRequest
metadata:
name: my-deployment
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
profilingConfig:
profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
configMapRef: # Optional: base DGD config
name: my-config
key: disagg.yaml
config:
sla: { ... }
hardware: { ... }
sweep: { ... }
planner: { ... }
deploymentOverrides:
workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
```
### SLA Configuration (Required)
```yaml
sla:
isl: 3000 # Average input sequence length (tokens)
osl: 150 # Average output sequence length (tokens)
ttft: 200.0 # Target Time To First Token (milliseconds)
itl: 20.0 # Target Inter-Token Latency (milliseconds)
```
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine)
- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine)
- **Trade-offs**: Tighter SLAs require more GPU resources
### Hardware Configuration (Optional)
```yaml
hardware:
minNumGpusPerEngine: 2 # Auto-determined from model size and VRAM if not provided
maxNumGpusPerEngine: 8 # Maximum GPUs to test
numGpusPerNode: 8 # GPUs per node (for multi-node MoE)
gpuType: h200_sxm # GPU type hint (informational, auto-detected)
```
- **minNumGpusPerEngine**: Skip small TP sizes if your model is large
- **maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
- **numGpusPerNode**: Determine the upper bound of GPUs per node for dense models and configure Grove for multi-node MoE engines
- **gpuType**: Informational only, auto-detected by the controller. For AI Configurator, use `aicSystem` in the [sweep configuration](#ai-configurator-configuration) instead
> [!TIP]
> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
### Sweep Configuration (Optional)
```yaml
sweep:
useAiConfigurator: false # Use real profiling (default)
prefillInterpolationGranularity: 16 # Samples for prefill TTFT curve
decodeInterpolationGranularity: 6 # Samples for decode ITL curve
```
- **useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
- **prefillInterpolationGranularity**: Samples for prefill TTFT curve (lower = faster but less accurate)
- **decodeInterpolationGranularity**: Samples for decode ITL curve. Since ITL interpolation is 3D and takes longer, we default to fewer samples. Increasing this value may quadratically increase profiling time.
### AI Configurator Configuration
Required if `useAiConfigurator: true`:
```yaml
sweep:
useAiConfigurator: true
aicSystem: h200_sxm # h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
aicHfId: Qwen/Qwen3-32B # HuggingFace model ID
aicBackendVersion: "0.20.0" # TensorRT-LLM version
```
### Planner Configuration (Optional)
Pass arguments to the SLA planner:
```yaml
planner:
planner_min_endpoint: 2 # Minimum endpoints to maintain
planner_adjustment_interval: 60 # Adjustment interval (seconds)
planner_load_predictor: linear # Load prediction method
```
> [!NOTE]
> Planner arguments use `planner_` prefix. See [SLA Planner documentation](../planner/planner-guide.md) for full list.
### Model Cache PVC (Advanced)
For large models, use a pre-populated PVC containing model weights instead of downloading from HuggingFace:
```yaml
deployment:
modelCache:
pvcName: "model-cache"
pvcPath: "hub/models--deepseek-ai--DeepSeek-R1"
mountPath: "/opt/model-cache"
```
Requirements:
- The PVC must exist in the same namespace as the DGDR
- The model weights must be accessible at `{mountPath}/{pvcPath}`
### Engine Configuration (Auto-configured)
The controller automatically injects these from high-level fields:
```yaml
# You specify:
spec:
model: "Qwen/Qwen3-0.6B"
backend: vllm
# Controller auto-injects:
profilingConfig:
config:
deployment:
model: "Qwen/Qwen3-0.6B"
engine:
backend: vllm
config: /path/to/configmap
```
You should **not** manually set `deployment.model` or `engine.backend` in `profilingConfig.config`.
### Using Existing DGD Configs (ConfigMap)
Reference an existing DGD config via ConfigMap:
```bash
kubectl create configmap my-config \
--from-file=disagg.yaml=/path/to/your/disagg.yaml \
--namespace $NAMESPACE \
--dry-run=client -o yaml | kubectl apply -f -
```
```yaml
profilingConfig:
configMapRef:
name: my-config
key: disagg.yaml
```
The profiler uses the DGD config as a **base template**, then optimizes it based on your SLA targets.
### CLI Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| `--backend` | string | - | Inference backend: vllm, sglang, trtllm |
| `--config` | string | - | Path to DGD YAML config file |
| `--model` | string | - | HuggingFace model ID |
| `--ttft` | float | - | Target TTFT in milliseconds |
| `--itl` | float | - | Target ITL in milliseconds |
| `--isl` | int | - | Average input sequence length |
| `--osl` | int | - | Average output sequence length |
| `--min-num-gpus` | int | auto | Minimum GPUs per engine |
| `--max-num-gpus` | int | 8 | Maximum GPUs per engine |
| `--use-ai-configurator` | flag | false | Use offline AI Configurator |
| `--pick-with-webui` | flag | false | Launch interactive WebUI |
| `--webui-port` | int | 8000 | Port for WebUI |
> [!NOTE]
> CLI arguments map to DGDR config fields: `--min-num-gpus` = `hardware.minNumGpusPerEngine`, `--max-num-gpus` = `hardware.maxNumGpusPerEngine`, `--use-ai-configurator` = `sweep.useAiConfigurator`. See [DGDR Configuration Structure](#dgdr-configuration-structure) for all field mappings.
## Integration
### With SLA Planner
The Profiler generates interpolation data that the SLA Planner uses for autoscaling decisions.
**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`):
- `prefill_isl`: 1D array of input sequence lengths tested
- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL
- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL
**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`):
- `max_kv_tokens`: Total KV tokens capacity in decode engine
- `x_kv_usage`: 1D array of active KV usage percentages [0, 1]
- `y_context_length`: 1D array of average context lengths tested
- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point
- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point
### With Dynamo Operator
When using DGDR, the Dynamo Operator:
1. Creates profiling jobs automatically
2. Stores profiling data in ConfigMaps (`planner-profile-data`)
3. Generates optimized DGD configurations
4. Deploys the DGD with SLA Planner integration
The generated DGD is tracked via labels:
```yaml
metadata:
labels:
dgdr.nvidia.com/name: my-deployment
dgdr.nvidia.com/namespace: your-namespace
```
### With Observability
Monitor profiling jobs:
```bash
kubectl logs -f job/profile-<dgdr-name> -n $NAMESPACE
kubectl describe dgdr <name> -n $NAMESPACE
```
## Advanced Topics
### Manual Deployment Control
Disable auto-deployment to review the generated DGD before applying:
```yaml
spec:
autoApply: false
```
Then manually extract and apply:
```bash
# Extract generated DGD from DGDR status
kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
# Or save to file for review
kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
```
### Mocker Deployment
Deploy a mocker deployment that simulates engines without GPUs:
```yaml
spec:
model: <model-name>
backend: trtllm
useMocker: true # Deploy mocker instead of real backend
autoApply: true
```
Profiling still runs against the real backend to collect performance data. The mocker uses this data to simulate realistic timing behavior. Useful for large-scale experiments, testing Planner behavior, and validating configurations.
### Accessing Profiling Artifacts
By default, profiling data is stored in ConfigMaps. For detailed artifacts (plots, logs, raw data), attach a PVC:
```yaml
profilingConfig:
outputPVC: "dynamo-pvc"
```
**ConfigMaps (always created):**
- `dgdr-output-<name>`: Generated DGD configuration
- `planner-profile-data`: Profiling data for Planner (JSON)
**PVC artifacts (optional):**
- Performance plots (PNGs)
- DGD configurations for each profiled deployment
- AIPerf profiling artifacts
- Raw profiling data (`.npz` files)
- Profiler logs
Access PVC results:
```bash
kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
kubectl delete pod pvc-access-pod -n $NAMESPACE
```
### Output Performance Plots
The profiler generates plots to visualize performance data:
**Parallelization Mapping Sweep Plots:**
- `prefill_performance.png`: TTFT vs Parallelization Mapping size
- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests
**In-Depth Profiling Plots:**
- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL
- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL
- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length
- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length
## Runtime Profiling (SGLang)
SGLang workers expose profiling endpoints for runtime performance analysis:
```bash
# Start profiling
curl -X POST http://localhost:9090/engine/start_profile \
-H "Content-Type: application/json" \
-d '{"output_dir": "/tmp/profiler_output"}'
# Run inference requests...
# Stop profiling
curl -X POST http://localhost:9090/engine/stop_profile
```
View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
## Troubleshooting
### Profiling Takes Too Long
**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
```yaml
sweep:
useAiConfigurator: true
```
**Solution 2**: Reduce search space:
```yaml
hardware:
minNumGpusPerEngine: 4 # Skip TP1, TP2
maxNumGpusPerEngine: 8 # Don't test beyond TP8
```
### SLA Cannot Be Met
**Symptoms**: Profiler reports no configuration meets targets
**Solutions:**
1. Relax SLA targets (increase TTFT/ITL)
2. Add more GPU resources
3. Try a different backend
4. Use a smaller model
### AI Configurator: Attention Head Constraint Error
**Symptoms**: Profiling fails with error:
```text
AssertionError: num_heads <N> should be divisible by tp_size <M> and the division result should be >= 4
```
**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes.
**Affected Models:**
- **Qwen3-0.6B** (16 heads): Max TP = 4
- **GPT-2** (12 heads): Max TP = 3
- Most models **\<1B parameters**: May hit this constraint
**Solution**: Limit `maxNumGpusPerEngine`:
```yaml
hardware:
maxNumGpusPerEngine: 4 # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
```
**Calculate Max TP**: `max_tp = num_attention_heads / 4`
> [!NOTE]
> This is an AI Configurator limitation. Online profiling doesn't have this constraint.
### Image Pull Errors
**Symptoms**: `ErrImagePull` or `ImagePullBackOff`
**Solution**: Ensure image pull secrets are configured:
```bash
kubectl create secret docker-registry nvcr-imagepullsecret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=<NGC_API_KEY> \
--namespace <your-namespace>
```
### Out of Memory During Profiling
**Symptoms**: OOM errors in profiling jobs
**Solutions:**
1. Reduce `gpu_memory_utilization` in engine config
2. Reduce `--max-context-length`
3. Skip larger TP configurations
4. Use fewer GPUs per test
### Unsupported Parallelization Mapping in Backend
**Symptoms**: Startup/runtime error in the backend (e.g., prime number of attention heads constraining TP to 1, or backend not supporting different TP sizes for prefill and decode).
**Solutions:**
1. Contact the backend to add support and bump backend version in Dynamo
2. Constrain the max and min number of GPUs per engine to the supported range
## See Also
- [Profiler Examples](profiler-examples.md) - Complete DGDR YAML examples
- [SLA Planner Guide](../planner/planner-guide.md) - End-to-end deployment workflow
- [SLA Planner Architecture](../planner/planner-guide.md) - How the Planner uses profiling data
- [DGDR API Reference](../../kubernetes/api-reference.md) - DGDR specification
- [Profiler Arguments Reference](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/profiler/utils/profiler_argparse.py) - Full CLI reference
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Router
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
## Quick Start
### Python / CLI Deployment
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
#### CLI Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
For all available options: `python -m dynamo.frontend --help`
### Kubernetes Deployment
To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv # Enable KV Smart Router
```
**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed
#### Environment Variables
All CLI arguments can be configured via environment variables using the `DYN_` prefix:
| CLI Argument | Environment Variable | Default |
|--------------|---------------------|---------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
For complete K8s examples and advanced configuration, see [K8s Examples](router-examples.md#k8s-examples).
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
For more configuration options and tuning guidelines, see the [Router Guide](router-guide.md).
## Prerequisites and Limitations
**Requirements:**
- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md))
- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
**Multimodal Support:**
- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes
- **SGLang**: Image routing not yet supported
- **Other modalities** (audio, video, etc.): Not yet supported
**Limitations:**
- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states
For basic model registration without KV routing, use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
## Next Steps
- **[Router Guide](router-guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
- **[Router Examples](router-examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../../design-docs/router-design.md)**: Architecture details, algorithms, and event transport modes
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Router Examples
For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns.
## Table of Contents
- [Using KvPushRouter Python API](#using-kvpushrouter-python-api)
- [K8s Examples](#k8s-examples)
- [Routing Patterns](#routing-patterns)
- [Custom Routing Example: Minimizing TTFT](#custom-routing-example-minimizing-ttft)
- [KV Event Publishing for Custom Engines](#kv-event-publishing-for-custom-engines)
## Using KvPushRouter Python API
Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
>[!Warning]
> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
### Methods
The `KvPushRouter` provides the following methods:
- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
- Without `request_id`: Query-only, doesn't update router state
- With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
### Setup
First, launch your backend engines:
```bash
python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
```
### Example Script
```python
import asyncio
from dynamollm import DistributedRuntime, KvPushRouter, KvRouterConfig
async def main():
# Get runtime and create endpoint
runtime = DistributedRuntime.detached()
namespace = runtime.namespace("dynamo")
component = namespace.component("backend")
endpoint = component.endpoint("generate")
# Create KV router
kv_router_config = KvRouterConfig()
router = KvPushRouter(
endpoint=endpoint,
block_size=16,
kv_router_config=kv_router_config
)
# Your input tokens
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Generate with per-request routing override
stream = await router.generate(
token_ids=token_ids,
model="meta-llama/Llama-2-7b-hf",
stop_conditions={
"max_tokens": 20, # Generate exactly 20 tokens
"ignore_eos": True, # Don't stop at EOS token
},
sampling_options={
"temperature": 0.7,
"top_p": 0.9,
},
router_config_override={
"overlap_score_weight": 2.0, # Prioritize cache hits for this request
"router_temperature": 0.5, # Add routing randomness
}
)
# Collect generated tokens
generated_tokens = []
async for response in stream:
if isinstance(response, dict) and "token_ids" in response:
generated_tokens.extend(response["token_ids"])
print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
if __name__ == "__main__":
asyncio.run(main())
```
## K8s Examples
For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployment section](README.md#kubernetes-deployment) in the Quick Start guide.
### Complete K8s Examples
- [TRT-LLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/agg_router.yaml)
- [vLLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
- [SGLang aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/agg_router.yaml)
- [Distributed inference tutorial](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
**For A/B Testing and Advanced K8s Setup:**
See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
### Example with Advanced Configuration
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv
- name: DYN_ROUTER_TEMPERATURE
value: "0.5" # Add some randomness to prevent worker saturation
- name: DYN_KV_OVERLAP_SCORE_WEIGHT
value: "1.5" # Prioritize TTFT over ITL
- name: DYN_KV_CACHE_BLOCK_SIZE
value: "16"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
```
### Alternative: Using Command Args in K8s
You can also pass CLI arguments directly in the container command:
```yaml
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
command:
- /bin/sh
- -c
args:
- "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
```
**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
## Routing Patterns
The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
### 1. Automatic Routing (Recommended)
Call `generate()` directly and let the router handle everything:
```python
stream = await router.generate(token_ids=tokens, model="model-name")
```
- **Best for**: Most use cases
- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
### 2. Manual State Management (Advanced)
Use `best_worker(request_id=...)` to select and track, then manage the request yourself:
```python
worker_id, _dp_rank, overlap = await router.best_worker(tokens, request_id="req-123")
response = await client.generate(tokens, request_id="req-123")
# await anext(response) # Get first token
await router.mark_prefill_complete("req-123") # After first token
# async for _ in response: # Continue generating
# ...
await router.free("req-123") # After completion
```
- **Best for**: Custom request handling with router state tracking
- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
### 3. Hierarchical Router Probing
Query without state updates, then route through a chosen router:
```python
# Probe multiple routers without updating state
worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens) # No request_id
worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens)
# Pick the best router based on results
chosen_router = router_1 if overlap_1 > overlap_2 else router_2
stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
```
- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
- **Advantage**: Query multiple routers before committing to one
### 4. Custom Load-Based Routing
Use `get_potential_loads()` to implement custom routing logic:
```python
loads = await router.get_potential_loads(tokens)
# Apply custom logic (e.g., weighted scoring, constraints)
best_worker = min(loads, key=lambda x: custom_cost_fn(x))
stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
```
- **Best for**: Custom optimization strategies beyond the built-in cost function
- **Advantage**: Full control over worker selection logic
- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
## Custom Routing Example: Minimizing TTFT
Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
```python
import asyncio
from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig
async def minimize_ttft_routing():
# Setup router
runtime = DistributedRuntime.detached()
namespace = runtime.namespace("dynamo")
component = namespace.component("backend")
endpoint = component.endpoint("generate")
router = KvPushRouter(
endpoint=endpoint,
block_size=16,
kv_router_config=KvRouterConfig()
)
# Your input tokens
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Get potential loads for all workers
potential_loads = await router.get_potential_loads(token_ids)
# Find worker with minimum prefill tokens (best for TTFT)
best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
print(f"Worker loads: {potential_loads}")
print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
# Route directly to the selected worker
stream = await router.generate(
token_ids=token_ids,
model="meta-llama/Llama-2-7b-hf",
worker_id=best_worker['worker_id'], # Force routing to optimal worker
stop_conditions={"max_tokens": 20}
)
# Process response
async for response in stream:
if isinstance(response, dict) and "token_ids" in response:
print(f"Generated tokens: {response['token_ids']}")
if __name__ == "__main__":
asyncio.run(minimize_ttft_routing())
```
This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
See [Router Design](../../design-docs/router-design.md) for architecture details and the cost function algorithm.
## KV Event Publishing for Custom Engines
The KV Router relies on real-time events from backend workers to track which KV cache blocks are stored on each worker. When your custom engine allocates or evicts KV cache blocks, it should publish these events so the router can make optimal routing decisions. There are two main publishing pathways: direct NATS publishing (`KvEventPublisher`) which publishes events directly to NATS and is the simplest approach for custom engines, and ZMQ-based publishing for engines with ZMQ event output (like vLLM) which uses a ZMQ publisher in the engine and `ZmqKvEventPublisher` to forward events to NATS.
### Event Types
The KV cache supports three event types:
| Event Type | Description | When to Publish |
|------------|-------------|-----------------|
| `BlockStored` | New blocks added to cache | After KV cache allocation succeeds |
| `BlockRemoved` | Blocks evicted from cache | When blocks are evicted or freed |
| `AllBlocksCleared` | All blocks removed | On cache reset or worker restart |
### Event Structure
Each event contains:
- **`event_id`**: Monotonically increasing identifier per worker
- **`dp_rank`**: Data parallel rank (0 if DP not enabled)
- **`data`**: One of `Stored`, `Removed`, or `Cleared`
For `BlockStored` events:
- **`token_ids`**: List of token IDs for the stored blocks
- **`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests.
- **`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`)
- **`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent).
- **`lora_id`**: LoRA adapter ID (0 if not using LoRA)
For `BlockRemoved` events:
- **`block_hashes`**: List of sequence block hashes being evicted
### Option 1: Direct NATS Publishing (Recommended)
The `KvEventPublisher` class publishes events directly to NATS. This is the simplest approach for custom engines.
```mermaid
flowchart LR
subgraph Engine["Custom Engine"]
cache["KV Cache Manager"]
end
subgraph Worker["Dynamo Worker Process"]
pub["KvEventPublisher"]
end
subgraph NATS["NATS"]
subject["kv-events subject"]
end
subgraph Router["KV Router"]
indexer["KvIndexer"]
end
cache -->|"on_blocks_stored()<br/>on_blocks_removed()"| pub
pub -->|"publish to NATS"| subject
subject --> indexer
```
**When to use:**
- Building a custom inference engine from scratch
- Your engine doesn't have a ZMQ-based event system
- You want the simplest integration path
#### Basic Setup
```python
from dynamo.llm import KvEventPublisher
class CustomEnginePublisher:
def __init__(self, component, worker_id: int, block_size: int, dp_rank: int = 0):
self.block_size = block_size
self.event_id = 0
self.kv_publisher = KvEventPublisher(
component=component,
worker_id=worker_id,
kv_block_size=block_size,
dp_rank=dp_rank,
enable_local_indexer=False,
)
def on_blocks_stored(self, token_ids: list[int], block_hashes: list[int],
lora_id: int = 0, parent_hash: int | None = None):
"""Call after KV cache blocks are allocated."""
self.event_id += 1
num_block_tokens = [self.block_size] * len(block_hashes)
self.kv_publisher.publish_stored(
event_id=self.event_id,
token_ids=token_ids,
num_block_tokens=num_block_tokens,
block_hashes=block_hashes,
lora_id=lora_id,
parent_hash=parent_hash,
)
def on_blocks_removed(self, block_hashes: list[int]):
"""Call when KV cache blocks are evicted."""
self.event_id += 1
self.kv_publisher.publish_removed(event_id=self.event_id, block_hashes=block_hashes)
```
#### Integration with Your Engine
```python
from dynamo.llm import register_llm
async def main():
# Register your engine with Dynamo
component, endpoint = await register_llm(
model="my-model",
generator=my_generate_fn,
)
# Initialize publisher
publisher = CustomEnginePublisher(
component=component,
worker_id=endpoint.connection_id(),
block_size=16, # Match your engine's block size
)
# Hook into your engine's cache events
def on_prefill_complete(request_id, token_ids, blocks):
block_hashes = [block.hash for block in blocks]
publisher.on_blocks_stored(token_ids=token_ids, block_hashes=block_hashes)
def on_cache_eviction(evicted_blocks):
block_hashes = [block.hash for block in evicted_blocks]
publisher.on_blocks_removed(block_hashes=block_hashes)
```
### Option 2: ZMQ-based Publishing
For engines that publish events via ZMQ (like vLLM), this option uses two components that work together:
1. **ZMQ Publisher** (in your engine) - Publishes events to a ZMQ socket
2. **ZmqKvEventPublisher** (Dynamo binding) - Subscribes to ZMQ and forwards to NATS
```mermaid
flowchart LR
subgraph Engine["Custom Engine / vLLM"]
cache["KV Cache Manager"]
zmq_pub["ZMQ Publisher<br/>(Pure Python)"]
end
subgraph ZMQ["ZMQ Socket"]
socket["tcp://127.0.0.1:5557"]
end
subgraph Worker["Dynamo Worker Process"]
zmq_sub["ZmqKvEventPublisher<br/>(Rust bindings)"]
end
subgraph NATS["NATS"]
subject["kv-events subject"]
end
subgraph Router["KV Router"]
indexer["KvIndexer"]
end
cache --> zmq_pub
zmq_pub -->|"PUB"| socket
socket -->|"SUB"| zmq_sub
zmq_sub --> subject
subject --> indexer
```
**When to use:**
- Your engine already has a ZMQ-based event system (like vLLM)
- You're integrating with a consolidator (like KVBM)
- You want to decouple event publishing from your engine's main loop
#### Part 1: ZMQ Subscriber (Dynamo Bindings)
If your engine already publishes to ZMQ, use `ZmqKvEventPublisher` to subscribe and forward to NATS:
```python
from dynamo.llm import ZmqKvEventPublisher, ZmqKvEventPublisherConfig
# Configure the ZMQ subscriber
config = ZmqKvEventPublisherConfig(
worker_id=endpoint.connection_id(),
kv_block_size=block_size,
zmq_endpoint="tcp://127.0.0.1:5557", # Where your engine publishes
zmq_topic="", # Subscribe to all topics
enable_local_indexer=False,
)
# Create publisher - it automatically subscribes to ZMQ and forwards to NATS
kv_publisher = ZmqKvEventPublisher(
component=component,
config=config,
)
```
#### Part 2: ZMQ Publisher (Pure Python)
If your engine needs to publish to ZMQ (e.g., for consolidator integration), implement the ZMQ protocol:
```python
import zmq
import msgpack
import time
class ZmqKvEventPublisher:
"""Pure Python ZMQ publisher for KV events (vLLM-compatible format)."""
def __init__(self, zmq_endpoint: str, kv_block_size: int, topic: str = ""):
self.kv_block_size = kv_block_size
self.topic = topic
self.ctx = zmq.Context()
self.socket = self.ctx.socket(zmq.PUB)
self.socket.bind(zmq_endpoint)
self.sequence = 0
self.data_parallel_rank = 0
def _to_signed_i64(self, value: int | None) -> int | None:
if value is None:
return None
return value - 0x10000000000000000 if value > 0x7FFFFFFFFFFFFFFF else value
def publish_stored(self, event_id: int, token_ids: list[int], num_block_tokens: list[int],
block_hashes: list[int], lora_id: int = 0, parent_hash: int | None = None):
event = {
"type": "BlockStored",
"block_hashes": [self._to_signed_i64(h) for h in block_hashes],
"parent_block_hash": self._to_signed_i64(parent_hash),
"token_ids": token_ids,
"block_size": self.kv_block_size,
"lora_id": lora_id if lora_id != 0 else None,
}
self._publish_event(event)
def publish_removed(self, event_id: int, block_hashes: list[int]):
event = {"type": "BlockRemoved", "block_hashes": [self._to_signed_i64(h) for h in block_hashes]}
self._publish_event(event)
def publish_all_cleared(self):
self._publish_event({"type": "AllBlocksCleared"})
def _publish_event(self, event: dict):
batch = [time.time(), [event], self.data_parallel_rank]
payload = msgpack.packb(batch, use_bin_type=True)
sequence_bytes = self.sequence.to_bytes(8, byteorder="big")
self.sequence += 1
self.socket.send_multipart([self.topic.encode(), sequence_bytes, payload])
def shutdown(self):
self.socket.close()
self.ctx.term()
```
### ZMQ Wire Format
The ZMQ message format (compatible with vLLM):
| Frame | Description |
|-------|-------------|
| 1 | Topic (empty string for all topics) |
| 2 | Sequence number (8 bytes, big-endian) |
| 3 | Msgpack payload: `[timestamp, [events], dp_rank]` |
Each event in the payload is a dictionary with `type` field (`BlockStored`, `BlockRemoved`, or `AllBlocksCleared`).
### Best Practices
1. **Event IDs must be monotonically increasing** per worker (use a thread-safe counter)
2. **Block size must match** your engine's actual `kv_block_size`
3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching
## See Also
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Guide](router-guide.md)**: Configuration, tuning, and production setup
- **[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
......@@ -3,12 +3,135 @@
# SPDX-License-Identifier: Apache-2.0
---
# KV Cache Routing
# Router Guide
This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics.
## Overview
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
This guide helps you get started with using the Dynamo router, with further details on configuration, disaggregated serving setup, and parameter tuning.
## Quick start
### Python / CLI Deployment
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
#### CLI Arguments
| Argument | Default | Description |
|----------|---------|-------------|
| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
For all available options: `python -m dynamo.frontend --help`
For detailed configuration options and tuning parameters, see [Using the KV Cache Router](#using-the-kv-cache-router).
### Kubernetes Deployment
To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv # Enable KV Smart Router
```
**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed
#### Environment Variables
All CLI arguments can be configured via environment variables using the `DYN_` prefix:
| CLI Argument | Environment Variable | Default |
|--------------|---------------------|---------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
For complete K8s examples and advanced configuration, see [K8s Examples](router-examples.md#k8s-examples).
For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
## KV Cache Routing
To enable KV cache aware routing start the frontend node like this:
KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.
```mermaid
graph TD
T[Tokens] --> R[KV Aware Router]
R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
style T fill:#fff3e0,stroke:#333,color:#333
style R fill:#2e8b57,stroke:#333,color:#fff
style W1 fill:#f3e5f5,stroke:#333,color:#333
style W2 fill:#c8e6c9,stroke:#333,color:#333
style W3 fill:#f3e5f5,stroke:#333,color:#333
linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
```
KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
### Cost Calculation
1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
- Lower costs indicate better routing choices
- `overlap_score_weight` balances cache hit optimization against load distribution
- Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
### Worker Selection
The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
### Using the KV Cache Router
To enable KV cache-aware routing, start the frontend node like this:
```bash
python -m dynamo.frontend --router-mode kv
```
......@@ -28,11 +151,13 @@ The main KV-aware routing arguments:
- `--router-reset-states`: When specified, resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting with a fresh state. By default (when this flag is not provided), the router persists state across restarts, downloading any available snapshot from NATS object store and continuing to consume events from where it left off. This enables routers to maintain KV cache awareness across restarts. **Warning**: Using `--router-reset-states` can bring existing router replicas into an inconsistent state. Only use this flag when launching the first router replica in a component, or consider using a different namespace/component for a clean slate.
- `--router-snapshot-threshold`: Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATs object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.
- `--router-snapshot-threshold`: Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATS object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.
- `--no-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management.
- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines emit `ForwardPassMetrics`. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).
- `--no-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. By default (`router_assume_kv_reuse=true`), the router computes actual block hashes for sequence tracking to deduplicate blocks and optimize load balancing. When disabled via this flag, the router generates random hashes for sequence blocks, treating each request's blocks as unique. This is useful in disaggregated setups where prefill transfers blocks to decode workers that may already have those blocks cached, but the engine cannot coordinate transfers to avoid duplication. Without this flag, the router's load balancing heuristics would undercount decode blocks when duplicates exist.
- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines publish load metrics. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).
- `--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled.
......@@ -50,32 +175,81 @@ The main KV-aware routing arguments:
>
> **Request plane is independent of KV event transport.**
> `DYN_REQUEST_PLANE` controls how **requests** are sent (TCP/HTTP/NATS), but KV-aware routing still uses **NATS** for KV events in both JetStream and NATS Core + Local Indexer modes.
> If you run with `DYN_REQUEST_PLANE=tcp` (or `http`) and KV events enabled (default), you must also configure NATS, e.g. `NATS_SERVER=nats://...`.
> Only `--no-kv-events` removes the NATS requirement.
> When KV events are enabled (default), NATS is automatically initialized. You can optionally set `NATS_SERVER=nats://...` to specify a custom NATS server; otherwise, it defaults to `localhost:4222`.
> Use `--no-kv-events` to disable KV events and remove the NATS requirement entirely (with request plane being `tcp` or `http`).
>
> When `--kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
>
> When `--kv-overlap-score-weight` is set to 0, no KvIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KvIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning. In both cases, it's recommended to disable your backend workers from publishing events through `KvEventPublisher` to avoid event accumulation in JetStream. WIP to enable disabling publishing of KV events completely in these cases.
> **Backend Configuration:** When using `--no-kv-events`, configure your backend workers to disable KV event publishing:
> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'`
> - **SGLang**: Do not use `--kv-events-config`
> - **TRT-LLM**: Do not use `--publish-events-and-metrics`
>
> The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.
## Prerequisites and Limitations
To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md).
>[!Note]
> **KV Router Requirements**: The KV router currently works only with **dynamic endpoints** that are registered via [`register_llm()`](../development/backend-guide.md) with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
## Basic Routing
Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
```python
client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
```
We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
**Current Limitations (WIP):**
- **Static endpoints**: Not yet supported. The KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states.
- **Multimodal models**: Not yet supported. The KV router currently tracks token-based blocks only.
- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
**What this means for your setup:**
1. Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md) or [example implementations](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/hello_world))
2. Your handler receives requests with pre-tokenized `token_ids`, not raw text or multimodal inputs
3. You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
KV Cache routing uses direct routing with a special worker selection algorithm.
For basic model registration without KV routing, you can use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
## Disaggregated Serving (Prefill and Decode)
For custom routing logic and advanced patterns, see [Routing Patterns](router-examples.md#routing-patterns) in the examples documentation.
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
## Tuning Guidelines
### 1. Understand Your Workload Characteristics
- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
### 2. Monitor Key Metrics
The router logs the cost calculation for each worker:
```text
Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
```
This shows:
- Total cost (125.3)
- Overlap weight × prefill blocks (1.0 × 100.5)
- Active blocks (25.0)
- Cached blocks that contribute to overlap (15)
### 3. Temperature-Based Routing
The `router_temperature` parameter controls routing randomness:
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution
### 4. Iterative Optimization
1. Begin with default settings
2. Monitor TTFT and ITL metrics
3. Adjust `kv-overlap-score-weight` to meet your performance goals:
- To reduce TTFT: Increase the weight
- To reduce ITL: Decrease the weight
4. If you observe severe load imbalance, increase the temperature setting
## Disaggregated Serving
Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
### Automatic Prefill Router Activation
......@@ -120,7 +294,7 @@ await register_llm(
await prefill_endpoint.serve_endpoint(prefill_handler.generate)
```
> [!NOTE]
> [!Note]
> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch/disagg_router.sh).
### Request Flow
......@@ -152,190 +326,13 @@ graph TD
linkStyle 5 stroke:#2196f3,stroke-width:2px
```
## Overview
The KV-aware router operates on two key principles to optimize request routing:
### Global KV Cache State Synchronization
KV events from engines are collected by the router to maintain a global view of cached blocks across all workers. The router supports two event transport modes:
#### Mode 1: JetStream (Default)
KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts.
- **Best for**: Production deployments requiring durability and multi-replica router consistency
- **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees
```mermaid
graph TD
subgraph Engines
E1[Engine 1<br/>KVPublisher]
E2[Engine 2<br/>KVPublisher]
E3[Engine 3<br/>KVPublisher]
end
subgraph "NATS JetStream"
JS[(Persistent KV Events Stream<br/>- Block created<br/>- Block removed)]
end
subgraph "NATS Object Store"
OS[(Radix Tree<br/>State Snapshot)]
end
subgraph "Router Replicas"
R1[Router 1<br/>KVIndexer]
R2[Router 2<br/>KVIndexer]
end
E1 -->|Publish Events| JS
E2 -->|Publish Events| JS
E3 -->|Publish Events| JS
JS -->|Consume as Durable Consumer| R1
JS -->|Consume as Durable Consumer| R2
JS -->|Periodic Snapshot| OS
style JS fill:#e1f5fe,stroke:#333,color:#333
style OS fill:#e1f5fe,stroke:#333,color:#333
style E1 fill:#f3e5f5,stroke:#333,color:#333
style E2 fill:#f3e5f5,stroke:#333,color:#333
style E3 fill:#f3e5f5,stroke:#333,color:#333
style R1 fill:#2e8b57,stroke:#333,color:#fff
style R2 fill:#2e8b57,stroke:#333,color:#fff
linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px
```
#### Mode 2: NATS Core with Local Indexer
When workers are started with `--enable-local-indexer`, each worker maintains its own local radix tree (local indexer) and publishes events over NATS Core (fire-and-forget pub/sub) instead of JetStream. Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly.
- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios
- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available
- **Enable with**: `--enable-local-indexer` flag on workers (vLLM, mocker)
```mermaid
graph TD
subgraph Engines
E1[Engine 1<br/>LocalKvIndexer]
E2[Engine 2<br/>LocalKvIndexer]
E3[Engine 3<br/>LocalKvIndexer]
end
subgraph "NATS Core"
NC[KV Events Pub/Sub<br/>- Block created<br/>- Block removed]
end
subgraph "Router Replicas"
R1[Router 1<br/>KVIndexer]
R2[Router 2<br/>KVIndexer]
end
E1 -->|Publish Events| NC
E2 -->|Publish Events| NC
E3 -->|Publish Events| NC
NC -->|Subscribe| R1
NC -->|Subscribe| R2
style NC fill:#e1f5fe,stroke:#333,color:#333
style E1 fill:#f3e5f5,stroke:#333,color:#333
style E2 fill:#f3e5f5,stroke:#333,color:#333
style E3 fill:#f3e5f5,stroke:#333,color:#333
style R1 fill:#2e8b57,stroke:#333,color:#fff
style R2 fill:#2e8b57,stroke:#333,color:#fff
linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px
```
**How gap detection works:**
1. Each worker assigns monotonically increasing event IDs starting from 0
2. The router tracks the last received event ID per worker
3. If an event arrives with `event_id > last_id + 1`, the router detects a gap
4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]`
5. On worker discovery (Added event), the router dumps the worker's entire local indexer state
**Startup behavior:**
- When a worker is discovered, the router queries and ingests its full local indexer state
- When a worker is removed, the router removes all its blocks from the global radix tree
>[!Note]
> The router automatically selects the transport mode based on worker configuration. If all connected workers have `enable_local_indexer=true`, the router uses NATS Core mode. Otherwise, it uses JetStream mode.
### Local Active Block Management with Replica Sync
Second, in addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when:
- The router receives and routes a request
- The first token is generated (prefill complete)
- The response ends (request freed)
This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging.
```mermaid
sequenceDiagram
participant C1 as Client 1
participant R1 as Router 1<br/>(Slot Manager)
participant R2 as Router 2<br/>(Slot Manager)
participant C2 as Client 2
Note over R1,R2: Router Replica Sync Enabled
C1->>R1: Request A
activate R1
R1->>R1: Predict blocks & route to worker
R1-->>R2: Sync: AddRequest(A)
C2->>R2: Request B
activate R2
R2->>R2: Predict blocks & route to worker
R2-->>R1: Sync: AddRequest(B)
R1->>R1: First token received<br/>(prefill complete)
R1-->>R2: Sync: MarkPrefillCompleted(A)
R1->>C1: Stream response
R2->>R2: First token received<br/>(prefill complete)
R2-->>R1: Sync: MarkPrefillCompleted(B)
R2->>C2: Stream response
R1->>R1: Response complete<br/>(free blocks)
R1-->>R2: Sync: Free(A)
deactivate R1
R2->>R2: Response complete<br/>(free blocks)
R2-->>R1: Sync: Free(B)
deactivate R2
Note over R1,R2: Both routers have consistent<br/>view of active blocks
```
This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution.
## Basic Routing
Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
```python
client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
```
We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
KV Cache routing uses direct routing with a special worker selection algorithm.
## Serving Multiple Router Replicas
For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.)
### Router State Management
The KV Router tracks two types of state (see [KV Router Architecture](README.md) for details):
The KV Router tracks two types of state (see [Router Design](../../design-docs/router-design.md) for details):
1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts.
......@@ -345,16 +342,16 @@ The KV Router tracks two types of state (see [KV Router Architecture](README.md)
```bash
# Router replica 1
python -m dynamo.frontend --router-mode kv --port 8000 --router-replica-sync
python -m dynamo.frontend --router-mode kv --http-port 8000 --router-replica-sync
# Router replica 2 (can be started later)
python -m dynamo.frontend --router-mode kv --port 8001 --router-replica-sync
python -m dynamo.frontend --router-mode kv --http-port 8001 --router-replica-sync
```
The `--router-replica-sync` flag enables active block synchronization between replicas:
- Active blocks are shared via NATS core messaging (fire-and-forget)
- Replicas exchange routing decisions to maintain consistent load estimates
- A new replica start with zero active blocks but quickly converge through request handling, by itself and active syncing with other replicas
- A new replica starts with zero active blocks but quickly converges through request handling, by itself and active syncing with other replicas
Without this flag, each replica maintains its own isolated view of active blocks, potentially leading to suboptimal routing.
......@@ -369,7 +366,7 @@ Persistence behavior depends on which event transport mode is active:
- You can launch a third Router replica even if the first two are down, and it will recover the full prefix state
```bash
python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync
python -m dynamo.frontend --router-mode kv --http-port 8002 --router-replica-sync
```
**NATS Core with Local Indexer Mode:**
......@@ -380,325 +377,13 @@ python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync
>[!Note]
> If you need to start with a fresh state in JetStream mode, you have two options:
> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](../design-docs/distributed-runtime.md)) which will start a new stream and NATS object store path
> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](../../design-docs/distributed-runtime.md)) which will start a new stream and NATS object store path
> 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.
## Understanding KV Cache
The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching).
### KV Cache Optimizations
Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks.
Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a
prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse.
In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally.
To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow:
1. Request tokenization: The incoming prompt is converted into tokens
2. Block partitioning: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block)
3. Block hashing: Each block of tokens is hashed to create a unique identifier
4. Cache lookup:
- For each block, the system checks if a matching block already exists in the KV cache
- If a match is found, the existing KV cache block is reused
- If no match is found, the system proceeds to the next step
5. Resource allocation:
- For blocks without matches, the system attempts to allocate new memory space
- If sufficient memory is available, allocate memory space and proceed to step 7
- If memory is constrained, proceed to step 6
6. Cache eviction (when necessary):
- The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal
- Selected blocks are evicted from the cache
- **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.**
- Alternatively, some systems may offload less-frequently used blocks to CPU memory.
7. KV computation:
- For new blocks, the model computes key and value tensors
- These tensors are stored in the newly allocated cache blocks
- **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**.
Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/).
## KV Cache Routing and Load Balancing
```mermaid
graph TD
T[Tokens] --> R[KV Aware Router]
R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
style T fill:#fff3e0,stroke:#333,color:#333
style R fill:#2e8b57,stroke:#333,color:#fff
style W1 fill:#f3e5f5,stroke:#333,color:#333
style W2 fill:#c8e6c9,stroke:#333,color:#333
style W3 fill:#f3e5f5,stroke:#333,color:#333
linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
```
KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
### Cost Calculation
1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
- Lower costs indicate better routing choices
- `overlap_score_weight` balances cache hit optimization against load distribution
- Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
### Worker Selection
The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
## Events
### KVPublisher
The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed.
The two types of events are:
- KV stored event
- KV removed event
The publisher can be initialized and used through C bindings or Python bindings.
### Deterministic Event IDs
Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's builtin `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect.
### KVIndexer
The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.
The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
### Inter-Router Communication
In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
## Using KvPushRouter Python API
Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
>[!Warning]
> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
### Methods
The `KvPushRouter` provides the following methods:
- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
- Without `request_id`: Query-only, doesn't update router state
- With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
- **`best_worker_id(token_ids, router_config_override=None, request_id=None)`**: **[DEPRECATED - use `best_worker()` instead]** Query which worker would be selected for given tokens. Returns `(worker_id, overlap_blocks)`.
- Without `request_id`: Query-only, doesn't update router state
- With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker_id()` for manual routing instead of `generate()`.
- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker_id()` for manual routing instead of `generate()`.
- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
### Setup
First, launch your backend engines:
```bash
python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
```
### Example Script
```python
import asyncio
from dynamo._core import DistributedRuntime, KvPushRouter, KvRouterConfig
async def main():
# Get runtime and create endpoint
runtime = DistributedRuntime.detached()
namespace = runtime.namespace("dynamo")
component = namespace.component("backend")
endpoint = component.endpoint("generate")
# Create KV router
kv_router_config = KvRouterConfig()
router = KvPushRouter(
endpoint=endpoint,
block_size=16,
kv_router_config=kv_router_config
)
# Your input tokens
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Generate with per-request routing override
stream = await router.generate(
token_ids=token_ids,
model="meta-llama/Llama-2-7b-hf",
stop_conditions={
"max_tokens": 20, # Generate exactly 20 tokens
"ignore_eos": True, # Don't stop at EOS token
},
sampling_options={
"temperature": 0.7,
"top_p": 0.9,
},
router_config_override={
"overlap_score_weight": 2.0, # Prioritize cache hits for this request
"router_temperature": 0.5, # Add routing randomness
}
)
# Collect generated tokens
generated_tokens = []
async for response in stream:
if isinstance(response, dict) and "token_ids" in response:
generated_tokens.extend(response["token_ids"])
print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
if __name__ == "__main__":
asyncio.run(main())
```
### Routing Patterns
The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
#### 1. Automatic Routing (Recommended)
Call `generate()` directly and let the router handle everything:
```python
stream = await router.generate(token_ids=tokens, model="model-name")
```
- **Best for**: Most use cases
- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
#### 2. Manual State Management (Advanced)
Use `best_worker_id(request_id=...)` to select and track, then manage the request yourself:
```python
worker_id, overlap = await router.best_worker_id(tokens, request_id="req-123")
response = await client.generate(tokens, request_id="req-123")
# await anext(response) # Get first token
await router.mark_prefill_complete("req-123") # After first token
# async for _ in response: # Continue generating
# ...
await router.free("req-123") # After completion
```
- **Best for**: Custom request handling with router state tracking
- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
#### 3. Hierarchical Router Probing
Query without state updates, then route through a chosen router:
```python
# Probe multiple routers without updating state
worker_id_1, overlap_1 = await router_1.best_worker_id(tokens) # No request_id
worker_id_2, overlap_2 = await router_2.best_worker_id(tokens)
# Pick the best router based on results
chosen_router = router_1 if overlap_1 > overlap_2 else router_2
stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
```
- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
- **Advantage**: Query multiple routers before committing to one
#### 4. Custom Load-Based Routing
Use `get_potential_loads()` to implement custom routing logic:
```python
loads = await router.get_potential_loads(tokens)
# Apply custom logic (e.g., weighted scoring, constraints)
best_worker = min(loads, key=lambda x: custom_cost_fn(x))
stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
```
- **Best for**: Custom optimization strategies beyond the built-in cost function
- **Advantage**: Full control over worker selection logic
- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
### Custom Routing Example: Minimizing TTFT
Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
```python
import asyncio
from dynamo._core import DistributedRuntime, KvPushRouter, KvRouterConfig
async def minimize_ttft_routing():
# Setup router
runtime = DistributedRuntime.detached()
namespace = runtime.namespace("dynamo")
component = namespace.component("backend")
endpoint = component.endpoint("generate")
router = KvPushRouter(
endpoint=endpoint,
block_size=16,
kv_router_config=KvRouterConfig()
)
# Your input tokens
token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
# Get potential loads for all workers
potential_loads = await router.get_potential_loads(token_ids)
# Find worker with minimum prefill tokens (best for TTFT)
best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
print(f"Worker loads: {potential_loads}")
print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
# Route directly to the selected worker
stream = await router.generate(
token_ids=token_ids,
model="meta-llama/Llama-2-7b-hf",
worker_id=best_worker['worker_id'], # Force routing to optimal worker
stop_conditions={"max_tokens": 20}
)
# Process response
async for response in stream:
if isinstance(response, dict) and "token_ids" in response:
print(f"Generated tokens: {response['token_ids']}")
if __name__ == "__main__":
asyncio.run(minimize_ttft_routing())
```
This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
- **Maximize cache reuse**: Use `best_worker_id()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
See [KV Router Architecture](README.md) for performance tuning details.
## Dynamic Threshold Configuration
Dynamic threshold configuration allows you to adjust worker busy thresholds at runtime without restarting the frontend, enabling real-time tuning of load balancing behavior based on observed system performance.
The busy thresholds can be updated at runtime without restarting the frontend. The frontend exposes HTTP endpoints at `/busy_threshold`:
**Get or set a model's thresholds (POST):**
......@@ -728,3 +413,10 @@ curl -X POST http://localhost:8000/busy_threshold \
curl http://localhost:8000/busy_threshold
# Response: {"thresholds": [{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}]}
```
## See Also
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Examples](router-examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
- **[KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md)**: Integrate custom inference engines with KV-aware routing
......@@ -40,14 +40,14 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
- [Dynamo Disaggregated Serving](disagg-serving.md)
- [Dynamo Smart Router](../router/kv-cache-routing.md)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm-intro.md)
- [Planner](../planner/planner-intro.md)
- [Dynamo Smart Router](../components/router/README.md)
- [Dynamo KV Cache Block Manager](../components/kvbm/README.md)
- [Planner](../components/planner/README.md)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../../assets/img/architecture.png "Dynamo Architecture")
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](/assets/img/architecture.png "Dynamo Architecture")
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
......@@ -61,7 +61,7 @@ Dynamo prioritizes seamless integration. Its modular design enables it to work h
Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../../assets/img/disagg-perf-benefit.png)
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](/assets/img/disagg-perf-benefit.png)
* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
......@@ -70,7 +70,7 @@ The disaggregation of prefill and decode phases offers valuable flexibility. Sin
### KV aware routing
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../../assets/img/kv-routing.png)
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](/assets/img/kv-routing.png)
* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
......@@ -80,7 +80,7 @@ Existing routing methods, including load-based routing, overlook the specific pr
### KV cache manager
The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../../assets/img/kvbm-agg-performance.png)
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](/assets/img/kvbm-agg-performance.png)
* Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment