Unverified Commit 2c3066bd authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: full migration of docs/ to fern format in fern/ (#6050)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarCursor <cursoragent@cursor.com>
parent d59b9d72
......@@ -8,15 +8,16 @@
## Overview
Dynamo provides structured logging in both text as well as JSONL. When
JSONL is enabled logs additionally contain `span` creation and exit
events as well as support for `trace_id` and `span_id` fields for
distributed tracing.
JSONL is enabled, logs support `trace_id` and `span_id` fields for
distributed tracing. Span creation and exit events can be optionally
enabled via the `DYN_LOGGING_SPAN_EVENTS` environment variable.
## Environment Variables
| Variable | Description | Default | Example |
|----------|-------------|---------|---------|
| `DYN_LOGGING_JSONL` | Enable JSONL logging format | `false` | `true` |
| `DYN_LOGGING_SPAN_EVENTS` | Enable span entry/close event logging (`SPAN_FIRST_ENTRY`, `SPAN_CLOSED` messages) | `false` | `true` |
| `DYN_LOG` | Log levels per target `<default_level>,<module_path>=<level>,<module_path>=<level>` | `info` | `DYN_LOG=info,dynamo_runtime::system_status_server:trace` |
| `DYN_LOG_USE_LOCAL_TZ` | Use local timezone for timestamps (default is UTC) | `false` | `true` |
| `DYN_LOGGING_CONFIG_PATH` | Path to custom TOML logging configuration | none | `/path/to/config.toml` |
......@@ -143,78 +144,84 @@ This section shows how trace and span information appears in JSONL logs. These l
When viewing the corresponding trace in Grafana, you should be able to see something like the following:
![Disaggregated Trace Example](../../assets/img/grafana-disagg-trace.png)
![Disaggregated Trace Example](/assets/img/grafana-disagg-trace.png)
### Trace Overview
| Attribute | Value |
|-----------|-------|
| **Trace ID** | b672ccf48683b392891c5cb4163d4b51 |
| **Start Time** | 2025-10-31 13:52:10.706 |
| **Duration** | 4.04s |
| **Request** | `POST /v1/chat/completions` |
### Root Span (Frontend): `http-request`
| Attribute | Value |
|-----------|-------|
| **Service** | frontend |
| **Span ID** | 5c20cc08e6afb2b7 |
| **Duration** | 4.04s |
| **Start Time** | 13:52:10.706 |
| **Status** | unset |
| **Method** | POST |
| **URI** | `/v1/chat/completions` |
| **HTTP Version** | HTTP/1.1 |
| **Parent ID** | (none) |
| **Child Count** | 2 |
| **Busy Time** | 18,101,350 ns (18.10ms) |
| **Idle Time** | 4,022,100,356 ns (4.02s) |
### Child Span (Prefill): `handle_payload`
| Attribute | Value |
|-----------|-------|
| **Service** | prefill |
| **Duration** | 39.65ms |
| **Start Time** | 13:52:10.707 |
| **Status** | unset |
| **Component** | prefill |
| **Endpoint** | generate |
| **Namespace** | vllm-disagg |
| **Instance ID** | 3866790875219207267 |
| **Trace ID** | b672ccf48683b392891c5cb4163d4b51 |
| **Parent ID** | 5c20cc08e6afb2b7 |
| **Busy Time** | 613,633 ns (0.61ms) |
| **Idle Time** | 36,340,242 ns (36.34ms) |
### Child Span (Decode): `handle_payload`
| Attribute | Value |
|-----------|-------|
| **Service** | decode |
| **Duration** | 4s |
| **Start Time** | 13:52:10.745 |
| **Status** | unset |
| **Component** | backend |
| **Endpoint** | generate |
| **Namespace** | vllm-disagg |
| **Instance ID** | 3866790875219207263 |
| **Trace ID** | b672ccf48683b392891c5cb4163d4b51 |
| **Parent ID** | 5c20cc08e6afb2b7 |
| **Busy Time** | 3,795,258 ns (3.79ms) |
| **Idle Time** | 3,996,532,471 ns (3.99s) |
### Frontend Logs with Trace Context
The following shows the JSONL logs from the frontend service for the same request. Note the `trace_id` field (`b672ccf48683b392891c5cb4163d4b51`) that correlates all logs for this request, and the `span_id` field that identifies individual operations:
Dynamo creates distributed traces that span across multiple services in a disaggregated serving setup. The following sections describe the key spans you'll see in Grafana when viewing traces for chat completion requests.
```
{"time":"2025-10-31T20:52:07.707164Z","level":"INFO","file":"/opt/dynamo/lib/runtime/src/logging.rs","line":806,"target":"dynamo_runtime::logging","message":"OTLP export enabled","endpoint":"http://tempo.tm.svc.cluster.local:4317","service":"frontend"}
{"time":"2025-10-31T20:52:10.707164Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"5c20cc08e6afb2b7","span_name":"http-request","trace_id":"b672ccf48683b392891c5cb4163d4b51","uri":"/v1/chat/completions","version":"HTTP/1.1"}
{"time":"2025-10-31T20:52:10.745264Z","level":"DEBUG","file":"/opt/dynamo/lib/llm/src/kv_router/prefill_router.rs","line":232,"target":"dynamo_llm::kv_router::prefill_router","message":"Prefill succeeded, using disaggregated params for decode","method":"POST","span_id":"5c20cc08e6afb2b7","span_name":"http-request","trace_id":"b672ccf48683b392891c5cb4163d4b51","uri":"/v1/chat/completions","version":"HTTP/1.1"}
{"time":"2025-10-31T20:52:10.745545Z","level":"DEBUG","file":"/opt/dynamo/lib/runtime/src/pipeline/network/tcp/server.rs","line":230,"target":"dynamo_runtime::pipeline::network::tcp::server","message":"Registering new TcpStream on 10.0.4.65:41959","method":"POST","span_id":"5c20cc08e6afb2b7","span_name":"http-request","trace_id":"b672ccf48683b392891c5cb4163d4b51","uri":"/v1/chat/completions","version":"HTTP/1.1"}
```
#### Available Spans in Disaggregated Mode
When running Dynamo in disaggregated mode, a typical request creates the following spans:
##### 1. `http-request` (Frontend - Root Span)
The root span for the entire request lifecycle, created in the **dynamo-frontend** service.
**Key Attributes:**
- **Service**: `dynamo-frontend`
- **Operation**: Handles the HTTP request from client to completion
- **Duration**: Total end-to-end request time (includes prefill + decode)
- **Method**: HTTP method (typically `POST`)
- **URI**: Request endpoint (e.g., `/v1/chat/completions`)
- **Status**: Request completion status
- **Children**: Typically 2-3 child spans (routing span + worker spans)
This span represents the complete request flow from when the frontend receives the HTTP request until the final response is sent back to the client.
##### 2. `prefill_routing` (Frontend - Routing Span)
A child span of `http-request`, created in the **dynamo-frontend** service during the routing phase.
**Key Attributes:**
- **Service**: `dynamo-frontend`
- **Operation**: Routes the prefill request to an appropriate prefill worker
- **Duration**: Time spent selecting and the span of prefill.
- **Parent**: `http-request` span
This span captures the routing logic and decision-making process and the request sent to the prefill worker.
##### 3. `handle_payload` (Prefill Worker Span)
A child span of `http-request`, created in the **dynamo-worker-vllm-prefill** service.
**Key Attributes:**
- **Service**: `dynamo-worker-vllm-prefill` (or `dynamo-worker-sglang-prefill` for SGLang)
- **Operation**: Processes the prefill phase of generation
- **Duration**: Time to compute prefill (typically milliseconds to seconds)
- **Component**: `prefill`
- **Endpoint**: `generate`
- **Parent**: `http-request` span
This span represents the actual prefill computation on a prefill-specialized worker, including prompt processing and initial KV cache generation.
##### 4. `handle_payload` (Decode Worker Span)
A child span of `http-request`, created in the **dynamo-worker-vllm-decode** service.
**Key Attributes:**
- **Service**: `dynamo-worker-vllm-decode` (or `dynamo-worker-sglang-decode` for SGLang)
- **Operation**: Processes the decode phase of generation
- **Duration**: Time to generate all output tokens (typically seconds)
- **Component**: `decode` or `backend`
- **Endpoint**: `generate`
- **Parent**: `http-request` span
This span represents the iterative token generation phase on a decode-specialized worker, which consumes the KV cache from prefill and produces output tokens.
#### Understanding Span Metrics
Each span provides several useful metrics:
| Metric | Description |
|--------|-------------|
| **Duration** | Total time from span start to end |
| **Busy Time** | Time actively processing (excluding waiting) |
| **Idle Time** | Time spent waiting (e.g., for network, other services) |
| **Start Time** | When the span began |
| **Child Count** | Number of direct child spans |
The relationship **Duration = Busy Time + Idle Time** helps identify where time is spent and potential bottlenecks.
## Custom Request IDs in Logs
......
......@@ -9,7 +9,7 @@
This guide shows how to set up Prometheus and Grafana for visualizing Dynamo metrics on a single machine for demo purposes.
![Grafana Dynamo Dashboard](../../assets/img/grafana-dynamo-composite.png)
![Grafana Dynamo Dashboard](/assets/img/grafana-dynamo-composite.png)
**Components:**
- **Prometheus Server** - Collects and stores metrics from Dynamo services
......
......@@ -144,7 +144,7 @@ http://localhost:8000/v1/chat/completions
Below is an example of what a trace looks like in Grafana Tempo:
![Trace Example](../../assets/img/trace.png)
![Trace Example](/assets/img/trace.png)
### 6. Stop Services
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Finding Best Initial Configs using AIConfigurator
[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.
## Why Use AIConfigurator?
When deploying LLMs with Dynamo, you need to make several critical decisions:
- **Aggregated vs Disaggregated**: Which architecture gives better performance for your workload?
- **Worker Configuration**: How many prefill and decode workers to deploy?
- **Parallelism Settings**: What tensor/pipeline parallel configuration to use?
- **SLA Compliance**: How to meet your TTFT and TPOT targets?
AIConfigurator answers these questions in seconds, providing:
- Optimal configurations that meet your SLA requirements
- Ready-to-deploy Dynamo configuration files
- Performance comparisons between different deployment strategies
- Up to 1.7x better throughput compared to manual configuration
## Quick Start
```bash
# Install
pip3 install aiconfigurator
# Find optimal configuration
aiconfigurator cli default \
--model QWEN3_32B \ # Model name (QWEN3_32B, LLAMA3.1_70B, etc.)
--total_gpus 32 \ # Number of available GPUs
--system h200_sxm \ # GPU type (h100_sxm, h200_sxm, a100_sxm)
--isl 4000 \ # Input sequence length (tokens)
--osl 500 \ # Output sequence length (tokens)
--ttft 300 \ # Target Time To First Token (ms)
--tpot 10 \ # Target Time Per Output Token (ms)
--save_dir ./dynamo-configs
# Deploy
kubectl apply -f ./dynamo-configs/disagg/top1/disagg/k8s_deploy.yaml
```
## Example Output
```text
********************************************************************************
* Dynamo aiconfigurator Final Results *
********************************************************************************
----------------------------------------------------------------------------
Input Configuration & SLA Target:
Model: QWEN3_32B (is_moe: False)
Total GPUs: 32
Best Experiment Chosen: disagg at 812.92 tokens/s/gpu (1.70x better)
----------------------------------------------------------------------------
Overall Best Configuration:
- Best Throughput: 812.92 tokens/s/gpu
- User Throughput: 120.23 tokens/s/user
- TTFT: 276.76ms
- TPOT: 8.32ms
----------------------------------------------------------------------------
Pareto Frontier:
QWEN3_32B Pareto Frontier: tokens/s/gpu vs tokens/s/user
┌────────────────────────────────────────────────────────────────────────┐
1600.0┤ •• disagg │
│ ff agg │
│ xx disagg best │
│ │
1333.3┤ f │
│ ff │
│ ff • │
│ f •••••••• │
1066.7┤ f •• │
│ fff •••••••• │
│ f •• │
│ f •••• │
800.0┤ fffff •••x │
│ fff •• │
│ fff • │
│ fffff •• │
533.3┤ ffff •• │
│ ffff •• │
│ fffffff ••••• │
│ ffffff •• │
266.7┤ fffff ••••••••• │
│ ffffffffff │
│ f │
│ │
0.0┤ │
└┬─────────────────┬─────────────────┬────────────────┬─────────────────┬┘
0 60 120 180 240
tokens/s/gpu tokens/s/user
1. **Performance Comparison**: Shows disaggregated vs aggregated serving performance
2. **Optimal Configuration**: The best configuration that meets your SLA targets
3. **Deployment Files**: Ready-to-use Dynamo configuration files
## Key Features
### Fast Profiling Integration
```bash
# Use with Dynamo's SLA planner (20-30 seconds vs hours)
python3 -m benchmarks.profiler.profile_sla \
--config ./examples/backends/trtllm/deploy/disagg.yaml \
--backend trtllm \
--use-ai-configurator \
--aic-system h200_sxm \
--aic-model-name QWEN3_32B
```
```
### Custom Configuration
```bash
# For advanced users: define custom search space
aiconfigurator cli exp --yaml_path custom_config.yaml
```
## Common Use Cases
```bash
# Strict SLAs (low latency)
aiconfigurator cli default --model QWEN2.5_7B --total_gpus 8 --system h200_sxm --ttft 100 --tpot 5
# High throughput (relaxed latency)
aiconfigurator cli default --model QWEN3_32B --total_gpus 32 --system h200_sxm --ttft 1000 --tpot 50
```
## Supported Configurations
**Models**: GPT, LLAMA2/3, QWEN2.5/3, Mixtral, DEEPSEEK_V3
**GPUs**: H100, H200, A100, B200 (preview), GB200 (preview)
**Backend**: TensorRT-LLM (vLLM and SGLang coming soon)
## Additional Options
```bash
# Web interface
aiconfigurator webapp # Visit http://127.0.0.1:7860
# Docker
docker run -it --rm nvcr.io/nvidia/aiconfigurator:latest \
aiconfigurator cli default --model LLAMA3.1_70B --total_gpus 16 --system h100_sxm
```
## Troubleshooting
**Model name mismatch**: Use exact model name that matches your deployment
**GPU allocation**: Verify available GPUs match `--total_gpus`
**Performance variance**: Results are estimates - benchmark actual deployment
## Learn More
- [Dynamo Installation Guide](../kubernetes/installation-guide.md)
- [SLA Planner Quick Start Guide](../planner/sla-planner-quickstart.md)
- [Benchmarking Guide](../benchmarks/benchmarking.md)
\ No newline at end of file
......@@ -33,8 +33,9 @@ Typically, the number of GPUs vs the performance follows the following pattern:
| Maximum number limited by communication scalability | Worst overall throughput/GPU, best latency/user |
| More than maximum | Communication overhead dominates, poor performance |
> [!NOTE]
> [!Note]
> for decode-only engines, sometimes larger number of GPUs has to larger KV cache per GPU and more decoding requests running in parallel, which leads to both better throughput/GPU and better latency/user.
>
> For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free GPU memory fraction:
| TP Size | KV Cache Size (GB) | KV Cache per GPU (GB) | Per GPU Improvement over TP1 |
......@@ -46,7 +47,7 @@ Typically, the number of GPUs vs the performance follows the following pattern:
The best number of GPUs to use in the prefill and decode engines can be determined by running a few fixed ISL/OSL/concurrency test using [AIPerf](https://github.com/ai-dynamo/aiperf/tree/main) and compare with the SLA.
AIPerf is pre-installed in the dynamo container.
> [!TIP]
> [!Tip]
> If you are unfamiliar with AIPerf, please see this helpful [tutorial](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorial.md) to get you started.
Besides the parallelization mapping, other common knobs to tune are maximum batch size, maximum number of tokens, and block size.
......@@ -76,7 +77,7 @@ For most frameworks, when chunked prefill is enabled and one forward iteration g
In the prefill engine, the best strategy is to operate at the smallest batch size that saturates the GPUs so that the average time to first token (TTFT) is minimized.
For example, for Llama3.3-70b NVFP4 quantization on B200 TP1 in vLLM, the below figure shows the prefill time with different isl (prefix caching is turned off):
![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](../../assets/img/prefill-time.png)
![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](/assets/img/prefill-time.png)
For isl less than 1000, the prefill efficiency is low because the GPU is not fully saturated.
For isl larger than 4000, the prefill time per token increases because the attention takes longer to compute with a longer history.
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Load-based Planner
This document covers load-based planner in `examples/llm/components/planner.py`.
> [!WARNING]
> Load-based planner is inoperable as vllm, sglang, and trtllm examples all do not use prefill queues. Please use SLA planner for now.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon.
## Load-based Scaling Up/Down Prefill/Decode Workers
To adjust the number of prefill/decode workers, planner monitors the following metrics:
* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload.
* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload.
Every `metric-pulling-interval`, planner gathers the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism picks up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, planner revokes the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests. The decode worker then finishes all the current requests in their original stream and exits gracefully.
There are two additional rules set by planner to prevent over-compensation:
1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
## SLA-based Scaling Up/Down Prefill/Decode Workers
See [SLA-Driven Profiling](../benchmarks/sla-driven-profiling.md) for more details.
## Usage
The planner integration with the new frontend + worker architecture is currently a work in progress. This documentation will be updated with the new deployment patterns and code examples once the planner component has been fully adapted to the new workflow.
Configuration options:
* `namespace` (str, default: "dynamo"): Target namespace for planner operations
* `environment` (str, default: "local"): Target environment (local, kubernetes)
* `no-operation` (bool, default: false): Run in observation mode only
* `log-dir` (str, default: None): Tensorboard log directory
* `adjustment-interval` (int, default: 30): Seconds between adjustments
* `metric-pulling-interval` (int, default: 1): Seconds between metric pulls
* `max-gpu-budget` (int, default: 8): Maximum GPUs for all workers
* `min-gpu-budget` (int, default: 1): Minimum GPUs per worker type
* `decode-kv-scale-up-threshold` (float, default: 0.9): KV cache threshold for scale-up
* `decode-kv-scale-down-threshold` (float, default: 0.5): KV cache threshold for scale-down
* `prefill-queue-scale-up-threshold` (float, default: 0.5): Queue threshold for scale-up
* `prefill-queue-scale-down-threshold` (float, default: 0.2): Queue threshold for scale-down
* `decode-engine-num-gpu` (int, default: 1): GPUs per decode engine
* `prefill-engine-num-gpu` (int, default: 1): GPUs per prefill engine
Run as standalone process:
```bash
PYTHONPATH=/workspace/examples/llm python components/planner.py --namespace=dynamo --served-model-name=vllm --no-operation --log-dir=log/planner
```
Monitor metrics with Tensorboard:
```bash
tensorboard --logdir=<path-to-tensorboard-log-dir>
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# Planner
The planner monitors the state of the system and adjusts workers to
ensure that the system runs efficiently.
Currently, the planner can scale the number of vllm workers up and down
based on the kv cache load and prefill queue size:
Key features include:
- **SLA-based scaling** that uses predictive modeling and performance
interpolation to proactively meet TTFT and ITL targets
- **Graceful scaling** that ensures no requests are dropped during
scale-down operations
> [!TIP]
> **New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](sla-planner-quickstart.md) for a complete, step-by-step workflow.
>
> **Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# SLA-based Planner
> [!TIP]
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Profiling + Planner Quick Start Guide](sla-planner-quickstart.md).
This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`.
The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers.
> [!NOTE]
> Currently, SLA-based planner only supports disaggregated setup.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
## Architecture Overview
**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (by default, can be updated in the podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: prefill and backend workers handle inference
The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/planner/defaults.py).
```mermaid
flowchart LR
Frontend --"/metrics"--> Prometheus
Planner --"query API"--> Prometheus
Planner --"scaling decisions"--> Workers
Frontend -.->|"requests"| Workers
```
## Features
* **SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets
* **Predictive load forecasting**: Uses ARIMA, Prophet, or constant predictors to forecast future load
* **Performance interpolation**: Leverages profiling results data from pre-deployment profiling for accurate scaling decisions
* **Correction factors**: Adapts to real-world performance deviations from profiled data
## Design
The SLA planner consists of several key components:
1. **Load Predictors**: Forecast future request patterns (number of requests, input/output sequence lengths)
2. **Performance Interpolators**: Estimate TTFT and ITL based on profiled performance data
3. **Correction Factors**: Adjust predictions based on observed vs. expected performance
4. **Scaling Logic**: Calculate optimal number of prefill/decode replicas to meet SLA targets
## SLA-Driven Pre-Deployment Profiling
**Prerequisite**: SLA-based planner requires pre-deployment profiling to be completed before deployment. The profiling process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters that the planner will use during operation.
See [Pre-Deployment Profiling](../benchmarks/sla-driven-profiling.md) for detailed instructions on running the profiling process.
## Load Prediction
The SLA planner use load predictor to predict the number of requests, ISL, and OSL in the next adjustment interval. Currently, three load prediction model is supported:
### Constant Predictor
- **Use case**: Stable and long prediction interval
- **Behavior**: Assumes next load equals current load
- **Configuration**: `load-predictor: "constant"`
### ARIMA Predictor
- **Use case**: Time-series data with trends and seasonality
- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
- **Configuration**: `load-predictor: "arima"`
### Prophet Predictor
- **Use case**: Complex seasonal patterns and trend changes
- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
- **Configuration**: `load-predictor: "prophet"`
## Scaling Algorithm
SLA planner uses a sophisticated scaling algorithm. At each adjustment interval, SLA planner performs the following operations:
### 1. Metric Collection
Every adjustment interval, collect:
- Average Time to First Token (TTFT)
- Average Inter-Token Latency (ITL)
- Request count and duration
- Input/Output sequence lengths
### 2. Correction Factor Calculation
Using the collected metrics, SLA planner applies the interpolator to find out the expected TTFT/ITL and calibrate the interpolation model. This step is important because the actual TTFT/ITL can often be different than the ideal world:
- **TTFT**: actual TTFT heavily depends on request queueing and prefix cache hit rate (if use kv reuse). For example, if all requests arrives at the beginning of the adjustment interval, they queue heavily and TTFT will be significantly higher. If prefix cache hit rate is very high, the actual number of tokens in the prefill will be very low and TTFT will be significantly lower.
- **ITL**: actual ITL maybe affected by chunked small prefill request in decode engine.
- **Metric variances**: large variances in request rate, ISL, and OSL may lead to inaccurate estimation of the TTFT/ITL since SLA only consider the average when interpolating.
SLA planner calculate the correction factor with
- **Prefill correction**: `actual_ttft / expected_ttft`
- **Decode correction**: `actual_itl / expected_itl`
### 3. Load Prediction
SLA planner forecasts these metric in the next interval using the load predictor
- Number of requests
- Input sequence length
- Output sequence length
### 4. Calculating Number of Replicas
**Prefill replicas**: SLA planner assumes the prefill correction factor has linear affect on the prefill throughput per GPU as prefill is single-batched.
```
predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
```
**Decode replicas**:
```
# 1. apply d_correction_factor to the ITL SLA
corrected_itl = self.args.itl / self.d_correction_factor
# 2. reversely find out what is best throughput/gpu that can achieve corrected_itl under the predicted context length
pred_decode_thpt_per_gpu = self.decode_interpolator.find_best_throughput_per_gpu(
itl=corrected_itl,
context_length=next_isl + next_osl / 2
)
# 3. compute number of decode replicas needed
next_num_d = math.ceil(next_num_req * next_osl / self.args.adjustment_interval / pred_decode_thpt_per_gpu / self.args.decode_engine_num_gpu)
```
### 5. Scaling
Finally, SLA planner applies the change by scaling up/down the number of prefill and decode workers to the calculated number of replica in the next interval.
> [!NOTE]
> SLA-planner scales up/down the P/D engines non-blockingly. If `adjustment-interval` is too short, the previous scaling operations may not finish before the new scaling operations are issued. Make sure to set a large enough `adjustment-interval`.
## Deploying
For complete deployment instructions, see the [SLA Planner Quick Start Guide](sla-planner-quickstart.md).
> [!NOTE]
> The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
### Virtual Deployment
The SLA planner supports virtual deployment mode for customized environments (e.g., customized cluster) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing the deployment infrastructure.
The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of directly scaling Kubernetes resources, it writes scaling decisions and waits for the deployment environment to acknowledge completion.
#### Scaling Decision Flow
1. **Decision Generation**: The planner calculates optimal worker counts
2. **Change Detection**: The planner skips scaling if the target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
3. **Readiness Check**: Before making new decisions, the planner verifies that previous scaling operations have completed by checking if `scaled_decision_id >= decision_id`
4. **Timeout Handling**: If a scaling decision isn't acknowledged within 30 minutes (1800 seconds), the planner proceeds with new decisions anyway
5. **Completion Tracking**: The planner can optionally wait for scaling completion confirmation (blocking mode)
#### Configuration
To use virtual deployment mode:
```yaml
environment: "virtual"
backend: "vllm" # or "sglang"
```
#### Deployment Environment Requirements
The external deployment environment must use `VirtualConnectorClient`:
```
from dynamo._core import DistributedRuntime, VirtualConnectorClient
client = VirtualConnectorClient(distributed_runtime, namespace)
```
1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()`. This blocks until there is a change.
2. **Parse Decisions**: Read `num_prefill_workers` and `num_decode_workers` values: `decision = await client.get()`
3. **Execute Scaling**: Apply the scaling decisions to the actual deployment infrastructure
4. **Acknowledge Completion**: Mark the decision completed when scaling is finished: `await client.complete(decision)`
A scaling decision (returned by `client.get()`) contains the following fields, which are -1 if not set yet:
- `num_prefill_workers`: Integer specifying the target number of prefill workers
- `num_decode_workers`: Integer specifying the target number of decode workers
- `decision_id`: Integer with incremental ID for each scaling decision
See `components/planner/test/test_virtual_connector.py` for a full example.
......@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0
---
# Dynamo Run
`dynamo-run` is a Rust binary that lets you easily run a model, explore the Dynamo components, and demonstrates the Rust API. It supports the `mistral.rs` engines, as well as testing engines `echo` and `mocker`.
It is primarily for development and rapid prototyping. For production use we recommend the Python wrapped components, see the main project README.
......
......@@ -17,24 +17,24 @@ This document provides a comprehensive compatibility matrix for key Dynamo featu
| Feature | vLLM | TensorRT-LLM | SGLang | Source |
| :--- | :---: | :---: | :---: | :--- |
| **Disaggregated Serving** | ✅ | ✅ | ✅ | [Design Doc](../design-docs/disagg-serving.md) |
| **KV-Aware Routing** | ✅ | ✅ | ✅ | [Router Doc](../router/kv-cache-routing.md) |
| **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc](../planner/planner-intro.md) |
| **KV Block Manager** | ✅ | ✅ | 🚧 | [KVBM Doc](../kvbm/kvbm-intro.md) |
| **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc](../multimodal/index.md) |
| **Multimodal (Video)** | ✅ | | | [Multimodal Doc](../multimodal/index.md) |
| **Multimodal (Audio)** | 🚧 | | | [Multimodal Doc](../multimodal/index.md) |
| **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc](../fault-tolerance/request-migration.md) |
| **Disaggregated Serving** | ✅ | ✅ | ✅ | [Design Doc][disagg] |
| **KV-Aware Routing** | ✅ | ✅ | ✅ | [Router Doc][kv-routing] |
| **SLA-Based Planner** | ✅ | ✅ | ✅ | [Planner Doc][planner] |
| **KV Block Manager** | ✅ | ✅ | 🚧 | [KVBM Doc][kvbm] |
| **Multimodal (Image)** | ✅ | ✅ | ✅ | [Multimodal Doc][mm] |
| **Multimodal (Video)** | ✅ | | | [Multimodal Doc][mm] |
| **Multimodal (Audio)** | 🚧 | | | [Multimodal Doc][mm] |
| **Request Migration** | ✅ | 🚧 | ✅ | [Migration Doc][migration] |
| **Request Cancellation** | ✅ | ✅ | 🚧 | Backend READMEs |
| **LoRA** | ✅ | | | [K8s Guide](../kubernetes/deployment/dynamomodel-guide.md) |
| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc](../agents/tool-calling.md) |
| **LoRA** | ✅ | | | [K8s Guide][lora] |
| **Tool Calling** | ✅ | ✅ | ✅ | [Tool Calling Doc][tools] |
| **Speculative Decoding** | ✅ | ✅ | 🚧 | Backend READMEs |
## 1. vLLM Backend
vLLM offers the broadest feature coverage in Dynamo, with full support for disaggregated serving, KV-aware routing, KV block management, LoRA adapters, and multimodal inference including video and audio.
*Source: [docs/backends/vllm/README.md](../backends/vllm/README.md)*
*Source: [docs/backends/vllm/README.md][vllm-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
......@@ -50,17 +50,17 @@ vLLM offers the broadest feature coverage in Dynamo, with full support for disag
| **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | — | ✅ | — |
> **Notes:**
> 1. **Multimodal + KV-Aware Routing**: The KV router uses token-based hashing and does not yet support image/video hashes, so it falls back to random/round-robin routing. ([Source](../router/kv-cache-routing.md))
> 1. **Multimodal + KV-Aware Routing**: The KV router uses token-based hashing and does not yet support image/video hashes, so it falls back to random/round-robin routing. ([Source][kv-routing])
> 2. **KV-Aware LoRA Routing**: vLLM supports routing requests based on LoRA adapter affinity.
> 3. **Audio Support**: vLLM supports audio models like Qwen2-Audio (experimental). ([Source](../multimodal/vllm.md))
> 4. **Video Support**: vLLM supports video input with frame sampling. ([Source](../multimodal/vllm.md))
> 5. **Speculative Decoding**: Eagle3 support documented. ([Source](../backends/vllm/speculative-decoding.md))
> 3. **Audio Support**: vLLM supports audio models like Qwen2-Audio (experimental). ([Source][mm-vllm])
> 4. **Video Support**: vLLM supports video input with frame sampling. ([Source][mm-vllm])
> 5. **Speculative Decoding**: Eagle3 support documented. ([Source][vllm-spec])
## 2. SGLang Backend
SGLang is optimized for high-throughput serving with fast primitives, providing robust support for disaggregated serving, KV-aware routing, and request migration.
*Source: [docs/backends/sglang/README.md](../backends/sglang/README.md)*
*Source: [docs/backends/sglang/README.md][sglang-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
......@@ -76,16 +76,16 @@ SGLang is optimized for high-throughput serving with fast primitives, providing
| **Speculative Decoding** | 🚧 | 🚧 | — | 🚧 | — | 🚧 | — | | 🚧 | — |
> **Notes:**
> 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source](../router/kv-cache-routing.md))
> 2. **Multimodal Patterns**: Supports **E/PD** and **E/P/D** only (requires separate vision encoder). Does **not** support simple Aggregated (EPD) or Traditional Disagg (EP/D). ([Source](../multimodal/sglang.md))
> 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source](../backends/sglang/README.md))
> 1. **Multimodal + KV-Aware Routing**: Not supported. ([Source][kv-routing])
> 2. **Multimodal Patterns**: Supports **E/PD** and **E/P/D** only (requires separate vision encoder). Does **not** support simple Aggregated (EPD) or Traditional Disagg (EP/D). ([Source][mm-sglang])
> 3. **Request Cancellation**: Cancellation during the remote prefill phase is not supported in disaggregated mode. ([Source][sglang-readme])
> 4. **Speculative Decoding**: Code hooks exist (`spec_decode_stats` in publisher), but no examples or documentation yet.
## 3. TensorRT-LLM Backend
TensorRT-LLM delivers maximum inference performance and optimization, with full KVBM integration and robust disaggregated serving support.
*Source: [docs/backends/trtllm/README.md](../backends/trtllm/README.md)*
*Source: [docs/backends/trtllm/README.md][trtllm-readme]*
| Feature | Disaggregated Serving | KV-Aware Routing | SLA-Based Planner | KV Block Manager | Multimodal | Request Migration | Request Cancellation | LoRA | Tool Calling | Speculative Decoding |
| :--- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
......@@ -94,14 +94,41 @@ TensorRT-LLM delivers maximum inference performance and optimization, with full
| **SLA-Based Planner** | ✅ | ✅ | — | | | | | | | |
| **KV Block Manager** | ✅ | ✅ | ✅ | — | | | | | | |
| **Multimodal** | ✅<sup>1</sup> | <sup>2</sup> | — | ✅ | — | | | | | |
| **Request Migration** | 🚧<sup>3</sup> | ✅ | ✅ | ✅ | 🚧 | — | | | | |
| **Request Cancellation** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | — | | | |
| **Request Migration** | | ✅ | ✅ | ✅ | 🚧 | — | | | | |
| **Request Cancellation** | ✅<sup>3</sup> | ✅<sup>3</sup> | ✅<sup>3</sup> | ✅<sup>3</sup> | ✅<sup>3</sup> | ✅<sup>3</sup> | — | | | |
| **LoRA** | | | | | | | | — | | |
| **Tool Calling** | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | | — | |
| **Speculative Decoding** | ✅ | ✅ | — | ✅ | — | ✅ | ✅ | | ✅ | — |
> **Notes:**
> 1. **Multimodal Disaggregation**: Fully supports **EP/D** (Traditional) pattern. **E/P/D** (Full Disaggregation) is WIP and currently supports pre-computed embeddings only. ([Source](../multimodal/trtllm.md))
> 2. **Multimodal + KV-Aware Routing**: Not supported. The KV router currently tracks token-based blocks only. ([Source](../router/kv-cache-routing.md))
> 3. **Request Migration**: Supported on **Decode/Aggregated** workers only. **Prefill** workers do not support migration. ([Source](../backends/trtllm/README.md))
> 4. **Speculative Decoding**: Llama 4 + Eagle support documented. ([Source](../backends/trtllm/llama4-plus-eagle.md))
> 1. **Multimodal Disaggregation**: Fully supports **EP/D** (Traditional) pattern. **E/P/D** (Full Disaggregation) is WIP and currently supports pre-computed embeddings only. ([Source][mm-trtllm])
> 2. **Multimodal + KV-Aware Routing**: Not supported. The KV router currently tracks token-based blocks only. ([Source][kv-routing])
> 3. **Request Cancellation**: Due to known issues, the TensorRT-LLM engine is temporarily not notified of request cancellations, meaning allocated resources for cancelled requests are not freed.
---
## Source References
{/* Backend READMEs */}
[vllm-readme]: docs/backends/vllm/README.md
[sglang-readme]: docs/backends/sglang/README.md
[trtllm-readme]: docs/backends/trtllm/README.md
{/* Design Docs */}
[disagg]: docs/design_docs/disagg_serving.md
[kv-routing]: docs/components/router/router_guide.md
[planner]: docs/components/planner/README.md
[kvbm]: docs/components/kvbm/README.md
[migration]: docs/fault_tolerance/request_migration.md
[tools]: docs/agents/tool-calling.md
{/* Multimodal */}
[mm]: docs/features/multimodal/README.md
[mm-vllm]: docs/features/multimodal/multimodal_vllm.md
[mm-trtllm]: docs/features/multimodal/multimodal_trtllm.md
[mm-sglang]: docs/features/multimodal/multimodal_sglang.md
{/* Feature-specific */}
[lora]: docs/kubernetes/deployment/dynamomodel-guide.md
[vllm-spec]: docs/features/speculative_decoding/speculative_decoding_vllm.md
[trtllm-eagle]: docs/backends/trtllm/llama4_plus_eagle.md
......@@ -3,8 +3,6 @@
# SPDX-License-Identifier: Apache-2.0
---
# NVIDIA Dynamo Glossary
## B
**Block** - A fixed-size chunk of tokens (typically 16 or 64 tokens) used for efficient KV cache management and memory allocation, serving as the fundamental unit for techniques like PagedAttention.
......
......@@ -7,7 +7,7 @@
This document provides a comprehensive inventory of all Dynamo release artifacts including container images, Python wheels, Helm charts, and Rust crates.
**See also:** [Support Matrix](support-matrix.md) for hardware and platform compatibility | [Feature Matrix](feature-matrix.md) for backend feature support
> **See also:** [Support Matrix](support-matrix.md) for hardware and platform compatibility | [Feature Matrix](feature-matrix.md) for backend feature support
Release history in this document begins at v0.6.0.
......@@ -74,7 +74,7 @@ We recommend using the TensorRT-LLM NGC container instead of the `ai-dynamo[trtl
### Container Images (NGC)
For detailed run instructions, see the [Container README](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md) or backend-specific guides: [vLLM](../backends/vllm/README.md) | [SGLang](../backends/sglang/README.md) | [TensorRT-LLM](../backends/trtllm/README.md)
> For detailed run instructions, see the [Container README](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md) or backend-specific guides: [vLLM](../backends/vllm/README.md) | [SGLang](../backends/sglang/README.md) | [TensorRT-LLM](../backends/trtllm/README.md)
```bash
# Runtime containers
......@@ -94,7 +94,7 @@ docker pull nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.8.1
### Python Wheels (PyPI)
For detailed installation instructions, see the [Local Quick Start](https://github.com/ai-dynamo/dynamo#local-quick-start) in the README.
> For detailed installation instructions, see the [Local Quick Start](https://github.com/ai-dynamo/dynamo#local-quick-start) in the README.
```bash
# Install Dynamo with a specific backend (Recommended)
......@@ -112,7 +112,7 @@ uv pip install kvbm==0.8.1
### Helm Charts (NGC)
For Kubernetes deployment instructions, see the [Kubernetes Installation Guide](../kubernetes/installation-guide.md).
> For Kubernetes deployment instructions, see the [Kubernetes Installation Guide](../kubernetes/installation-guide.md).
```bash
helm install dynamo-crds oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dynamo-crds --version 0.8.1
......@@ -122,7 +122,7 @@ helm install dynamo-graph oci://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/dyna
### Rust Crates (crates.io)
For API documentation, see each crate on [docs.rs](https://docs.rs/). To build Dynamo from source, see [Building from Source](https://github.com/ai-dynamo/dynamo#building-from-source).
> For API documentation, see each crate on [docs.rs](https://docs.rs/). To build Dynamo from source, see [Building from Source](https://github.com/ai-dynamo/dynamo#building-from-source).
```bash
cargo add dynamo-runtime@0.8.1
......@@ -166,17 +166,17 @@ For a complete list of known issues, refer to the release notes for each patch:
|---------|--------------|--------|------|
| `v0.8.1` | Jan 23, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.1) | [Docs](https://docs.nvidia.com/dynamo/archive/0.8.1/index.html) |
| `v0.8.0` | Jan 15, 2026 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.8.0) | [Docs](https://docs.nvidia.com/dynamo/archive/0.8.0/index.html) |
| `v0.7.1` | Dec 15, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.1) | |
| `v0.7.1` | Dec 15, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.1) | [Docs](https://docs.nvidia.com/dynamo/archive/0.7.1/index.html) |
| `v0.7.0` | Nov 26, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.7.0) | [Docs](https://docs.nvidia.com/dynamo/archive/0.7.0/index.html) |
| `v0.6.1` | Nov 6, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.6.1) | [Docs](https://docs.nvidia.com/dynamo/archive/0.6.1/index.html) |
| `v0.6.0` | Oct 28, 2025 | [Release](https://github.com/ai-dynamo/dynamo/releases/tag/v0.6.0) | [Docs](https://docs.nvidia.com/dynamo/archive/0.6.0/index.html) |
### Container Images
**NGC Collection:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
To access a specific version, append `?version=TAG` to the container URL:
`https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/{container}?version={tag}`
> **NGC Collection:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
>
> To access a specific version, append `?version=TAG` to the container URL:
> `https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/{container}?version={tag}`
#### vllm-runtime
......@@ -245,9 +245,9 @@ To access a specific version, append `?version=TAG` to the container URL:
### Python Wheels
**PyPI:** [ai-dynamo](https://pypi.org/project/ai-dynamo/) | [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/) | [kvbm](https://pypi.org/project/kvbm/)
To access a specific version: `https://pypi.org/project/{package}/{version}/`
> **PyPI:** [ai-dynamo](https://pypi.org/project/ai-dynamo/) | [ai-dynamo-runtime](https://pypi.org/project/ai-dynamo-runtime/) | [kvbm](https://pypi.org/project/kvbm/)
>
> To access a specific version: `https://pypi.org/project/{package}/{version}/`
#### ai-dynamo (wheel)
......@@ -284,9 +284,9 @@ To access a specific version: `https://pypi.org/project/{package}/{version}/`
### Helm Charts
**NGC Helm Registry:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
Direct download: `https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/{chart}-{version}.tgz`
> **NGC Helm Registry:** [ai-dynamo](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo)
>
> Direct download: `https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/{chart}-{version}.tgz`
#### dynamo-crds (Helm chart)
......@@ -323,9 +323,9 @@ Direct download: `https://helm.ngc.nvidia.com/nvidia/ai-dynamo/charts/{chart}-{v
### Rust Crates
**crates.io:** [dynamo-runtime](https://crates.io/crates/dynamo-runtime) | [dynamo-llm](https://crates.io/crates/dynamo-llm) | [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai) | [dynamo-parsers](https://crates.io/crates/dynamo-parsers) | [dynamo-memory](https://crates.io/crates/dynamo-memory) | [dynamo-config](https://crates.io/crates/dynamo-config)
To access a specific version: `https://crates.io/crates/{crate}/{version}`
> **crates.io:** [dynamo-runtime](https://crates.io/crates/dynamo-runtime) | [dynamo-llm](https://crates.io/crates/dynamo-llm) | [dynamo-async-openai](https://crates.io/crates/dynamo-async-openai) | [dynamo-parsers](https://crates.io/crates/dynamo-parsers) | [dynamo-memory](https://crates.io/crates/dynamo-memory) | [dynamo-config](https://crates.io/crates/dynamo-config)
>
> To access a specific version: `https://crates.io/crates/{crate}/{version}`
#### dynamo-runtime (crate)
......
......@@ -13,24 +13,43 @@ This document provides the support matrix for Dynamo, including hardware, softwa
The following table shows the backend framework versions included with each Dynamo release:
| **Dependency** | **main (ToT)** | **v0.8.1.post1** | **v0.8.1 (latest)** | **v0.8.0** | **v0.7.1** | **v0.7.0.post1** | **v0.7.0** |
| :------------- | :------------- | :--------------- | :------------------ | :--------- | :--------- | :--------------- | :--------- |
| vLLM | `0.14.0` | `0.12.0` | `0.12.0` | `0.12.0` | `0.11.0` | `0.11.0` | `0.11.0` |
| SGLang | `0.5.8` | `0.5.6.post2` | `0.5.6.post2` | `0.5.6.post2` | `0.5.3.post4` | `0.5.3.post4` | `0.5.3.post4` |
| TensorRT-LLM | `1.2.0rc6.post2` | `1.2.0rc6.post2` | `1.2.0rc6.post1` | `1.2.0rc6.post1` | `1.2.0rc3` | `1.2.0rc3` | `1.2.0rc2` |
| NIXL | `0.9.0` | `0.8.0` | `0.8.0` | `0.8.0` | `0.8.0` | `0.8.0` | `0.8.0` |
**main (ToT)** reflects the current development branch. **v0.8.1.post1** is a patch release for PyPI wheels and TRT-LLM container only (no GitHub release).
> [!WARNING]
> Currently TensorRT-LLM does not support Python 3.11 so installation of the ai-dynamo[trtllm] Python wheel will fail.
| **Dynamo Version** | **SGLang** | **TensorRT-LLM** | **vLLM** |
| :----------------- | :------------------------ | :--------------- | :----------------------- |
| **Dynamo 0.8.1** | CUDA 12.9, CUDA 13.0 (Experimental) | CUDA 13.0 | CUDA 12.9, CUDA 13.0 (Experimental) |
| **Dynamo 0.8.0** | CUDA 12.9, CUDA 13.0 (Experimental) | CUDA 13.0 | CUDA 12.9, CUDA 13.0 (Experimental) |
| **Dynamo 0.7.1** | CUDA 12.8 | CUDA 13.0 | CUDA 12.9 |
| **Dynamo 0.7.0** | CUDA 12.9 | CUDA 13.0 | CUDA 12.8 |
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **NIXL** |
| :--- | :--- | :--- | :--- | :--- |
| **main (ToT)** | `0.14.1` | `0.5.8` | `1.3.0rc1` | `0.9.0` |
| **v1.0.0** *(planned)* | `0.15.0` | *Latest as of 2/17* | *Latest as of 2/17* | `0.10.0` |
| **v0.9.0** *(in progress)* | `0.14.1` | `0.5.8` | `1.3.0rc1` | `0.9.0` |
| **v0.8.1.post3** *(in progress)* | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post3` | `0.8.0` |
| **v0.8.1.post2** | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post2` | `0.8.0` |
| **v0.8.1.post1** | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post1` | `0.8.0` |
| **v0.8.1** | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post1` | `0.8.0` |
| **v0.8.0** | `0.12.0` | `0.5.6.post2` | `1.2.0rc6.post1` | `0.8.0` |
| **v0.7.1** | `0.11.0` | `0.5.4.post3` | `1.2.0rc3` | `0.8.0` |
| **v0.7.0.post1** | `0.11.0` | `0.5.4.post3` | `1.2.0rc3` | `0.8.0` |
| **v0.7.0** | `0.11.0` | `0.5.4.post3` | `1.2.0rc2` | `0.8.0` |
| **v0.6.1.post1** | `0.11.0` | `0.5.3.post2` | `1.1.0rc5` | `0.6.0` |
| **v0.6.1** | `0.11.0` | `0.5.3.post2` | `1.1.0rc5` | `0.6.0` |
| **v0.6.0** | `0.11.0` | `0.5.3.post2` | `1.1.0rc5` | `0.6.0` |
### Version Labels
- **main (ToT)** reflects the current development branch.
- Releases marked *(in progress)* or *(planned)* show target versions that may change before final release.
### Version Compatibility
- Backend versions listed are the only versions tested and supported for each release.
- TensorRT-LLM does not support Python 3.11; installation of the `ai-dynamo[trtllm]` wheel will fail on Python 3.11.
### CUDA Versions by Backend
| **Dynamo** | **vLLM** | **SGLang** | **TensorRT-LLM** | **Notes** |
| :--- | :--- | :--- | :--- | :--- |
| **v0.8.1** | `12.9`, `13.0` | `12.9`, `13.0` | `13.0` | Experimental vLLM/SGLang CUDA 13 support |
| **v0.8.0** | `12.9`, `13.0` | `12.9`, `13.0` | `13.0` | Experimental vLLM/SGLang CUDA 13 support |
| **v0.7.1** | `12.9` | `12.8` | `13.0` | |
| **v0.7.0** | `12.8` | `12.9` | `13.0` | TensorRT-LLM CUDA 13 support - CUDA 12.9 deprecated |
| **v0.6.1** | `12.8` | `12.9` | `12.9` | |
| **v0.6.0** | `12.8` | `12.8` | `12.9` | |
Patch versions (e.g., v0.8.1.post1, v0.7.0.post1) have the same CUDA support as their base version.
......@@ -69,7 +88,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
Wheels are built using a manylinux_2_28-compatible environment and validated on CentOS Stream 9 and Ubuntu (22.04, 24.04). Compatibility with other Linux distributions is expected but not officially verified.
> [!CAUTION]
> [!Caution]
> KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
## Software Compatibility
......@@ -119,7 +138,7 @@ For extended driver compatibility beyond the minimum versions listed above, cons
| :------------------------ | :---------- | :--------------- | :--------- |
| **Amazon Linux** | 2023 | x86_64 | Supported |
> [!CAUTION]
> [!Caution]
> **AL2023 TensorRT-LLM Limitation:** There is a known issue with the TensorRT-LLM framework when running the AL2023 container locally with `docker run --network host ...` due to a [bug](https://github.com/mpi4py/mpi4py/discussions/491#discussioncomment-12660609) in mpi4py. To avoid this issue, replace the `--network host` flag with more precise networking configuration by mapping only the necessary ports (e.g., 4222 for nats, 2379/2380 for etcd, 8000 for frontend).
## Build Support
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
# KV Router
## Overview
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks). Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
## Quick Start
### Python / CLI Deployment
To launch the Dynamo frontend with the KV Router:
```bash
python -m dynamo.frontend --router-mode kv --http-port 8000
```
This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
Backend workers register themselves using the `register_llm` API, after which the KV Router automatically:
- Tracks the state of all registered workers
- Makes routing decisions based on KV cache overlap
- Balances load across available workers
### Kubernetes Deployment
To enable the KV Router in a Kubernetes deployment, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv # Enable KV Smart Router
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
Worker:
# ... worker configuration ...
```
**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed
**Complete K8s Examples:**
- [TRT-LLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/agg_router.yaml)
- [vLLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
- [SGLang aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/agg_router.yaml)
- [Distributed inference tutorial](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
**For A/B Testing and Advanced K8s Setup:**
See the comprehensive [KV Router A/B Benchmarking Guide](../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
## Configuration Options
### CLI Arguments (Python Deployment)
The KV Router supports several key configuration options:
- **`--router-mode kv`**: Enable KV cache-aware routing (required)
- **`--kv-cache-block-size <size>`**: Sets the KV cache block size (default: backend-specific). Larger blocks reduce overlap detection granularity but improve memory efficiency. This should match your backend configuration.
- **`--router-temperature <float>`**: Controls routing randomness (default: 0.0)
- `0.0`: Deterministic selection of the best worker
- `> 0.0`: Probabilistic selection using softmax sampling
- Higher values increase randomness, helping prevent worker saturation
- **`--kv-events` / `--no-kv-events`**: Controls how the router tracks cached blocks (default: `--kv-events`)
- `--kv-events`: Uses real-time events from workers for accurate cache tracking
- `--no-kv-events`: Uses approximation based on routing decisions (lower overhead, less accurate)
- **`--kv-overlap-score-weight <float>`**: Balance between prefill and decode optimization (default: 1.0)
- Higher values (> 1.0): Prioritize reducing prefill cost (better TTFT)
- Lower values (< 1.0): Prioritize decode performance (better ITL)
For a complete list of available options:
```bash
python -m dynamo.frontend --help
```
### Kubernetes Environment Variables
All CLI arguments can be configured via environment variables in Kubernetes deployments. Use the `DYN_` prefix with uppercase parameter names:
| CLI Argument | K8s Environment Variable | Default | Description |
|--------------|-------------------------|---------|-------------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` | Enable KV router |
| `--router-temperature <float>` | `DYN_ROUTER_TEMPERATURE=<float>` | `0.0` | Routing randomness |
| `--kv-cache-block-size <size>` | `DYN_KV_CACHE_BLOCK_SIZE=<size>` | Backend-specific | KV cache block size |
| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` | Disable KV event tracking |
| `--kv-overlap-score-weight <float>` | `DYN_KV_OVERLAP_SCORE_WEIGHT=<float>` | `1.0` | Prefill vs decode weight |
| `--http-port <port>` | `DYN_HTTP_PORT=<port>` | `8000` | HTTP server port |
### Example with Advanced Configuration
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
services:
Frontend:
dynamoNamespace: my-namespace
componentType: frontend
replicas: 1
envs:
- name: DYN_ROUTER_MODE
value: kv
- name: DYN_ROUTER_TEMPERATURE
value: "0.5" # Add some randomness to prevent worker saturation
- name: DYN_KV_OVERLAP_SCORE_WEIGHT
value: "1.5" # Prioritize TTFT over ITL
- name: DYN_KV_CACHE_BLOCK_SIZE
value: "16"
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
```
### Alternative: Using Command Args in K8s
You can also pass CLI arguments directly in the container command:
```yaml
extraPodSpec:
mainContainer:
image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
command:
- /bin/sh
- -c
args:
- "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
```
**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
## KV Router Architecture
The KV Router tracks two key metrics for each worker:
1. **Potential Active Blocks**: The number of blocks that would be used for decoding if a request is routed to a worker. This includes both existing active blocks and new blocks from the incoming request.
2. **Potential New Prefill Blocks**: The number of tokens that need to be computed from scratch on a worker, calculated as:
- New prefill tokens = Total input tokens - (Overlap blocks × Block size)
- Potential prefill blocks = New prefill tokens / Block size
### Block Tracking Mechanisms
The router maintains block information through two complementary systems:
- **Active Decoding Blocks**: Tracked locally by the router throughout the request lifecycle:
- Incremented when adding a new request
- Updated during token generation
- Decremented upon request completion
- **Cached Blocks**: Maintained globally by the KvIndexer using a prefix tree built from worker-reported KV events. This provides accurate overlap information for routing decisions.
## Cost Function
The KV Router's routing decision is based on a simple cost function:
```
logit = kv_overlap_score_weight × potential_prefill_blocks + potential_active_blocks
```
Where:
- Lower logit values are better (less computational cost)
- The router uses softmax sampling with optional temperature to select workers
### Key Parameter: kv-overlap-score-weight
The `kv-overlap-score-weight` parameter (default: 1.0) controls the balance between prefill and decode optimization:
- **Higher values (> 1.0)**: Emphasize reducing prefill cost
- Prioritizes routing to workers with better cache hits
- Optimizes for Time To First Token (TTFT)
- Best for workloads where initial response latency is critical
- **Lower values (< 1.0)**: Emphasize decode performance
- Distributes active decoding blocks more evenly
- Optimizes for Inter-Token Latency (ITL)
- Best for workloads with long generation sequences
## KV Events vs. Approximation Mode
The router uses KV events from workers by default to maintain an accurate global view of cached blocks. You can disable this with the `--no-kv-events` flag:
- **With KV Events (default)**:
- Calculates overlap accurately using actual cached blocks
- Provides higher accuracy with event processing overhead
- Recommended for production deployments
- **Without KV Events (--no-kv-events)**:
- Router predicts cache state based on routing decisions with TTL-based expiration and pruning
- Tracks blocks from recent requests with configurable time-to-live
- Reduces overhead at the cost of routing accuracy
- Suitable for testing or when event processing becomes a bottleneck
## Tuning Guidelines
### 1. Understand Your Workload Characteristics
- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
### 2. Monitor Key Metrics
The router logs the cost calculation for each worker:
```
Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
```
This shows:
- Total cost (125.3)
- Overlap weight × prefill blocks (1.0 × 100.5)
- Active blocks (25.0)
- Cached blocks that contribute to overlap (15)
### 3. Temperature-Based Routing
The `router_temperature` parameter controls routing randomness:
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution
### 4. Iterative Optimization
1. Begin with default settings
2. Monitor TTFT and ITL metrics
3. Adjust `kv-overlap-score-weight` to meet your performance goals:
- To reduce TTFT: Increase the weight
- To reduce ITL: Decrease the weight
4. If you observe severe load imbalance, increase the temperature setting
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
Templates for creating consistent Dynamo documentation.
## Directory Hierarchy
### Components (Router, Planner, KVBM, Frontend, Profiler)
```
┌──────────────────────────────────────────────────────────────┐
│ Tier 1: components/src/dynamo/<component>/README.md │ ← Redirect stub
│ Content: 1-5 lines pointing to docs/components/<component>/│
│ Template: incode_readme.md │
└─────────────────────┬────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Tier 2: docs/components/<component>/ │ ← User docs
│ • README.md ← component_readme.md │
│ • <component>_guide.md ← component_guide.md │
│ • <component>_examples.md ← component_examples.md │
└─────────────────────┬────────────────────────────────────────┘
┌──────────────────────────────────────────────────────────────┐
│ Tier 3: docs/design_docs/<component>_design.md │ ← Contributor docs
│ Template: component_design.md │
└──────────────────────────────────────────────────────────────┘
```
### Backends (vLLM, SGLang, TRT-LLM)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: components/src/dynamo/<backend>/README.md │ ← Redirect stub
│ Content: 1-5 lines pointing to docs/backends/ │
│ Template: incode_readme.md │
└─────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/backends/<backend>/ │ ← User docs
│ • README.md ← backend_readme.md │
│ • <backend>_guide.md ← backend_guide.md │
│ │
│ Tier 2.5: docs/backends/README.md (exists) │
│ • Backend comparison table │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 3: External │
│ Backend internals documented in upstream repos │
└─────────────────────────────────────────────────────┘
```
### Features (Multimodal, LoRA, Speculative Decoding)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (features are not components) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/features/<feature>/ │ ← User docs
│ • README.md ← feature_readme.md │
│ • <feature>_vllm.md ← feature_backend.md │
│ • <feature>_sglang.md ← feature_backend.md │
│ • <feature>_trtllm.md ← feature_backend.md │
└─────────────────────┬───────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 3: docs/design_docs/<feature>_design.md │ ← Optional
│ Only if significant architecture │
└─────────────────────────────────────────────────────┘
```
### Integrations (LMCache, HiCache, NIXL)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (external tools) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/integrations/<integration>/ │ ← User docs
│ • README.md ← integration_readme.md │
│ • <integration>_setup.md (custom) │
│ • <integration>_<backend>.md (custom) │
└─────────────────────────────────────────────────────┘
```
### Deploy (Kubernetes, Helm, Operator)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (deployment topics) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/deploy/ │ ← User docs
│ • README.md (deployment overview) │
│ • installation_guide.md, dynamo_operator.md │
│ • helm.md, examples/ │
└─────────────────────────────────────────────────────┘
```
### Performance (Tuning, Benchmarks)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (performance topics) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/performance/ │ ← User docs
│ • README.md (performance overview) │
│ • tuning.md, benchmarking.md, etc. │
└─────────────────────────────────────────────────────┘
```
### Infrastructure (Observability, Fault Tolerance, Development)
```
┌─────────────────────────────────────────────────────┐
│ Tier 1: N/A │
│ No in-code README (operations topics) │
└─────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────┐
│ Tier 2: docs/infrastructure/<topic>/ │ ← User docs
│ • README.md ← infrastructure_readme.md │
│ • <subtopic>.md (detailed guides) │
└─────────────────────────────────────────────────────┘
```
## Three-Tier Pattern
| Tier | Purpose | Audience | Location |
|------|---------|----------|----------|
| **Tier 1** | Redirect stub (5 lines) | Developers browsing code | `components/src/dynamo/`\<name>`/README.md` |
| **Tier 2** | User documentation | Users, operators | `docs/`\<category>`/`\<name>`/` (e.g., `docs/components/router/`) |
| **Tier 3** | Design documentation | Contributors | `docs/design_docs/`\<name>`_design.md` |
## Template Selection
| What you're documenting | Templates to use |
|------------------------|------------------|
| New component | `incode_readme.md` + `component_*.md` (all 4) |
| New backend | `incode_readme.md` + `backend_*.md` (both) |
| New feature | `feature_readme.md` + `feature_backend.md` (per backend) |
| New integration | `integration_readme.md` |
| New deploy topic | Custom (follows `docs/deploy/` structure) |
| New performance topic | Custom (follows `docs/performance/` structure) |
| New infrastructure topic | `infrastructure_readme.md` |
| Migrating existing docs | Use the template matching your target file |
## Usage
1. Identify which category your documentation belongs to (component, backend, feature, integration)
2. Create the directory structure shown above
3. Copy templates to the correct locations with correct filenames
4. Replace all `<placeholders>` with actual values
5. Replace `{/* comments */}` with actual content
6. Remove sections that don't apply
## Updating Navigation
After adding new documentation:
1. **Sphinx (current):** Update `docs/index.rst` or the appropriate `_sections/*.rst` file to include your new docs in the navigation
2. **Fern (future):** Update `fern/docs.yml` with your new pages
See [docs/README.md](../README.md) for documentation build instructions.
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
Advanced deployment and configuration for the `<Backend>` backend.
## Deployment
### Single-Node Setup
{/* Local deployment instructions */}
### Multi-Node Setup
{/* Distributed deployment with TP/PP */}
### Kubernetes Deployment
```yaml
# Full DGDR example
```
## Configuration
### CLI Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| {/* arg */} | {/* type */} | {/* default */} | {/* description */} |
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| {/* var */} | {/* default */} | {/* description */} |
### Model Configuration
{/* Model-specific settings, quantization */}
## Performance Tuning
### Memory Optimization
{/* KV cache sizing, batch limits */}
### Throughput Optimization
{/* Concurrency, prefill/decode settings */}
## Troubleshooting
### Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| {/* issue */} | {/* cause */} | {/* solution */} |
### Debug Mode
```bash
# Add debug command from existing docs
```
## See Also
| Document | Path |
|----------|------|
| `<Backend> Overview` | `./README.md` |
| Backend Comparison | `../README.md` |
{/* Convert to links when using template */}
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
{/* 2-3 sentence overview of this backend integration */}
## Feature Matrix
{/* Copy actual feature matrix from existing backend docs */}
{/* Example pattern (from vLLM README): */}
| Feature | Status | Notes |
|---------|--------|-------|
| Disaggregated Serving | ✅ | |
| KV-Aware Routing | ✅ | |
| SLA-Based Planner | ✅ | |
| Multimodal | ✅ | Vision models |
| LoRA | 🚧 | Experimental |
## Quick Start
### Prerequisites
- {/* List prerequisites */}
### Usage
```bash
# Add minimal usage example from existing backend docs
# Example pattern (vLLM):
# python -m dynamo.vllm --model <model-name>
# Example pattern (SGLang):
# python -m dynamo.sglang --model <model-name>
```
### Kubernetes
```yaml
# Add DGDR example - use apiVersion: nvidia.com/v1alpha1
# See recipes/ folder for production examples
```
## Configuration
| Parameter | Default | Description |
|-----------|---------|-------------|
| {/* param */} | {/* default */} | {/* description */} |
{/* EXAMPLE: Filled-in Configuration for vLLM would look like:
| Parameter | Default | Description |
|-----------|---------|-------------|
| `--model` | required | Model path or HuggingFace ID |
| `--tensor-parallel-size` | `1` | Number of GPUs for tensor parallelism |
| `--max-model-len` | auto | Maximum sequence length | */}
## Next Steps
| Document | Path | Description |
|----------|------|-------------|
| `<Backend> Guide` | `<backend>_guide.md` | Advanced configuration |
| Backend Comparison | `../README.md` | Compare backends |
{/* Convert table rows to markdown links */}
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
Architecture and design decisions for the `<Component>`.
## Overview
{/* High-level architecture description */}
## Design Goals
1. **Goal 1** - Description
2. **Goal 2** - Description
3. **Goal 3** - Description
## Architecture
### Components
{/* Description of internal components */}
### Data Flow
```
┌─────────┐ ┌─────────┐ ┌─────────┐
│ Input │───▶│ Process │───▶│ Output │
└─────────┘ └─────────┘ └─────────┘
```
## Design Decisions
### Decision 1: {/* Title */}
**Context:** {/* What problem were we solving? */}
**Options Considered:**
1. Option A - Pros/Cons
2. Option B - Pros/Cons
**Decision:** {/* What we chose and why */}
**Consequences:** {/* Trade-offs accepted */}
## Algorithms
### {/* Algorithm Name */}
{/* Algorithm description */}
```
Pseudocode or formula
```
## Performance Considerations
{/* Performance characteristics, bottlenecks, optimization opportunities */}
## Future Work
- {/* Planned improvement 1 */}
- {/* Planned improvement 2 */}
## References
- {/* Related design docs */}
- {/* External papers or resources */}
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
Usage examples for the `<Component>`.
## Basic Examples
### Example 1: {/* Title */}
```bash
# Add example from existing docs
```
### Example 2: {/* Title */}
```python
# Add example from existing docs
```
## Kubernetes Examples
### Minimal Deployment
```yaml
# Add minimal DGDR from existing docs
```
### Production Deployment
```yaml
# Add production DGDR from existing docs
```
## Advanced Examples
### {/* Advanced Use Case Title */}
{/* Description */}
```bash
# Add example
```
## Sample Configurations
### config-minimal.yaml
```yaml
# Add from existing docs
```
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
---
This guide covers deployment, configuration, and integration for the `<Component>`.
## Deployment
### Single-Node Setup
{/* Instructions for local/single-node deployment */}
### Multi-Node Setup
{/* Instructions for distributed deployment */}
### Kubernetes Deployment
```yaml
# Full DGDR example
```
## Configuration
### CLI Arguments
| Argument | Type | Default | Description |
|----------|------|---------|-------------|
| {/* arg */} | {/* type */} | {/* default */} | {/* description */} |
### Environment Variables
| Variable | Default | Description |
|----------|---------|-------------|
| {/* var */} | {/* default */} | {/* description */} |
### Configuration File
```yaml
# Add config file example if applicable
```
## Integration
### With Router
{/* How to integrate with Router */}
### With Planner
{/* How to integrate with Planner */}
### With Observability
{/* Metrics, logging, tracing integration */}
## Troubleshooting
### Common Issues
| Issue | Cause | Solution |
|-------|-------|----------|
| Error message | Root cause | Fix |
### Debug Mode
```bash
# Add debug command from existing docs
```
## See Also
| Document | Path |
|----------|------|
| `<Component> Examples` | `<component>_examples.md` |
| `<Component> Design` | `/docs/design_docs/`\<component>`_design.md` |
{/* Convert table rows to markdown links */}
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment