Unverified Commit 96f3bdcc authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: 1.0 documentation improvements (#7168)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
Co-authored-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
parent 5e51d6dd
...@@ -6,16 +6,6 @@ title: Examples ...@@ -6,16 +6,6 @@ title: Examples
For quick start instructions, see the [SGLang README](README.md). This document provides all deployment patterns for running SGLang with Dynamo, including LLMs, multimodal, and diffusion models, and Kubernetes deployment. For quick start instructions, see the [SGLang README](README.md). This document provides all deployment patterns for running SGLang with Dynamo, including LLMs, multimodal, and diffusion models, and Kubernetes deployment.
## Table of Contents
- [Infrastructure Setup](#infrastructure-setup)
- [LLM Serving](#llm-serving)
- [Embedding Models](#embedding-models)
- [Vision Models](#vision-models)
- [Diffusion Models](#diffusion-models)
- [Kubernetes Deployment](#kubernetes-deployment)
- [Testing](#testing)
## Infrastructure Setup ## Infrastructure Setup
For local/bare-metal development, start etcd and optionally NATS using Docker Compose: For local/bare-metal development, start etcd and optionally NATS using Docker Compose:
......
...@@ -10,13 +10,22 @@ The Dynamo Frontend is the API gateway for serving LLM inference requests. It pr ...@@ -10,13 +10,22 @@ The Dynamo Frontend is the API gateway for serving LLM inference requests. It pr
| Feature | Status | | Feature | Status |
|---------|--------| |---------|--------|
| OpenAI Chat Completions API | ✅ Supported | | OpenAI Chat Completions API (`/v1/chat/completions`) | ✅ Supported |
| OpenAI Completions API | ✅ Supported | | OpenAI Completions API (`/v1/completions`) | ✅ Supported |
| OpenAI Embeddings API (`/v1/embeddings`) | ✅ Supported |
| OpenAI Responses API (`/v1/responses`) | ✅ Supported |
| OpenAI Models API (`/v1/models`) | ✅ Supported |
| Image Generation (`/v1/images/generations`) | ✅ Supported |
| Video Generation (`/v1/videos/generations`) | ✅ Supported |
| Anthropic Messages API (`/v1/messages`) | 🧪 Experimental |
| KServe gRPC v2 API | ✅ Supported | | KServe gRPC v2 API | ✅ Supported |
| Streaming responses | ✅ Supported | | Streaming responses (SSE) | ✅ Supported |
| Multi-model serving | ✅ Supported | | Multi-model serving | ✅ Supported |
| Integrated routing | ✅ Supported | | Integrated KV-aware routing | ✅ Supported |
| Tool calling | ✅ Supported | | Tool calling | ✅ Supported |
| TLS (HTTPS) | ✅ Supported |
| Swagger UI (`/docs`) | ✅ Supported |
| NVIDIA request extensions (`nvext`) | ✅ Supported |
## Quick Start ## Quick Start
...@@ -76,7 +85,7 @@ spec: ...@@ -76,7 +85,7 @@ spec:
|-----------|---------|-------------| |-----------|---------|-------------|
| `--http-port` | 8000 | HTTP server port | | `--http-port` | 8000 | HTTP server port |
| `--kserve-grpc-server` | false | Enable KServe gRPC server | | `--kserve-grpc-server` | false | Enable KServe gRPC server |
| `--router-mode` | `round_robin` | Routing strategy: `round_robin`, `random`, `kv` | | `--router-mode` | `round-robin` | Routing strategy: `round-robin`, `random`, `kv`, `direct` |
See the [Frontend Guide](frontend-guide.md) for full configuration options. See the [Frontend Guide](frontend-guide.md) for full configuration options.
...@@ -84,5 +93,7 @@ See the [Frontend Guide](frontend-guide.md) for full configuration options. ...@@ -84,5 +93,7 @@ See the [Frontend Guide](frontend-guide.md) for full configuration options.
| Document | Description | | Document | Description |
|----------|-------------| |----------|-------------|
| [Configuration Reference](configuration.md) | All CLI arguments, env vars, and HTTP endpoints |
| [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration | | [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration |
| [NVIDIA Request Extensions (nvext)](nvext.md) | Custom request fields for routing hints and cache control |
| [Router Documentation](../router/README.md) | KV-aware routing configuration | | [Router Documentation](../router/README.md) | KV-aware routing configuration |
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Frontend Configuration Reference
subtitle: Complete reference for all frontend CLI arguments, environment variables, and HTTP endpoints
---
This page documents all configuration options for the Dynamo Frontend (`python -m dynamo.frontend`).
Every CLI argument has a corresponding environment variable. CLI arguments take precedence over environment variables.
## HTTP & Networking
| CLI Argument | Env Var | Default | Description |
|-------------|---------|---------|-------------|
| `--http-host` | `DYN_HTTP_HOST` | `0.0.0.0` | HTTP listen address |
| `--http-port` | `DYN_HTTP_PORT` | `8000` | HTTP listen port |
| `--tls-cert-path` | `DYN_TLS_CERT_PATH` | — | TLS certificate path (PEM). Must be paired with `--tls-key-path` |
| `--tls-key-path` | `DYN_TLS_KEY_PATH` | — | TLS private key path (PEM). Must be paired with `--tls-cert-path` |
The Rust HTTP server also reads these environment variables (not exposed as CLI args):
| Env Var | Default | Description |
|---------|---------|-------------|
| `DYN_HTTP_BODY_LIMIT_MB` | `192` | Maximum request body size in MB |
| `DYN_HTTP_GRACEFUL_SHUTDOWN_TIMEOUT_SECS` | `5` | Graceful shutdown timeout in seconds |
## Router
| CLI Argument | Env Var | Default | Description |
|-------------|---------|---------|-------------|
| `--router-mode` | `DYN_ROUTER_MODE` | `round-robin` | Routing strategy: `round-robin`, `random`, `kv`, `direct` |
| `--router-kv-overlap-score-weight` | `DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT` | `1.0` | Weight for KV cache overlap in worker scoring. Higher = prefer cache reuse |
| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` | Softmax temperature for worker sampling. 0 = deterministic |
| `--router-kv-events` / `--no-router-kv-events` | `DYN_ROUTER_USE_KV_EVENTS` | `true` | Enable KV cache state events from workers. Disable for prediction-based routing |
| `--router-ttl-secs` | `DYN_ROUTER_TTL_SECS` | `120.0` | Block TTL when KV events are disabled |
| `--router-max-tree-size` | `DYN_ROUTER_MAX_TREE_SIZE` | `1048576` | Max radix tree size before pruning (no-events mode) |
| `--router-prune-target-ratio` | `DYN_ROUTER_PRUNE_TARGET_RATIO` | `0.8` | Target size ratio after pruning (no-events mode) |
| `--router-replica-sync` / `--no-router-replica-sync` | `DYN_ROUTER_REPLICA_SYNC` | `false` | Sync state across multiple router instances |
| `--router-snapshot-threshold` | `DYN_ROUTER_SNAPSHOT_THRESHOLD` | `1000000` | Messages before triggering a snapshot |
| `--router-reset-states` / `--no-router-reset-states` | `DYN_ROUTER_RESET_STATES` | `false` | Reset router state on startup. **Warning:** affects existing replicas |
| `--router-track-active-blocks` / `--no-router-track-active-blocks` | `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` | `true` | Track blocks used by in-progress requests for load balancing |
| `--router-assume-kv-reuse` / `--no-router-assume-kv-reuse` | `DYN_ROUTER_ASSUME_KV_REUSE` | `true` | Assume KV cache reuse when tracking active blocks |
| `--router-track-output-blocks` / `--no-router-track-output-blocks` | `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` | `false` | Track output blocks with fractional decay during generation |
| `--router-event-threads` | `DYN_ROUTER_EVENT_THREADS` | `4` | Event processing threads. >1 enables concurrent radix tree |
| `--router-queue-threshold` | `DYN_ROUTER_QUEUE_THRESHOLD` | — | Queue threshold fraction of prefill capacity. Enables priority scheduling |
| `--enable-cache-control` / `--no-enable-cache-control` | `DYN_ENABLE_CACHE_CONTROL` | `false` | Enable TTL-based cache pinning (requires `--router-mode=kv`) |
| `--decode-fallback` / `--no-decode-fallback` | `DYN_DECODE_FALLBACK` | `false` | Fall back to aggregated mode when prefill workers unavailable |
## Fault Tolerance
| CLI Argument | Env Var | Default | Description |
|-------------|---------|---------|-------------|
| `--migration-limit` | `DYN_MIGRATION_LIMIT` | `0` | Max request migrations per worker disconnect. 0 = disabled |
| `--active-decode-blocks-threshold` | `DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD` | — | KV cache utilization fraction (0.0–1.0) for busy detection |
| `--active-prefill-tokens-threshold` | `DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD` | — | Absolute token count for prefill busy detection |
| `--active-prefill-tokens-threshold-frac` | `DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD_FRAC` | — | Fraction of `max_num_batched_tokens` for prefill busy detection. OR logic with absolute threshold |
## Model Discovery
| CLI Argument | Env Var | Default | Description |
|-------------|---------|---------|-------------|
| `--namespace` | `DYN_NAMESPACE` | — | Exact namespace for model discovery scoping |
| `--namespace-prefix` | `DYN_NAMESPACE_PREFIX` | — | Namespace prefix for discovery (e.g., `ns` matches `ns`, `ns-abc123`). Takes precedence over `--namespace` |
| `--model-name` | `DYN_MODEL_NAME` | — | Override model name string |
| `--model-path` | `DYN_MODEL_PATH` | — | Path to local model directory (for private/custom models) |
| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | — | KV cache block size override |
## Infrastructure
| CLI Argument | Env Var | Default | Description |
|-------------|---------|---------|-------------|
| `--discovery-backend` | `DYN_DISCOVERY_BACKEND` | `etcd` | Service discovery: `kubernetes`, `etcd`, `file`, `mem` |
| `--request-plane` | `DYN_REQUEST_PLANE` | `tcp` | Request distribution: `tcp` (fastest), `nats`, `http` |
| `--event-plane` | `DYN_EVENT_PLANE` | `nats` | Event publishing: `nats`, `zmq` |
## KServe gRPC
| CLI Argument | Env Var | Default | Description |
|-------------|---------|---------|-------------|
| `--kserve-grpc-server` / `--no-kserve-grpc-server` | `DYN_KSERVE_GRPC_SERVER` | `false` | Start KServe gRPC v2 server |
| `--grpc-metrics-port` | `DYN_GRPC_METRICS_PORT` | `8788` | HTTP metrics port for gRPC service |
See the [Frontend Guide](frontend-guide.md) for KServe message formats and integration details.
## Monitoring
| CLI Argument | Env Var | Default | Description |
|-------------|---------|---------|-------------|
| `--metrics-prefix` | `DYN_METRICS_PREFIX` | `dynamo_frontend` | Prefix for frontend Prometheus metrics |
| `--dump-config-to` | `DYN_DUMP_CONFIG_TO` | — | Dump resolved config to file path |
## Experimental
| CLI Argument | Env Var | Default | Description |
|-------------|---------|---------|-------------|
| `--enable-anthropic-api` | `DYN_ENABLE_ANTHROPIC_API` | `false` | Enable `/v1/messages` (Anthropic Messages API) |
| `--dyn-chat-processor` | `DYN_CHAT_PROCESSOR` | `dynamo` | Chat processor: `dynamo` or `vllm` |
| `--dyn-debug-perf` | `DYN_DEBUG_PERF` | `false` | Log per-function timing for preprocessing (vllm processor only) |
| `--dyn-preprocess-workers` | `DYN_PREPROCESS_WORKERS` | `0` | Worker processes for CPU-bound preprocessing. 0 = main event loop (vllm processor only) |
| `-i` / `--interactive` | `DYN_INTERACTIVE` | `false` | Interactive text chat mode |
## HTTP Endpoints
The frontend exposes the following HTTP endpoints:
### OpenAI-Compatible
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/v1/chat/completions` | Chat completions (streaming and non-streaming) |
| `POST` | `/v1/completions` | Text completions |
| `POST` | `/v1/embeddings` | Text embeddings |
| `POST` | `/v1/responses` | Responses API |
| `POST` | `/v1/images/generations` | Image generation |
| `POST` | `/v1/videos/generations` | Video generation |
| `POST` | `/v1/videos/generations/stream` | Video generation (streaming) |
| `GET` | `/v1/models` | List available models |
### Anthropic (Experimental)
| Method | Path | Description |
|--------|------|-------------|
| `POST` | `/v1/messages` | Anthropic Messages API (requires `--enable-anthropic-api`) |
| `POST` | `/v1/messages/count_tokens` | Token counting for Anthropic API |
### Infrastructure
| Method | Path | Description |
|--------|------|-------------|
| `GET` | `/health` | Health check |
| `GET` | `/live` | Liveness check |
| `GET` | `/metrics` | Prometheus metrics |
| `GET` | `/openapi.json` | OpenAPI specification |
| `GET` | `/docs` | Swagger UI |
| `POST` | `/busy_threshold` | Set busy thresholds |
| `GET` | `/busy_threshold` | Get current busy thresholds |
### Endpoint Path Customization
All endpoint paths can be overridden via environment variables:
| Env Var | Default Path |
|---------|-------------|
| `DYN_HTTP_SVC_CHAT_PATH_ENV` | `/v1/chat/completions` |
| `DYN_HTTP_SVC_CMP_PATH_ENV` | `/v1/completions` |
| `DYN_HTTP_SVC_EMB_PATH_ENV` | `/v1/embeddings` |
| `DYN_HTTP_SVC_RESPONSES_PATH_ENV` | `/v1/responses` |
| `DYN_HTTP_SVC_MODELS_PATH_ENV` | `/v1/models` |
| `DYN_HTTP_SVC_ANTHROPIC_PATH_ENV` | `/v1/messages` |
| `DYN_HTTP_SVC_HEALTH_PATH_ENV` | `/health` |
| `DYN_HTTP_SVC_LIVE_PATH_ENV` | `/live` |
| `DYN_HTTP_SVC_METRICS_PATH_ENV` | `/metrics` |
## Deprecated
| CLI Argument | Env Var | Description |
|-------------|---------|-------------|
| `--router-durable-kv-events` | `DYN_ROUTER_DURABLE_KV_EVENTS` | Use event-plane local indexer instead |
## See Also
- [Frontend Overview](README.md) — quick start and feature matrix
- [Frontend Guide](frontend-guide.md) — KServe gRPC configuration
- [NVIDIA Request Extensions (nvext)](nvext.md) — custom request fields
- [Router Guide](../router/router-guide.md) — detailed routing configuration
- [Metrics](../../observability/metrics.md) — available Prometheus metrics
- [Fault Tolerance](../../fault-tolerance/README.md) — request migration and rejection
...@@ -9,20 +9,6 @@ The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to h ...@@ -9,20 +9,6 @@ The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to h
KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems. KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
## Table of Contents
- [Quick Start](#quick-start)
- [Run KVBM Standalone](#run-kvbm-standalone)
- [Run KVBM in Dynamo with vLLM](#run-kvbm-in-dynamo-with-vllm)
- [Run KVBM in Dynamo with TensorRT-LLM](#run-kvbm-in-dynamo-with-tensorrt-llm)
- [Run Dynamo with SGLang HiCache](#run-dynamo-with-sglang-hicache)
- [Disaggregated Serving with KVBM](#disaggregated-serving-with-kvbm)
- [Configuration](#configuration)
- [Enable and View KVBM Metrics](#enable-and-view-kvbm-metrics)
- [Benchmarking KVBM](#benchmarking-kvbm)
- [Troubleshooting](#troubleshooting)
- [Developing Locally](#developing-locally)
## Quick Start ## Quick Start
## Run KVBM Standalone ## Run KVBM Standalone
......
...@@ -6,15 +6,6 @@ title: Router Examples ...@@ -6,15 +6,6 @@ title: Router Examples
For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns. For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns.
## Table of Contents
- [Using KvRouter Python API](#using-kvrouter-python-api)
- [K8s Examples](#k8s-examples)
- [Routing Patterns](#routing-patterns)
- [Custom Routing Example: Minimizing TTFT](#custom-routing-example-minimizing-ttft)
- [KV Event Publishing for Custom Engines](#kv-event-publishing-for-custom-engines)
- [Global Router (Hierarchical Routing)](#global-router-hierarchical-routing)
## Using KvRouter Python API ## Using KvRouter Python API
Instead of launching the KV Router via command line, you can create a `KvRouter` object directly in Python. This allows per-request routing configuration overrides. Instead of launching the KV Router via command line, you can create a `KvRouter` object directly in Python. This allows per-request routing configuration overrides.
......
...@@ -10,7 +10,7 @@ subtitle: Enable KV-aware routing using Router for Dynamo deployments ...@@ -10,7 +10,7 @@ subtitle: Enable KV-aware routing using Router for Dynamo deployments
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups. The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
This guide helps you get started with using the Dynamo router, with further details on configuration, disaggregated serving setup, and parameter tuning. This guide helps you get started with using the Dynamo router, with further details on configuration, disaggregated serving setup, and parameter tuning.
## Quick start ## Quick Start
### Python / CLI Deployment ### Python / CLI Deployment
...@@ -31,7 +31,7 @@ Backend workers register themselves using the `register_model` API, after which ...@@ -31,7 +31,7 @@ Backend workers register themselves using the `register_model` API, after which
| Argument | Default | Description | | Argument | Default | Description |
|----------|---------|-------------| |----------|---------|-------------|
| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing | | `--router-mode kv` | `round-robin` | Enable KV cache-aware routing |
| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) | | `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) | | `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
| `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking | | `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking |
...@@ -73,7 +73,7 @@ All CLI arguments can be configured via environment variables using the `DYN_` p ...@@ -73,7 +73,7 @@ All CLI arguments can be configured via environment variables using the `DYN_` p
| CLI Argument | Environment Variable | Default | | CLI Argument | Environment Variable | Default |
|--------------|---------------------|---------| |--------------|---------------------|---------|
| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` | | `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round-robin` |
| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` | | `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific | | `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
| `--no-router-kv-events` | `DYN_ROUTER_USE_KV_EVENTS=false` | `true` | | `--no-router-kv-events` | `DYN_ROUTER_USE_KV_EVENTS=false` | `true` |
...@@ -311,7 +311,7 @@ graph TD ...@@ -311,7 +311,7 @@ graph TD
## Serving Multiple Router Replicas ## Serving Multiple Router Replicas
For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.) For improved fault tolerance, you can launch multiple frontend + router replicas. If multiple `dynamo.frontend` processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone `python -m dynamo.router` service; see the [Standalone Router README](../../../components/src/dynamo/router/README.md).
### Router State Management ### Router State Management
......
...@@ -147,7 +147,7 @@ The interpolators use the profiling sweep granularity to determine precision. Fi ...@@ -147,7 +147,7 @@ The interpolators use the profiling sweep granularity to determine precision. Fi
## Initialization ## Initialization
The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check. The planner currently waits 30 seconds (`INIT_PLANNER_START_DELAY` in `components/src/dynamo/planner/__main__.py`) as a temporary workaround while other components (frontend, workers) register and stabilize; see [Known Limitations](#known-limitations) for the planned readiness-probing replacement.
After the delay: After the delay:
...@@ -232,4 +232,3 @@ In aggregated mode (`--mode agg`), engines handle both prefill and decode via ch ...@@ -232,4 +232,3 @@ In aggregated mode (`--mode agg`), engines handle both prefill and decode via ch
| `defaults.py` | Default configs, backend name mappings | | `defaults.py` | Default configs, backend name mappings |
| `planner_argparse.py` | CLI argument definitions | | `planner_argparse.py` | CLI argument definitions |
...@@ -2,9 +2,12 @@ ...@@ -2,9 +2,12 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: Writing Python Workers in Dynamo title: Writing Python Workers in Dynamo
sidebar-title: Writing Python Workers
subtitle: Create custom Python workers and engines for Dynamo subtitle: Create custom Python workers and engines for Dynamo
--- ---
# Writing Python Workers in Dynamo
This guide explains how to create your own Python worker in Dynamo. This guide explains how to create your own Python worker in Dynamo.
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo. The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
...@@ -75,7 +78,7 @@ The `model_type` can be: ...@@ -75,7 +78,7 @@ The `model_type` can be:
See `examples/backends` for full code examples. See `examples/backends` for full code examples.
## Component names ## Component Names
A worker needs three names to register itself: namespace.component.endpoint A worker needs three names to register itself: namespace.component.endpoint
......
...@@ -75,7 +75,7 @@ See [Health Checks](../observability/health-checks.md) for details. ...@@ -75,7 +75,7 @@ See [Health Checks](../observability/health-checks.md) for details.
| Feature | Environment Variable | Default | | Feature | Environment Variable | Default |
|---------|---------------------|---------| |---------|---------------------|---------|
| Worker health port | `DYN_SYSTEM_PORT` | `9090` | | Worker health port | `DYN_SYSTEM_PORT` | `9090` |
| Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` (K8s: `true`) | | Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` |
| Canary wait time | `DYN_CANARY_WAIT_TIME` | `10` seconds | | Canary wait time | `DYN_CANARY_WAIT_TIME` | `10` seconds |
| Health check timeout | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | `3` seconds | | Health check timeout | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | `3` seconds |
| Decode blocks threshold | `--active-decode-blocks-threshold` | None (disabled) | | Decode blocks threshold | `--active-decode-blocks-threshold` | None (disabled) |
......
...@@ -30,31 +30,20 @@ Dynamo's LoRA implementation provides: ...@@ -30,31 +30,20 @@ Dynamo's LoRA implementation provides:
### Architecture ### Architecture
```text ```mermaid
┌─────────────────────────────────────────────────────────────────┐ flowchart TD
│ LoRA Architecture │ Frontend["Frontend"] --> Router["Router<br/>(LoRA-aware)"]
├─────────────────────────────────────────────────────────────────┤ Router --> Workers["Workers<br/>(LoRA-loaded)"]
│ │ Workers --> ManagerNode["LoRA Manager"]
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Frontend │────▶│ Router │────▶│ Workers │ │ subgraph ManagerGroup["LoRA Manager"]
│ │ /v1/models │ │ LoRA-aware │ │ LoRA-loaded │ │ Downloader
│ └──────────────┘ └──────────────┘ └──────────────┘ │ Cache
│ │ │ end
│ ▼ │
│ ┌─────────────────────────────────┐ │ ManagerNode --> Local["file://<br/>Local"]
│ │ LoRA Manager │ │ ManagerNode --> S3["s3://<br/>S3/MinIO"]
│ │ ┌───────────┐ ┌─────────────┐ │ │ ManagerNode --> HF["hf://<br/>(custom)"]
│ │ │ Downloader│ │ Cache │ │ │
│ │ └───────────┘ └─────────────┘ │ │
│ └─────────────────────────────────┘ │
│ │ │
│ ┌───────────────────┼───────────────────┐ │
│ ▼ ▼ ▼ │
│ ┌────────────┐ ┌────────────┐ ┌─────────┐│
│ │ file:// │ │ s3:// │ │ hf:// ││
│ │ Local │ │ S3/MinIO │ │(custom) ││
│ └────────────┘ └────────────┘ └─────────┘│
└─────────────────────────────────────────────────────────────────┘
``` ```
The LoRA system consists of: The LoRA system consists of:
......
...@@ -138,7 +138,7 @@ navigation: ...@@ -138,7 +138,7 @@ navigation:
path: fault-tolerance/request-rejection.md path: fault-tolerance/request-rejection.md
- page: Testing - page: Testing
path: fault-tolerance/testing.md path: fault-tolerance/testing.md
- page: Writing Python Workers in Dynamo - page: Writing Python Workers
path: development/backend-guide.md path: development/backend-guide.md
# ==================== Backends ==================== # ==================== Backends ====================
...@@ -237,18 +237,22 @@ navigation: ...@@ -237,18 +237,22 @@ navigation:
path: design-docs/disagg-serving.md path: design-docs/disagg-serving.md
- page: Distributed Runtime - page: Distributed Runtime
path: design-docs/distributed-runtime.md path: design-docs/distributed-runtime.md
- page: Discovery Plane - section: Communication Planes
path: design-docs/discovery-plane.md contents:
- page: Request Plane - page: Discovery Plane
path: design-docs/request-plane.md path: design-docs/discovery-plane.md
- page: Event Plane - page: Request Plane
path: design-docs/event-plane.md path: design-docs/request-plane.md
- page: Router Design - page: Event Plane
path: design-docs/router-design.md path: design-docs/event-plane.md
- page: KVBM Design - section: Component Design
path: design-docs/kvbm-design.md contents:
- page: Planner Design - page: Router Design
path: design-docs/planner-design.md path: design-docs/router-design.md
- page: KVBM Design
path: design-docs/kvbm-design.md
- page: Planner Design
path: design-docs/planner-design.md
# ==================== Blog ==================== # ==================== Blog ====================
- section: Blog - section: Blog
......
...@@ -17,23 +17,6 @@ If you want to use LoRA deploy Dynamo without the Inference Gateway. ...@@ -17,23 +17,6 @@ If you want to use LoRA deploy Dynamo without the Inference Gateway.
Currently, these setups are only supported with the kGateway based Inference Gateway. Currently, these setups are only supported with the kGateway based Inference Gateway.
## Table of Contents
- [Prerequisites](#prerequisites)
- [Installation Steps](#installation-steps)
- [1. Install Dynamo Platform](#1-install-dynamo-platform)
- [2. Deploy Inference Gateway](#2-deploy-inference-gateway)
- [3. Deploy Your Model](#3-deploy-your-model)
- [4. Build EPP image (Optional)](#4-build-epp-image-optional)
- [5. Deploy](#5-deploy)
- [6. Verify Installation](#6-verify-installation)
- [7. Usage](#7-usage)
- [8. Deleting the installation](#8-deleting-the-installation)
- [Gateway API Inference Extension Details](#gateway-api-inference-extension-integration)
- [Router bookkeeping operations](#router-bookkeeping-operations)
- [Header Routing Hints](#header-routing-hints)
## Prerequisites ## Prerequisites
- Kubernetes cluster with kubectl configured - Kubernetes cluster with kubectl configured
......
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
title: Model Caching
subtitle: Download models once and share across all pods in a Kubernetes cluster
---
Large language models can take minutes to download. Without caching, every pod downloads the full model independently, wasting bandwidth and delaying startup. Dynamo supports two approaches to ensure models are downloaded once and shared across the cluster.
## Option 1: PVC + Download Job (Recommended)
The simplest approach: create a shared PVC, run a one-time Job to download the model, then mount the PVC in your DynamoGraphDeployment.
This is the pattern used by all Dynamo recipes today.
### Step 1: Create a Shared PVC
```yaml
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: model-cache
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 100Gi
```
<Note>
`ReadWriteMany` access mode is required so multiple pods can mount the PVC simultaneously. Ensure your storage class supports RWX (e.g., NFS, CephFS, or cloud-provider shared file systems).
</Note>
### Step 2: Download the model
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: model-download
spec:
template:
spec:
restartPolicy: Never
containers:
- name: downloader
image: python:3.12-slim
command: ["sh", "-c"]
args:
- |
pip install huggingface_hub hf_transfer
HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
$MODEL_NAME --revision $MODEL_REVISION
env:
- name: MODEL_NAME
value: "Qwen/Qwen3-0.6B"
- name: MODEL_REVISION
value: "main"
- name: HF_HOME
value: /cache/huggingface
envFrom:
- secretRef:
name: hf-token-secret
volumeMounts:
- name: model-cache
mountPath: /cache/huggingface
volumes:
- name: model-cache
persistentVolumeClaim:
claimName: model-cache
```
### Step 3: Mount in DynamoGraphDeployment
```yaml
apiVersion: nvidia.com/v1alpha1
kind: DynamoGraphDeployment
metadata:
name: my-deployment
spec:
pvcs:
- create: false
name: model-cache
services:
VllmWorker:
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
```
All `VllmWorker` pods that mount `model-cache` now read from the shared cache, avoiding per-pod worker downloads. If you also want the frontend to reuse tokenizer and config files, mount the same PVC there too.
### Compilation Cache
For vLLM, you can also cache compiled artifacts (CUDA graphs, etc.) with a second PVC:
```yaml
spec:
pvcs:
- create: false
name: model-cache
- create: false
name: compilation-cache
services:
VllmWorker:
volumeMounts:
- name: model-cache
mountPoint: /home/dynamo/.cache/huggingface
- name: compilation-cache
mountPoint: /home/dynamo/.cache/vllm
```
## Option 2: Model Express (P2P Distribution)
[Model Express](https://github.com/ai-dynamo/modelexpress) is a P2P model distribution server that downloads a model once and serves it to all pods over the network. It integrates directly with vLLM's weight loading pipeline via custom load formats.
### How It Works
1. A Model Express server runs in the cluster and caches model weights
2. Workers use `--load-format=mx-source` or `--load-format=mx-target` to load from the server
3. The K8s operator injects `MODEL_EXPRESS_URL` into all pods automatically
### Setup
**Install with Dynamo Platform:**
```bash
helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
--namespace ${NAMESPACE} \
--set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
```
**Configure workers to use Model Express:**
```yaml
services:
VllmWorker:
envs:
- name: VLLM_LOAD_FORMAT
value: mx-target
```
When `MODEL_EXPRESS_URL` is configured in the operator, it is automatically injected as an environment variable into all component pods. Workers using `mx-source` or `mx-target` load formats will connect to the server for model weight distribution.
### When to Use Model Express
| Scenario | Recommended Approach |
|----------|---------------------|
| Small cluster, simple setup | PVC + Download Job |
| Large cluster, many nodes | Model Express |
| Models already on shared storage (NFS) | PVC |
| Frequent model updates across fleet | Model Express |
## See Also
- [Managing Models with DynamoModel](deployment/dynamomodel-guide.md) — declarative model management CRD
- [Detailed Installation Guide](installation-guide.md) — Helm chart configuration including Model Express
- [LoRA Adapters](../features/lora/README.md) — dynamic adapter loading (separate from base model caching)
...@@ -6,22 +6,6 @@ title: Webhooks ...@@ -6,22 +6,6 @@ title: Webhooks
This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting. This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Configuration](#configuration)
- [Certificate Management Options](#certificate-management-options)
- [Advanced Configuration](#advanced-configuration)
- [Certificate Management](#certificate-management)
- [Automatic Certificates (Default)](#automatic-certificates-default)
- [cert-manager Integration](#cert-manager-integration)
- [External Certificates](#external-certificates)
- [Multi-Operator Deployments](#multi-operator-deployments)
- [Troubleshooting](#troubleshooting)
---
## Overview ## Overview
The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation. The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.
......
...@@ -120,21 +120,21 @@ When a sequence needs blocks, the manager first checks if they already exist (ca ...@@ -120,21 +120,21 @@ When a sequence needs blocks, the manager first checks if they already exist (ca
The following diagram illustrates the block lifecycle, based on vLLM's block manager design: The following diagram illustrates the block lifecycle, based on vLLM's block manager design:
``` ```mermaid
┌───── Cache hit (Use) ────┐ stateDiagram-v2
│ │ [*] --> Active : alloc
▼ │ Active --> Inactive : deref
┌───────────┐ ┌───────────┐ ┌──────────┴──────┐ ┌───────────┐ Inactive --> Active : cache hit (reuse)
│ New Block │──────►│ Active │──────►│ Inactive │──────►│ Freed │ Inactive --> Freed : evict
└───────────┘ alloc │ Pool │ deref │ Pool │ evict └───────────┘ Active --> Freed : destroy (preemption)
│(ref_count)│ │ (LRU order) │ Freed --> [*]
└─────┬─────┘ └─────────────────┘
state Active {
│ destroy (preemption) [*] --> Tracked : ref_count tracked
}
┌───────────┐ state Inactive {
│ Freed │ [*] --> Ordered : LRU order
└───────────┘ }
``` ```
### Evictor ### Evictor
......
...@@ -188,18 +188,24 @@ curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{ ...@@ -188,18 +188,24 @@ curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
``` ```
**Timeline:** **Timeline:**
``` ```mermaid
Timeline: 0, 1, ... sequenceDiagram
Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (SGLang, TRT, vLLM) participant Client
│request start │received │ participant Frontend as Frontend:8000
| | | participant Backend as Backend (SGLang/TRT/vLLM)
│ ├──> start prefill ──> first token ──> |last token
│ │ (not impl) | | Client->>Frontend: Request start
├─────actual HTTP queue¹ ──────────┘ │ | Note over Frontend,Backend: HTTP queue begins
│ │ │ Frontend->>Backend: Forward request
├─────implemented HTTP queue ─────────────────────────────┘ | Note over Backend: Start prefill
│ │ Backend-->>Frontend: First token
└─────────────────────────────────── Inflight ────────────────────────────┘ Note over Frontend,Backend: HTTP queue ends
loop Token generation
Backend-->>Frontend: Tokens
end
Backend-->>Frontend: Last token
Frontend-->>Client: Complete response
Note over Frontend: Inflight ends
``` ```
**Concurrency Example:** **Concurrency Example:**
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment