docs: 1.0 documentation improvements (#7168)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: athreesh <anish.maddipoti@utexas.edu> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

docs: 1.0 documentation improvements (#7168)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: athreesh <anish.maddipoti@utexas.edu> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
96f3bdcc · dagil-nvidia · GitHub · 5e51d6dd · 96f3bdcc · 96f3bdcc
Unverified Commit 96f3bdcc authored Mar 10, 2026 by dagil-nvidia Committed by GitHub Mar 10, 2026
16 changed files
--- a/docs/backends/sglang/sglang-examples.md
+++ b/docs/backends/sglang/sglang-examples.md
@@ -6,16 +6,6 @@ title: Examples
 For quick start instructions, see the [SGLang README](README.md). This document provides all deployment patterns for running SGLang with Dynamo, including LLMs, multimodal, and diffusion models, and Kubernetes deployment.
-## Table of Contents
- [Infrastructure Setup](#infrastructure-setup)
- [LLM Serving](#llm-serving)
- [Embedding Models](#embedding-models)
- [Vision Models](#vision-models)
- [Diffusion Models](#diffusion-models)
- [Kubernetes Deployment](#kubernetes-deployment)
- [Testing](#testing)
 ## Infrastructure Setup
 For local/bare-metal development, start etcd and optionally NATS using Docker Compose:

--- a/docs/components/frontend/README.md
+++ b/docs/components/frontend/README.md
@@ -10,13 +10,22 @@ The Dynamo Frontend is the API gateway for serving LLM inference requests. It pr
 | Feature | Status |
 |---------|--------|
-| OpenAI Chat Completions API | ✅ Supported |
+| OpenAI Chat Completions API (`/v1/chat/completions`) | ✅ Supported |
-| OpenAI Completions API | ✅ Supported |
+| OpenAI Completions API (`/v1/completions`) | ✅ Supported |
+| OpenAI Embeddings API (`/v1/embeddings`) | ✅ Supported |
+| OpenAI Responses API (`/v1/responses`) | ✅ Supported |
+| OpenAI Models API (`/v1/models`) | ✅ Supported |
+| Image Generation (`/v1/images/generations`) | ✅ Supported |
+| Video Generation (`/v1/videos/generations`) | ✅ Supported |
+| Anthropic Messages API (`/v1/messages`) | 🧪 Experimental |
 | KServe gRPC v2 API | ✅ Supported |
-| Streaming responses | ✅ Supported |
+| Streaming responses (SSE) | ✅ Supported |
 | Multi-model serving | ✅ Supported |
-| Integrated routing | ✅ Supported |
+| Integrated KV-aware routing | ✅ Supported |
 | Tool calling | ✅ Supported |
+| TLS (HTTPS) | ✅ Supported |
+| Swagger UI (`/docs`) | ✅ Supported |
+| NVIDIA request extensions (`nvext`) | ✅ Supported |
 ## Quick Start
@@ -76,7 +85,7 @@ spec:
 |-----------|---------|-------------|
 | `--http-port` | 8000 | HTTP server port |
 | `--kserve-grpc-server` | false | Enable KServe gRPC server |
-| `--router-mode` | `round_robin` | Routing strategy: `round_robin`, `random`, `kv` |
+| `--router-mode` | `round-robin` | Routing strategy: `round-robin`, `random`, `kv`, `direct` |
 See the [Frontend Guide](frontend-guide.md) for full configuration options.
@@ -84,5 +93,7 @@ See the [Frontend Guide](frontend-guide.md) for full configuration options.
 | Document | Description |
 |----------|-------------|
+| [Configuration Reference](configuration.md) | All CLI arguments, env vars, and HTTP endpoints |
 | [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration |
+| [NVIDIA Request Extensions (nvext)](nvext.md) | Custom request fields for routing hints and cache control |
 | [Router Documentation](../router/README.md) | KV-aware routing configuration |
--- a/docs/components/frontend/configuration.md
+++ b/docs/components/frontend/configuration.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Frontend Configuration Reference
+subtitle: Complete reference for all frontend CLI arguments, environment variables, and HTTP endpoints
+---
+This page documents all configuration options for the Dynamo Frontend (`python -m dynamo.frontend`).
+Every CLI argument has a corresponding environment variable. CLI arguments take precedence over environment variables.
+## HTTP & Networking
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--http-host` | `DYN_HTTP_HOST` | `0.0.0.0` | HTTP listen address |
+| `--http-port` | `DYN_HTTP_PORT` | `8000` | HTTP listen port |
+| `--tls-cert-path` | `DYN_TLS_CERT_PATH` | — | TLS certificate path (PEM). Must be paired with `--tls-key-path` |
+| `--tls-key-path` | `DYN_TLS_KEY_PATH` | — | TLS private key path (PEM). Must be paired with `--tls-cert-path` |
+The Rust HTTP server also reads these environment variables (not exposed as CLI args):
+| Env Var | Default | Description |
+|---------|---------|-------------|
+| `DYN_HTTP_BODY_LIMIT_MB` | `192` | Maximum request body size in MB |
+| `DYN_HTTP_GRACEFUL_SHUTDOWN_TIMEOUT_SECS` | `5` | Graceful shutdown timeout in seconds |
+## Router
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--router-mode` | `DYN_ROUTER_MODE` | `round-robin` | Routing strategy: `round-robin`, `random`, `kv`, `direct` |
+| `--router-kv-overlap-score-weight` | `DYN_ROUTER_KV_OVERLAP_SCORE_WEIGHT` | `1.0` | Weight for KV cache overlap in worker scoring. Higher = prefer cache reuse |
+| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` | Softmax temperature for worker sampling. 0 = deterministic |
+| `--router-kv-events` / `--no-router-kv-events` | `DYN_ROUTER_USE_KV_EVENTS` | `true` | Enable KV cache state events from workers. Disable for prediction-based routing |
+| `--router-ttl-secs` | `DYN_ROUTER_TTL_SECS` | `120.0` | Block TTL when KV events are disabled |
+| `--router-max-tree-size` | `DYN_ROUTER_MAX_TREE_SIZE` | `1048576` | Max radix tree size before pruning (no-events mode) |
+| `--router-prune-target-ratio` | `DYN_ROUTER_PRUNE_TARGET_RATIO` | `0.8` | Target size ratio after pruning (no-events mode) |
+| `--router-replica-sync` / `--no-router-replica-sync` | `DYN_ROUTER_REPLICA_SYNC` | `false` | Sync state across multiple router instances |
+| `--router-snapshot-threshold` | `DYN_ROUTER_SNAPSHOT_THRESHOLD` | `1000000` | Messages before triggering a snapshot |
+| `--router-reset-states` / `--no-router-reset-states` | `DYN_ROUTER_RESET_STATES` | `false` | Reset router state on startup. **Warning:** affects existing replicas |
+| `--router-track-active-blocks` / `--no-router-track-active-blocks` | `DYN_ROUTER_TRACK_ACTIVE_BLOCKS` | `true` | Track blocks used by in-progress requests for load balancing |
+| `--router-assume-kv-reuse` / `--no-router-assume-kv-reuse` | `DYN_ROUTER_ASSUME_KV_REUSE` | `true` | Assume KV cache reuse when tracking active blocks |
+| `--router-track-output-blocks` / `--no-router-track-output-blocks` | `DYN_ROUTER_TRACK_OUTPUT_BLOCKS` | `false` | Track output blocks with fractional decay during generation |
+| `--router-event-threads` | `DYN_ROUTER_EVENT_THREADS` | `4` | Event processing threads. >1 enables concurrent radix tree |
+| `--router-queue-threshold` | `DYN_ROUTER_QUEUE_THRESHOLD` | — | Queue threshold fraction of prefill capacity. Enables priority scheduling |
+| `--enable-cache-control` / `--no-enable-cache-control` | `DYN_ENABLE_CACHE_CONTROL` | `false` | Enable TTL-based cache pinning (requires `--router-mode=kv`) |
+| `--decode-fallback` / `--no-decode-fallback` | `DYN_DECODE_FALLBACK` | `false` | Fall back to aggregated mode when prefill workers unavailable |
+## Fault Tolerance
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--migration-limit` | `DYN_MIGRATION_LIMIT` | `0` | Max request migrations per worker disconnect. 0 = disabled |
+| `--active-decode-blocks-threshold` | `DYN_ACTIVE_DECODE_BLOCKS_THRESHOLD` | — | KV cache utilization fraction (0.0–1.0) for busy detection |
+| `--active-prefill-tokens-threshold` | `DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD` | — | Absolute token count for prefill busy detection |
+| `--active-prefill-tokens-threshold-frac` | `DYN_ACTIVE_PREFILL_TOKENS_THRESHOLD_FRAC` | — | Fraction of `max_num_batched_tokens` for prefill busy detection. OR logic with absolute threshold |
+## Model Discovery
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--namespace` | `DYN_NAMESPACE` | — | Exact namespace for model discovery scoping |
+| `--namespace-prefix` | `DYN_NAMESPACE_PREFIX` | — | Namespace prefix for discovery (e.g., `ns` matches `ns`, `ns-abc123`). Takes precedence over `--namespace` |
+| `--model-name` | `DYN_MODEL_NAME` | — | Override model name string |
+| `--model-path` | `DYN_MODEL_PATH` | — | Path to local model directory (for private/custom models) |
+| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | — | KV cache block size override |
+## Infrastructure
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--discovery-backend` | `DYN_DISCOVERY_BACKEND` | `etcd` | Service discovery: `kubernetes`, `etcd`, `file`, `mem` |
+| `--request-plane` | `DYN_REQUEST_PLANE` | `tcp` | Request distribution: `tcp` (fastest), `nats`, `http` |
+| `--event-plane` | `DYN_EVENT_PLANE` | `nats` | Event publishing: `nats`, `zmq` |
+## KServe gRPC
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--kserve-grpc-server` / `--no-kserve-grpc-server` | `DYN_KSERVE_GRPC_SERVER` | `false` | Start KServe gRPC v2 server |
+| `--grpc-metrics-port` | `DYN_GRPC_METRICS_PORT` | `8788` | HTTP metrics port for gRPC service |
+See the [Frontend Guide](frontend-guide.md) for KServe message formats and integration details.
+## Monitoring
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--metrics-prefix` | `DYN_METRICS_PREFIX` | `dynamo_frontend` | Prefix for frontend Prometheus metrics |
+| `--dump-config-to` | `DYN_DUMP_CONFIG_TO` | — | Dump resolved config to file path |
+## Experimental
+| CLI Argument | Env Var | Default | Description |
+|-------------|---------|---------|-------------|
+| `--enable-anthropic-api` | `DYN_ENABLE_ANTHROPIC_API` | `false` | Enable `/v1/messages` (Anthropic Messages API) |
+| `--dyn-chat-processor` | `DYN_CHAT_PROCESSOR` | `dynamo` | Chat processor: `dynamo` or `vllm` |
+| `--dyn-debug-perf` | `DYN_DEBUG_PERF` | `false` | Log per-function timing for preprocessing (vllm processor only) |
+| `--dyn-preprocess-workers` | `DYN_PREPROCESS_WORKERS` | `0` | Worker processes for CPU-bound preprocessing. 0 = main event loop (vllm processor only) |
+| `-i` / `--interactive` | `DYN_INTERACTIVE` | `false` | Interactive text chat mode |
+## HTTP Endpoints
+The frontend exposes the following HTTP endpoints:
+### OpenAI-Compatible
+| Method | Path | Description |
+|--------|------|-------------|
+| `POST` | `/v1/chat/completions` | Chat completions (streaming and non-streaming) |
+| `POST` | `/v1/completions` | Text completions |
+| `POST` | `/v1/embeddings` | Text embeddings |
+| `POST` | `/v1/responses` | Responses API |
+| `POST` | `/v1/images/generations` | Image generation |
+| `POST` | `/v1/videos/generations` | Video generation |
+| `POST` | `/v1/videos/generations/stream` | Video generation (streaming) |
+| `GET` | `/v1/models` | List available models |
+### Anthropic (Experimental)
+| Method | Path | Description |
+|--------|------|-------------|
+| `POST` | `/v1/messages` | Anthropic Messages API (requires `--enable-anthropic-api`) |
+| `POST` | `/v1/messages/count_tokens` | Token counting for Anthropic API |
+### Infrastructure
+| Method | Path | Description |
+|--------|------|-------------|
+| `GET` | `/health` | Health check |
+| `GET` | `/live` | Liveness check |
+| `GET` | `/metrics` | Prometheus metrics |
+| `GET` | `/openapi.json` | OpenAPI specification |
+| `GET` | `/docs` | Swagger UI |
+| `POST` | `/busy_threshold` | Set busy thresholds |
+| `GET` | `/busy_threshold` | Get current busy thresholds |
+### Endpoint Path Customization
+All endpoint paths can be overridden via environment variables:
+| Env Var | Default Path |
+|---------|-------------|
+| `DYN_HTTP_SVC_CHAT_PATH_ENV` | `/v1/chat/completions` |
+| `DYN_HTTP_SVC_CMP_PATH_ENV` | `/v1/completions` |
+| `DYN_HTTP_SVC_EMB_PATH_ENV` | `/v1/embeddings` |
+| `DYN_HTTP_SVC_RESPONSES_PATH_ENV` | `/v1/responses` |
+| `DYN_HTTP_SVC_MODELS_PATH_ENV` | `/v1/models` |
+| `DYN_HTTP_SVC_ANTHROPIC_PATH_ENV` | `/v1/messages` |
+| `DYN_HTTP_SVC_HEALTH_PATH_ENV` | `/health` |
+| `DYN_HTTP_SVC_LIVE_PATH_ENV` | `/live` |
+| `DYN_HTTP_SVC_METRICS_PATH_ENV` | `/metrics` |
+## Deprecated
+| CLI Argument | Env Var | Description |
+|-------------|---------|-------------|
+| `--router-durable-kv-events` | `DYN_ROUTER_DURABLE_KV_EVENTS` | Use event-plane local indexer instead |
+## See Also
+- [Frontend Overview](README.md) — quick start and feature matrix
+- [Frontend Guide](frontend-guide.md) — KServe gRPC configuration
+- [NVIDIA Request Extensions (nvext)](nvext.md) — custom request fields
+- [Router Guide](../router/router-guide.md) — detailed routing configuration
+- [Metrics](../../observability/metrics.md) — available Prometheus metrics
+- [Fault Tolerance](../../fault-tolerance/README.md) — request migration and rejection
--- a/docs/components/kvbm/kvbm-guide.md
+++ b/docs/components/kvbm/kvbm-guide.md
@@ -9,20 +9,6 @@ The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to h
 KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
-## Table of Contents
- [Quick Start](#quick-start)
- [Run KVBM Standalone](#run-kvbm-standalone)
- [Run KVBM in Dynamo with vLLM](#run-kvbm-in-dynamo-with-vllm)
- [Run KVBM in Dynamo with TensorRT-LLM](#run-kvbm-in-dynamo-with-tensorrt-llm)
- [Run Dynamo with SGLang HiCache](#run-dynamo-with-sglang-hicache)
- [Disaggregated Serving with KVBM](#disaggregated-serving-with-kvbm)
- [Configuration](#configuration)
- [Enable and View KVBM Metrics](#enable-and-view-kvbm-metrics)
- [Benchmarking KVBM](#benchmarking-kvbm)
- [Troubleshooting](#troubleshooting)
- [Developing Locally](#developing-locally)
 ## Quick Start
 ## Run KVBM Standalone

--- a/docs/components/router/router-examples.md
+++ b/docs/components/router/router-examples.md
@@ -6,15 +6,6 @@ title: Router Examples
 For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns.
-## Table of Contents
- [Using KvRouter Python API](#using-kvrouter-python-api)
- [K8s Examples](#k8s-examples)
- [Routing Patterns](#routing-patterns)
- [Custom Routing Example: Minimizing TTFT](#custom-routing-example-minimizing-ttft)
- [KV Event Publishing for Custom Engines](#kv-event-publishing-for-custom-engines)
- [Global Router (Hierarchical Routing)](#global-router-hierarchical-routing)
 ## Using KvRouter Python API
 Instead of launching the KV Router via command line, you can create a `KvRouter` object directly in Python. This allows per-request routing configuration overrides.

--- a/docs/components/router/router-guide.md
+++ b/docs/components/router/router-guide.md
@@ -10,7 +10,7 @@ subtitle: Enable KV-aware routing using Router for Dynamo deployments
 The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
 This guide helps you get started with using the Dynamo router, with further details on configuration, disaggregated serving setup, and parameter tuning.
-## Quick start
+## Quick Start
 ### Python / CLI Deployment
@@ -31,7 +31,7 @@ Backend workers register themselves using the `register_model` API, after which
 | Argument | Default | Description |
 |----------|---------|-------------|
-| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
+| `--router-mode kv` | `round-robin` | Enable KV cache-aware routing |
 | `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
 | `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
 | `--router-kv-events` / `--no-router-kv-events` | `--router-kv-events` | Enable/disable real-time KV event tracking |
@@ -73,7 +73,7 @@ All CLI arguments can be configured via environment variables using the `DYN_` p
 | CLI Argument | Environment Variable | Default |
 |--------------|---------------------|---------|
-| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
+| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round-robin` |
 | `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
 | `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
 | `--no-router-kv-events` | `DYN_ROUTER_USE_KV_EVENTS=false` | `true` |
@@ -311,7 +311,7 @@ graph TD
 ## Serving Multiple Router Replicas
-For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.)
+For improved fault tolerance, you can launch multiple frontend + router replicas. If multiple `dynamo.frontend` processes share the same host or network namespace, give each instance a different HTTP port. In Kubernetes or on separate hosts, replicas can usually reuse the same container port. Alternatively, you can deploy the router separately as the standalone `python -m dynamo.router` service; see the [Standalone Router README](../../../components/src/dynamo/router/README.md).
 ### Router State Management

--- a/docs/design-docs/planner-design.md
+++ b/docs/design-docs/planner-design.md
@@ -147,7 +147,7 @@ The interpolators use the profiling sweep granularity to determine precision. Fi
 ## Initialization
-The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
+The planner currently waits 30 seconds (`INIT_PLANNER_START_DELAY` in `components/src/dynamo/planner/__main__.py`) as a temporary workaround while other components (frontend, workers) register and stabilize; see [Known Limitations](#known-limitations) for the planned readiness-probing replacement.
 After the delay:
@@ -232,4 +232,3 @@ In aggregated mode (`--mode agg`), engines handle both prefill and decode via ch
 | `defaults.py`                | Default configs, backend name mappings                |
 | `planner_argparse.py`        | CLI argument definitions                              |
--- a/docs/development/backend-guide.md
+++ b/docs/development/backend-guide.md
@@ -2,9 +2,12 @@
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
 title: Writing Python Workers in Dynamo
+sidebar-title: Writing Python Workers
 subtitle: Create custom Python workers and engines for Dynamo
 ---
+# Writing Python Workers in Dynamo
 This guide explains how to create your own Python worker in Dynamo.
 The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
@@ -75,7 +78,7 @@ The `model_type` can be:
 See `examples/backends` for full code examples.
-## Component names
+## Component Names
 A worker needs three names to register itself: namespace.component.endpoint

--- a/docs/fault-tolerance/README.md
+++ b/docs/fault-tolerance/README.md
@@ -75,7 +75,7 @@ See [Health Checks](../observability/health-checks.md) for details.
 | Feature | Environment Variable | Default |
 |---------|---------------------|---------|
 | Worker health port | `DYN_SYSTEM_PORT` | `9090` |
-| Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` (K8s: `true`) |
+| Canary health checks | `DYN_HEALTH_CHECK_ENABLED` | `false` |
 | Canary wait time | `DYN_CANARY_WAIT_TIME` | `10` seconds |
 | Health check timeout | `DYN_HEALTH_CHECK_REQUEST_TIMEOUT` | `3` seconds |
 | Decode blocks threshold | `--active-decode-blocks-threshold` | None (disabled) |

--- a/docs/features/lora/README.md
+++ b/docs/features/lora/README.md
@@ -30,31 +30,20 @@ Dynamo's LoRA implementation provides:
 ### Architecture
-```text
+```mermaid
-┌─────────────────────────────────────────────────────────────────┐
+flowchart TD
-│                        LoRA Architecture                         │
+    Frontend["Frontend"] --> Router["Router<br/>(LoRA-aware)"]
-├─────────────────────────────────────────────────────────────────┤
+    Router --> Workers["Workers<br/>(LoRA-loaded)"]
-│                                                                  │
+    Workers --> ManagerNode["LoRA Manager"]
-│  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐     │
-│  │   Frontend   │────▶│    Router    │────▶│   Workers    │     │
+    subgraph ManagerGroup["LoRA Manager"]
-│  │  /v1/models  │     │  LoRA-aware  │     │  LoRA-loaded │     │
+        Downloader
-│  └──────────────┘     └──────────────┘     └──────────────┘     │
+        Cache
-│                                                   │              │
+    end
-│                                                   ▼              │
-│                              ┌─────────────────────────────────┐ │
+    ManagerNode --> Local["file://<br/>Local"]
-│                              │         LoRA Manager            │ │
+    ManagerNode --> S3["s3://<br/>S3/MinIO"]
-│                              │  ┌───────────┐ ┌─────────────┐  │ │
+    ManagerNode --> HF["hf://<br/>(custom)"]
-│                              │  │ Downloader│ │    Cache    │  │ │
-│                              │  └───────────┘ └─────────────┘  │ │
-│                              └─────────────────────────────────┘ │
-│                                         │                        │
-│                     ┌───────────────────┼───────────────────┐   │
-│                     ▼                   ▼                   ▼   │
-│              ┌────────────┐      ┌────────────┐      ┌─────────┐│
-│              │  file://   │      │   s3://    │      │  hf://  ││
-│              │   Local    │      │  S3/MinIO  │      │(custom) ││
-│              └────────────┘      └────────────┘      └─────────┘│
-└─────────────────────────────────────────────────────────────────┘
 ```
 The LoRA system consists of:

--- a/docs/index.yml
+++ b/docs/index.yml
@@ -138,7 +138,7 @@ navigation:
            path: fault-tolerance/request-rejection.md
          - page: Testing
            path: fault-tolerance/testing.md
-      - page: Writing Python Workers in Dynamo
+      - page: Writing Python Workers
        path: development/backend-guide.md
  # ==================== Backends ====================
@@ -237,18 +237,22 @@ navigation:
        path: design-docs/disagg-serving.md
      - page: Distributed Runtime
        path: design-docs/distributed-runtime.md
-      - page: Discovery Plane
+      - section: Communication Planes
-        path: design-docs/discovery-plane.md
+        contents:
-      - page: Request Plane
+          - page: Discovery Plane
-        path: design-docs/request-plane.md
+            path: design-docs/discovery-plane.md
-      - page: Event Plane
+          - page: Request Plane
-        path: design-docs/event-plane.md
+            path: design-docs/request-plane.md
-      - page: Router Design
+          - page: Event Plane
-        path: design-docs/router-design.md
+            path: design-docs/event-plane.md
-      - page: KVBM Design
+      - section: Component Design
-        path: design-docs/kvbm-design.md
+        contents:
-      - page: Planner Design
+          - page: Router Design
-        path: design-docs/planner-design.md
+            path: design-docs/router-design.md
+          - page: KVBM Design
+            path: design-docs/kvbm-design.md
+          - page: Planner Design
+            path: design-docs/planner-design.md
  # ==================== Blog ====================
  - section: Blog

--- a/docs/kubernetes/inference-gateway.md
+++ b/docs/kubernetes/inference-gateway.md
@@ -17,23 +17,6 @@ If you want to use LoRA deploy Dynamo without the Inference Gateway.
 Currently, these setups are only supported with the kGateway based Inference Gateway.
-## Table of Contents
- [Prerequisites](#prerequisites)
- [Installation Steps](#installation-steps)
-  - [1. Install Dynamo Platform](#1-install-dynamo-platform)
-  - [2. Deploy Inference Gateway](#2-deploy-inference-gateway)
-  - [3. Deploy Your Model](#3-deploy-your-model)
-  - [4. Build EPP image (Optional)](#4-build-epp-image-optional)
-  - [5. Deploy](#5-deploy)
-  - [6. Verify Installation](#6-verify-installation)
-  - [7. Usage](#7-usage)
-  - [8. Deleting the installation](#8-deleting-the-installation)
- [Gateway API Inference Extension Details](#gateway-api-inference-extension-integration)
-  - [Router bookkeeping operations](#router-bookkeeping-operations)
-  - [Header Routing Hints](#header-routing-hints)
 ## Prerequisites
 - Kubernetes cluster with kubectl configured

--- a/docs/kubernetes/model-caching.md
+++ b/docs/kubernetes/model-caching.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+title: Model Caching
+subtitle: Download models once and share across all pods in a Kubernetes cluster
+---
+Large language models can take minutes to download. Without caching, every pod downloads the full model independently, wasting bandwidth and delaying startup. Dynamo supports two approaches to ensure models are downloaded once and shared across the cluster.
+## Option 1: PVC + Download Job (Recommended)
+The simplest approach: create a shared PVC, run a one-time Job to download the model, then mount the PVC in your DynamoGraphDeployment.
+This is the pattern used by all Dynamo recipes today.
+### Step 1: Create a Shared PVC
+```yaml
+apiVersion: v1
+kind: PersistentVolumeClaim
+metadata:
+  name: model-cache
+spec:
+  accessModes:
+    - ReadWriteMany
+  resources:
+    requests:
+      storage: 100Gi
+```
+<Note>
+`ReadWriteMany` access mode is required so multiple pods can mount the PVC simultaneously. Ensure your storage class supports RWX (e.g., NFS, CephFS, or cloud-provider shared file systems).
+</Note>
+### Step 2: Download the model
+```yaml
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: model-download
+spec:
+  template:
+    spec:
+      restartPolicy: Never
+      containers:
+        - name: downloader
+          image: python:3.12-slim
+          command: ["sh", "-c"]
+          args:
+            - |
+              pip install huggingface_hub hf_transfer
+              HF_HUB_ENABLE_HF_TRANSFER=1 huggingface-cli download \
+                $MODEL_NAME --revision $MODEL_REVISION
+          env:
+            - name: MODEL_NAME
+              value: "Qwen/Qwen3-0.6B"
+            - name: MODEL_REVISION
+              value: "main"
+            - name: HF_HOME
+              value: /cache/huggingface
+          envFrom:
+            - secretRef:
+                name: hf-token-secret
+          volumeMounts:
+            - name: model-cache
+              mountPath: /cache/huggingface
+      volumes:
+        - name: model-cache
+          persistentVolumeClaim:
+            claimName: model-cache
+```
+### Step 3: Mount in DynamoGraphDeployment
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+spec:
+  pvcs:
+    - create: false
+      name: model-cache
+  services:
+    VllmWorker:
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /home/dynamo/.cache/huggingface
+```
+All `VllmWorker` pods that mount `model-cache` now read from the shared cache, avoiding per-pod worker downloads. If you also want the frontend to reuse tokenizer and config files, mount the same PVC there too.
+### Compilation Cache
+For vLLM, you can also cache compiled artifacts (CUDA graphs, etc.) with a second PVC:
+```yaml
+spec:
+  pvcs:
+    - create: false
+      name: model-cache
+    - create: false
+      name: compilation-cache
+  services:
+    VllmWorker:
+      volumeMounts:
+        - name: model-cache
+          mountPoint: /home/dynamo/.cache/huggingface
+        - name: compilation-cache
+          mountPoint: /home/dynamo/.cache/vllm
+```
+## Option 2: Model Express (P2P Distribution)
+[Model Express](https://github.com/ai-dynamo/modelexpress) is a P2P model distribution server that downloads a model once and serves it to all pods over the network. It integrates directly with vLLM's weight loading pipeline via custom load formats.
+### How It Works
+1. A Model Express server runs in the cluster and caches model weights
+2. Workers use `--load-format=mx-source` or `--load-format=mx-target` to load from the server
+3. The K8s operator injects `MODEL_EXPRESS_URL` into all pods automatically
+### Setup
+**Install with Dynamo Platform:**
+```bash
+helm install dynamo-platform dynamo-platform-${RELEASE_VERSION}.tgz \
+  --namespace ${NAMESPACE} \
+  --set "dynamo-operator.modelExpressURL=http://model-express-server.model-express.svc.cluster.local:8080"
+```
+**Configure workers to use Model Express:**
+```yaml
+services:
+  VllmWorker:
+    envs:
+      - name: VLLM_LOAD_FORMAT
+        value: mx-target
+```
+When `MODEL_EXPRESS_URL` is configured in the operator, it is automatically injected as an environment variable into all component pods. Workers using `mx-source` or `mx-target` load formats will connect to the server for model weight distribution.
+### When to Use Model Express
+| Scenario | Recommended Approach |
+|----------|---------------------|
+| Small cluster, simple setup | PVC + Download Job |
+| Large cluster, many nodes | Model Express |
+| Models already on shared storage (NFS) | PVC |
+| Frequent model updates across fleet | Model Express |
+## See Also
+- [Managing Models with DynamoModel](deployment/dynamomodel-guide.md) — declarative model management CRD
+- [Detailed Installation Guide](installation-guide.md) — Helm chart configuration including Model Express
+- [LoRA Adapters](../features/lora/README.md) — dynamic adapter loading (separate from base model caching)
--- a/docs/kubernetes/webhooks.md
+++ b/docs/kubernetes/webhooks.md
@@ -6,22 +6,6 @@ title: Webhooks
 This document describes the webhook functionality in the Dynamo Operator, including validation webhooks, certificate management, and troubleshooting.
-## Table of Contents
- [Overview](#overview)
- [Architecture](#architecture)
- [Configuration](#configuration)
-  - [Certificate Management Options](#certificate-management-options)
-  - [Advanced Configuration](#advanced-configuration)
- [Certificate Management](#certificate-management)
-  - [Automatic Certificates (Default)](#automatic-certificates-default)
-  - [cert-manager Integration](#cert-manager-integration)
-  - [External Certificates](#external-certificates)
- [Multi-Operator Deployments](#multi-operator-deployments)
- [Troubleshooting](#troubleshooting)
---
 ## Overview
 The Dynamo Operator uses **Kubernetes admission webhooks** to provide real-time validation and mutation of custom resources. Currently, the operator implements **validation webhooks** that ensure invalid configurations are rejected immediately at the API server level, providing faster feedback to users compared to controller-based validation.

--- a/docs/mocker/mocker.md
+++ b/docs/mocker/mocker.md
@@ -120,21 +120,21 @@ When a sequence needs blocks, the manager first checks if they already exist (ca
 The following diagram illustrates the block lifecycle, based on vLLM's block manager design:
-```
+```mermaid
-                        ┌───── Cache hit (Use) ────┐
+stateDiagram-v2
-                        │                          │
+    [*] --> Active : alloc
-                        ▼                          │
+    Active --> Inactive : deref
-┌───────────┐       ┌───────────┐       ┌──────────┴──────┐       ┌───────────┐
+    Inactive --> Active : cache hit (reuse)
-│ New Block │──────►│  Active   │──────►│    Inactive     │──────►│   Freed   │
+    Inactive --> Freed : evict
-└───────────┘ alloc │   Pool    │ deref │      Pool       │ evict └───────────┘
+    Active --> Freed : destroy (preemption)
-                    │(ref_count)│       │   (LRU order)   │
+    Freed --> [*]
-                    └─────┬─────┘       └─────────────────┘
-                          │
+    state Active {
-                          │ destroy (preemption)
+        [*] --> Tracked : ref_count tracked
-                          ▼
+    }
-                    ┌───────────┐
+    state Inactive {
-                    │   Freed   │
+        [*] --> Ordered : LRU order
-                    └───────────┘
+    }
 ```
 ### Evictor

--- a/docs/observability/metrics.md
+++ b/docs/observability/metrics.md
@@ -188,18 +188,24 @@ curl -s localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
 ```
 **Timeline:**
-```
+```mermaid
-Timeline:    0, 1, ...
+sequenceDiagram
-Client ────> Frontend:8000 ────────────────────> Dynamo component/backend (SGLang, TRT, vLLM)
+    participant Client
-             │request start                     │received                              │
+    participant Frontend as Frontend:8000
-             |                                  |                                      |
+    participant Backend as Backend (SGLang/TRT/vLLM)
-             │                                  ├──> start prefill ──> first token ──> |last token
-             │                                  │     (not impl)       |               |
+    Client->>Frontend: Request start
-             ├─────actual HTTP queue¹ ──────────┘                      │               |
+    Note over Frontend,Backend: HTTP queue begins
-             │                                                         │               │
+    Frontend->>Backend: Forward request
-             ├─────implemented HTTP queue ─────────────────────────────┘               |
+    Note over Backend: Start prefill
-             │                                                                         │
+    Backend-->>Frontend: First token
-             └─────────────────────────────────── Inflight ────────────────────────────┘
+    Note over Frontend,Backend: HTTP queue ends
+    loop Token generation
+        Backend-->>Frontend: Tokens
+    end
+    Backend-->>Frontend: Last token
+    Frontend-->>Client: Complete response
+    Note over Frontend: Inflight ends
 ```
 **Concurrency Example:**