docs: migrate Fern docs from fern/ into docs/ (#6206)

Signed-off-by: Jont828 <jt572@cornell.edu>

docs: migrate Fern docs from fern/ into docs/ (#6206)
Signed-off-by: Jont828 <jt572@cornell.edu>
39d645e5 · Jonathan Tong · GitHub · d381e6ff · d381e6ff · d381e6ff
Unverified Commit 39d645e5 authored Feb 11, 2026 by Jonathan Tong Committed by GitHub Feb 11, 2026
20 changed files
--- a/docs/components/kvbm/kvbm_guide.md
+++ b/docs/components/kvbm/kvbm_guide.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# KVBM Guide
-The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
-
-KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
-
-## Table of Contents
-
- [Quick Start](#quick-start)
- [Run KVBM Standalone](#run-kvbm-standalone)
- [Run KVBM in Dynamo with vLLM](#run-kvbm-in-dynamo-with-vllm)
- [Run KVBM in Dynamo with TensorRT-LLM](#run-kvbm-in-dynamo-with-tensorrt-llm)
- [Run Dynamo with SGLang HiCache](#run-dynamo-with-sglang-hicache)
- [Disaggregated Serving with KVBM](#disaggregated-serving-with-kvbm)
- [Configuration](#configuration)
- [Enable and View KVBM Metrics](#enable-and-view-kvbm-metrics)
- [Benchmarking KVBM](#benchmarking-kvbm)
- [Troubleshooting](#troubleshooting)
- [Developing Locally](#developing-locally)
-
-## Quick Start
-
-## Run KVBM Standalone
-
-KVBM can be used independently without using the rest of the Dynamo stack:
-
-```bash
-pip install kvbm
-```
-
-See the [support matrix](../../reference/support-matrix.md) for version compatibility.
-
-### Build from Source
-
-To build KVBM from source, see the detailed instructions in the [KVBM bindings README](../../../lib/bindings/kvbm/README.md#build-from-source).
-
-## Run KVBM in Dynamo with vLLM
-
-### Docker Setup
-
-```bash
-# Start up etcd for KVBM leader/worker registration and discovery
-docker compose -f deploy/docker-compose.yml up -d
-
-# Build a dynamo vLLM container (KVBM is built in by default)
-./container/build.sh --framework vllm
-
-# Launch the container
-./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds
-```
-
-### Aggregated Serving
-
-```bash
-cd $DYNAMO_HOME/examples/backends/vllm
-./launch/agg_kvbm.sh
-```
-
-#### Verify Deployment
-
-```bash
-curl localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [{"role": "user", "content": "Hello, how are you?"}],
-    "stream": false,
-    "max_tokens": 10
-  }'
-```
-
-#### Alternative: Using Direct vllm serve
-
-You can also use `vllm serve` directly with KVBM:
-
-```bash
-vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
-```
-
-## Run KVBM in Dynamo with TensorRT-LLM
-
-> [!NOTE]
-> **Prerequisites:**
-> - Ensure `etcd` and `nats` are running before starting
-> - KVBM only supports TensorRT-LLM's PyTorch backend
-> - Disable partial reuse (`enable_partial_reuse: false`) to increase offloading cache hits
-> - KVBM requires TensorRT-LLM v1.2.0rc2 or newer
-
-### Docker Setup
-
-```bash
-# Start up etcd for KVBM leader/worker registration and discovery
-docker compose -f deploy/docker-compose.yml up -d
-
-# Build a dynamo TRTLLM container (KVBM is built in by default)
-./container/build.sh --framework trtllm
-
-# Launch the container
-./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds
-```
-
-### Aggregated Serving
-
-```bash
-# Write the LLM API config
-cat > "/tmp/kvbm_llm_api_config.yaml" <<EOF
-backend: pytorch
-cuda_graph_config: null
-kv_cache_config:
-  enable_partial_reuse: false
-  free_gpu_memory_fraction: 0.80
-kv_connector_config:
-  connector_module: kvbm.trtllm_integration.connector
-  connector_scheduler_class: DynamoKVBMConnectorLeader
-  connector_worker_class: DynamoKVBMConnectorWorker
-EOF
-
-# Start dynamo frontend
-python3 -m dynamo.frontend --http-port 8000 &
-
-# Serve the model with KVBM
-python3 -m dynamo.trtllm \
-  --model-path Qwen/Qwen3-0.6B \
-  --served-model-name Qwen/Qwen3-0.6B \
-  --extra-engine-args /tmp/kvbm_llm_api_config.yaml &
-```
-
-#### Verify Deployment
-
-```bash
-curl localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [{"role": "user", "content": "Hello, how are you?"}],
-    "stream": false,
-    "max_tokens": 30
-  }'
-```
-
-#### Alternative: Using trtllm-serve
-
-```bash
-trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
-```
-
-## Run Dynamo with SGLang HiCache
-
-SGLang's Hierarchical Cache (HiCache) extends KV cache storage beyond GPU memory to include host CPU memory. When using NIXL as the storage backend, HiCache integrates with Dynamo's memory infrastructure.
-
-### Quick Start
-
-```bash
-# Start SGLang worker with HiCache enabled
-python -m dynamo.sglang \
-  --model-path Qwen/Qwen3-0.6B \
-  --host 0.0.0.0 --port 8000 \
-  --enable-hierarchical-cache \
-  --hicache-ratio 2 \
-  --hicache-write-policy write_through \
-  --hicache-storage-backend nixl
-
-# In a separate terminal, start the frontend
-python -m dynamo.frontend --http-port 8000
-
-# Send a test request
-curl localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [{"role": "user", "content": "Hello!"}],
-    "stream": false,
-    "max_tokens": 30
-  }'
-```
-
-> **Learn more:** See the [SGLang HiCache Integration Guide](../../integrations/sglang_hicache.md) for detailed configuration, deployment examples, and troubleshooting.
-
-## Disaggregated Serving with KVBM
-
-KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache.
-
-### Disaggregated Serving with vLLM
-
-```bash
-# 1P1D - one prefill worker and one decode worker
-# NOTE: requires at least 2 GPUs
-cd $DYNAMO_HOME/examples/backends/vllm
-./launch/disagg_kvbm.sh
-
-# 2P2D - two prefill workers and two decode workers
-# NOTE: requires at least 4 GPUs
-cd $DYNAMO_HOME/examples/backends/vllm
-./launch/disagg_kvbm_2p2d.sh
-```
-
-### Disaggregated Serving with TRT-LLM
-
-> [!NOTE]
-> The latest TensorRT-LLM release (1.3.0rc1) is currently experiencing a request hang when running disaggregated serving with KVBM.
-> Please include the TensorRT-LLM commit id `18e611da773026a55d187870ebcfa95ff00c8482` when building the Dynamo TensorRT-LLM runtime image to test the KVBM + disaggregated serving feature.
-
-```bash
-# Build the Dynamo TensorRT-LLM container using commit ID 18e611da773026a55d187870ebcfa95ff00c8482. Note: This build can take a long time.
-./container/build.sh --framework trtllm --tensorrtllm-commit 18e611da773026a55d187870ebcfa95ff00c8482 --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git
-
-# Launch the container
-./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds
-```
-> [!NOTE]
-> Important: After logging into the Dynamo TensorRT-LLM runtime container, copy the Triton kernels into the container’s virtual environment as a separate Python module.
-
-```bash
-# Clone the TensorRT-LLM repo and copy the triton_kernels folder into the container as a Python module.
-git clone https://github.com/NVIDIA/TensorRT-LLM.git /tmp/TensorRT-LLM && \
-cd /tmp/TensorRT-LLM && \
-git checkout 18e611da773026a55d187870ebcfa95ff00c8482 && \
-cp -r triton_kernels /opt/dynamo/venv/lib/python3.12/site-packages/ && \
-cd /workspace && \
-rm -rf /tmp/TensorRT-LLM
-```
-
-```bash
-# Launch prefill worker with KVBM
-python3 -m dynamo.trtllm \
-  --model-path Qwen/Qwen3-0.6B \
-  --served-model-name Qwen/Qwen3-0.6B \
-  --extra-engine-args /tmp/kvbm_llm_api_config.yaml \
-  --disaggregation-mode prefill &
-```
-
-## Configuration
-
-### Cache Tier Configuration
-
-Configure KVBM cache tiers using environment variables:
-
-```bash
-# Option 1: CPU cache only (GPU -> CPU offloading)
-export DYN_KVBM_CPU_CACHE_GB=4  # 4GB of pinned CPU memory
-
-# Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
-export DYN_KVBM_CPU_CACHE_GB=4
-export DYN_KVBM_DISK_CACHE_GB=8  # 8GB of disk
-
-# [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading)
-# NOTE: Experimental, may not provide optimal performance
-# NOTE: Disk offload filtering not supported with this option
-export DYN_KVBM_DISK_CACHE_GB=8
-```
-
-You can also specify exact block counts instead of GB:
- `DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS`
- `DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS`
-
-### SSD Lifespan Protection
-
-When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step.
-
-To disable disk offload filtering:
-
-```bash
-export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true
-```
-
-## Enable and View KVBM Metrics
-
-### Setup Monitoring Stack
-
-```bash
-# Start basic services (etcd & natsd), along with Prometheus and Grafana
-docker compose -f deploy/docker-observability.yml up -d
-```
-
-### Enable Metrics for vLLM
-
-```bash
-DYN_KVBM_METRICS=true \
-DYN_KVBM_CPU_CACHE_GB=20 \
-python -m dynamo.vllm \
-    --model Qwen/Qwen3-0.6B \
-    --enforce-eager \
-    --connector kvbm
-```
-
-### Enable Metrics for TensorRT-LLM
-
-```bash
-DYN_KVBM_METRICS=true \
-DYN_KVBM_CPU_CACHE_GB=20 \
-python3 -m dynamo.trtllm \
-  --model-path Qwen/Qwen3-0.6B \
-  --served-model-name Qwen/Qwen3-0.6B \
-  --extra-engine-args /tmp/kvbm_llm_api_config.yaml &
-```
-
-### Firewall Configuration (Optional)
-
-```bash
-# If firewall blocks KVBM metrics ports
-sudo ufw allow 6880/tcp
-```
-
-### View Metrics
-
-Access Grafana at http://localhost:3000 (default login: `dynamo`/`dynamo`) and look for the **KVBM Dashboard**.
-
-### Available Metrics
-
-| Metric | Description |
-|--------|-------------|
-| `kvbm_matched_tokens` | Number of matched tokens |
-| `kvbm_offload_blocks_d2h` | Offload blocks from device to host |
-| `kvbm_offload_blocks_h2d` | Offload blocks from host to disk |
-| `kvbm_offload_blocks_d2d` | Offload blocks from device to disk (bypassing host) |
-| `kvbm_onboard_blocks_d2d` | Onboard blocks from disk to device |
-| `kvbm_onboard_blocks_h2d` | Onboard blocks from host to device |
-| `kvbm_host_cache_hit_rate` | Host cache hit rate (0.0-1.0) |
-| `kvbm_disk_cache_hit_rate` | Disk cache hit rate (0.0-1.0) |
-
-## Benchmarking KVBM
-
-Use [LMBenchmark](https://github.com/LMCache/LMBenchmark) to evaluate KVBM performance.
-
-### Setup
-
-```bash
-git clone https://github.com/LMCache/LMBenchmark.git
-cd LMBenchmark/synthetic-multi-round-qa
-```
-
-### Run Benchmark
-
-```bash
-# Synthetic multi-turn chat dataset
-# Arguments: model, endpoint, output prefix, qps
-./long_input_short_output_run.sh \
-    "Qwen/Qwen3-0.6B" \
-    "http://localhost:8000" \
-    "benchmark_kvbm" \
-    1
-```
-
-Average TTFT and other performance numbers will be in the output.
-
-> **TIP:** If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard.
-
-### Baseline Comparison
-
-#### vLLM Baseline (without KVBM)
-
-```bash
-vllm serve Qwen/Qwen3-0.6B
-```
-
-#### TensorRT-LLM Baseline (without KVBM)
-
-```bash
-# Create config without kv_connector_config
-cat > "/tmp/llm_api_config.yaml" <<EOF
-backend: pytorch
-cuda_graph_config: null
-kv_cache_config:
-  enable_partial_reuse: false
-  free_gpu_memory_fraction: 0.80
-EOF
-
-trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/llm_api_config.yaml
-```
-
-## Troubleshooting
-
-### No TTFT Performance Gain
-
-**Symptom:** Enabling KVBM does not show TTFT improvement or causes performance degradation.
-
-**Cause:** Not enough prefix cache hits on KVBM to reuse offloaded KV blocks.
-
-**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
-
-![Grafana Example](../../images/kvbm_metrics_grafana.png)
-
-### KVBM Worker Initialization Timeout
-
-**Symptom:** KVBM fails to start when allocating large memory or disk storage.
-
-**Solution:** Increase the leader-worker initialization timeout (default: 1800 seconds):
-
-```bash
-export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600  # 1 hour
-```
-
-### Disk Offload Fails to Start
-
-**Symptom:** KVBM fails to start when disk offloading is enabled.
-
-**Cause:** `fallocate()` is not supported on the filesystem (e.g., Lustre, certain network filesystems).
-
-**Solution:** Enable disk zerofill fallback:
-
-```bash
-export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
-```
-
-If you encounter "write all error" or EINVAL (errno 22), also try:
-
-```bash
-export DYN_KVBM_DISK_DISABLE_O_DIRECT=true
-```
-
-## Developing Locally
-
-Inside the Dynamo container, after changing KVBM-related code (Rust and/or Python):
-
-```bash
-cd /workspace/lib/bindings/kvbm
-uv pip install maturin[patchelf]
-maturin build --release --out /workspace/dist
-uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl
-```
-
-## See Also
-
- [KVBM Overview](README.md) for a quick overview of KV Caching, KVBM and its architecture
- [KVBM Design](../../design_docs/kvbm_design.md) for a deep dive into KVBM architecture
- [LMCache Integration](../../integrations/lmcache_integration.md)
- [FlexKV Integration](../../integrations/flexkv_integration.md)
- [SGLang HiCache](../../integrations/sglang_hicache.md)
--- a/docs/components/planner/README.md
+++ b/docs/components/planner/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Planner
-
-The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
-
-> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner_guide.md) for a complete workflow including profiling and deployment.
-
-## Feature Matrix
-
-| Category | Feature | Status |
-|----------|---------|--------|
-| **Backend** | Local (bare metal) | Deprecated |
-| | Kubernetes | Supported |
-| **LLM Framework** | vLLM | Supported |
-| | TensorRT-LLM | Supported |
-| | SGLang | Supported |
-| **Serving Type** | Aggregated | Unsupported |
-| | Disaggregated | Supported |
-| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
-| | Load-based (KV cache/queue thresholds) | Deprecated |
-| **Load Predictors** | ARIMA | Supported |
-| | Prophet | Supported |
-| | Kalman filter | Supported |
-| | Constant (current = next) | Supported |
-| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
-| | VirtualConnector (external environments) | Supported |
-
-## Quick Start
-
-### Prerequisites
-
- Dynamo platform installed on Kubernetes ([Installation Guide](/docs/kubernetes/installation_guide.md))
- kube-prometheus-stack installed ([Metrics Setup](/docs/kubernetes/observability/metrics.md))
- Pre-deployment profiling completed ([Profiling Guide](/docs/components/profiler/profiler_guide.md))
-
-### Deploy with DGDR (Recommended)
-
-The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
-
-```bash
-kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
-```
-
-This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Guide](planner_guide.md) for the full workflow.
-
-### Deploy with DGD (Manual)
-
-For manual control, use the disaggregated planner templates:
-
-```bash
-# After profiling is complete
-kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
-```
-
-## Documentation
-
-| Document | Description |
-|----------|-------------|
-| [Planner Guide](planner_guide.md) | Deployment, configuration, integration, troubleshooting |
-| [Planner Examples](planner_examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
-| [SLA Planner Guide](planner_guide.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
-| [SLA-based Planner](planner_guide.md) | Scaling algorithm, correction factors, load prediction details |
-| [Load-based Planner](README.md) | Legacy load-based scaling (deprecated) |
-| [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) | Pre-deployment profiling process and configuration |
-| [Planner Design](/docs/design_docs/planner_design.md) | Architecture deep-dive for contributors |
-
-## Configuration Reference
-
-### Key Arguments
-
-| Argument | Default | Description |
-|----------|---------|-------------|
-| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
-| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
-| `--environment` | `kubernetes` | Deployment environment |
-| `--adjustment-interval` | `180` | Seconds between scaling decisions |
-| `--ttft` | `500.0` | Target Time To First Token (ms) |
-| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
-| `--isl` | `3000` | Expected average input sequence length |
-| `--osl` | `150` | Expected average output sequence length |
-| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
-| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
-| `--min-endpoint` | `1` | Minimum replicas per worker type |
-| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
-| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
-| `--no-operation` | `false` | Observation mode (no actual scaling) |
-| `--no-correction` | `false` | Disable correction factors |
-| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
-
-### Environment Variables
-
-| Variable | Default | Description |
-|----------|---------|-------------|
-| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
-| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
-| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
-| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
-
-## Monitoring
-
-### Grafana Dashboard
-
-Deploy the planner dashboard:
-
-```bash
-kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
-```
-
-The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
-
-### Prometheus Metrics
-
-The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
- Request count and duration
- TTFT and ITL distributions
- Input/output sequence lengths
-
-```{toctree}
-:hidden:
-
-planner_guide
-planner_examples
-```
--- a/docs/components/planner/planner_examples.md
+++ b/docs/components/planner/planner_examples.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Planner Examples
-
-Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner_guide.md). For a quick overview, see the [Planner README](README.md).
-
-## Basic Examples
-
-### Minimal DGDR with AIC (Fastest)
-
-The simplest way to deploy with the SLA planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: sla-aic
-spec:
-  model: Qwen/Qwen3-32B
-  backend: vllm
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-    config:
-      sla:
-        isl: 3000
-        osl: 150
-        ttft: 200
-        itl: 20
-      sweep:
-        useAiConfigurator: true
-        aicSystem: h200_sxm
-        aicHfId: Qwen/Qwen3-32B
-        aicBackendVersion: "0.20.0"
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-
-  autoApply: true
-```
-
-Deploy:
-```bash
-export NAMESPACE=your-namespace
-kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
-```
-
-### Online Profiling (Real Measurements)
-
-Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: sla-online
-spec:
-  model: meta-llama/Llama-3.3-70B-Instruct
-  backend: vllm
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-    config:
-      sla:
-        isl: 3000
-        osl: 150
-        ttft: 200
-        itl: 20
-      sweep:
-        useAiConfigurator: false
-        prefillInterpolationGranularity: 16
-        decodeInterpolationGranularity: 6
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-
-  autoApply: true
-```
-
-Deploy:
-```bash
-kubectl apply -f benchmarks/profiler/deploy/profile_sla_dgdr.yaml -n $NAMESPACE
-```
-
-Available sample DGDRs in `benchmarks/profiler/deploy/`:
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
-
-> **Profiling Config Cases**: Prior to 0.8.1, fields under `profilingConfig.config` use snake_case. Starting 0.8.1, fields use camelCase. There is backwards compatibility to snake_case, but example DGDRs use camelCase.
-
-## Kubernetes Examples
-
-### MoE Models (SGLang)
-
-For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: sla-moe
-spec:
-  model: deepseek-ai/DeepSeek-R1
-  backend: sglang
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-    config:
-      sla:
-        isl: 4000
-        osl: 500
-        ttft: 300
-        itl: 10
-      sweep:
-        useAiConfigurator: false
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-
-  autoApply: true
-```
-
-Deploy:
-```bash
-kubectl apply -f benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml -n $NAMESPACE
-```
-
-### Using Existing DGD Configs (Custom Setups)
-
-Reference an existing DynamoGraphDeployment config via ConfigMap:
-
-**Step 1: Create ConfigMap from your DGD config:**
-
-```bash
-kubectl create configmap deepseek-r1-config \
-  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
-  --namespace $NAMESPACE \
-  --dry-run=client -o yaml | kubectl apply -f -
-```
-
-**Step 2: Reference it in your DGDR:**
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: deepseek-r1
-spec:
-  model: deepseek-ai/DeepSeek-R1
-  backend: sglang
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-    configMapRef:
-      name: deepseek-r1-config
-      key: disagg.yaml  # Must match the key used in --from-file
-    config:
-      sla:
-        isl: 4000
-        osl: 500
-        ttft: 300
-        itl: 10
-      sweep:
-        useAiConfigurator: true
-        aicSystem: h200_sxm
-        aicHfId: deepseek-ai/DeepSeek-V3
-        aicBackendVersion: "0.20.0"
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-
-  autoApply: true
-```
-
-The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` and `spec.backend` into the final configuration.
-
-### Inline Configuration (Simple Use Cases)
-
-For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler auto-generates a basic DGD configuration:
-
-```yaml
-profilingConfig:
-  config:
-    sla:
-      isl: 8000
-      osl: 200
-      ttft: 200.0
-      itl: 10.0
-
-    hardware:
-      minNumGpusPerEngine: 2
-      maxNumGpusPerEngine: 8
-      gpuType: h200_sxm
-
-    sweep:
-      prefillInterpolationGranularity: 16
-      decodeInterpolationGranularity: 6
-```
-
-### Mocker Deployment (Testing)
-
-Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:
- Large-scale experiments without GPU resources
- Testing planner behavior and infrastructure
- Validating deployment configurations
-
-```yaml
-spec:
-  model: <model-name>
-  backend: trtllm  # Real backend for profiling
-  useMocker: true  # Deploy mocker instead of real backend
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
-    config:
-      sla:
-        isl: 3000
-        osl: 150
-        ttft: 200
-        itl: 20
-      sweep:
-        useAiConfigurator: true
-        aicSystem: h100_sxm
-  autoApply: true
-```
-
-Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.
-
-### Model Cache PVC (0.8.1+)
-
-For large models, use a pre-populated PVC instead of downloading from HuggingFace:
-
-See [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md) for configuration details.
-
-## Advanced Examples
-
-### Custom Load Predictors
-
-#### Warm-starting with Trace Data
-
-Pre-load predictors with historical request patterns before live traffic:
-
-```yaml
-# In planner arguments
-args:
-  - --load-predictor arima
-  - --load-predictor-warmup-trace /data/trace.jsonl
-  - --load-predictor-log1p
-```
-
-The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.
-
-#### Kalman Filter Tuning
-
-For workloads with rapid changes, tune the Kalman filter:
-
-```yaml
-args:
-  - --load-predictor kalman
-  - --kalman-q-level 2.0      # Higher = more responsive to level changes
-  - --kalman-q-trend 0.5      # Higher = trend changes faster
-  - --kalman-r 5.0            # Lower = trusts new measurements more
-  - --kalman-min-points 3     # Fewer points before forecasting starts
-  - --load-predictor-log1p    # Often helps with request-rate series
-```
-
-#### Prophet for Seasonal Workloads
-
-For workloads with daily/weekly patterns:
-
-```yaml
-args:
-  - --load-predictor prophet
-  - --prophet-window-size 100   # Larger window for seasonal detection
-  - --load-predictor-log1p
-```
-
-### Virtual Connector
-
-For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:
-
-```python
-from dynamo._core import DistributedRuntime, VirtualConnectorClient
-
-# Initialize client
-client = VirtualConnectorClient(distributed_runtime, namespace)
-
-# Main loop: watch for planner decisions and execute them
-while True:
-    # Block until the planner makes a new scaling decision
-    await client.wait()
-
-    # Read the decision
-    decision = await client.get()
-    print(f"Scale to: prefill={decision.num_prefill_workers}, "
-          f"decode={decision.num_decode_workers}, "
-          f"id={decision.decision_id}")
-
-    # Execute scaling in your environment
-    scale_prefill_workers(decision.num_prefill_workers)
-    scale_decode_workers(decision.num_decode_workers)
-
-    # Report completion
-    await client.complete(decision)
-```
-
-See `components/planner/test/test_virtual_connector.py` for a full working example.
-
-### Planner Configuration Passthrough
-
-Pass planner-specific settings through the DGDR:
-
-```yaml
-profilingConfig:
-  config:
-    planner:
-      plannerMinEndpoint: 2
-```
-
-### Review Before Deploy (autoApply: false)
-
-Disable auto-deployment to inspect the generated DGD:
-
-```yaml
-spec:
-  autoApply: false
-```
-
-After profiling completes:
-
-```bash
-# Extract and review generated DGD
-kubectl get dgdr sla-aic -n $NAMESPACE \
-  -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
-
-# Review and modify as needed
-vi my-dgd.yaml
-
-# Deploy manually
-kubectl apply -f my-dgd.yaml -n $NAMESPACE
-```
-
-### Profiling Artifacts with PVC
-
-Save detailed profiling artifacts (plots, logs, raw data) to a PVC:
-
-```yaml
-spec:
-  profilingConfig:
-    outputPVC: "dynamo-pvc"
-    config:
-      sla:
-        isl: 3000
-        osl: 150
-        ttft: 200
-        itl: 20
-```
-
-Setup:
-```bash
-export NAMESPACE=your-namespace
-deploy/utils/setup_benchmarking_resources.sh
-```
-
-Access results:
-```bash
-kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
-kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
-kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
-kubectl delete pod pvc-access-pod -n $NAMESPACE
-```
-
-## Related Documentation
-
- [Planner README](README.md) -- Overview and quick start
- [Planner Guide](planner_guide.md) -- Deployment, configuration, integration
- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive
- [DGDR Configuration Reference](/docs/components/profiler/profiler_guide.md#dgdr-configuration-reference)
- [SLA-Driven Profiling](/docs/components/profiler/profiler_guide.md)
--- a/docs/components/planner/planner_guide.md
+++ b/docs/components/planner/planner_guide.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Planner Guide
-
-Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](/docs/design_docs/planner_design.md).
-
-## Deployment
-
-### Prerequisites
-
-Before deploying the planner, ensure:
-
- **Dynamo platform installed** with the operator running (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **[kube-prometheus-stack](/docs/kubernetes/observability/metrics.md) installed and running** (required for SLA planner metric collection)
- **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
- **Sufficient GPU resources** available in your cluster for profiling
- **Runtime images available** that contain both profiler and runtime components
-
-### Container Images
-
-Each DGDR requires container images for the profiling and deployment process:
-
-**profilingConfig.profilerImage** (Required):
-The container image used for the profiling job. Must contain the profiler code and dependencies for SLA-based profiling.
-
-**deploymentOverrides.workersImage** (Optional):
-The container image used for DGD worker components (frontend, workers, planner). Used for:
- Temporary DGDs created during online profiling (for performance measurements)
- The final DGD deployed after profiling completes
-
-If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. Public images are available from 0.6.1 onward.
-
-```yaml
-spec:
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Optional
-```
-
-### What is a DynamoGraphDeploymentRequest (DGDR)?
-
-A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface for deploying models with specific performance and resource constraints. It specifies:
-
- **What** model to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
-
-The Dynamo Operator watches for DGDRs and automatically:
-1. Discovers available GPU resources in your cluster
-2. Runs profiling (online or offline) to find optimal configurations
-3. Generates an optimized DynamoGraphDeployment (DGD) configuration
-4. Deploys the DGD to your cluster
-
-**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
-
-### DGDR Workflow
-
-The DGDR workflow automates the entire process from SLA specification to deployment:
-
-1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information
-2. **Automatic Profiling**: The operator profiles your model to find optimal configurations
-3. **Auto-Deploy**: The system deploys the optimal configuration that meets your SLAs
-
-```mermaid
-flowchart TD
-    A[Create DGDR] --> B[DGDR Controller]
-    B --> C{Profiling Method}
-    C -->|Online| D[Run Profiling Job<br/>2-4 hours]
-    C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
-    D --> F[Generate DGD Config]
-    E --> F
-    F --> G[Auto-Deploy DGD]
-    G --> H[Monitor & Scale]
-
-    style A fill:#e1f5fe
-    style D fill:#fff3e0
-    style E fill:#e8f5e8
-    style G fill:#f3e5f5
-    style H fill:#fff8e1
-```
-
-### Monitoring Progress
-
-Watch DGDR status:
-
-```bash
-# View status
-kubectl get dgdr -n $NAMESPACE
-
-# Detailed status
-kubectl describe dgdr sla-aic -n $NAMESPACE
-
-# Watch profiling job logs
-kubectl logs -f job/profile-sla-aic -n $NAMESPACE
-```
-
-**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
-
-### Relationship to DGD
-
- **DGDR**: High-level "intent" -- what you want deployed
- **DGD**: Low-level "implementation" -- how it's deployed
-
-The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes the SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs
-
-The generated DGD is tracked via labels:
-```yaml
-metadata:
-  labels:
-    dgdr.nvidia.com/name: sla-aic
-    dgdr.nvidia.com/namespace: your-namespace
-```
-
-## Configuration
-
-### DGDR Configuration
-
-#### Required Fields
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) |
-| `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
-| `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
-| `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |
-
-#### Optional Fields
-
-| Field | Type | Description |
-|-------|------|-------------|
-| `spec.deploymentOverrides.workersImage` | string | Container image for DGD workers. If omitted, uses image from base config. |
-| `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
-| `spec.useMocker` | boolean | Deploy mocker instead of real backend (default: false) |
-| `spec.deploymentOverrides` | object | Customize metadata and image for auto-created DGD |
-
-#### SLA Configuration
-
-```yaml
-sla:
-  isl: 3000      # Average input sequence length (tokens)
-  osl: 150       # Average output sequence length (tokens)
-  ttft: 200      # Target Time To First Token (milliseconds, float)
-  itl: 20        # Target Inter-Token Latency (milliseconds, float)
-```
-
-**Choosing SLA Values:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed)
- **ITL**: Token generation latency target (lower = more GPUs needed)
- **Trade-offs**: Tighter SLAs require more GPU resources
-
-For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](/docs/components/profiler/profiler_guide.md#dgdr-configuration-reference).
-
-### Profiling Methods
-
-Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):
-
-```yaml
-# Online Profiling (Default)
-sweep:
-  useAiConfigurator: false
-
-# Offline Profiling (AI Configurator)
-sweep:
-  useAiConfigurator: true
-  aicSystem: h200_sxm
-  aicHfId: Qwen/Qwen3-32B
-  aicBackendVersion: "0.20.0"
-```
-
-For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](/docs/components/profiler/profiler_guide.md#profiling-methods).
-
-### Load Predictors
-
-The SLA planner forecasts the number of requests, ISL, and OSL in the next adjustment interval. Four prediction models are supported:
-
-#### Constant Predictor
- **Use case**: Stable workloads with long prediction intervals
- **Behavior**: Assumes next load equals current load
- **Configuration**: `load-predictor: "constant"`
-
-#### ARIMA Predictor
- **Use case**: Time-series data with trends and seasonality
- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
- **Configuration**: `load-predictor: "arima"`
- **Tunable parameters**:
-  - `--load-predictor-log1p`: model `log1p(y)` instead of `y`. If not set, ARIMA starts in raw space, and if it collapses to `(0,d,0)`, it falls back to `log1p` automatically.
-
-#### Kalman Predictor
- **Use case**: Low-latency online forecasting (observe 1 -> predict 1) with smooth adaptation
- **Behavior**: Local linear trend Kalman filter (fast online updates; good default when ARIMA collapses to mean-only)
- **Configuration**: `load-predictor: "kalman"`
- **Tunable parameters**:
-  - `--kalman-q-level`: process noise for level (higher = more responsive)
-  - `--kalman-q-trend`: process noise for trend (higher = trend changes faster)
-  - `--kalman-r`: measurement noise (lower = trusts new measurements more)
-  - `--kalman-min-points`: minimum points before forecasting
-  - `--load-predictor-log1p`: model `log1p(y)` instead of `y` (often helps request-rate/count series)
-
-#### Prophet Predictor
- **Use case**: Complex seasonal patterns and trend changes
- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
- **Configuration**: `load-predictor: "prophet"`
- **Tunable parameters**:
-  - `--prophet-window-size`: bounds internal history to control refit cost
-  - `--load-predictor-log1p`: model `log1p(y)` instead of `y`
-
-#### Warm-starting Load Predictors (Optional)
-
-You can warm-start load predictors with a mooncake-style JSONL trace file:
-
- **CLI argument**: `--load-predictor-warmup-trace <path/to/trace.jsonl>`
- **Effect**: preloads predictors with historical request-count / ISL / OSL samples extracted from the trace
-
-### Planner Scaling Parameters
-
-| Argument | Default | Description |
-|----------|---------|-------------|
-| `--adjustment-interval` | `180` | Seconds between scaling decisions |
-| `--ttft` | `500.0` | Target Time To First Token (ms) |
-| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
-| `--isl` | `3000` | Expected average input sequence length |
-| `--osl` | `150` | Expected average output sequence length |
-| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
-| `--min-endpoint` | `1` | Minimum replicas per worker type |
-| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
-| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
-| `--no-operation` | `false` | Observation mode (no actual scaling) |
-| `--no-correction` | `false` | Disable correction factors |
-
-#### Planner Configuration Passthrough
-
-Add planner-specific settings in the DGDR:
-
-```yaml
-profilingConfig:
-  config:
-    planner:
-      plannerMinEndpoint: 2
-```
-
-## Integration
-
-### Prometheus Setup
-
-The planner queries Prometheus to collect frontend request metrics. The architecture:
-
-```mermaid
-flowchart LR
-  Frontend --"/metrics"--> Prometheus
-  Planner --"query API"--> Prometheus
-  Planner --"scaling decisions"--> Workers
-  Frontend -.->|"requests"| Workers
-```
-
-**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (configurable in podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: Prefill and backend workers handle inference
-
-The planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with request count, ISL, OSL, TTFT, and ITL in the correct format. The Dynamo frontend provides these metrics automatically.
-
-**Prometheus endpoint configuration:**
-
-| Variable | Default |
-|----------|---------|
-| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` |
-
-If you see errors like "Failed to resolve prometheus service", ensure `PROMETHEUS_ENDPOINT` points to your Prometheus service.
-
-### Virtual Deployment
-
-The SLA planner supports virtual deployment mode for customized environments (e.g., custom orchestrators) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing Kubernetes resources.
-
-The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of PATCHing DGD resources, it writes scaling decisions and waits for the external environment to acknowledge completion.
-
-#### Scaling Decision Flow
-
-1. **Decision Generation**: The planner calculates optimal worker counts
-2. **Change Detection**: Skips scaling if target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
-3. **Readiness Check**: Verifies previous scaling operations completed by checking `scaled_decision_id >= decision_id`
-4. **Timeout Handling**: If not acknowledged within 30 minutes (1800 seconds), proceeds with new decisions
-5. **Completion Tracking**: Optionally waits for scaling completion confirmation (blocking mode)
-
-#### Configuration
-
-To use virtual deployment mode:
-
-```yaml
-environment: "virtual"
-backend: "vllm"  # or "sglang"
-```
-
-#### Deployment Environment Requirements
-
-The external deployment environment must use `VirtualConnectorClient`:
-
-```python
-from dynamo._core import DistributedRuntime, VirtualConnectorClient
-
-client = VirtualConnectorClient(distributed_runtime, namespace)
-```
-
-1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()` (blocks until change)
-2. **Parse Decisions**: Read values: `decision = await client.get()`
-3. **Execute Scaling**: Apply the scaling decisions to your infrastructure
-4. **Acknowledge Completion**: Mark done: `await client.complete(decision)`
-
-A scaling decision (returned by `client.get()`) contains:
- `num_prefill_workers`: Target number of prefill workers (-1 if not set)
- `num_decode_workers`: Target number of decode workers (-1 if not set)
- `decision_id`: Incremental ID for each scaling decision
-
-See `components/planner/test/test_virtual_connector.py` for a full example.
-
-### Grafana Dashboard
-
-Deploy the planner Grafana dashboard:
-
-```bash
-kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
-```
-
-Follow [Dynamo Metrics Collection on Kubernetes](/docs/kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
-
-The dashboard displays:
- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
- **Predicted Metrics**: Planner's load predictions and recommended replica counts
- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
-
-> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your deployment namespace.
-
-## DGDR Immutability
-
-DGDRs are **immutable**. To update SLAs or configuration:
-
-1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
-2. Create a new DGDR with updated specifications
-
-## Manual Deployment Control
-
-### Option 1: Use DGDR-Generated Configuration (Recommended)
-
-Disable auto-deployment to review the generated DGD before applying:
-
-```yaml
-spec:
-  autoApply: false
-```
-
-Then manually extract and apply:
-
-```bash
-# Extract generated DGD from DGDR status
-kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
-
-# Or save to file first for review/modification
-kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
-vi my-dgd.yaml
-kubectl apply -f my-dgd.yaml -n $NAMESPACE
-```
-
-### Option 2: Use Standalone Planner Templates (Advanced)
-
-For advanced use cases, use the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:
-
-```bash
-# After profiling completes, profiling data is stored in ConfigMaps
-kubectl get configmap dgdr-output-<dgdr-name> -n $NAMESPACE -o yaml
-kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml
-
-# Update PROMETHEUS_ENDPOINT in the template, then deploy
-kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
-```
-
-## Accessing Profiling Artifacts
-
-By default, profiling jobs save essential data to ConfigMaps. For detailed artifacts, configure the DGDR to use `dynamo-pvc`:
-
-**ConfigMaps (always created):**
- Generated DGD configuration
- Profiling data for Planner (`.json` files)
-
-**PVC (optional):**
- Performance plots (PNGs)
- DGD configuration and logs for each profiled deployment
- AIPerf profiling artifacts
- Raw profiling data (`.npz` files)
- Profiler log
-
-```bash
-# Setup PVC
-deploy/utils/setup_benchmarking_resources.sh
-
-# Access results after profiling
-kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
-kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
-kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
-kubectl delete pod pvc-access-pod -n $NAMESPACE
-```
-
-## Troubleshooting
-
-### Quick Diagnostics
-
-```bash
-# Check DGDR status and events
-kubectl describe dgdr sla-aic -n $NAMESPACE
-
-# Check operator logs
-kubectl logs -n $NAMESPACE -l app.kubernetes.io/name=dynamo-operator --tail=100
-
-# Check profiling job logs
-kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
-```
-
-### Common Issues
-
-| Issue | Quick Fix |
-|-------|-----------|
-| **DGDR stuck in Pending** | Check GPU availability: `kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'` |
-| **Image pull errors** | Verify secret exists: `kubectl get secret nvcr-imagepullsecret -n $NAMESPACE` |
-| **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
-| **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
-| **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
-| **Prometheus errors** | Ensure `PROMETHEUS_ENDPOINT` env var points to your Prometheus service |
-
-For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](/docs/components/profiler/profiler_guide.md#troubleshooting).
-
-## Related Documentation
-
- [Planner README](README.md) -- Overview and quick start
- [Planner Examples](planner_examples.md) -- DGDR YAML examples and sample configurations
- [Planner Design](/docs/design_docs/planner_design.md) -- Architecture deep-dive for contributors
- [DGDR API Reference](/docs/kubernetes/api_reference.md)
- [Pre-Deployment Profiling](/docs/components/profiler/profiler_guide.md)
- [Dynamo Operator Guide](/docs/kubernetes/dynamo_operator.md)
--- a/docs/components/profiler/README.md
+++ b/docs/components/profiler/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Profiler
-
-The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.
-
-## Feature Matrix
-
-| Feature | vLLM | SGLang | TensorRT-LLM |
-|---------|------|--------|--------------|
-| Dense Model Profiling | ✅ | ✅ | ✅ |
-| MoE Model Profiling | 🚧 | ✅ | 🚧 |
-| AI Configurator (Offline) | ❌ | ❌ | ✅ |
-| Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
-| Interactive WebUI | ✅ | ✅ | ✅ |
-| Runtime Profiling Endpoints | ❌ | ✅ | ❌ |
-
-## Quick Start
-
-### Prerequisites
-
- Dynamo platform installed (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- Kubernetes cluster with GPU nodes (for DGDR-based profiling)
- kube-prometheus-stack installed (required for SLA planner)
-
-### Using DynamoGraphDeploymentRequest (Recommended)
-
-The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: my-model-profiling
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-    config:
-      sla:
-        isl: 3000      # Average input sequence length
-        osl: 150       # Average output sequence length
-        ttft: 200.0    # Target Time To First Token (ms)
-        itl: 20.0      # Target Inter-Token Latency (ms)
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-
-  autoApply: true
-```
-
-```bash
-kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
-```
-
-### Using AI Configurator (Fast Offline Profiling)
-
-For TensorRT-LLM, use AI Configurator for rapid profiling (~30 seconds):
-
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      useAiConfigurator: true
-      aicSystem: h200_sxm
-      aicHfId: Qwen/Qwen3-32B
-      aicBackendVersion: "0.20.0"
-```
-
-### Direct Script Usage (Advanced)
-
-For advanced scenarios, run the profiler directly:
-
-```bash
-python -m benchmarks.profiler.profile_sla \
-  --backend vllm \
-  --config path/to/disagg.yaml \
-  --model meta-llama/Llama-3-8B \
-  --ttft 200 --itl 15 \
-  --isl 3000 --osl 150
-```
-
-## Configuration
-
-| Parameter | Default | Description |
-|-----------|---------|-------------|
-| `sla.isl` | - | Average input sequence length (tokens) |
-| `sla.osl` | - | Average output sequence length (tokens) |
-| `sla.ttft` | - | Target Time To First Token (milliseconds) |
-| `sla.itl` | - | Target Inter-Token Latency (milliseconds) |
-| `sweep.useAiConfigurator` | `false` | Use offline simulation instead of real profiling |
-| `hardware.minNumGpusPerEngine` | auto | Minimum GPUs per engine (auto-detected from model size) |
-| `hardware.maxNumGpusPerEngine` | 8 | Maximum GPUs per engine |
-
-## Profiling Methods
-
-| Method | Duration | Accuracy | GPU Required | Backends |
-|--------|----------|----------|--------------|----------|
-| Online (AIPerf) | 2-4 hours | Highest | Yes | All |
-| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM |
-
-## Output
-
-The profiler generates:
-
-1. **Optimal Configuration**: Recommended TP sizes for prefill and decode engines
-2. **Performance Data**: Interpolation models for the SLA Planner
-3. **Generated DGD**: Complete deployment manifest with optimized settings
-
-Example recommendations:
-```text
-Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
-Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
-```
-
-## Next Steps
-
-| Document | Description |
-|----------|-------------|
-| [Profiler Guide](profiler_guide.md) | Configuration, methods, and troubleshooting |
-| [Profiler Examples](profiler_examples.md) | Complete DGDR YAMLs, WebUI, script examples |
-| [SLA Planner Guide](/docs/components/planner/planner_guide.md) | End-to-end deployment workflow |
-| [SLA Planner Architecture](/docs/components/planner/planner_guide.md) | How the Planner uses profiling data |
-
-```{toctree}
-:hidden:
-
-profiler_guide
-profiler_examples
-```
--- a/docs/components/profiler/profiler_examples.md
+++ b/docs/components/profiler/profiler_examples.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Profiler Examples
-
-Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.
-
-## DGDR Examples
-
-### Dense Model: AIPerf on Real Engines
-
-Standard online profiling with real GPU measurements:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: vllm-dense-online
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-    config:
-      sla:
-        isl: 3000
-        osl: 150
-        ttft: 200.0
-        itl: 20.0
-
-      hardware:
-        minNumGpusPerEngine: 1
-        maxNumGpusPerEngine: 8
-
-      sweep:
-        useAiConfigurator: false
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-
-  autoApply: true
-```
-
-### Dense Model: AI Configurator Simulation
-
-Fast offline profiling (~30 seconds, TensorRT-LLM only):
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: trtllm-aic-offline
-spec:
-  model: "Qwen/Qwen3-32B"
-  backend: trtllm
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
-    config:
-      sla:
-        isl: 4000
-        osl: 500
-        ttft: 300.0
-        itl: 10.0
-
-      sweep:
-        useAiConfigurator: true
-        aicSystem: h200_sxm  # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
-        aicHfId: Qwen/Qwen3-32B
-        aicBackendVersion: "0.20.0"
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
-
-  autoApply: true
-```
-
-### MoE Model
-
-Multi-node MoE profiling with SGLang:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: sglang-moe
-spec:
-  model: "deepseek-ai/DeepSeek-R1"
-  backend: sglang
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
-    config:
-      sla:
-        isl: 2048
-        osl: 512
-        ttft: 300.0
-        itl: 25.0
-
-      hardware:
-        numGpusPerNode: 8
-        maxNumGpusPerEngine: 32
-
-      engine:
-        isMoeModel: true
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
-
-  autoApply: true
-```
-
-### Using Existing DGD Config (ConfigMap)
-
-Reference a custom DGD configuration via ConfigMap:
-
-```bash
-# Create ConfigMap from your DGD config file
-kubectl create configmap deepseek-r1-config \
-  --from-file=/path/to/your/disagg.yaml \
-  --namespace $NAMESPACE \
-  --dry-run=client -o yaml | kubectl apply -f -
-```
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: deepseek-r1
-spec:
-  model: deepseek-ai/DeepSeek-R1
-  backend: sglang
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
-    configMapRef:
-      name: deepseek-r1-config
-      key: disagg.yaml
-    config:
-      sla:
-        isl: 4000
-        osl: 500
-        ttft: 300
-        itl: 10
-      sweep:
-        useAiConfigurator: true
-        aicSystem: h200_sxm
-        aicHfId: deepseek-ai/DeepSeek-V3
-        aicBackendVersion: "0.20.0"
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
-
-  autoApply: true
-```
-
-## Interactive WebUI
-
-Launch an interactive configuration selection interface:
-
-```bash
-python -m benchmarks.profiler.profile_sla \
-  --backend trtllm \
-  --config path/to/disagg.yaml \
-  --pick-with-webui \
-  --use-ai-configurator \
-  --model Qwen/Qwen3-32B-FP8 \
-  --aic-system h200_sxm \
-  --ttft 200 --itl 15
-```
-
-The WebUI launches on port 8000 by default (configurable with `--webui-port`).
-
-### Features
-
- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
-
-### Selection Methods
-
-1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
-2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
-
-### Example DGD Config Output
-
-When you click "Show Config", you see a DynamoGraphDeployment configuration:
-
-```yaml
-# DynamoGraphDeployment Configuration
-# Prefill: 1 GPU(s), TP=1
-# Decode: 4 GPU(s), TP=4
-# Model: Qwen/Qwen3-32B-FP8
-# Backend: trtllm
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-spec:
-  services:
-    PrefillWorker:
-      subComponentType: prefill
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --tensor-parallel-size=1
-    DecodeWorker:
-      subComponentType: decode
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --tensor-parallel-size=4
-```
-
-Once you select a configuration, the full DGD CRD is saved as `config_with_planner.yaml`.
-
-## Direct Script Examples
-
-### Basic Profiling
-
-```bash
-python -m benchmarks.profiler.profile_sla \
-  --backend vllm \
-  --config path/to/disagg.yaml \
-  --model meta-llama/Llama-3-8B \
-  --ttft 200 --itl 15 \
-  --isl 3000 --osl 150
-```
-
-### With GPU Constraints
-
-```bash
-python -m benchmarks.profiler.profile_sla \
-  --backend sglang \
-  --config examples/backends/sglang/deploy/disagg.yaml \
-  --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-  --ttft 200 --itl 15 \
-  --isl 3000 --osl 150 \
-  --min-num-gpus 2 \
-  --max-num-gpus 8
-```
-
-### AI Configurator (Offline)
-
-```bash
-python -m benchmarks.profiler.profile_sla \
-  --backend trtllm \
-  --config path/to/disagg.yaml \
-  --use-ai-configurator \
-  --model Qwen/Qwen3-32B-FP8 \
-  --aic-system h200_sxm \
-  --ttft 200 --itl 15 \
-  --isl 4000 --osl 500
-```
-
-## SGLang Runtime Profiling
-
-Profile SGLang workers at runtime via HTTP endpoints:
-
-```bash
-# Start profiling
-curl -X POST http://localhost:9090/engine/start_profile \
-  -H "Content-Type: application/json" \
-  -d '{"output_dir": "/tmp/profiler_output"}'
-
-# Run inference requests to generate profiling data...
-
-# Stop profiling
-curl -X POST http://localhost:9090/engine/stop_profile
-```
-
-A test script is provided at `examples/backends/sglang/test_sglang_profile.py`:
-
-```bash
-python examples/backends/sglang/test_sglang_profile.py
-```
-
-View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
--- a/docs/components/profiler/profiler_guide.md
+++ b/docs/components/profiler/profiler_guide.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Profiler Guide
-
-This guide covers deployment, configuration, integration, and troubleshooting for the Dynamo Profiler.
-
-## What is a DynamoGraphDeploymentRequest (DGDR)?
-
-A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. You specify:
-
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
-
-The Dynamo Operator watches for DGDRs and automatically:
-1. Discovers available GPU resources in your cluster
-2. Runs profiling (online or offline) to find optimal configurations
-3. Generates an optimized DynamoGraphDeployment (DGD) configuration
-4. Deploys the DGD to your cluster
-
-**Relationship to DGD:**
- **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed
-
-## Support Matrix
-
-| Backend | Dense Models | MoE Models |
-|---------|-------------|------------|
-| vLLM | ✅ | 🚧 |
-| SGLang | ✅ | ✅ |
-| TensorRT-LLM | ✅ | 🚧 |
-
-The profiler sweeps over the following parallelization mappings for prefill and decode:
-
-| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping |
-|---------|-------------|------------|
-| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP |
-| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
-| Other Models | TP | TP |
-
-> [!NOTE]
-> Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
-
-## Deployment
-
-### Kubernetes Deployment (DGDR)
-
-The recommended deployment method is through DGDRs. Sample configurations are provided in `benchmarks/profiler/deploy/`:
-
-| Sample | Description |
-|--------|-------------|
-| `profile_sla_dgdr.yaml` | Standard online profiling with AIPerf |
-| `profile_sla_aic_dgdr.yaml` | Fast offline profiling with AI Configurator |
-| `profile_sla_moe_dgdr.yaml` | MoE model profiling (SGLang) |
-
-#### Container Images
-
-Each DGDR requires container images for profiling and deployment:
-
- **`profilingConfig.profilerImage`** (Required): Container image for the profiling job. Must contain the profiler code and dependencies.
- **`deploymentOverrides.workersImage`** (Optional): Container image for DGD worker components (frontend, workers, planner). If omitted, uses image from the base config file.
-
-```yaml
-spec:
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-```
-
-#### Quick Start: Deploy with DGDR
-
-**Step 1: Create Your DGDR**
-
-Use a sample configuration or create your own:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: my-model-profiling
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-    config:
-      sla:
-        isl: 3000
-        osl: 150
-        ttft: 200.0
-        itl: 20.0
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-  autoApply: true
-```
-
-**Step 2: Apply the DGDR**
-
-```bash
-export NAMESPACE=your-namespace
-kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
-```
-
-**Step 3: Monitor Progress**
-
-```bash
-# View status
-kubectl get dgdr -n $NAMESPACE
-
-# Detailed status
-kubectl describe dgdr my-model-profiling -n $NAMESPACE
-
-# Watch profiling job logs
-kubectl logs -f job/profile-my-model-profiling -n $NAMESPACE
-```
-
-**DGDR Status States:**
- `Pending`: Initial state, preparing to profile
- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
- `Deploying`: Generating and applying DGD configuration
- `Ready`: DGD successfully deployed and running
- `Failed`: Error occurred (check events for details)
-
-**Step 4: Access Your Deployment**
-
-```bash
-# Find the frontend service
-kubectl get svc -n $NAMESPACE | grep frontend
-
-# Port-forward to access locally
-kubectl port-forward svc/<deployment>-frontend 8000:8000 -n $NAMESPACE
-
-# Test the endpoint
-curl http://localhost:8000/v1/models
-```
-
-> [!NOTE]
-> DGDRs are **immutable**. To update SLAs or configuration, delete the existing DGDR and create a new one.
-
-### Direct Script Execution
-
-For advanced use cases or local development:
-
-```bash
-python -m benchmarks.profiler.profile_sla \
-  --backend vllm \
-  --config path/to/disagg.yaml \
-  --model meta-llama/Llama-3-8B \
-  --ttft 200 --itl 15 \
-  --isl 3000 --osl 150 \
-  --min-num-gpus 1 \
-  --max-num-gpus 8
-```
-
-## Profiling Method
-
-The profiler follows a 5-step process:
-
-1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
-2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense models and 4 nodes for MoE models.
-3. **Parallelization Mapping Sweep**: Test performance of engines with different parallelization mappings using the input ISL and OSL.
-   - For dense models, test different TP sizes for both prefill and decode.
-   - For MoE models (SGLang), evaluate both TEP and DEP as candidates for prefill and decode.
-   - **Prefill**:
-     - TP/TEP: Measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
-     - DEP: Attention uses data parallelism. Send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst.
-   ![Prefill Performance](../../images/h100_prefill_performance.png)
-   - **Decode**: Measure the ITL under different numbers of in-flight requests, from 1 to the maximum the KV cache can hold. To measure ITL without being affected by piggy-backed prefill requests, the script enables KV-reuse and warms up the engine by issuing the same prompts before measuring.
-   ![Decode Performance](../../images/h100_decode_performance.png)
-4. **Recommendation**: Select optimal parallelization mapping for prefill and decode that achieves the highest per-GPU throughput while adhering to the SLA on TTFT and ITL.
-5. **In-Depth Profiling on the Recommended P/D Engine**: Interpolate TTFT with ISL and ITL with active KV cache and decode context length for more accurate performance estimation.
-![ITL Interpolation](../../images/pd_interpolation.png)
-   - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
-   - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths.
-
-### AIPerf on Real Engines
-
-Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
-
- **Duration**: 2-4 hours
- **Accuracy**: Highest (real measurements)
- **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM
-
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      useAiConfigurator: false  # Default
-```
-
-### AI Configurator Simulation
-
-Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
-
- **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
-
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      useAiConfigurator: true
-      aicSystem: h200_sxm
-      aicHfId: Qwen/Qwen3-32B
-      aicBackendVersion: "0.20.0"      # TRT-LLM version simulated by AIC
-```
-
-> [!NOTE]
-> `aicBackendVersion` specifies the TensorRT-LLM version that AI Configurator simulates. See the [AI Configurator supported features](https://github.com/ai-dynamo/aiconfigurator#supported-features) for available versions.
-
-**Currently supports:**
- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
-
-See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features) for the full list.
-
-### Automatic GPU Discovery
-
-Cluster-scoped operators can optionally enable automatic GPU discovery:
-
-```yaml
-spec:
-  enableGpuDiscovery: true
-```
-
-This is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions.
-
-## Configuration
-
-### DGDR Configuration Structure
-
-All profiler configuration goes under `spec.profilingConfig.config`:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: my-deployment
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-    configMapRef:                  # Optional: base DGD config
-      name: my-config
-      key: disagg.yaml
-
-    config:
-      sla: { ... }
-      hardware: { ... }
-      sweep: { ... }
-      planner: { ... }
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
-```
-
-### SLA Configuration (Required)
-
-```yaml
-sla:
-  isl: 3000      # Average input sequence length (tokens)
-  osl: 150       # Average output sequence length (tokens)
-  ttft: 200.0    # Target Time To First Token (milliseconds)
-  itl: 20.0      # Target Inter-Token Latency (milliseconds)
-```
-
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine)
- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine)
- **Trade-offs**: Tighter SLAs require more GPU resources
-
-### Hardware Configuration (Optional)
-
-```yaml
-hardware:
-  minNumGpusPerEngine: 2      # Auto-determined from model size and VRAM if not provided
-  maxNumGpusPerEngine: 8      # Maximum GPUs to test
-  numGpusPerNode: 8           # GPUs per node (for multi-node MoE)
-  gpuType: h200_sxm           # GPU type hint (informational, auto-detected)
-```
-
- **minNumGpusPerEngine**: Skip small TP sizes if your model is large
- **maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
- **numGpusPerNode**: Determine the upper bound of GPUs per node for dense models and configure Grove for multi-node MoE engines
- **gpuType**: Informational only, auto-detected by the controller. For AI Configurator, use `aicSystem` in the [sweep configuration](#ai-configurator-configuration) instead
-
-> [!TIP]
-> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
-
-### Sweep Configuration (Optional)
-
-```yaml
-sweep:
-  useAiConfigurator: false              # Use real profiling (default)
-  prefillInterpolationGranularity: 16   # Samples for prefill TTFT curve
-  decodeInterpolationGranularity: 6     # Samples for decode ITL curve
-```
-
- **useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
- **prefillInterpolationGranularity**: Samples for prefill TTFT curve (lower = faster but less accurate)
- **decodeInterpolationGranularity**: Samples for decode ITL curve. Since ITL interpolation is 3D and takes longer, we default to fewer samples. Increasing this value may quadratically increase profiling time.
-
-### AI Configurator Configuration
-
-Required if `useAiConfigurator: true`:
-
-```yaml
-sweep:
-  useAiConfigurator: true
-  aicSystem: h200_sxm              # h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
-  aicHfId: Qwen/Qwen3-32B         # HuggingFace model ID
-  aicBackendVersion: "0.20.0"      # TensorRT-LLM version
-```
-
-### Planner Configuration (Optional)
-
-Pass arguments to the SLA planner:
-
-```yaml
-planner:
-  planner_min_endpoint: 2                    # Minimum endpoints to maintain
-  planner_adjustment_interval: 60            # Adjustment interval (seconds)
-  planner_load_predictor: linear             # Load prediction method
-```
-
-> [!NOTE]
-> Planner arguments use `planner_` prefix. See [SLA Planner documentation](/docs/components/planner/planner_guide.md) for full list.
-
-### Model Cache PVC (Advanced)
-
-For large models, use a pre-populated PVC containing model weights instead of downloading from HuggingFace:
-
-```yaml
-deployment:
-  modelCache:
-    pvcName: "model-cache"
-    pvcPath: "hub/models--deepseek-ai--DeepSeek-R1"
-    mountPath: "/opt/model-cache"
-```
-
-Requirements:
- The PVC must exist in the same namespace as the DGDR
- The model weights must be accessible at `{mountPath}/{pvcPath}`
-
-### Engine Configuration (Auto-configured)
-
-The controller automatically injects these from high-level fields:
-
-```yaml
-# You specify:
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-
-# Controller auto-injects:
-profilingConfig:
-  config:
-    deployment:
-      model: "Qwen/Qwen3-0.6B"
-    engine:
-      backend: vllm
-      config: /path/to/configmap
-```
-
-You should **not** manually set `deployment.model` or `engine.backend` in `profilingConfig.config`.
-
-### Using Existing DGD Configs (ConfigMap)
-
-Reference an existing DGD config via ConfigMap:
-
-```bash
-kubectl create configmap my-config \
-  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
-  --namespace $NAMESPACE \
-  --dry-run=client -o yaml | kubectl apply -f -
-```
-
-```yaml
-profilingConfig:
-  configMapRef:
-    name: my-config
-    key: disagg.yaml
-```
-
-The profiler uses the DGD config as a **base template**, then optimizes it based on your SLA targets.
-
-### CLI Arguments
-
-| Argument | Type | Default | Description |
-|----------|------|---------|-------------|
-| `--backend` | string | - | Inference backend: vllm, sglang, trtllm |
-| `--config` | string | - | Path to DGD YAML config file |
-| `--model` | string | - | HuggingFace model ID |
-| `--ttft` | float | - | Target TTFT in milliseconds |
-| `--itl` | float | - | Target ITL in milliseconds |
-| `--isl` | int | - | Average input sequence length |
-| `--osl` | int | - | Average output sequence length |
-| `--min-num-gpus` | int | auto | Minimum GPUs per engine |
-| `--max-num-gpus` | int | 8 | Maximum GPUs per engine |
-| `--use-ai-configurator` | flag | false | Use offline AI Configurator |
-| `--pick-with-webui` | flag | false | Launch interactive WebUI |
-| `--webui-port` | int | 8000 | Port for WebUI |
-
-> [!NOTE]
-> CLI arguments map to DGDR config fields: `--min-num-gpus` = `hardware.minNumGpusPerEngine`, `--max-num-gpus` = `hardware.maxNumGpusPerEngine`, `--use-ai-configurator` = `sweep.useAiConfigurator`. See [DGDR Configuration Structure](#dgdr-configuration-structure) for all field mappings.
-
-## Integration
-
-### With SLA Planner
-
-The Profiler generates interpolation data that the SLA Planner uses for autoscaling decisions.
-
-**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`):
- `prefill_isl`: 1D array of input sequence lengths tested
- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL
- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL
-
-**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`):
- `max_kv_tokens`: Total KV tokens capacity in decode engine
- `x_kv_usage`: 1D array of active KV usage percentages [0, 1]
- `y_context_length`: 1D array of average context lengths tested
- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point
- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point
-
-### With Dynamo Operator
-
-When using DGDR, the Dynamo Operator:
-
-1. Creates profiling jobs automatically
-2. Stores profiling data in ConfigMaps (`planner-profile-data`)
-3. Generates optimized DGD configurations
-4. Deploys the DGD with SLA Planner integration
-
-The generated DGD is tracked via labels:
-```yaml
-metadata:
-  labels:
-    dgdr.nvidia.com/name: my-deployment
-    dgdr.nvidia.com/namespace: your-namespace
-```
-
-### With Observability
-
-Monitor profiling jobs:
-
-```bash
-kubectl logs -f job/profile-<dgdr-name> -n $NAMESPACE
-kubectl describe dgdr <name> -n $NAMESPACE
-```
-
-## Advanced Topics
-
-### Manual Deployment Control
-
-Disable auto-deployment to review the generated DGD before applying:
-
-```yaml
-spec:
-  autoApply: false
-```
-
-Then manually extract and apply:
-
-```bash
-# Extract generated DGD from DGDR status
-kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
-
-# Or save to file for review
-kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
-```
-
-### Mocker Deployment
-
-Deploy a mocker deployment that simulates engines without GPUs:
-
-```yaml
-spec:
-  model: <model-name>
-  backend: trtllm
-  useMocker: true    # Deploy mocker instead of real backend
-  autoApply: true
-```
-
-Profiling still runs against the real backend to collect performance data. The mocker uses this data to simulate realistic timing behavior. Useful for large-scale experiments, testing Planner behavior, and validating configurations.
-
-### Accessing Profiling Artifacts
-
-By default, profiling data is stored in ConfigMaps. For detailed artifacts (plots, logs, raw data), attach a PVC:
-
-```yaml
-profilingConfig:
-  outputPVC: "dynamo-pvc"
-```
-
-**ConfigMaps (always created):**
- `dgdr-output-<name>`: Generated DGD configuration
- `planner-profile-data`: Profiling data for Planner (JSON)
-
-**PVC artifacts (optional):**
- Performance plots (PNGs)
- DGD configurations for each profiled deployment
- AIPerf profiling artifacts
- Raw profiling data (`.npz` files)
- Profiler logs
-
-Access PVC results:
-```bash
-kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
-kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
-kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
-kubectl delete pod pvc-access-pod -n $NAMESPACE
-```
-
-### Output Performance Plots
-
-The profiler generates plots to visualize performance data:
-
-**Parallelization Mapping Sweep Plots:**
- `prefill_performance.png`: TTFT vs Parallelization Mapping size
- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests
-
-**In-Depth Profiling Plots:**
- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL
- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL
- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length
- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length
-
-## Runtime Profiling (SGLang)
-
-SGLang workers expose profiling endpoints for runtime performance analysis:
-
-```bash
-# Start profiling
-curl -X POST http://localhost:9090/engine/start_profile \
-  -H "Content-Type: application/json" \
-  -d '{"output_dir": "/tmp/profiler_output"}'
-
-# Run inference requests...
-
-# Stop profiling
-curl -X POST http://localhost:9090/engine/stop_profile
-```
-
-View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
-
-## Troubleshooting
-
-### Profiling Takes Too Long
-
-**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
-```yaml
-sweep:
-  useAiConfigurator: true
-```
-
-**Solution 2**: Reduce search space:
-```yaml
-hardware:
-  minNumGpusPerEngine: 4  # Skip TP1, TP2
-  maxNumGpusPerEngine: 8  # Don't test beyond TP8
-```
-
-### SLA Cannot Be Met
-
-**Symptoms**: Profiler reports no configuration meets targets
-
-**Solutions:**
-1. Relax SLA targets (increase TTFT/ITL)
-2. Add more GPU resources
-3. Try a different backend
-4. Use a smaller model
-
-### AI Configurator: Attention Head Constraint Error
-
-**Symptoms**: Profiling fails with error:
-```text
-AssertionError: num_heads <N> should be divisible by tp_size <M> and the division result should be >= 4
-```
-
-**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes.
-
-**Affected Models:**
- **Qwen3-0.6B** (16 heads): Max TP = 4
- **GPT-2** (12 heads): Max TP = 3
- Most models **<1B parameters**: May hit this constraint
-
-**Solution**: Limit `maxNumGpusPerEngine`:
-```yaml
-hardware:
-  maxNumGpusPerEngine: 4  # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
-```
-
-**Calculate Max TP**: `max_tp = num_attention_heads / 4`
-
-> [!NOTE]
-> This is an AI Configurator limitation. Online profiling doesn't have this constraint.
-
-### Image Pull Errors
-
-**Symptoms**: `ErrImagePull` or `ImagePullBackOff`
-
-**Solution**: Ensure image pull secrets are configured:
-```bash
-kubectl create secret docker-registry nvcr-imagepullsecret \
-  --docker-server=nvcr.io \
-  --docker-username='$oauthtoken' \
-  --docker-password=<NGC_API_KEY> \
-  --namespace <your-namespace>
-```
-
-### Out of Memory During Profiling
-
-**Symptoms**: OOM errors in profiling jobs
-
-**Solutions:**
-1. Reduce `gpu_memory_utilization` in engine config
-2. Reduce `--max-context-length`
-3. Skip larger TP configurations
-4. Use fewer GPUs per test
-
-### Unsupported Parallelization Mapping in Backend
-
-**Symptoms**: Startup/runtime error in the backend (e.g., prime number of attention heads constraining TP to 1, or backend not supporting different TP sizes for prefill and decode).
-
-**Solutions:**
-1. Contact the backend to add support and bump backend version in Dynamo
-2. Constrain the max and min number of GPUs per engine to the supported range
-
-## See Also
-
- [Profiler Examples](profiler_examples.md) - Complete DGDR YAML examples
- [SLA Planner Guide](/docs/components/planner/planner_guide.md) - End-to-end deployment workflow
- [SLA Planner Architecture](/docs/components/planner/planner_guide.md) - How the Planner uses profiling data
- [DGDR API Reference](/docs/kubernetes/api_reference.md) - DGDR specification
- [Profiler Arguments Reference](/benchmarks/profiler/utils/profiler_argparse.py) - Full CLI reference
--- a/docs/components/router/README.md
+++ b/docs/components/router/README.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Router
-
-The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
-
-## Quick Start
-
-### Python / CLI Deployment
-
-To launch the Dynamo frontend with the KV Router:
-
-```bash
-python -m dynamo.frontend --router-mode kv --http-port 8000
-```
-
-This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
-
-Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
-
-#### CLI Arguments
-
-| Argument | Default | Description |
-|----------|---------|-------------|
-| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
-| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
-| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
-| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
-| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
-
-For all available options: `python -m dynamo.frontend --help`
-
-### Kubernetes Deployment
-
-To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-spec:
-  services:
-    Frontend:
-      dynamoNamespace: my-namespace
-      componentType: frontend
-      replicas: 1
-      envs:
-        - name: DYN_ROUTER_MODE
-          value: kv  # Enable KV Smart Router
-```
-
-**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed
-
-#### Environment Variables
-
-All CLI arguments can be configured via environment variables using the `DYN_` prefix:
-
-| CLI Argument | Environment Variable | Default |
-|--------------|---------------------|---------|
-| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
-| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
-| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
-| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
-| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
-
-For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples).
-
-For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
-
-For more configuration options and tuning guidelines, see the [Router Guide](router_guide.md).
-
-## Prerequisites and Limitations
-
-**Requirements:**
- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md))
- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
-
-**Multimodal Support:**
- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes
- **SGLang**: Image routing not yet supported
- **Other modalities** (audio, video, etc.): Not yet supported
-
-**Limitations:**
- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states
-
-For basic model registration without KV routing, use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
-
-## Next Steps
-
- **[Router Guide](router_guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../../design_docs/router_design.md)**: Architecture details, algorithms, and event transport modes
-
-```{toctree}
-:hidden:
-
-router_guide
-router_examples
-```
--- a/docs/components/router/router_examples.md
+++ b/docs/components/router/router_examples.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Router Examples
-
-For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns.
-
-## Table of Contents
-
- [Using KvPushRouter Python API](#using-kvpushrouter-python-api)
- [K8s Examples](#k8s-examples)
- [Routing Patterns](#routing-patterns)
- [Custom Routing Example: Minimizing TTFT](#custom-routing-example-minimizing-ttft)
- [KV Event Publishing for Custom Engines](#kv-event-publishing-for-custom-engines)
-
-## Using KvPushRouter Python API
-
-Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
-
->[!Warning]
-> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
-
-### Methods
-
-The `KvPushRouter` provides the following methods:
-
- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
-
- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
-  - Without `request_id`: Query-only, doesn't update router state
-  - With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
-
- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
-
- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
-
- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
-
- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
-
-### Setup
-
-First, launch your backend engines:
-```bash
-python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
-```
-
-### Example Script
-
-```python
-import asyncio
-from dynamollm import DistributedRuntime, KvPushRouter, KvRouterConfig
-
-async def main():
-    # Get runtime and create endpoint
-    runtime = DistributedRuntime.detached()
-    namespace = runtime.namespace("dynamo")
-    component = namespace.component("backend")
-    endpoint = component.endpoint("generate")
-
-    # Create KV router
-    kv_router_config = KvRouterConfig()
-    router = KvPushRouter(
-        endpoint=endpoint,
-        block_size=16,
-        kv_router_config=kv_router_config
-    )
-
-    # Your input tokens
-    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-
-    # Generate with per-request routing override
-    stream = await router.generate(
-        token_ids=token_ids,
-        model="meta-llama/Llama-2-7b-hf",
-        stop_conditions={
-            "max_tokens": 20,        # Generate exactly 20 tokens
-            "ignore_eos": True,      # Don't stop at EOS token
-        },
-        sampling_options={
-            "temperature": 0.7,
-            "top_p": 0.9,
-        },
-        router_config_override={
-            "overlap_score_weight": 2.0,    # Prioritize cache hits for this request
-            "router_temperature": 0.5,       # Add routing randomness
-        }
-    )
-
-    # Collect generated tokens
-    generated_tokens = []
-    async for response in stream:
-        if isinstance(response, dict) and "token_ids" in response:
-            generated_tokens.extend(response["token_ids"])
-
-    print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-## K8s Examples
-
-For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployment section](README.md#kubernetes-deployment) in the Quick Start guide.
-
-### Complete K8s Examples
-
- [TRT-LLM aggregated router example](../../examples/backends/trtllm/deploy/agg_router.yaml)
- [vLLM aggregated router example](../../examples/backends/vllm/deploy/agg_router.yaml)
- [SGLang aggregated router example](../../examples/backends/sglang/deploy/agg_router.yaml)
- [Distributed inference tutorial](../../examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
-
-**For A/B Testing and Advanced K8s Setup:**
-See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
-
-### Example with Advanced Configuration
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-spec:
-  services:
-    Frontend:
-      dynamoNamespace: my-namespace
-      componentType: frontend
-      replicas: 1
-      envs:
-        - name: DYN_ROUTER_MODE
-          value: kv
-        - name: DYN_ROUTER_TEMPERATURE
-          value: "0.5"  # Add some randomness to prevent worker saturation
-        - name: DYN_KV_OVERLAP_SCORE_WEIGHT
-          value: "1.5"  # Prioritize TTFT over ITL
-        - name: DYN_KV_CACHE_BLOCK_SIZE
-          value: "16"
-      extraPodSpec:
-        mainContainer:
-          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
-```
-
-### Alternative: Using Command Args in K8s
-
-You can also pass CLI arguments directly in the container command:
-
-```yaml
-extraPodSpec:
-  mainContainer:
-    image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
-    command:
-      - /bin/sh
-      - -c
-    args:
-      - "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
-```
-
-**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
-
-## Routing Patterns
-
-The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
-
-### 1. Automatic Routing (Recommended)
-Call `generate()` directly and let the router handle everything:
-```python
-stream = await router.generate(token_ids=tokens, model="model-name")
-```
- **Best for**: Most use cases
- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
-
-### 2. Manual State Management (Advanced)
-Use `best_worker(request_id=...)` to select and track, then manage the request yourself:
-```python
-worker_id, _dp_rank, overlap = await router.best_worker(tokens, request_id="req-123")
-response = await client.generate(tokens, request_id="req-123")
-# await anext(response)  # Get first token
-await router.mark_prefill_complete("req-123")  # After first token
-# async for _ in response:  # Continue generating
-#     ...
-await router.free("req-123")  # After completion
-```
- **Best for**: Custom request handling with router state tracking
- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
-
-### 3. Hierarchical Router Probing
-Query without state updates, then route through a chosen router:
-```python
-# Probe multiple routers without updating state
-worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens)  # No request_id
-worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens)
-
-# Pick the best router based on results
-chosen_router = router_1 if overlap_1 > overlap_2 else router_2
-stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
-```
- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
- **Advantage**: Query multiple routers before committing to one
-
-### 4. Custom Load-Based Routing
-Use `get_potential_loads()` to implement custom routing logic:
-```python
-loads = await router.get_potential_loads(tokens)
-# Apply custom logic (e.g., weighted scoring, constraints)
-best_worker = min(loads, key=lambda x: custom_cost_fn(x))
-stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
-```
- **Best for**: Custom optimization strategies beyond the built-in cost function
- **Advantage**: Full control over worker selection logic
- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
-
-All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
-
-## Custom Routing Example: Minimizing TTFT
-
-Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
-
-```python
-import asyncio
-from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig
-
-async def minimize_ttft_routing():
-    # Setup router
-    runtime = DistributedRuntime.detached()
-    namespace = runtime.namespace("dynamo")
-    component = namespace.component("backend")
-    endpoint = component.endpoint("generate")
-
-    router = KvPushRouter(
-        endpoint=endpoint,
-        block_size=16,
-        kv_router_config=KvRouterConfig()
-    )
-
-    # Your input tokens
-    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-
-    # Get potential loads for all workers
-    potential_loads = await router.get_potential_loads(token_ids)
-
-    # Find worker with minimum prefill tokens (best for TTFT)
-    best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
-
-    print(f"Worker loads: {potential_loads}")
-    print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
-
-    # Route directly to the selected worker
-    stream = await router.generate(
-        token_ids=token_ids,
-        model="meta-llama/Llama-2-7b-hf",
-        worker_id=best_worker['worker_id'],  # Force routing to optimal worker
-        stop_conditions={"max_tokens": 20}
-    )
-
-    # Process response
-    async for response in stream:
-        if isinstance(response, dict) and "token_ids" in response:
-            print(f"Generated tokens: {response['token_ids']}")
-
-if __name__ == "__main__":
-    asyncio.run(minimize_ttft_routing())
-```
-
-This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
-
- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
-
-See [Router Design](../../design_docs/router_design.md) for architecture details and the cost function algorithm.
-
-## KV Event Publishing for Custom Engines
-
-The KV Router relies on real-time events from backend workers to track which KV cache blocks are stored on each worker. When your custom engine allocates or evicts KV cache blocks, it should publish these events so the router can make optimal routing decisions. There are two main publishing pathways: direct NATS publishing (`KvEventPublisher`) which publishes events directly to NATS and is the simplest approach for custom engines, and ZMQ-based publishing for engines with ZMQ event output (like vLLM) which uses a ZMQ publisher in the engine and `ZmqKvEventPublisher` to forward events to NATS.
-
-### Event Types
-
-The KV cache supports three event types:
-
-| Event Type | Description | When to Publish |
-|------------|-------------|-----------------|
-| `BlockStored` | New blocks added to cache | After KV cache allocation succeeds |
-| `BlockRemoved` | Blocks evicted from cache | When blocks are evicted or freed |
-| `AllBlocksCleared` | All blocks removed | On cache reset or worker restart |
-
-### Event Structure
-
-Each event contains:
- **`event_id`**: Monotonically increasing identifier per worker
- **`dp_rank`**: Data parallel rank (0 if DP not enabled)
- **`data`**: One of `Stored`, `Removed`, or `Cleared`
-
-For `BlockStored` events:
- **`token_ids`**: List of token IDs for the stored blocks
- **`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests.
- **`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`)
- **`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent).
- **`lora_id`**: LoRA adapter ID (0 if not using LoRA)
-
-For `BlockRemoved` events:
- **`block_hashes`**: List of sequence block hashes being evicted
-
-### Option 1: Direct NATS Publishing (Recommended)
-
-The `KvEventPublisher` class publishes events directly to NATS. This is the simplest approach for custom engines.
-
-```mermaid
-flowchart LR
-    subgraph Engine["Custom Engine"]
-        cache["KV Cache Manager"]
-    end
-
-    subgraph Worker["Dynamo Worker Process"]
-        pub["KvEventPublisher"]
-    end
-
-    subgraph NATS["NATS"]
-        subject["kv-events subject"]
-    end
-
-    subgraph Router["KV Router"]
-        indexer["KvIndexer"]
-    end
-
-    cache -->|"on_blocks_stored()<br/>on_blocks_removed()"| pub
-    pub -->|"publish to NATS"| subject
-    subject --> indexer
-```
-
-**When to use:**
- Building a custom inference engine from scratch
- Your engine doesn't have a ZMQ-based event system
- You want the simplest integration path
-
-#### Basic Setup
-
-```python
-from dynamo.llm import KvEventPublisher
-
-class CustomEnginePublisher:
-    def __init__(self, component, worker_id: int, block_size: int, dp_rank: int = 0):
-        self.block_size = block_size
-        self.event_id = 0
-        self.kv_publisher = KvEventPublisher(
-            component=component,
-            worker_id=worker_id,
-            kv_block_size=block_size,
-            dp_rank=dp_rank,
-            enable_local_indexer=False,
-        )
-
-    def on_blocks_stored(self, token_ids: list[int], block_hashes: list[int],
-                         lora_id: int = 0, parent_hash: int | None = None):
-        """Call after KV cache blocks are allocated."""
-        self.event_id += 1
-        num_block_tokens = [self.block_size] * len(block_hashes)
-        self.kv_publisher.publish_stored(
-            event_id=self.event_id,
-            token_ids=token_ids,
-            num_block_tokens=num_block_tokens,
-            block_hashes=block_hashes,
-            lora_id=lora_id,
-            parent_hash=parent_hash,
-        )
-
-    def on_blocks_removed(self, block_hashes: list[int]):
-        """Call when KV cache blocks are evicted."""
-        self.event_id += 1
-        self.kv_publisher.publish_removed(event_id=self.event_id, block_hashes=block_hashes)
-```
-
-#### Integration with Your Engine
-
-```python
-from dynamo.llm import register_llm
-
-async def main():
-    # Register your engine with Dynamo
-    component, endpoint = await register_llm(
-        model="my-model",
-        generator=my_generate_fn,
-    )
-
-    # Initialize publisher
-    publisher = CustomEnginePublisher(
-        component=component,
-        worker_id=endpoint.connection_id(),
-        block_size=16,  # Match your engine's block size
-    )
-
-    # Hook into your engine's cache events
-    def on_prefill_complete(request_id, token_ids, blocks):
-        block_hashes = [block.hash for block in blocks]
-        publisher.on_blocks_stored(token_ids=token_ids, block_hashes=block_hashes)
-
-    def on_cache_eviction(evicted_blocks):
-        block_hashes = [block.hash for block in evicted_blocks]
-        publisher.on_blocks_removed(block_hashes=block_hashes)
-```
-
-### Option 2: ZMQ-based Publishing
-
-For engines that publish events via ZMQ (like vLLM), this option uses two components that work together:
-
-1. **ZMQ Publisher** (in your engine) - Publishes events to a ZMQ socket
-2. **ZmqKvEventPublisher** (Dynamo binding) - Subscribes to ZMQ and forwards to NATS
-
-```mermaid
-flowchart LR
-    subgraph Engine["Custom Engine / vLLM"]
-        cache["KV Cache Manager"]
-        zmq_pub["ZMQ Publisher<br/>(Pure Python)"]
-    end
-
-    subgraph ZMQ["ZMQ Socket"]
-        socket["tcp://127.0.0.1:5557"]
-    end
-
-    subgraph Worker["Dynamo Worker Process"]
-        zmq_sub["ZmqKvEventPublisher<br/>(Rust bindings)"]
-    end
-
-    subgraph NATS["NATS"]
-        subject["kv-events subject"]
-    end
-
-    subgraph Router["KV Router"]
-        indexer["KvIndexer"]
-    end
-
-    cache --> zmq_pub
-    zmq_pub -->|"PUB"| socket
-    socket -->|"SUB"| zmq_sub
-    zmq_sub --> subject
-    subject --> indexer
-```
-
-**When to use:**
- Your engine already has a ZMQ-based event system (like vLLM)
- You're integrating with a consolidator (like KVBM)
- You want to decouple event publishing from your engine's main loop
-
-#### Part 1: ZMQ Subscriber (Dynamo Bindings)
-
-If your engine already publishes to ZMQ, use `ZmqKvEventPublisher` to subscribe and forward to NATS:
-
-```python
-from dynamo.llm import ZmqKvEventPublisher, ZmqKvEventPublisherConfig
-
-# Configure the ZMQ subscriber
-config = ZmqKvEventPublisherConfig(
-    worker_id=endpoint.connection_id(),
-    kv_block_size=block_size,
-    zmq_endpoint="tcp://127.0.0.1:5557",  # Where your engine publishes
-    zmq_topic="",                          # Subscribe to all topics
-    enable_local_indexer=False,
-)
-
-# Create publisher - it automatically subscribes to ZMQ and forwards to NATS
-kv_publisher = ZmqKvEventPublisher(
-    component=component,
-    config=config,
-)
-```
-
-#### Part 2: ZMQ Publisher (Pure Python)
-
-If your engine needs to publish to ZMQ (e.g., for consolidator integration), implement the ZMQ protocol:
-
-```python
-import zmq
-import msgpack
-import time
-
-class ZmqKvEventPublisher:
-    """Pure Python ZMQ publisher for KV events (vLLM-compatible format)."""
-
-    def __init__(self, zmq_endpoint: str, kv_block_size: int, topic: str = ""):
-        self.kv_block_size = kv_block_size
-        self.topic = topic
-        self.ctx = zmq.Context()
-        self.socket = self.ctx.socket(zmq.PUB)
-        self.socket.bind(zmq_endpoint)
-        self.sequence = 0
-        self.data_parallel_rank = 0
-
-    def _to_signed_i64(self, value: int | None) -> int | None:
-        if value is None:
-            return None
-        return value - 0x10000000000000000 if value > 0x7FFFFFFFFFFFFFFF else value
-
-    def publish_stored(self, event_id: int, token_ids: list[int], num_block_tokens: list[int],
-                       block_hashes: list[int], lora_id: int = 0, parent_hash: int | None = None):
-        event = {
-            "type": "BlockStored",
-            "block_hashes": [self._to_signed_i64(h) for h in block_hashes],
-            "parent_block_hash": self._to_signed_i64(parent_hash),
-            "token_ids": token_ids,
-            "block_size": self.kv_block_size,
-            "lora_id": lora_id if lora_id != 0 else None,
-        }
-        self._publish_event(event)
-
-    def publish_removed(self, event_id: int, block_hashes: list[int]):
-        event = {"type": "BlockRemoved", "block_hashes": [self._to_signed_i64(h) for h in block_hashes]}
-        self._publish_event(event)
-
-    def publish_all_cleared(self):
-        self._publish_event({"type": "AllBlocksCleared"})
-
-    def _publish_event(self, event: dict):
-        batch = [time.time(), [event], self.data_parallel_rank]
-        payload = msgpack.packb(batch, use_bin_type=True)
-        sequence_bytes = self.sequence.to_bytes(8, byteorder="big")
-        self.sequence += 1
-        self.socket.send_multipart([self.topic.encode(), sequence_bytes, payload])
-
-    def shutdown(self):
-        self.socket.close()
-        self.ctx.term()
-```
-
-### ZMQ Wire Format
-
-The ZMQ message format (compatible with vLLM):
-
-| Frame | Description |
-|-------|-------------|
-| 1 | Topic (empty string for all topics) |
-| 2 | Sequence number (8 bytes, big-endian) |
-| 3 | Msgpack payload: `[timestamp, [events], dp_rank]` |
-
-Each event in the payload is a dictionary with `type` field (`BlockStored`, `BlockRemoved`, or `AllBlocksCleared`).
-
-### Best Practices
-
-1. **Event IDs must be monotonically increasing** per worker (use a thread-safe counter)
-
-2. **Block size must match** your engine's actual `kv_block_size`
-
-3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching
-
-## See Also
-
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Guide](router_guide.md)**: Configuration, tuning, and production setup
- **[Router Design](../../design_docs/router_design.md)**: Architecture details and event transport modes
--- a/docs/components/router/router_guide.md
+++ b/docs/components/router/router_guide.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Router Guide
-
-## Overview
-
-The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
-This guide helps you get started with using the Dynamo router, with further details on configuration, disaggregated serving setup, and parameter tuning.
-
-## Quick start
-
-### Python / CLI Deployment
-
-To launch the Dynamo frontend with the KV Router:
-
-```bash
-python -m dynamo.frontend --router-mode kv --http-port 8000
-```
-
-This command:
- Launches the Dynamo frontend service with KV routing enabled
- Exposes the service on port 8000 (configurable)
- Automatically handles all backend workers registered to the Dynamo endpoint
-
-Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
-
-#### CLI Arguments
-
-| Argument | Default | Description |
-|----------|---------|-------------|
-| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
-| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
-| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
-| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
-| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
-
-For all available options: `python -m dynamo.frontend --help`
-
-For detailed configuration options and tuning parameters, see [Using the KV Cache Router](#using-the-kv-cache-router).
-
-### Kubernetes Deployment
-
-To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-metadata:
-  name: my-deployment
-spec:
-  services:
-    Frontend:
-      dynamoNamespace: my-namespace
-      componentType: frontend
-      replicas: 1
-      envs:
-        - name: DYN_ROUTER_MODE
-          value: kv  # Enable KV Smart Router
-```
-
-**Key Points:**
- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
- Workers automatically report KV cache events to the router
- No worker-side configuration changes needed
-
-#### Environment Variables
-
-All CLI arguments can be configured via environment variables using the `DYN_` prefix:
-
-| CLI Argument | Environment Variable | Default |
-|--------------|---------------------|---------|
-| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
-| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
-| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
-| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
-| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
-
-For complete K8s examples and advanced configuration, see [K8s Examples](router_examples.md#k8s-examples).
-For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
-
-## KV Cache Routing
-
-KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.
-
-```mermaid
-graph TD
-    T[Tokens] --> R[KV Aware Router]
-
-    R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
-    R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
-    R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
-
-    style T fill:#fff3e0,stroke:#333,color:#333
-    style R fill:#2e8b57,stroke:#333,color:#fff
-    style W1 fill:#f3e5f5,stroke:#333,color:#333
-    style W2 fill:#c8e6c9,stroke:#333,color:#333
-    style W3 fill:#f3e5f5,stroke:#333,color:#333
-
-    linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
-```
-
-KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
-
-The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
-
-### Cost Calculation
-
-1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
-
-2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
-
-3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
-   - Lower costs indicate better routing choices
-   - `overlap_score_weight` balances cache hit optimization against load distribution
-   - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
-
-### Worker Selection
-
-The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
-
-Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
-
-### Using the KV Cache Router
-
-To enable KV cache-aware routing, start the frontend node like this:
-```bash
-python -m dynamo.frontend --router-mode kv
-```
-
-When KV blocks are created or removed, the engine notifies the Dynamo router, which then identifies the worker with the best matching blocks and routes traffic accordingly.
-
-To evaluate the benefits of KV-aware routing, compare your workload's performance using `--router-mode random|round-robin` against KV-aware routing.
-
-The main KV-aware routing arguments:
-
- `--kv-overlap-score-weight`: Controls the importance of prefix cache overlaps in prefill cost calculations. Higher values improve Time To First Token (TTFT) at the cost of Inter-Token Latency (ITL). When set to 0, the router ignores prefix caches and uses pure load balancing. Defaults to 1.
-
- `--router-temperature`: Controls worker selection randomness through softmax sampling of router cost logits. A value of 0 (default) ensures deterministic selection of the lowest-cost worker, while higher values introduce more randomness.
-
- `--no-kv-events`: Disables KV event tracking. By default (when this flag is not provided), the router uses KV events to monitor block creation and deletion from workers. When disabled with this flag, the router predicts cache state based on routing decisions with TTL-based expiration (default 120s) and pruning. Use this flag if your backend doesn't support KV events (or you are not confident in the accuracy or responsiveness of the events).
-
- `--durable-kv-events`: Enables JetStream mode for KV event transport. Must be specified on **both** the frontend **and** all workers. When enabled, workers publish to JetStream instead of the local indexer, and the frontend consumes from JetStream as a durable consumer. Without this flag (default), workers use the local indexer with NATS Core or ZMQ event plane.
-
- `--router-replica-sync`:  Disabled by default. Enables NATS-based synchronization of local routing decisions between router replicas. When enabled, routers share their active sequence information and local predictions of block usage, improving routing consistency across instances. Note that this does not sync the radix tree or cached KV block states themselves - in JetStream mode those are synchronized through JetStream events; in local indexer mode (default) each router queries workers directly.
-
- `--router-reset-states`: Only applies in JetStream mode (`--durable-kv-events`). When specified, resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting with a fresh state. **Warning**: Using `--router-reset-states` can bring existing router replicas into an inconsistent state. Only use this flag when launching the first router replica in a component, or consider using a different namespace/component for a clean slate.
-
- `--router-snapshot-threshold`: Only applies in JetStream mode (`--durable-kv-events`). Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATS object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.
-
- `--no-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management.
-
- `--track-output-blocks`: Enables tracking of output blocks during generation (default: disabled). When enabled, the router adds placeholder blocks as tokens are generated and applies fractional decay based on progress toward `expected_output_tokens`. This improves load balancing accuracy for long-running generation requests by accounting for output-side KV cache growth.
-
- `--no-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. By default (`router_assume_kv_reuse=true`), the router computes actual block hashes for sequence tracking to deduplicate blocks and optimize load balancing. When disabled via this flag, the router generates random hashes for sequence blocks, treating each request's blocks as unique. This is useful in disaggregated setups where prefill transfers blocks to decode workers that may already have those blocks cached, but the engine cannot coordinate transfers to avoid duplication. Without this flag, the router's load balancing heuristics would undercount decode blocks when duplicates exist.
-
- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines publish load metrics. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).
-
- `--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled.
-
- `--active-prefill-tokens-threshold-frac`: Fraction of `max_num_batched_tokens` for busy detection. A worker is marked busy when `active_prefill_tokens > frac * max_num_batched_tokens`. Uses OR logic with `--active-prefill-tokens-threshold` (worker is busy if either threshold is exceeded). If not set, fractional busy detection is disabled.
-
- `--router-ttl`: Time-to-live in seconds for blocks in the router's local cache predictions. Blocks older than this duration will be automatically expired and removed from the router's radix tree. Defaults to 120.0 seconds when `--no-kv-events` is used. This helps manage memory usage by removing stale cache predictions that are unlikely to be accurate.
-
- `--router-max-tree-size`: Maximum tree size (number of blocks) before pruning is triggered. When the total number of blocks in the radix tree exceeds this threshold, the router will prune the least recently used blocks. Defaults to 1048576 (2^20 blocks) when `--no-kv-events` is used. This prevents unbounded memory growth in long-running deployments.
-
- `--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. For example, with a value of 0.8 (default) and max tree size of 1048576, the router will prune down to approximately 838860 blocks when the threshold is exceeded. Defaults to 0.8 when `--no-kv-events` is used. This creates headroom before the next pruning cycle.
-
->[!Note]
-> **State persistence** depends on the event transport mode:
-> - **NATS Core / Event Plane mode** (default): State persists on workers—router rebuilds state by querying workers on startup. This is the default when workers have `local_indexer` enabled (which is the default). Works with both NATS Core and ZMQ event planes.
-> - **JetStream mode** (`--durable-kv-events` on **both** frontend **and** workers): State persists across router restarts via JetStream and NATS object store snapshots.
-> - **No KV events** (`--no-kv-events`): State persistence is not supported.
->
-> **Request plane is independent of KV event transport.**
-> The router can run without etcd or NATS when using ZMQ event plane (`--event-plane zmq`) and file/mem store (`--store-kv file` or `--store-kv mem`); in this case, KV events use ZMQ transport instead of NATS.
-> `DYN_REQUEST_PLANE` controls how **requests** are sent (TCP/HTTP/NATS), but KV-aware routing uses **NATS** for KV events only in JetStream or NATS Core modes (not ZMQ mode).
-> When KV events are enabled (default) with NATS-based event plane, NATS is automatically initialized. You can optionally set `NATS_SERVER=nats://...` to specify a custom NATS server; otherwise, it defaults to `localhost:4222`.
-> `--no-kv-events` disables KV event transport entirely.
->
-> When `--kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
->
-> **Backend Configuration:** When using `--no-kv-events`, configure your backend workers to disable KV event publishing:
-> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'`
-> - **SGLang**: Do not use `--kv-events-config`
-> - **TRT-LLM**: Do not use `--publish-events-and-metrics`
->
-> The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.
-
-To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../../integrations/kv_events_custom_engines.md).
-
-## Basic Routing
-
-Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
-
-First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
-
-```python
-client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
-```
-
-We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
-
- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
-
-KV Cache routing uses direct routing with a special worker selection algorithm.
-
-For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
-
-For custom routing logic and advanced patterns, see [Routing Patterns](router_examples.md#routing-patterns) in the examples documentation.
-
-## Tuning Guidelines
-
-### 1. Understand Your Workload Characteristics
-
- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
-
-### 2. Monitor Key Metrics
-
-The router logs the cost calculation for each worker:
-```text
-Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
-```
-
-This shows:
- Total cost (125.3)
- Overlap weight × prefill blocks (1.0 × 100.5)
- Active blocks (25.0)
- Cached blocks that contribute to overlap (15)
-
-### 3. Temperature-Based Routing
-
-The `router_temperature` parameter controls routing randomness:
- **0.0 (default)**: Deterministic selection of the best worker
- **> 0.0**: Probabilistic selection, higher values increase randomness
- Useful for preventing worker saturation and improving load distribution
-
-### 4. Iterative Optimization
-
-1. Begin with default settings
-2. Monitor TTFT and ITL metrics
-3. Adjust `kv-overlap-score-weight` to meet your performance goals:
-   - To reduce TTFT: Increase the weight
-   - To reduce ITL: Decrease the weight
-4. If you observe severe load imbalance, increase the temperature setting
-
-## Disaggregated Serving
-
-Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
-
-### Automatic Prefill Router Activation
-
-The prefill router is automatically created when:
-1. A decode model is registered (e.g., via `register_llm()` with `ModelType.Chat | ModelType.Completions`)
-2. A prefill worker is detected with the same model name and `ModelType.Prefill`
-
-**Key characteristics of the prefill router:**
- **Always disables active block tracking** (`track_active_blocks=false`) since prefill workers don't perform decode
- **Seamlessly integrated** into the request pipeline between preprocessing and decode routing
- **Falls back gracefully** to decode-only mode if prefill fails or no prefill workers are available
-
-### Setup Example
-
-When both workers are registered, requests are automatically routed.
-
-```python
-# Decode worker registration (in your decode worker)
-decode_endpoint = runtime.namespace("dynamo").component("decode").endpoint("generate")
-
-await register_llm(
-    model_input=ModelInput.Tokens,
-    model_type=ModelType.Chat | ModelType.Completions,
-    endpoint=decode_endpoint,
-    model_name="meta-llama/Llama-2-7b-hf",
-    # ... other parameters
-)
-
-await decode_endpoint.serve_endpoint(decode_handler.generate)
-
-# Prefill worker registration (in your prefill worker)
-prefill_endpoint = runtime.namespace("dynamo").component("prefill").endpoint("generate")
-
-await register_llm(
-    model_input=ModelInput.Tokens,
-    model_type=ModelType.Prefill,  # <-- Mark as prefill worker
-    endpoint=prefill_endpoint,
-    model_name="meta-llama/Llama-2-7b-hf",  # Must match decode model name
-    # ... other parameters
-)
-
-await prefill_endpoint.serve_endpoint(prefill_handler.generate)
-```
-
-> [!Note]
-> The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](../../examples/backends/sglang/launch/disagg_router.sh).
-
-### Request Flow
-
-The following diagram shows an overview of the major components in disaggregated serving:
-
-```mermaid
-graph TD
-    HTTP[HTTP]
-    ROUTER[Router]
-    PREFILL[Prefill Worker]
-    DECODE[Decode Worker]
-
-    classDef worker_style fill:#f3e5f5,stroke:#333,stroke-width:2px,color:#333;
-    classDef router_style fill:#2e8b57,stroke:#333,stroke-width:2px,color:#fff;
-
-    class PREFILL,DECODE worker_style
-    class ROUTER router_style
-
-    HTTP <--> |"request/response"| ROUTER
-    ROUTER --> |"1. send to prefill"| PREFILL
-    PREFILL --> |"2. return NIXL metadata"| ROUTER
-    ROUTER --> |"3. send with metadata"| DECODE
-    DECODE --> |"4. stream response"| ROUTER
-
-    PREFILL -.-> |"publish kv events"| ROUTER
-
-    linkStyle 0,1,2,3,4 stroke:#8b4513,stroke-width:2px
-    linkStyle 5 stroke:#2196f3,stroke-width:2px
-```
-
-## Serving Multiple Router Replicas
-
-For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.)
-
-### Router State Management
-
-The KV Router tracks two types of state (see [Router Design](../../design_docs/router_design.md) for details):
-
-1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - in local indexer mode (default) state is rebuilt from workers on startup; in JetStream mode (`--durable-kv-events`) it is backed by JetStream events and object store snapshots.
-
-2. **Active blocks (decoding blocks)**: Tracks blocks currently being used for active generation requests. This state is **ephemeral** - when a new router replica starts, it begins with zero active block knowledge but becomes eventually consistent as it handles requests.
-
-### Enabling Router Replica Synchronization
-
-```bash
-# Router replica 1
-python -m dynamo.frontend --router-mode kv --http-port 8000 --router-replica-sync
-
-# Router replica 2 (can be started later)
-python -m dynamo.frontend --router-mode kv --http-port 8001 --router-replica-sync
-```
-
-The `--router-replica-sync` flag enables active block synchronization between replicas:
- Active blocks are shared via NATS core messaging (fire-and-forget)
- Replicas exchange routing decisions to maintain consistent load estimates
- A new replica starts with zero active blocks but quickly converges through request handling, by itself and active syncing with other replicas
-
-Without this flag, each replica maintains its own isolated view of active blocks, potentially leading to suboptimal routing.
-
-### Persistence and Recovery
-
-Persistence behavior depends on which event transport mode is active:
-
-**NATS Core / Event Plane with Local Indexer Mode (default):**
- State persists on workers—events are fire-and-forget but workers retain their local indexer state
- On startup, the router queries each worker's local indexer to rebuild state
- Recovery depends on workers being available; if a worker is down, its blocks cannot be recovered
- Simpler infrastructure (no JetStream required)
-
-**JetStream Mode** (`--durable-kv-events` on **both** frontend **and** workers)**:**
- Prefix blocks are stored in NATS JetStream with 1-hour retention
- Snapshots saved to NATS object store at configurable thresholds
- New replicas automatically restore this state on startup
- You can launch a third Router replica even if the first two are down, and it will recover the full prefix state
-
-```bash
-python -m dynamo.frontend --router-mode kv --http-port 8002 --router-replica-sync
-```
-
->[!Note]
-> If you need to start with a fresh state in JetStream mode, you have two options:
-> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](/docs/design_docs/distributed_runtime.md)) which will start a new stream and NATS object store path
-> 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.
-
-## Dynamic Threshold Configuration
-
-Dynamic threshold configuration allows you to adjust worker busy thresholds at runtime without restarting the frontend, enabling real-time tuning of load balancing behavior based on observed system performance.
-
-The busy thresholds can be updated at runtime without restarting the frontend. The frontend exposes HTTP endpoints at `/busy_threshold`:
-
-**Get or set a model's thresholds (POST):**
-```bash
-# Set both thresholds for a model
-curl -X POST http://localhost:8000/busy_threshold \
-  -H "Content-Type: application/json" \
-  -d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}'
-# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}
-
-# Set only active decode blocks threshold
-curl -X POST http://localhost:8000/busy_threshold \
-  -H "Content-Type: application/json" \
-  -d '{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85}'
-# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": <current_value>}
-
-# Get current thresholds (omit threshold fields)
-curl -X POST http://localhost:8000/busy_threshold \
-  -H "Content-Type: application/json" \
-  -d '{"model": "meta-llama/Llama-2-7b-hf"}'
-# Response: {"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}
-# Or if not configured: {"model": "...", "active_decode_blocks_threshold": null, "active_prefill_tokens_threshold": null}
-```
-
-**List all configured thresholds (GET):**
-```bash
-curl http://localhost:8000/busy_threshold
-# Response: {"thresholds": [{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}]}
-```
-
-## See Also
-
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Examples](router_examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[Router Design](../../design_docs/router_design.md)**: Architecture details and event transport modes
- **[KV Event Publishing for Custom Engines](../../integrations/kv_events_custom_engines.md)**: Integrate custom inference engines with KV-aware routing
--- a/docs/conf.py
+++ b/docs/conf.py
-# SPDX-FileCopyrightText: Copyright (c) 2023-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
-
-# Configuration file for the Sphinx documentation builder.
-import os
-import sys
-
-# -- Project information -----------------------------------------------------
-project = "NVIDIA Dynamo"
-copyright = "2024-2026, NVIDIA CORPORATION & AFFILIATES"
-author = "NVIDIA"
-
-# Version is set via DYNAMO_DOCS_VERSION env var during build (e.g., "0.3.0")
-# Defaults to "dev" for main branch and PR builds
-release = os.environ.get("DYNAMO_DOCS_VERSION", "dev")
-
-# -- General configuration ---------------------------------------------------
-
-# Standard extensions
-extensions = [
-    "ablog",
-    "myst_parser",
-    "sphinx_copybutton",
-    "sphinx_design",
-    "sphinx_prompt",
-    # "sphinxcontrib.bibtex",
-    "sphinx_tabs.tabs",
-    "sphinx_sitemap",
-    "sphinx.ext.autodoc",
-    "sphinx.ext.autosummary",
-    "sphinx.ext.mathjax",
-    "sphinx.ext.napoleon",
-    "sphinx.ext.ifconfig",
-    "sphinx.ext.extlinks",
-    "sphinxcontrib.mermaid",
-    "sphinx_reredirects",
-]
-
-# Redirects configuration
-redirects = {
-    # Frontend migration
-    "frontends/kserve": "../components/frontend/frontend_guide.html",
-    # PR  #3802
-    "guides/tool-calling": "../agents/tool-calling.html",  # key format corrected
-    "architecture/architecture": "../design_docs/architecture.html",
-    "architecture/disagg_serving": "../design_docs/disagg_serving.html",
-    "architecture/distributed_runtime": "../design_docs/distributed_runtime.html",
-    "architecture/dynamo_flow": "../design_docs/dynamo_flow.html",
-    "architecture/request_cancellation": "../fault_tolerance/request_cancellation.html",
-    "architecture/request_migration": "../fault_tolerance/request_migration.html",
-    "kubernetes/create_deployment": "../kubernetes/deployment/create_deployment.html",
-    "kubernetes/minikube": "../kubernetes/deployment/minikube.html",
-    "kubernetes/multinode-deployment": "../kubernetes/deployment/multinode-deployment.html",
-    "kubernetes/logging": "../kubernetes/observability/logging.html",
-    "kubernetes/metrics": "../kubernetes/observability/metrics.html",
-    "architecture/kv_cache_routing": "../components/router/router_guide.html",
-    # PR #3658
-    "API/nixl_connect/README": "../../api/nixl_connect/README.html",
-    "API/nixl_connect/connector": "../../api/nixl_connect/connector.html",
-    "API/nixl_connect/descriptor": "../../api/nixl_connect/descriptor.html",
-    "API/nixl_connect/device": "../../api/nixl_connect/device.html",
-    "API/nixl_connect/device_kind": "../../api/nixl_connect/device_kind.html",
-    "API/nixl_connect/operation_status": "../../api/nixl_connect/operation_status.html",
-    "API/nixl_connect/rdma_metadata": "../../api/nixl_connect/rdma_metadata.html",
-    "API/nixl_connect/read_operation": "../../api/nixl_connect/read_operation.html",
-    "API/nixl_connect/readable_operation": "../../api/nixl_connect/readable_operation.html",
-    "API/nixl_connect/writable_operation": "../../api/nixl_connect/writable_operation.html",
-    "API/nixl_connect/write_operation": "../../api/nixl_connect/write_operation.html",
-    "guides/backend": "../development/backend-guide.html",
-    "runtime/README": "../development/runtime-guide.html",
-    "guides/tool_calling": "../agents/tool-calling.html",
-    "architecture/kvbm_architecture": "../design_docs/kvbm_design.html",
-    "architecture/kvbm_components": "../design_docs/kvbm_design.html",
-    "architecture/kvbm_intro": "../components/kvbm/README.html",
-    "architecture/kvbm_motivation": "../design_docs/kvbm_design.html",
-    "architecture/kvbm_reading": "../design_docs/kvbm_design.html",
-    "guides/run_kvbm_in_trtllm": "../components/kvbm/kvbm_guide.html",
-    "guides/run_kvbm_in_vllm": "../components/kvbm/kvbm_guide.html",
-    "guides/health_check": "../observability/health-checks.html",
-    "guides/logging": "../observability/logging.html",
-    "guides/metrics": "../observability/metrics.html",
-    "guides/disagg_perf_tuning": "../performance/tuning.html",
-    "architecture/load_planner": "../components/planner/README.html",
-    "architecture/planner_intro": "../components/planner/README.html",
-    "architecture/sla_planner": "../components/planner/planner_guide.html",
-    "kubernetes/sla_planner_quickstart": "../components/planner/planner_guide.html",
-    "guides/dynamo_run": "../reference/cli.html",
-    "dynamo_glossary": "../reference/glossary.html",
-    "support_matrix": "../reference/support-matrix.html",
-    # Multimodal documentation consolidation (all redirect to features/multimodal/)
-    "backends/vllm/multimodal": "../../features/multimodal/multimodal_vllm.html",
-    "backends/vllm/multimodal_vllm_guide": "../../features/multimodal/multimodal_vllm.html",
-    "backends/trtllm/multimodal_support": "../../features/multimodal/multimodal_trtllm.html",
-    "backends/trtllm/multimodal_trtllm_guide": "../../features/multimodal/multimodal_trtllm.html",
-    "backends/trtllm/multinode/multinode-multimodal-example": "../../../features/multimodal/multimodal_trtllm.html",
-    "backends/sglang/multimodal_epd": "../../features/multimodal/multimodal_sglang.html",
-    "backends/sglang/multimodal_sglang_guide": "../../features/multimodal/multimodal_sglang.html",
-    "multimodal/multimodal_intro": "../features/multimodal/README.html",
-    # Speculative decoding consolidation
-    "backends/vllm/speculative_decoding": "../../features/speculative_decoding/speculative_decoding_vllm.html",
-    # Multimodal migration to features/multimodal/
-    "multimodal/index": "../features/multimodal/README.html",
-    "multimodal/vllm": "../features/multimodal/multimodal_vllm.html",
-    "multimodal/sglang": "../features/multimodal/multimodal_sglang.html",
-    "multimodal/trtllm": "../features/multimodal/multimodal_trtllm.html",
-    # Component consolidation into docs/components/
-    "router/README": "../components/router/README.html",
-    "router/kv_cache_routing": "../components/router/router_guide.html",
-    "router/kv_events": "../integrations/kv_events_custom_engines.html",
-    "planner/planner_intro": "../components/planner/README.html",
-    "planner/README": "../components/planner/README.html",
-    "planner/planner_guide": "../components/planner/planner_guide.html",
-    "planner/planner_examples": "../components/planner/planner_examples.html",
-    "planner/sla_planner_quickstart": "../components/planner/planner_guide.html",
-    "planner/sla_planner": "../components/planner/planner_guide.html",
-    "planner/load_planner": "../components/planner/README.html",
-    "kvbm/kvbm_intro": "../components/kvbm/README.html",
-    "kvbm/README": "../components/kvbm/README.html",
-    "kvbm/kvbm_guide": "../components/kvbm/kvbm_guide.html",
-    "kvbm/kvbm_design": "../design_docs/kvbm_design.html",
-    # Profiler consolidation
-    "benchmarks/sla_driven_profiling": "../components/profiler/profiler_guide.html",
-}
-
-# Custom extensions
-sys.path.insert(0, os.path.abspath("_extensions"))
-extensions.append("github_alerts")
-
-# Handle Mermaid diagrams as code blocks (not directives) to avoid warnings
-myst_fence_as_directive = ["mermaid"]  # Uncomment if sphinxcontrib-mermaid is installed
-
-# File extensions (myst_parser automatically handles .md files)
-source_suffix = [".rst", ".md"]
-
-# MyST parser configuration
-myst_enable_extensions = [
-    "colon_fence",  # ::: code blocks
-    "deflist",  # Definition lists
-    "html_image",  # HTML images
-    "tasklist",  # Task lists
-]
-
-# Templates path
-templates_path = ["_templates"]
-
-# List of patterns to ignore when looking for source files
-exclude_patterns = ["_build", "Thumbs.db", ".DS_Store", "build"]
-
-# -- Options for HTML output -------------------------------------------------
-html_theme = "nvidia_sphinx_theme"
-html_static_path = ["_static"]
-html_extra_path = ["project.json"]
-html_theme_options = {
-    "collapse_navigation": False,
-    "icon_links": [
-        {
-            "name": "GitHub",
-            "url": "https://github.com/ai-dynamo/dynamo",
-            "icon": "fa-brands fa-github",
-        }
-    ],
-    "switcher": {
-        # Use single shared URL so all versions see the same switcher list
-        # When a new version is added, all old docs automatically see it
-        "json_url": "https://docs.nvidia.com/dynamo/versions1.json",
-        "version_match": release,
-    },
-    "extra_head": {
-        """
-    <script src="https://assets.adobedtm.com/5d4962a43b79/c1061d2c5e7b/launch-191c2462b890.min.js" ></script>
-    """
-    },
-    "extra_footer": {
-        """
-    <script type="text/javascript">if (typeof _satellite !== "undefined") {_satellite.pageBottom();}</script>
-    """
-    },
-    "navbar_start": ["navbar-logo"],
-    "primary_sidebar_end": [],
-}
-
-# Document settings
-master_doc = "index"
-html_title = f"{project} Documentation"
-html_short_title = project
-html_baseurl = "https://docs.nvidia.com/dynamo/latest/"
-
-# Suppress warnings for external links and missing references
-suppress_warnings = [
-    "myst.xref_missing",  # Missing cross-references of relative links outside docs folder
-]
-
-# Additional MyST configuration
-myst_heading_anchors = 7  # Generate anchors for headers
-myst_substitutions = {}  # Custom substitutions
--- a/fern/convert_callouts.py
+++ b/fern/convert_callouts.py
--- a/docs/design_docs/architecture.md
+++ b/docs/design_docs/architecture.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# High Level Architecture
-
-Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
-
- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
- **LLM-aware request routing**: Eliminates unnecessary KV cache recomputation
- **Accelerated data transfer**: Reduces inference response time using NIXL
- **KV cache offloading**: Uses multiple memory hierarchies for higher system throughput and lower latency
-
-Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach
-
-## Motivation behind Dynamo
-
-Scaling inference for generative AI and reasoning models presents complex challenges in three key areas: performance, correctness, and efficiency. Here's what we're solving:
-
-There are multi-faceted challenges:
-
- *Difficult UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further complicates matters. Developers need a clear, intuitive way to define, optimize, and update inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, hindering model deployment and innovation. A modern distributed inference stack must consider usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
-
- *GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (which generates large prompt embeddings) is highly compute-intensive, while decode (which generates tokens) is latency-sensitive. A disaggregated approach that separate prefill and decode ensures optimal GPU utilization and increases overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)).
-
- *Expensive KV cache re-computation*: When requests aren't efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948))
-
- *Memory bottlenecks*: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. ([Mooncake](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [LMCache](https://lmcache.ai/))
-
- *Fluctuating demand and inefficient GPU allocation*: Inference workloads are use-case specific and dynamic—demand surges inherently cause unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset))
-
- *Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it requires deep networking expertise to configure correctly, making it impractical for broad inference use cases. Developers need a library optimized for inference workloads that can abstract heterogeneous memory (remote memory or storage) and dynamically select the best transport mechanism via a unified API.
-
-To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
-
-## Key benefits
-
-The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
-
- [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](../components/router/README.md)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm_intro.rst)
- [Planner](../planner/planner_intro.rst)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
-
-Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
-
-![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../images/architecture.png "Dynamo Architecture")
-
-Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
-
-Beyond efficient event communication, data transfer across multi-node deployments is crucial at scale. To address this, Dynamo utilizes NIXL, a technology designed to expedite transfers through reduced synchronization and intelligent batching. This acceleration is particularly vital for disaggregated serving, ensuring minimal latency when prefill workers pass KV cache data to decode workers.
-
-Dynamo prioritizes seamless integration. Its modular design enables it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. We built critical performance-sensitive modules with Rust for speed, memory safety, and robust concurrency. Meanwhile, we used Python for its flexibility, enabling rapid prototyping and effortless customization.
-
-## Performance benefits of key features
-
-### Disaggregated serving
-
-Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
-
-![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../images/disagg_perf_benefit.png)
-
-* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
-
-
-The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation can provide tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput.
-
-### KV aware routing
-
-![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../images/kv_routing.png)
-
-* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
-
-
-Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The preceeding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput.
-
-### KV cache manager
-
-The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
-![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../images/kvbm_agg_performance.png)
-
-* Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.
-
-### NVIDIA Inference Transfer Library (NIXL)
-
-NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL can abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support a single tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and throughput.
-
-## Acknowledgements
-
-We'd like to acknowledge several open source software stacks that motivated our creation Dynamo.
-
- vLLM and vLLM-project
- SGLang
- DistServe
- Mooncake
- AIBrix
- BentoML
--- a/docs/design_docs/disagg_serving.md
+++ b/docs/design_docs/disagg_serving.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
-
-The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.
-
-Disaggregated execution of a request has three main steps:
-1. Prefill engine computes prefill phase and generates KV cache
-2. Prefill engine transfers the KV cache to decode engine
-3. Decode engine computes decode phase.
-
-The disaggregation design in Dynamo features a flexible framework that delivers strong performance across various conditions.
-
-## Efficient KV Transfer
-
-The key to high-performance disaggregation is efficient KV transfer. Dynamo leverages NIXL to transfer KV cache directly from the VRAM of the prefill engine to the VRAM of the decode engine. The KV transfer is non-blocking, allowing GPU forward passes to continue serving other requests during the transfer.
-
-### Router Orchestration
-
-The disaggregated serving flow is orchestrated by the `PrefillRouter`:
-
-```mermaid
-sequenceDiagram
-    participant Client
-    participant Frontend
-    participant Router as PrefillRouter
-    participant Prefill as Prefill Worker
-    participant Decode as Decode Worker
-
-    Client->>Frontend: Request
-    Frontend->>Router: Preprocessed Request
-    Router->>Router: Select prefill worker
-    Router->>Prefill: Prefill request
-    Prefill->>Prefill: Compute KV cache
-    Prefill-->>Router: disaggregated_params
-    Router->>Router: Select decode worker
-    Router->>Decode: Decode request + transfer metadata
-    Decode<<->>Prefill: KV transfer (NIXL)
-    Decode->>Decode: Generate tokens
-    Decode-->>Frontend: Stream tokens
-    Frontend-->>Client: Response
-```
-
-1. **Worker Selection**: The router selects a prefill worker using KV-aware routing (based on cache overlap scores and load) or simple load balancing.
-
-2. **Prefill Execution**: The router sends the prefill request to the selected prefill worker. The prefill worker computes the KV cache and returns `disaggregated_params` containing backend-specific transfer metadata.
-
-3. **Decode Routing**: The router injects the prefill result into the decode request, then routes to the decode worker.
-
-4. **KV Transfer**: The decode worker uses the transfer metadata to coordinate with the prefill worker. NIXL handles the direct GPU-to-GPU transfer using the optimal available transport (NVLink, InfiniBand/UCX, etc.).
-
-### Backend-Specific Transfer Metadata
-
-The transfer metadata format varies by backend:
-
- **SGLang**: Uses `bootstrap_info` (host, port, room_id) for RDMA bootstrap coordination. SGLang prefill workers publish their bootstrap endpoint to the discovery service during initialization. With this mechanism, prefill can run as a background task, allowing the decode phase to begin immediately while the KV transfer proceeds in parallel.
-
- **vLLM**: Uses `kv_transfer_params` containing block IDs and remote worker connection info. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
-
- **TRTLLM**: Uses `opaque_state` containing serialized TRT-LLM internal metadata. Prefill runs synchronously; decode waits for prefill to complete before proceeding.
-
-
-## Runtime-Reconfigurable xPyD
-
-Dynamo's disaggregation design supports runtime-reconfigurable xPyD (x prefill workers, y decode workers). Workers can be added and removed at runtime:
-
- **Add worker**: Worker registers with the discovery service and publishes its `RuntimeConfig` (including KV capacity).
- **Remove worker**: Worker drains active requests and deregisters from discovery.
-
-The router automatically discovers new workers via the discovery service and incorporates them into routing decisions.
-
--- a/docs/design_docs/distributed_runtime.md
+++ b/docs/design_docs/distributed_runtime.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Dynamo Distributed Runtime
-
-## Overview
-
-Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., Python bindings can be found in `/lib/bindings/python`). The runtime supports multiple discovery backends (Kubernetes-native or etcd) and request planes (TCP, HTTP, or NATS). `DistributedRuntime` follows a hierarchical structure:
-
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It manages connections to discovery backends (K8s API or etcd) and optional messaging (NATS for KV events), and handles lifecycle with cancellation tokens.
- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
- `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
- `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.
-
-While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
-
-For example, a typical deployment configuration (like `examples/backends/vllm/deploy/agg.yaml` or `examples/backends/sglang/deploy/agg.yaml`) has multiple components:
-
- `Frontend`: Starts an HTTP server (OpenAI-compatible API on port 8000), handles incoming requests, applies chat templates, performs tokenization, and routes requests to workers. The `make_engine` function encapsulates this functionality.
- `Worker` components (e.g., `VllmDecodeWorker`, `VllmPrefillWorker`, `SGLangDecodeWorker`, `TRTLLMWorker`): Perform the actual inference computation using their respective engines (vLLM, SGLang, TensorRT-LLM).
-
-Since these components are deployed in different processes, each has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `sglang-disagg`). Under their namespace, each has its own `Component`:
-
- `Frontend` uses the `make_engine` function which handles HTTP serving, request preprocessing, and worker discovery automatically
- Worker components register with names like `backend`, `prefill`, `decode`, or `encoder` depending on their role
- Workers register endpoints like `generate`, `clear_kv_blocks`, or `load_metrics`
-
-Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("dynamo").component("backend")`), and their `Endpoint`s are created using the `component.endpoint()` method.
-
-## Initialization
-
-In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are multiple modes for `DistributedRuntime` initialization based on the deployment environment.
-
-```{caution}
-The hierarchy and naming may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
-```
-
-### Service Discovery Backends
-
-The `DistributedRuntime` supports two service discovery backends, configured via `DYN_DISCOVERY_BACKEND`:
-
- **KV Store Discovery** (`DYN_DISCOVERY_BACKEND=kv_store`): Uses etcd for service discovery. **This is the global default** for all deployments unless explicitly overridden.
-
- **Kubernetes Discovery** (`DYN_DISCOVERY_BACKEND=kubernetes`): Uses native Kubernetes resources (DynamoWorkerMetadata CRD, EndpointSlices) for service discovery. **Must be explicitly set.** The Dynamo operator automatically sets this environment variable for Kubernetes deployments. **No etcd required.**
-
-> **Note:** There is no automatic detection of the deployment environment. The runtime always defaults to `kv_store`. For Kubernetes deployments, the operator injects `DYN_DISCOVERY_BACKEND=kubernetes` into pod environments.
-
-When using Kubernetes discovery, the KV store backend automatically switches to in-memory storage since etcd is not needed.
-
-### Runtime Initialization
-
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections based on the discovery backend:
-    - **Kubernetes mode**: Uses K8s API for service registration via DynamoWorkerMetadata CRD. No external dependencies required.
-    - **KV Store mode**: Connects to etcd for service discovery. Creates a primary lease with a background keep-alive task. All objects registered under this `DistributedRuntime` use this lease_id to maintain their lifecycle.
-    - **NATS** (optional): Used for KV event messaging when using KV-aware routing. Can be disabled via `--no-kv-events` flag, which enables prediction-based routing without event persistence.
-    - **Request Plane**: TCP by default. Can be configured to use HTTP or NATS via `DYN_REQUEST_PLANE` environment variable.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism. They provide the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, it registers a service in the internal registry of the `DistributedRuntime`, which tracks all services and endpoints.
- `Endpoint`: When an Endpoint object is created and started, it performs registration based on the discovery backend:
-  - **Kubernetes mode**: Endpoint information is stored in DynamoWorkerMetadata CRD resources, which are watched by other components for discovery.
-  - **KV Store mode**: Endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id`.
-
-## Calling Endpoints
-
-Dynamo uses a `Client` object to call an endpoint. When a `Client` is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then watches for endpoint changes:
-
- **Kubernetes mode**: Watches DynamoWorkerMetadata CRD resources for endpoint updates.
- **KV Store mode**: Sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`.
-
-The watcher continuously updates the `Client` with information about available `Endpoint`s.
-
-The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
-
- `random`: randomly select an endpoint to hit
- `round_robin`: select endpoints in round-robin order
- `direct`: direct the request to a specific endpoint by specifying the instance ID
-
-After selecting which endpoint to hit, the `Client` sends the request using the configured request plane (TCP by default). The request plane handles the actual transport:
-
- **TCP** (default): Direct TCP connection with connection pooling
- **HTTP**: HTTP/2-based transport
- **NATS**: Message broker-based transport (legacy)
-
-## Examples
-
-We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`:
-
- Rust: `/lib/runtime/examples/`
- Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `components/src/dynamo` for full implementation details.
-
-
--- a/docs/design_docs/dynamo_flow.md
+++ b/docs/design_docs/dynamo_flow.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Dynamo Architecture Flow
-
-This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.
-
-## 🔵 Main Request Flow (Blue)
-The primary user journey through the system:
-
-1. **Request (S1)**: HTTP client sends API request to Frontend (OpenAI-compatible server on port 8000)
-2. **Preprocess (S2)**: Frontend preprocesses the request (applies chat template, tokenizes) and validates it
-3. **Route to Prefill (S3)**: PrefillRouter selects a prefill worker using KV-aware routing or load balancing
-
-## 🟢 Prefill Flow (Green)
-The prefill processing pipeline:
-
-4. **Prefill (S4)**: Prefill worker executes the prefill computation on the input tokens and generates KV cache
-5. **Return Metadata (S5)**: Prefill worker returns `disaggregated_params` containing backend-specific transfer metadata
-
-## 🟠 Decode Routing Flow (Orange)
-Router orchestration to decode phase:
-
-6. **Route to Decode (S6)**: PrefillRouter injects prefill result into decode request and routes to decode worker
-7. **KV Transfer (S7)**: Decode worker coordinates with prefill worker for direct GPU-to-GPU KV cache transfer via NIXL
-
-## 🟣 Completion Flow (Purple)
-The response generation and delivery:
-
-8. **Decode (S8)**: Decode worker generates tokens using the transferred KV cache
-9. **Response (S9)**: Generated tokens stream back through Frontend for post-processing (detokenization) and delivery to Client
-
-## 🔗 Infrastructure Connections (Dotted lines)
-Coordination and messaging support:
-
-### Service Discovery
- **On Kubernetes** (default): Uses native K8s resources (DynamoWorkerMetadata CRD, EndpointSlices). No etcd required.
- **On bare metal**: Uses etcd or filesystem for service discovery and endpoint registration.
-
-### Request Plane
- **TCP** (default): Direct TCP connections between Frontend and Workers for request/response transport.
- **HTTP/NATS**: Alternative transports configurable via `DYN_REQUEST_PLANE`.
-
-### NATS Connections (Optional, for KV routing)
- **KV Events**: Cache state events for KV-aware routing (can be disabled with `--no-kv-events`)
-
-### Planning Connections (Gold, dotted)
- **Frontend → Planner**: Metrics collection for auto-scaling decisions
- **Planner → Workers**: Resource scaling commands for workers
-
-## Technical Implementation Details
-
-### PrefillRouter Orchestration:
- The `PrefillRouter` sits between the Frontend and workers, orchestrating disaggregated serving
- Selects prefill workers using KV-aware routing (cache overlap scores + load) or simple load balancing
- Injects transfer metadata into decode requests for KV cache coordination
-
-### NIXL (NVIDIA Interchange Library):
- Enables high-speed GPU-to-GPU data transfers using NVLink, InfiniBand/UCX, or PCIe
- Transfer metadata exchanged via `disaggregated_params` in prefill response
- Backend-specific coordination: SGLang uses bootstrap connections, vLLM uses block IDs, TRTLLM uses opaque state
-
-### Disaggregated KV Cache:
- Each worker maintains local KV cache in its GPU memory
- No shared storage bottlenecks—transfers are direct worker-to-worker via NIXL
- Non-blocking transfers allow GPU forward passes to continue during KV transfer
-
-```mermaid
-%%{init: {'theme':'dark', 'themeVariables': {'primaryColor': '#f4f4f4', 'primaryTextColor': '#333333', 'primaryBorderColor': '#888888', 'lineColor': '#4A90E2', 'sectionBkgColor': '#f9f9f9', 'altSectionBkgColor': '#eeeeee', 'tertiaryColor': '#f0f0f0', 'background': '#ffffff', 'mainBkg': '#f8f8f8', 'secondaryColor': '#f4f4f4', 'nodeTextColor': '#333333'}, 'flowchart': {'htmlLabels': true, 'curve': 'basis'}, 'fontFamily': 'Inter, system-ui, -apple-system, "Segoe UI", Roboto, sans-serif', 'fontSize': '18px'}%%
-graph TD
-    %% Top Layer - Client & Frontend
-    Client["<b>HTTP Client</b>"]
-    Frontend["<b>Frontend</b><br/><i>OpenAI Compatible Server<br/>Port 8000</i>"]
-    S1[["<b>1 REQUEST</b>"]]
-    S2[["<b>2 PREPROCESS</b>"]]
-
-    %% Router Layer
-    PrefillRouter["<b>PrefillRouter</b><br/><i>Orchestrates Disaggregated Serving</i>"]
-    S3[["<b>3 ROUTE TO PREFILL</b>"]]
-
-    %% Infrastructure
-    subgraph INF["<b>Infrastructure Layer</b>"]
-        Discovery[("<b>Discovery</b><br/><i>Service Registry<br/>(ETCD or K8s)</i>")]
-        NATS[("<b>NATS</b><br/><i>KV Events<br/>(Optional)</i>")]
-        Planner["<b>Planner</b><br/><i>Auto-scaling</i>"]
-    end
-
-    %% Worker Layer
-    subgraph WL["<b>Worker Layer</b>"]
-        %% Prefill Worker
-        PrefillWorker["<b>Prefill Worker</b><br/><i>Computes KV Cache</i>"]
-        S4[["<b>4 PREFILL</b>"]]
-        S5[["<b>5 RETURN METADATA</b>"]]
-
-        %% Decode Worker
-        DecodeWorker["<b>Decode Worker</b><br/><i>Token Generation</i>"]
-        S6[["<b>6 ROUTE TO DECODE</b>"]]
-        S7[["<b>7 KV TRANSFER</b>"]]
-        S8[["<b>8 DECODE</b>"]]
-        S9[["<b>9 RESPONSE</b>"]]
-
-        %% KV Cache
-        PrefillKVCache[("<b>Prefill KV Cache</b><br/><i>GPU VRAM</i>")]
-        DecodeKVCache[("<b>Decode KV Cache</b><br/><i>GPU VRAM</i>")]
-    end
-
-    %% Main Request Flow (Blue)
-    Client --> S1
-    S1 -->|HTTP API Call| Frontend
-    Frontend --> S2
-    S2 -->|Tokenize & Validate| PrefillRouter
-    PrefillRouter --> S3
-    S3 -->|Select Prefill Worker| PrefillWorker
-
-    %% Prefill Flow (Green)
-    PrefillWorker --> S4
-    S4 -->|Compute KV Cache| PrefillKVCache
-    PrefillWorker --> S5
-    S5 -->|disaggregated_params| PrefillRouter
-
-    %% Decode Routing Flow (Orange)
-    PrefillRouter --> S6
-    S6 -->|Inject Transfer Metadata| DecodeWorker
-    DecodeWorker --> S7
-    S7 -->|NIXL GPU-to-GPU| PrefillKVCache
-    PrefillKVCache -.->|Direct Transfer| DecodeKVCache
-
-    %% Completion Flow (Purple)
-    DecodeWorker --> S8
-    S8 -->|Generate Tokens| DecodeKVCache
-    DecodeWorker --> S9
-    S9 -->|Stream Tokens| Frontend
-    Frontend -->|HTTP Response| Client
-
-    %% Infrastructure Connections
-    Frontend -.->|Service Discovery| Discovery
-    PrefillRouter -.->|Worker Discovery| Discovery
-    PrefillWorker -.->|Register| Discovery
-    DecodeWorker -.->|Register| Discovery
-    Planner -.->|Service Discovery| Discovery
-
-    %% NATS for KV events (optional)
-    PrefillWorker -.->|KV Events| NATS
-    DecodeWorker -.->|KV Events| NATS
-
-    %% Planning Connections
-    Frontend -.->|Metrics| Planner
-    Planner -.->|Auto-scaling| PrefillWorker
-    Planner -.->|Auto-scaling| DecodeWorker
-
-    %% Styling
-    classDef client fill:#e8f5e8,stroke:#2E7D32,stroke-width:3px
-    classDef frontend fill:#fff3e0,stroke:#F57C00,stroke-width:3px
-    classDef router fill:#f3e5f5,stroke:#7B1FA2,stroke-width:3px
-    classDef worker fill:#e3f2fd,stroke:#1565C0,stroke-width:3px
-    classDef prefillWorker fill:#e8f5e9,stroke:#388E3C,stroke-width:3px
-    classDef planner fill:#f1f8e9,stroke:#558B2F,stroke-width:3px
-    classDef storage fill:#e0f2f1,stroke:#00695C,stroke-width:3px
-    classDef discovery fill:#fff9c4,stroke:#F9A825,stroke-width:3px
-    classDef nats fill:#ede7f6,stroke:#5E35B1,stroke-width:3px
-    classDef infraLayer fill:#fff9c4,stroke:#FFC107,stroke-width:3px
-    classDef workerLayer fill:#e3f2fd,stroke:#2196F3,stroke-width:3px
-
-    class Client client
-    class Frontend frontend
-    class PrefillRouter router
-    class DecodeWorker worker
-    class PrefillWorker prefillWorker
-    class Planner planner
-    class PrefillKVCache,DecodeKVCache storage
-    class Discovery discovery
-    class NATS nats
-    class INF infraLayer
-    class WL workerLayer
-
-    %% Flow Colors
-    %% Main Request Flow - Blue
-    linkStyle 0,1,2,3,4,5 stroke:#1565C0,stroke-width:4px
-
-    %% Prefill Flow - Green
-    linkStyle 6,7,8,9 stroke:#2E7D32,stroke-width:4px
-
-    %% Decode Routing Flow - Orange
-    linkStyle 10,11,12,13,14 stroke:#E65100,stroke-width:4px
-
-    %% Completion Flow - Purple
-    linkStyle 15,16,17,18,19 stroke:#6A1B9A,stroke-width:4px
-
-    %% Infrastructure - Gray dotted
-    linkStyle 20,21,22,23,24,25,26,27,28,29 stroke:#757575,stroke-width:2px,stroke-dasharray: 8 8
-```
--- a/docs/design_docs/event_plane.md
+++ b/docs/design_docs/event_plane.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Event Plane Architecture
-
-This document describes Dynamo's event plane architecture, which handles service discovery, coordination, and event distribution using etcd and NATS.
-
-## Overview
-
-Dynamo's coordination layer adapts to the deployment environment:
-
-| Deployment | Service Discovery | KV Events | Request Plane |
-|------------|-------------------|-----------|---------------|
-| **Kubernetes** (with operator) | Native K8s (CRDs, EndpointSlices) | NATS (optional) | TCP |
-| **Bare metal / Local** (default) | etcd | NATS (optional) | TCP |
-
-> **Note:** The runtime always defaults to `kv_store` (etcd) for service discovery. Kubernetes deployments must explicitly set `DYN_DISCOVERY_BACKEND=kubernetes` - the Dynamo operator handles this automatically.
-
-```
-┌─────────────────────────────────────────────────────────────────────┐
-│                    Coordination Layer                                │
-│                                                                      │
-│  ┌─────────────────────────┐    ┌─────────────────────────────────┐ │
-│  │   Service Discovery     │    │            NATS                 │ │
-│  │                         │    │         (Optional)              │ │
-│  │  • K8s: CRDs + API      │    │  • KV Cache Events              │ │
-│  │  • Bare metal: etcd     │    │  • Router Replica Sync          │ │
-│  │                         │    │  • JetStream Persistence        │ │
-│  └─────────────────────────┘    └─────────────────────────────────┘ │
-│                                                                      │
-└─────────────────────────────────────────────────────────────────────┘
-                    │                          │
-         ┌──────────┴──────────┐    ┌─────────┴──────────┐
-         ▼                     ▼    ▼                    ▼
-    ┌─────────┐          ┌─────────┐              ┌─────────┐
-    │Frontend │          │ Planner │              │ Worker  │
-    └─────────┘          └─────────┘              └─────────┘
-```
-
-## Kubernetes-Native Service Discovery
-
-When running on Kubernetes with the Dynamo operator, service discovery uses native Kubernetes resources instead of etcd.
-
-### Configuration
-
-The operator explicitly sets:
-```bash
-DYN_DISCOVERY_BACKEND=kubernetes
-```
-
-> **Important:** This must be explicitly configured. The runtime defaults to `kv_store` in all environments.
-
-### How It Works
-
-1. **DynamoWorkerMetadata CRD**: Workers register their endpoints by creating/updating DynamoWorkerMetadata custom resources
-2. **EndpointSlices**: Used to signal readiness status to the system
-3. **K8s API Watches**: Components watch for CRD changes to discover available endpoints
-
-### Benefits
-
- No external etcd cluster required
- Native integration with Kubernetes lifecycle
- Automatic cleanup when pods terminate
- Works with standard K8s RBAC
-
-### Environment Variables (Injected by Operator)
-
-| Variable | Description |
-|----------|-------------|
-| `DYN_DISCOVERY_BACKEND` | Set to `kubernetes` |
-| `POD_NAME` | Current pod name |
-| `POD_NAMESPACE` | Current namespace |
-| `POD_UID` | Pod unique identifier |
-
---
-
-## etcd Architecture (Default for All Deployments)
-
-When `DYN_DISCOVERY_BACKEND=kv_store` (the global default), etcd is used for service discovery.
-
-### Connection Configuration
-
-etcd connection is configured via environment variables:
-
-| Variable | Description | Default |
-|----------|-------------|---------|
-| `ETCD_ENDPOINTS` | Comma-separated etcd URLs | `http://localhost:2379` |
-| `ETCD_AUTH_USERNAME` | Basic auth username | None |
-| `ETCD_AUTH_PASSWORD` | Basic auth password | None |
-| `ETCD_AUTH_CA` | CA certificate path (TLS) | None |
-| `ETCD_AUTH_CLIENT_CERT` | Client certificate path | None |
-| `ETCD_AUTH_CLIENT_KEY` | Client key path | None |
-
-Example:
-```bash
-export ETCD_ENDPOINTS=http://etcd-0:2379,http://etcd-1:2379,http://etcd-2:2379
-```
-
-### Lease Management
-
-Each `DistributedRuntime` maintains a primary lease with etcd:
-
-```
-┌────────────────────┐         ┌──────────────┐
-│ DistributedRuntime │◄────────│ Primary Lease │
-│                    │         │  TTL: 10s     │
-│  • Namespace       │         └───────┬───────┘
-│  • Components      │                 │
-│  • Endpoints       │                 │ Keep-Alive
-│                    │                 │ Heartbeat
-└────────────────────┘                 ▼
-                               ┌──────────────┐
-                               │     etcd     │
-                               └──────────────┘
-```
-
-**Lease Lifecycle:**
-
-1. **Creation**: Lease created during `DistributedRuntime` initialization
-2. **Keep-Alive**: Background task sends heartbeats at 50% of remaining TTL
-3. **Expiration**: If heartbeats stop, lease expires after TTL (10 seconds default)
-4. **Cleanup**: All keys associated with the lease are automatically deleted
-
-**Automatic Recovery:**
-
- Reconnection with exponential backoff (50ms to 5s)
- Deadline-based retry logic
- Cancellation token propagation
-
-### Service Discovery
-
-Endpoints are registered in etcd for dynamic discovery:
-
-**Key Format:**
-```
-/services/{namespace}/{component}/{endpoint}/{instance_id}
-```
-
-**Example:**
-```
-/services/vllm-agg/backend/generate/694d98147d54be25
-```
-
-**Registration Data:**
-```json
-{
-  "namespace": "vllm-agg",
-  "component": "backend",
-  "endpoint": "generate",
-  "instance_id": 7587888160958628000,
-  "transport": {
-    "tcp": "192.168.1.10:9999"
-  }
-}
-```
-
-### Discovery Queries
-
-The discovery system supports multiple query patterns:
-
-| Query Type | Pattern | Use Case |
-|------------|---------|----------|
-| `AllEndpoints` | `/services/` | List all services |
-| `NamespacedEndpoints` | `/services/{namespace}/` | Filter by namespace |
-| `ComponentEndpoints` | `/services/{namespace}/{component}/` | Filter by component |
-| `Endpoint` | `/services/{namespace}/{component}/{endpoint}/` | Specific endpoint |
-
-### Watch Functionality
-
-Clients watch etcd prefixes for real-time updates:
-
-```python
-# Client watches for endpoint changes
-watcher = etcd.watch_prefix("/services/vllm-agg/backend/generate/")
-
-for event in watcher:
-    if event.type == "PUT":
-        # New endpoint registered
-        add_endpoint(event.value)
-    elif event.type == "DELETE":
-        # Endpoint removed (worker died)
-        remove_endpoint(event.key)
-```
-
-**Watch Features:**
-
- Initial state retrieval with `get_and_watch_prefix()`
- Automatic reconnection on stream failure
- Revision tracking for no-event-loss guarantees
- Event types: `PUT` (create/update) and `DELETE`
-
-### Distributed Locks
-
-etcd provides distributed locking for coordination:
-
-**Lock Types:**
-
-| Type | Key Pattern | Behavior |
-|------|-------------|----------|
-| Write Lock | `v1/{prefix}/writer` | Exclusive (no readers/writers) |
-| Read Lock | `v1/{prefix}/readers/{id}` | Shared (multiple readers) |
-
-**Operations:**
-
-```rust
-// Non-blocking write lock
-let lock = client.try_write_lock("my_resource").await?;
-
-// Blocking read lock with polling (100ms intervals)
-let lock = client.read_lock_with_wait("my_resource").await?;
-```
-
-## NATS Architecture
-
-### When NATS is Used
-
-NATS is used for:
-
-1. **KV Cache Events**: Real-time KV cache state updates for routing
-2. **Router Replica Sync**: Synchronizing router state across replicas
-3. **Legacy Request Plane**: NATS-based request transport (optional)
-
-### Configuration
-
-| Variable | Description | Default |
-|----------|-------------|---------|
-| `NATS_SERVER` | NATS server URL | `nats://localhost:4222` |
-
-### Disabling NATS
-
-For deployments without KV-aware routing:
-
-```bash
-# Disable NATS and KV events
-python -m dynamo.frontend --no-kv-events
-```
-
-This enables "approximate mode" for KV routing without event persistence.
-
-### Event Publishing
-
-Components publish events to NATS subjects:
-
-```rust
-pub trait EventPublisher {
-    async fn publish(&self, event: &str, data: &[u8]) -> Result<()>;
-    async fn publish_serialized<T: Serialize>(&self, event: &str, data: &T) -> Result<()>;
-}
-```
-
-**Subject Naming:**
-```
-{base_subject}.{event_name}
-```
-
-Example:
-```
-vllm-agg.backend.kv_cache_update
-```
-
-### Event Subscription
-
-Components subscribe to events:
-
-```rust
-pub trait EventSubscriber {
-    async fn subscribe(&self, topic: &str) -> Result<Subscriber>;
-    async fn subscribe_typed<T: DeserializeOwned>(&self, topic: &str) -> Result<TypedSubscriber<T>>;
-}
-```
-
-### JetStream Persistence
-
-For durable event delivery, NATS JetStream provides:
-
- Message persistence
- Replay from offset
- Consumer groups for load balancing
- Acknowledgment tracking
-
-## Key-Value Store Abstraction
-
-Dynamo provides a unified KV store interface supporting multiple backends:
-
-### Supported Backends
-
-| Backend | Use Case | Configuration |
-|---------|----------|---------------|
-| `EtcdStore` | Production deployments | `ETCD_ENDPOINTS` |
-| `MemoryStore` | Testing, development | Default |
-| `NatsStore` | NATS-only deployments | `NATS_SERVER` |
-| `FileStore` | Local persistence | File path |
-
-### Store Interface
-
-```rust
-pub trait KvStore {
-    async fn get(&self, bucket: &str, key: &str) -> Result<Option<Vec<u8>>>;
-    async fn put(&self, bucket: &str, key: &str, value: &[u8]) -> Result<()>;
-    async fn delete(&self, bucket: &str, key: &str) -> Result<()>;
-    async fn watch(&self, bucket: &str) -> Result<WatchStream>;
-}
-```
-
-### Buckets
-
-Data is organized into logical buckets:
-
-| Bucket | Purpose |
-|--------|---------|
-| `v1/instances` | Endpoint instance registry |
-| `v1/mdc` | Model deployment cards |
-
-## Typed Prefix Watcher
-
-For type-safe watching of etcd prefixes:
-
-```rust
-// Watch and maintain HashMap of deserialized values
-let watcher = watch_prefix_with_extraction::<DiscoveryInstance>(
-    &etcd_client,
-    "/services/vllm-agg/",
-    lease_id_extractor,
-    value_extractor,
-).await?;
-
-// Receive updates via watch channel
-let instances = watcher.borrow();
-```
-
-**Key Extractors:**
-
-| Extractor | Description |
-|-----------|-------------|
-| `lease_id()` | Use lease ID as key |
-| `key_string()` | Extract key with prefix stripping |
-| `full_key_string()` | Use full etcd key |
-
-## Reliability Features
-
-### Connection Resilience
-
-**etcd Reconnection:**
- Exponential backoff: 50ms to 5s
- Deadline-based retry logic
- Mutex ensures single concurrent reconnect
-
-**NATS Reconnection:**
- Built-in reconnection in NATS client
- Configurable max reconnect attempts
- Buffering during disconnection
-
-### Lease-Based Cleanup
-
-When a worker crashes or loses connectivity:
-
-1. Keep-alive heartbeats stop
-2. Lease expires after TTL (10 seconds)
-3. All registered endpoints automatically deleted
-4. Clients receive DELETE watch events
-5. Traffic reroutes to healthy workers
-
-### Transaction Safety
-
-etcd transactions ensure atomic operations:
-
-```rust
-// Atomic create-if-not-exists
-let txn = Txn::new()
-    .when([Compare::create_revision(key, CompareOp::Equal, 0)])
-    .and_then([Op::put(key, value, options)]);
-
-etcd_client.txn(txn).await?;
-```
-
-This prevents race conditions in concurrent service registration.
-
-## Operational Modes
-
-### Kubernetes Mode (Requires Explicit Configuration)
-
-Native Kubernetes service discovery:
-
-```bash
-# Operator explicitly sets this (not auto-detected):
-export DYN_DISCOVERY_BACKEND=kubernetes
-
-# Workers register via K8s CRDs
-python -m dynamo.vllm --model Qwen/Qwen3-0.6B
-
-# Frontend discovers workers via K8s API
-python -m dynamo.frontend
-```
-
-No etcd or NATS required for basic operation when using K8s discovery.
-
-### KV Store Mode (Global Default)
-
-Full service discovery with etcd:
-
-```bash
-# This is the default - no configuration needed
-# export DYN_DISCOVERY_BACKEND=kv_store  # (implicit)
-
-# Workers register with etcd
-python -m dynamo.vllm --model Qwen/Qwen3-0.6B
-
-# Frontend discovers workers via etcd
-python -m dynamo.frontend
-```
-
-### KV-Aware Routing (Optional)
-
-Enable NATS for KV cache event tracking:
-
-```bash
-# Default: KV events enabled (requires NATS)
-python -m dynamo.frontend --router-mode kv
-
-# Disable KV events for prediction-based routing (no NATS)
-python -m dynamo.frontend --router-mode kv --no-kv-events
-```
-
-With `--no-kv-events`:
- Router predicts cache state based on routing decisions
- TTL-based expiration and LRU pruning
- No NATS infrastructure required
-
-## Best Practices
-
-### 1. Use Kubernetes Discovery on K8s
-
-The Dynamo operator automatically sets `DYN_DISCOVERY_BACKEND=kubernetes` for pods. No additional setup required when using the operator.
-
-### 2. For Bare Metal: Deploy etcd Cluster
-
-For bare-metal production deployments, deploy a 3-node etcd cluster for high availability.
-
-### 3. Configure Appropriate TTLs (etcd mode)
-
-Balance between detection speed and overhead:
-
- **Short TTL (5s)**: Faster failure detection, more keep-alive traffic
- **Long TTL (30s)**: Less overhead, slower detection
-
-### 4. KV Routing Without NATS
-
-For simpler deployments without NATS:
-
-```bash
-# Use prediction-based KV routing
-python -m dynamo.frontend --router-mode kv --no-kv-events
-```
-
-This provides KV-aware routing with reduced accuracy but no NATS dependency.
-
-## Related Documentation
-
- [Distributed Runtime](distributed_runtime.md) - Runtime architecture
- [Request Plane](request_plane.md) - Request transport configuration
- [Fault Tolerance](../fault_tolerance/README.md) - Failure handling
--- a/docs/design_docs/kvbm_design.md
+++ b/docs/design_docs/kvbm_design.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# KVBM Design
-
-This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in vLLM and SGLang, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading).
-
-## KVBM Components
-
-![Internal Components of Dynamo KVBM](../images/kvbm-components.png)
-
-
-*Internal Components of Dynamo KVBM*
-
-### Core
-
- **KvBlockManager**: Public facade. Constructs/owns the internal state and exposes the pools and onboarding APIs.
- **Scheduler**: Gates transfer execution relative to model progress (iteration/layer completion) when integrated with a framework connector (e.g., vLLM V1).
- **Config (config.rs)**: Describes model dims, page size, layout choices, and runtime flags used to build pools and layouts.
- **KvBlockManagerState**: Central object wiring together layouts, storage backends, and pools; owns the OffloadManager, metrics, and events hooks.
- **Events/Metrics**: Observability components emitting counters/gauges and event hooks for integration/testing.
-
-### Layouts and Blocks
-
- **LayoutConfig & LayoutType**: Translate tensor shapes into KV cache layouts (layer-separated or fully-contiguous), including block counts and geometry.
- **Blocks & Metadata**: Typed block handles (mutable/immutable), metadata (e.g., priority), and views by layer/outer dims; used to allocate, register, and match by `sequence_hash`.
-
-### Transfer Manager
-
- **TransferManager**: Asynchronous transfer orchestrator with per-path queues (Device→Host, Host→Disk, Host→Device, Disk→Device).
-
-### Storage & Pools
-
- **Device Pool (G1)**: GPU-resident KV block pool. Allocates mutable GPU blocks, registers completed blocks (immutable), serves lookups by sequence hash, and is the target for onboarding (Host→Device, Disk→Device).
- **Host Pool (G2)**: CPU pinned-memory KV block pool. Receives Device offloads (Device→Host), can onboard to Device (Host→Device), and offloads to Disk. Uses pinned (page-locked) memory for efficient CUDA transfers and NIXL I/O.
- **Disk Pool (G3)**: Local SSD NVMe-backed KV block pool. Receives Host offloads (Host→Disk) and provides blocks for onboarding to Device (Disk→Device). NIXL descriptors expose file offsets/regions for zero-copy I/O and optional GDS.
- **Remote Storage (G4)**: Remote or cloud-backed KV block storage. KVBM treats G4 as an opaque blob store accessed through NIXL, unaware of internal layout optimizations.
-
-## KVBM Data Flows
-
-![KVBM Data Flows](../images/kvbm-data-flows.png)
-
-
-*KVBM Data Flows from device to other memory hierarchies*
-
-### Device → Host (Offload)
-
- Triggered when explicitly requested to offload by the connector scheduler
- Worker allocates a Host block and performs CUDA D2H/Custom Kernel copy
- Host pool registers the new immutable block (dedup by sequence hash)
-
-### Host → Disk (Offload)
-
- **Local Disk (G3)**: NIXL Write via POSIX; GDS when available
- **Remote Disk (G4)** (Network FS like NFS/Lustre/GPFS): NIXL Write via POSIX to the mounted FS; batching/concurrency identical
- Triggered on registered host blocks or explicit offload requests
- Worker allocates a Disk block and performs NIXL Write (Host→Disk)
- Disk pool registers the new immutable block (dedup by sequence hash)
-
-### Host → Device (Onboard)
-
- Called to bring a host block into GPU memory
- Worker uses provided Device targets and performs CUDA H2D/Custom Kernel copy
- Device pool registers the new immutable block
-
-### Disk → Device (Onboard)
-
- Called to bring a disk block directly into GPU memory
- Worker uses provided Device targets and performs NIXL Read (Disk→Device), possibly via GDS
- Device pool registers the new immutable block
-
-## Internal Architecture Deep Dive
-
-![Internal architecture and key modules in the Dynamo KVBM](../images/kvbm-internal-arch.png)
-
-
-*Internal architecture and key modules in the Dynamo KVBM*
-
-### KvBlockManager as Orchestration Layer
-
-The `KvBlockManager<H, D>` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
-
-`KvBlockManager<H, D>` owns:
-
- A device-side `BlockPool<Device>`
- A host-side `BlockPool<Host>`
- A remote NIXL agent that supports communication and memory sharing across nodes
- A block set registry for remote lookup and import/export of block metadata
-
-Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness.
-
-### Block Layout and Memory Mapping
-
-Each block is a 2D array `[num_layers][page_size × inner_dim]`. The `BlockLayout` trait abstracts the memory layout. The default implementation, `FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation:
-
-```text
-block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
-```
-
-Both CPU and GPU pools share this memory layout, but they use storage-specific backends:
-
- `DeviceStorage` → CUDA device buffer
- `PinnedStorage` → page-locked host memory
- `SystemStorage` → CPU heap memory (fallback/test)
- `NixlStorage` → remote memory through NIXL RDMA handles (includes storage)
-
-Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a `StorageAllocator`.
-
-### BlockPool and Memory Pools (Active and Inactive)
-
-Each `BlockPool<T>` (where `T` is `DeviceStorage`, `PinnedStorage`, etc.) tracks two sub-pools:
-
- **ActivePool**: Contains blocks currently in use by sequences
- **InactivePool**: Recycled blocks ready for allocation (free list)
-
-When a token block is requested (e.g., `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
-
-### Block State Machine
-
-The state machine (`BlockState`) tracks block lifecycle transitions:
-
-| State | Description | Ownership | Valid Actions/Transitions |
-|-------|-------------|-----------|---------------------------|
-| Reset | Block hasn't been initialized or was reset. No associated sequence. | Held in InactivePool, reusable | `init_sequence(salt_hash)` → Partial |
-| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | `add_token()` / `add_tokens()` (accumulate), `commit()` → Complete, `reset()` → Reset |
-| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | `register()` → Registered, `reset()` → Reset |
-| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. | Shared ownership (global registry) | Auto `drop()` → triggers Remove event and transitions to Reset |
-
-#### Valid State Transitions
-
-| From → To | Trigger | Validation |
-|-----------|---------|------------|
-| Reset → Partial | `init_sequence(salt_hash)` | Must not be in use |
-| Partial → Complete | `commit()` | Must be full |
-| Complete → Registered | `register()` | Must be finalized |
-| Registered → Reset | Drop of `RegistrationHandle` | Automatic |
-| Partial → Reset | Aborted sequence | Explicit or drop |
-| Complete → Reset | Invalidated | Explicit or drop |
-
-#### Example Block Lifecycle
-
-A sequence requests a new KV block:
-
-1. Allocator pops from InactivePool → Block is in Reset
-2. `init_sequence()` → Transitions to Partial
-3. Tokens are appended → State remains Partial
-4. On full → `commit()` → State becomes Complete
-5. `register()` → Block is hashed and moved to Registered. Blocks can now be used for lookup.
-6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset
-
-### Lifecycle Management using RAII and Event Plane
-
-The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop:
-
- `PublishHandle` triggers Register events
- Dropping it triggers Remove events
-
-This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
-
-### Remote Memory Integration using NIXL
-
-The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include:
-
- `nixl_register()`: Registers memory region with NIXL runtime
- `serialize() / deserialize()`: Converts layout and memory into transferable descriptors
- `import_remote_blockset()`: Loads remote node's block layouts into the manager
- `get_remote_blocks_mutable()`: Fetches transferable memory views from another node
-
-`RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
-
-#### Remote Memory Registration Protocol
-
-The following describes a bidirectional remote memory registration and layout synchronization protocol between workers (e.g., Worker 1 and Worker 2) using NIXL:
-
-**1. Agent Creation & Memory Registration**
-
-Each worker independently sets up a NixlAgent:
- Registers its memory regions (i.e., device memory) through `nixl_register()`
- These regions correspond to blocks managed in the local BlockPool
-
-Once the worker registers the memory, NIXL creates remote-accessible descriptors, which it binds to the memory layout.
-
-**2. Metadata Exchange**
-
-After memory registration, workers exchange serialized layout metadata, encapsulated in a `SerializedNixlBlockLayout`.
-
-Why is this step critical?
- LLM inference workloads often differ in *tensor parallel (TP)* configurations:
-  - Worker 1 might have TP=4, while Worker 2 has TP=8
-  - Even if both systems use similar `FullyContiguous` layouts, their internal slicing and alignment assumptions differ
- The metadata exchange bridges this semantic mismatch by sharing:
-  - LayoutConfig (num_layers, page_size, inner_dim, dtype)
-  - BlockSetID
-  - Base address + stride information (including alignment)
-  - Device ID + memory type (host/device)
- Once workers share metadata, each can reconstruct the layout on its side using `deserialize()`
-
-This enables NIXL to:
- Understand where each layer/block resides
- Perform correct gather-scatter operations during RDMA-like transfers
-
-Without this step, remote fetches would result in data corruption or misaligned tokens.
-
-**3. Serialization & Deserialization: Making Layouts Portable**
-
-In the serialization stage, KVBM exports and `FullyContiguous::serialize()` encodes:
- FullyContiguousConfig
- base_offset
- Physical memory descriptors (NixlStorage), including:
-  - Memory type (VRAM, DRAM)
-  - Address & size
-  - Device ID
-
-The system sends this using NIXL transfer and then injects it into a KVBM scheduler state.
-
-In the deserialization stage, `SerializedNixlBlockLayout::deserialize()` rehydrates this into:
- A fully reconstructed memory layout view
- Local representation of a remote memory slice with correct offsets and size semantics
-
-It also enables direct access to remote memory with consistent logical semantics. This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
-
-**4. Ownership Handles and Lifetime Tracking**
-
-Memory ownership in NIXL is tightly coupled with RAII-based handles:
- When a block is registered, it returns a `PublishHandle` which wraps a `RegistrationHandle`
- On drop of this handle, an automatic Remove event is published, which:
-  - Deregisters the block from the NIXL layer
-  - Removes it from the remote block registry
- This ensures that once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes
-
-This mechanism avoids:
- Stale memory access
- Dangling pointers on GPU or host
- Manual deregistration bugs
-
-The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency.
-
-### Storage Backends and Pluggability
-
-You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers.
-
-```mermaid
---
-title: Example KVBM System Architecture
---
-flowchart TD
-    A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"]
-
-    B --> C["NIXL Storage Agent<br/>- Volume registration<br/>- get()/put() abstraction"]
-    B --> D["Event Plane<br/>- NATS-based Pub/Sub<br/>- StoreEvent / RemoveEvent"]
-
-    C --> E["G4 Storage Infrastructure<br/>(SSD, Object store, etc.)<br/>- Store KV blocks"]
-    D --> F["Storage Provider Subscriber<br/>- Parse Events<br/>- Build fast tree/index<br/>- Optimize G4 tiering"]
-```
-
-#### NIXL Storage Interface (for Backend Integration)
-
-The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
-
- `registerVolume(descriptor)`: Register a logical volume for KV cache data
- `unregisterVolume()`: Cleanly deregister and release volume mappings
- `get() / put()`: Block-level APIs used by KVBM to fetch and store token blocks
-
-These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Note that these APIs are still being finalized.
-
-#### Dynamo Event Plane (Pub/Sub Coordination Layer)
-
-To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations:
-
- **StoreEvent**: Emitted when a KV block is registered
- **RemoveEvent**: Emitted when a KV block is released or evicted
-
-Each KVEvent (~100 bytes) contains:
-
-| Field | Description |
-|-------|-------------|
-| `sequence_hash` | Unique identifier of the KV block |
-| `prefix_hash` | Prefix grouping for query-level aggregation |
-| `block_size` | Size in bytes |
-| `storage_location` | Logical volume identifier |
-| `event_type` | Store or Remove |
-| `extra_metadata` | Reserved fields for partner-specific optimization |
-
-For scalability, the system batches and publishes these events periodically (e.g., every ~10s, or dynamically based on system load).
-
-#### Conceptual Design of a Storage Advisor
-
-This section provides an overview for storage providers interested in integrating as a custom backend to KVBM. **This is optional for KVBM integration with a backend.**
-
-External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
-
-1. Storage volumes are pre-provisioned and mounted by the storage provider
-2. These volumes are registered with Dynamo through the NIXL Storage Agent using `registerVolume()` APIs
-3. Dynamo KV Block Manager interacts only with logical block-level APIs (`get()` and `put()`)
-4. The Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel
-5. Storage vendors implement a lightweight subscriber process that listens to these events
-
-To enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream:
-
- On receiving a **StoreEvent**: Insert a record into an internal prefix tree, hash map, or LRU index with `prefix_hash`, `sequence_hash`, and associated metadata
- On receiving a **RemoveEvent**: Delete or prune the corresponding record, optionally triggering cleanup or tier migration workflows
-
-With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies:
-
- **Hot block promotion**: Frequently accessed KV blocks can be migrated to fast SSD volumes
- **Cold block demotion**: Infrequently used blocks can be demoted to slower storage (HDDs, cloud object storage)
- **Proactive compaction**: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks
-
-This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
-
-## Framework Integrations
-
-KVBM integrates with inference frameworks (vLLM, TensorRT-LLM, SGLang) via Connector APIs to influence KV caching behavior, scheduling, and forward pass execution.
-
-### Connector Architecture
-
-There are two components of the interface:
-
- **Scheduler (Leader)**: Responsible for orchestration of KV block offload/onboard, builds metadata specifying transfer data to the workers. It also maintains hooks for handling asynchronous transfer completion.
- **Worker**: Responsible for reading metadata built by the scheduler (leader), performs async onboarding/offloading at the end of the forward pass.
-
-![vLLM KVBM Integration](../images/kvbm-integrations.png)
-
-*Typical integration of KVBM with inference frameworks (vLLM shown as example)*
-
-### Onboarding Operations
-
-![Onboarding blocks from Host to Device](../images/kvbm-onboard-host2device.png)
-
-*Onboarding blocks from Host to Device*
-
-![Onboarding blocks from Disk to Device](../images/kvbm-onboard-disk2device.png)
-
-*Onboarding blocks from Disk to Device*
-
-### Offloading Operations
-
-![Offloading blocks from Device to Host & Disk](../images/kvbm-offload.png)
-
-*Offloading blocks from Device to Host & Disk*
-
-## Further Reading
-
- [vLLM Automatic Prefix Caching](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)
- [SGLang HiCache Benchmarks](https://github.com/sgl-project/sglang/tree/main/benchmark/hicache)
- [EMOGI: Efficient Memory-access for Out-of-memory Graph-traversal](https://arxiv.org/abs/2006.06890)
-
-## See Also
-
- [KVBM Overview](../components/kvbm/README.md)
- [KVBM Guide](../components/kvbm/kvbm_guide.md)
- [NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
--- a/docs/design_docs/planner_design.md
+++ b/docs/design_docs/planner_design.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2024-2026 NVIDIA CORPORATION & AFFILIATES.
-All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-->
-
-# Planner Design
-
-> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/components/planner/](/docs/components/planner/).
-
-## Overview
-
-The Planner is Dynamo's autoscaling controller. It observes system metrics, predicts future load, and adjusts prefill/decode worker replica counts to proactively meet SLA targets. This document covers the internal architecture, algorithms, and design trade-offs.
-
-## Architecture
-
-```text
-┌──────────────────────────────────────────────────────────┐
-│                    Planner Component                     │
-│                                                          │
-│  ┌───────────────┐ ┌───────────────┐ ┌────────────────┐  │
-│  │    Metric     │ │     Load      │ │  Performance   │  │
-│  │   Collector   │ │   Predictor   │ │  Interpolator  │  │
-│  │  (Prometheus) │ │ (ARIMA/etc.)  │ │  (JSON data)   │  │
-│  └───────┬───────┘ └───────┬───────┘ └───────┬────────┘  │
-│          │                 │                  │          │
-│          ▼                 ▼                  ▼          │
-│  ┌───────────────────────────────────────────────────┐   │
-│  │              Scaling Algorithm                    │   │
-│  └───────────────────────┬───────────────────────────┘   │
-│                          │                               │
-│  ┌───────────────────────▼───────────────────────────┐   │
-│  │               Connector Layer                     │   │
-│  │  ┌───────────────────┐  ┌───────────────────────┐ │   │
-│  │  │ KubernetesConn.   │  │   VirtualConn.        │ │   │
-│  │  │ (PATCH DGD)       │  │   (Runtime bridge)    │ │   │
-│  │  └───────────────────┘  └───────────────────────┘ │   │
-│  └───────────────────────────────────────────────────┘   │
-└──────────────────────────────────────────────────────────┘
-```
-
-## Scaling Algorithm
-
-### Step 1: Metric Collection
-
-Every `adjustment_interval` seconds, the planner queries Prometheus for:
-
- Average TTFT and ITL over the interval
- Total request count
- Average input sequence length (ISL) and output sequence length (OSL)
-
-The Prometheus query targets the Frontend's `/metrics` endpoint, which exposes histograms and counters.
-
-### Step 2: Correction Factor Calculation
-
-The planner maintains correction factors that adapt profiling-based predictions to real-world behavior:
-
-```text
-prefill_correction = actual_ttft / expected_ttft
-decode_correction  = actual_itl  / expected_itl
-```
-
-These factors account for hard to model factors such as:
-
- **Request queueing**: Bursty traffic causes higher TTFT than profiled steady-state
- **Prefix cache hits**: KV reuse reduces effective prefill tokens, lowering actual TTFT
- **Chunked prefill in decode**: Small prefills processed in decode engine affect ITL
- **Metric variance**: Average ISL/OSL may not represent the actual distribution
-
-The correction factors are applied as multipliers to the next scaling decision. Setting `--no-correction` disables this for debugging or when cold-start artifacts dominate.
-
-### Step 3: Load Prediction
-
-The planner forecasts three values for the next interval:
-
- `next_num_req`: Number of requests
- `next_isl`: Average input sequence length
- `next_osl`: Average output sequence length
-
-Four predictor implementations are available:
-
-
-| Predictor    | Algorithm                                | Best For                         |
-| ------------ | ---------------------------------------- | -------------------------------- |
-| **Constant** | `next = current`                         | Stable workloads, long intervals |
-| **ARIMA**    | Auto-ARIMA with optional log1p transform | Trending/seasonal patterns       |
-| **Kalman**   | Local linear trend Kalman filter         | Bursty traffics                  |
-| **Prophet**  | Facebook Prophet time-series model       | Complex seasonality              |
-
-
-All predictors support warm-starting from trace files (`--load-predictor-warmup-trace`).
-
-### Step 4: Replica Calculation
-
-**Prefill replicas:**
-
-```python
-predicted_load = next_requests * next_isl / interval * min(1, prefill_correction)
-prefill_replicas = ceil(predicted_load / interpolated_throughput / gpus_per_engine)
-```
-
-The prefill correction factor has a linear effect on throughput because prefill is single-batched.
-
-**Decode replicas:**
-
-```python
-# Apply correction to the ITL SLA target
-corrected_itl = target_itl / decode_correction_factor
-
-# Find best throughput/GPU that achieves corrected ITL at predicted context length
-throughput_per_gpu = decode_interpolator.find_best_throughput_per_gpu(
-    itl=corrected_itl,
-    context_length=next_isl + next_osl / 2
-)
-
-# Calculate required replicas
-decode_replicas = ceil(next_num_req * next_osl / interval / throughput_per_gpu / gpus_per_engine)
-```
-
-### Step 5: Scaling Execution
-
-The planner calls `connector.set_component_replicas()` with the calculated targets. Scaling is non-blocking by default: the planner continues monitoring while replicas are adjusting.
-
-## Connector Design
-
-### Interface
-
-```python
-class PlannerConnector(ABC):
-    async def add_component(self, component_name)
-    async def remove_component(self, component_name)
-    # Extended interface (not on ABC, but implemented by both connectors):
-    async def set_component_replicas(self, targets, blocking)
-    async def validate_deployment(self, ...)
-    async def wait_for_deployment_ready(self)
-```
-
-### KubernetesConnector
-
-Directly PATCHes the DGD resource to update replica counts. The operator watches for DGD changes and reconciles component deployments.
-
-**Design decisions:**
-
- Uses `DYN_PARENT_DGD_K8S_NAME` to find its parent DGD (injected by operator)
- Resolves services by `subComponentType` field (prefill/decode), with fallback to legacy component names
- Validates deployment structure on startup: checks that prefill and decode services exist and model names match
-
-### VirtualConnector
-
-For non-native environments (e.g., custom orchestrators). Writes scaling decisions to the distributed runtime via `VirtualConnectorCoordinator` (Rust binding). External systems use `VirtualConnectorClient` to poll decisions and report completion.
-
-**Scaling decision flow:**
-
-1. Planner writes `(num_prefill, num_decode, decision_id)` to runtime
-2. External system reads decision via `client.wait()`
-3. External system executes scaling
-4. External system reports completion via `client.complete(decision)`
-5. Planner sees `scaled_decision_id >= decision_id` and proceeds
-
-**Timeout**: If scaling isn't acknowledged within 1800s (configurable), the planner proceeds with new decisions anyway.
-
-## Performance Interpolation
-
-The planner uses pre-deployment profiling data (NPZ files) to map (throughput, ISL/OSL, context_length) -> (TTFT, ITL). This data comes from the SLA-driven profiling process (either online GPU profiling or AI Configurator estimation).
-
-Two interpolators are maintained:
-
- **Prefill interpolator**: Maps (throughput_per_gpu, ISL) -> TTFT
- **Decode interpolator**: Maps (throughput_per_gpu, context_length) -> ITL
-
-The interpolators use the profiling sweep granularity to determine precision. Finer granularity means more profiling samples but more accurate interpolation.
-
-## Initialization
-
-The planner starts with a 30-second delay (`INIT_PLANNER_START_DELAY`) to allow other components (frontend, workers) to register and stabilize. This is a known workaround (marked TODO in code) that should be replaced with a proper readiness check.
-
-After the delay:
-
-1. Initialize the connector (K8s or Virtual based on `--environment`)
-2. Validate deployment structure
-3. Load profiling results
-4. Build interpolators
-5. Initialize load predictor
-6. Enter main scaling loop
-
-## Performance Considerations
-
- **Adjustment interval sizing**: The interval must be long enough for scaling operations to complete. If `adjustment_interval` is shorter than the time to add/remove a worker (which includes pod scheduling, model loading, and registration), scaling decisions will overlap. Default of 180s is conservative; workloads with fast model loading can use shorter intervals.
- **Correction factor stability**: Correction factors are recalculated each interval. During traffic transitions (e.g., ramp-up), they can oscillate. The `--no-correction` flag disables correction for scenarios where cold-start artifacts dominate and distort the factor.
- **Interpolation accuracy vs profiling cost**: Higher `prefillInterpolationGranularity` and `decodeInterpolationGranularity` in the profiling sweep produce more accurate interpolation but increase profiling time linearly. Default granularity (16 prefill, 6 decode) balances accuracy with profiling duration.
- **Predictor warm-up period**: All predictors need observation history before making reliable forecasts. ARIMA and Prophet need multiple adjustment intervals of data. Kalman starts forecasting after `--kalman-min-points` observations. During warm-up, the planner uses the constant predictor as fallback.
-
-## Known Limitations
-
-1. **30-second startup delay**: Hardcoded wait for component registration. It should be replaced with runtime readiness probing.
-2. **Adjustment interval vs scaling latency**: If `adjustment_interval` < time to scale, scaling decisions can pile up. The planner logs warnings but doesn't queue.
-3. **Average-based interpolation**: The planner uses average ISL/OSL, which may not represent bimodal or heavy-tailed distributions well.
-4. **Single DGD scope**: Each planner instance manages exactly one DGD. Multi-model/multi-DGD coordination is not supported.
-5. **Load-based planner deprecated**: The load-based code path exists but is non-functional with current backends (no prefill queue metrics).
-
-## Future Work
-
- Support aggregated (non-disaggregated) scaling mode for single-worker deployments
- Multi-DGD coordination for shared-cluster scenarios
- Distribution-aware interpolation (beyond mean ISL/OSL)
- Adaptive adjustment interval based on observed scaling latency
-
-## File Map
-
-
-| File                         | Size | Purpose                                               |
-| ---------------------------- | ---- | ----------------------------------------------------- |
-| `planner_core.py`            | 36k  | Main scaling loop, algorithm implementation           |
-| `perf_interpolation.py`      | 13k  | NPZ data loading and throughput/latency interpolation |
-| `load_predictor.py`          | 16k  | ARIMA, Prophet, Kalman, Constant predictors           |
-| `pre_swept_results_utils.py` | 12k  | Pre-computed H100/H200 profiling data loader          |
-| `kubernetes_connector.py`    | 11k  | K8s API integration for DGD scaling                   |
-| `kube.py`                    | 7.4k | Low-level K8s client wrapper                          |
-| `exceptions.py`              | 7.2k | Custom exception hierarchy                            |
-| `prometheus.py`              | 7.3k | Prometheus query builder and client                   |
-| `defaults.py`                | 8.1k | Default configs, backend name mappings                |
-| `planner_argparse.py`        | 6.2k | CLI argument definitions                              |
-
-
--- a/docs/design_docs/request_plane.md
+++ b/docs/design_docs/request_plane.md
-<!--
-SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-SPDX-License-Identifier: Apache-2.0
-
-Licensed under the Apache License, Version 2.0 (the "License");
-you may not use this file except in compliance with the License.
-You may obtain a copy of the License at
-
-http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing, software
-distributed under the License is distributed on an "AS IS" BASIS,
-WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-See the License for the specific language governing permissions and
-limitations under the License.
-->
-
-# Dynamo Request Planes User Guide
-
-## Overview
-
-Dynamo supports multiple transport mechanisms for its request plane (the communication layer between services). You can choose from three different request plane modes based on your deployment requirements:
-
- **TCP** (default): Direct TCP connection for optimal performance
- **NATS**: Message broker-based request plane
- **HTTP**: HTTP/2-based request plane
-
-This guide explains how to configure and use request plane in your Dynamo deployment.
-
-## What is a Request Plane?
-
-The request plane is the transport layer that handles communication between Dynamo services (e.g., frontend to backend, worker to worker). Different request planes offer different trade-offs:
-
-| Request Plane | Suitable For | Characteristics |
-|--------------|----------|-----------------|
-| **NATS** | Production deployments with KV routing | Requires NATS infrastructure, provides pub/sub patterns, highest flexibility |
-| **TCP** | Low-latency direct communication | Direct connections, minimal overhead |
-| **HTTP** | Standard deployments, debugging | HTTP/2 protocol, easier observability with standard tools, widely compatible |
-
-## Request Plane vs KV Event Plane
-
-Dynamo has **two independent communication planes**:
-
- **Request plane** (**`DYN_REQUEST_PLANE`**): how **RPC requests** flow between components (frontend → router → worker), via `tcp`, `http`, or `nats`.
- **KV event plane** (currently only **NATS** is supported): how **KV cache events** (and optional router replica sync) are distributed/persisted for KV-aware routing.
-
-**Note:** If you are using `tcp` or `http` request plane with KV events enabled (default), NATS is automatically initialized. You can optionally configure `NATS_SERVER` environment variable (e.g., `NATS_SERVER=nats://nats-hostname:port`) to specify a custom NATS server; otherwise, it defaults to `localhost:4222`. To completely disable NATS, use `--no-kv-events` on the frontend.
-
-Because they are independent, you can mix them.
-
-For example, a deployment with TCP request plane can use different KV event planes:
- **JetStream KV events**: requests use TCP, KV routing still uses NATS JetStream + object store for persistence.
- **NATS Core KV events (local indexer)**: requests use TCP, KV events use NATS Core pub/sub and persistence lives on workers.
- **no KV events**: requests use TCP and KV routing runs without events (no NATS required, but no event-backed persistence).
-
-## Configuration
-
-### Environment Variable
-
-Set the request plane mode using the `DYN_REQUEST_PLANE` environment variable:
-
-```bash
-export DYN_REQUEST_PLANE=<mode>
-```
-
-Where `<mode>` is one of:
- `tcp` (default)
- `nats`
- `http`
-
-The value is case-insensitive.
-
-### Default Behavior
-
-If `DYN_REQUEST_PLANE` is not set or contains an invalid value, Dynamo defaults to `tcp`.
-
-## Usage Examples
-
-### Using TCP (Default)
-
-TCP is the default request plane and provides direct, low-latency communication between services.
-
-**Configuration:**
-
-```bash
-# TCP is the default, so no need to set DYN_REQUEST_PLANE explicitly
-# But you can explicitly set it if desired:
-export DYN_REQUEST_PLANE=tcp
-
-# Optional: Configure TCP server host and port
-export DYN_TCP_RPC_HOST=0.0.0.0  # Default host
-# export DYN_TCP_RPC_PORT=9999   # Optional: specify a fixed port
-
-# Run your Dynamo service
-DYN_REQUEST_PLANE=tcp python -m dynamo.frontend --http-port=8000 &
-DYN_REQUEST_PLANE=tcp python -m dynamo.vllm --model Qwen/Qwen3-0.6B
-```
-
-**Note:** By default, TCP uses an OS-assigned free port (port 0). This is ideal for environments where multiple services may run on the same machine or when you want to avoid port conflicts. If you need a specific port (e.g., for firewall rules), set `DYN_TCP_RPC_PORT` explicitly.
-
-**When to use TCP:**
- Simple deployments with direct service-to-service communication (e.g. frontend to backend)
- Minimal infrastructure requirements (NATS is initialized by default for KV events but can be disabled with `--no-kv-events`)
- Low-latency requirements
-
-**TCP Configuration Options:**
-
-Additional TCP-specific environment variables:
- `DYN_TCP_RPC_HOST`: Server host address (default: auto-detected)
- `DYN_TCP_RPC_PORT`: Server port. If not set, the OS assigns a free port automatically (recommended for most deployments). Set explicitly only if you need a specific port for firewall rules.
- `DYN_TCP_MAX_MESSAGE_SIZE`: Maximum message size for TCP client (default: 32MB)
- `DYN_TCP_REQUEST_TIMEOUT`: Request timeout for TCP client (default: 10 seconds)
- `DYN_TCP_POOL_SIZE`: Connection pool size for TCP client (default: 50)
- `DYN_TCP_CONNECT_TIMEOUT`: Connect timeout for TCP client (default: 3 seconds)
- `DYN_TCP_CHANNEL_BUFFER`: Request channel buffer size for TCP client (default: 100)
-
-### Using HTTP
-
-HTTP/2 provides a standards-based request plane that's easy to debug and widely compatible.
-
-**Configuration:**
-
-```bash
-# Optional: Configure HTTP server host and port
-export DYN_HTTP_RPC_HOST=0.0.0.0      # Default host
-export DYN_HTTP_RPC_PORT=8888         # Default port
-export DYN_HTTP_RPC_ROOT_PATH=/v1/rpc # Default path
-
-# Run your Dynamo service
-DYN_REQUEST_PLANE=http python -m dynamo.frontend --http-port=8000 &
-DYN_REQUEST_PLANE=http python -m dynamo.vllm --model Qwen/Qwen3-0.6B
-```
-
-**When to use HTTP:**
- Standard deployments requiring HTTP compatibility
- Debugging scenarios (use curl, browser tools, etc.)
- Integration with HTTP-based infrastructure
- Load balancers and proxies that work with HTTP
-
-**HTTP Configuration Options:**
-
-Additional HTTP-specific environment variables:
- `DYN_HTTP_RPC_HOST`: Server host address (default: auto-detected)
- `DYN_HTTP_RPC_PORT`: Server port (default: 8888)
- `DYN_HTTP_RPC_ROOT_PATH`: Root path for RPC endpoints (default: /v1/rpc)
-
-`DYN_HTTP2_*`: Various HTTP/2 client configuration options
- `DYN_HTTP2_MAX_FRAME_SIZE`: Maximum frame size for HTTP client (default: 1MB)
- `DYN_HTTP2_MAX_CONCURRENT_STREAMS`: Maximum concurrent streams for HTTP client (default: 1000)
- `DYN_HTTP2_POOL_MAX_IDLE_PER_HOST`: Maximum idle connections per host for HTTP client (default: 100)
- `DYN_HTTP2_POOL_IDLE_TIMEOUT_SECS`: Idle timeout for HTTP client (default: 90 seconds)
- `DYN_HTTP2_KEEP_ALIVE_INTERVAL_SECS`: Keep-alive interval for HTTP client (default: 30 seconds)
- `DYN_HTTP2_KEEP_ALIVE_TIMEOUT_SECS`: Keep-alive timeout for HTTP client (default: 10 seconds)
- `DYN_HTTP2_ADAPTIVE_WINDOW`: Enable adaptive flow control (default: true)
-
-### Using NATS
-
-NATS provides durable jetstream messaging for request plane and can be used for KV events (and router replica sync).
-
-**Prerequisites:**
- NATS server must be running and accessible
- Configure NATS connection via standard Dynamo NATS environment variables
-
-```bash
-# Explicitly set to NATS
-export DYN_REQUEST_PLANE=nats
-
-# Run your Dynamo service
-DYN_REQUEST_PLANE=nats python -m dynamo.frontend --http-port=8000 &
-DYN_REQUEST_PLANE=nats python -m dynamo.vllm --model Qwen/Qwen3-0.6B
-```
-
-**When to use NATS:**
- Production deployments with service discovery
- KV-aware routing with accurate cache state tracking (requires NATS for event transport). Note: approximate mode (`--no-kv-events`) provides KV routing without NATS but with reduced accuracy.
- Need for message replay and persistence features
-
-Limitations:
- NATS does not support payloads beyond 16MB (use TCP for larger payloads)
-
-## Complete Example
-
-Here's a complete example showing how to launch a Dynamo deployment with different request planes:
-
-See [`examples/backends/vllm/launch/agg_request_planes.sh`](../../examples/backends/vllm/launch/agg_request_planes.sh) for a complete working example that demonstrates launching Dynamo with TCP, HTTP, or NATS request planes.
-
-
-## Real-World Example
-
-The Dynamo repository includes a complete example demonstrating all three request planes:
-
-**Location:** `examples/backends/vllm/launch/agg_request_planes.sh`
-
-```bash
-cd examples/backends/vllm/launch
-
-# Run with TCP
-./agg_request_planes.sh --tcp
-
-# Run with HTTP
-./agg_request_planes.sh --http
-
-# Run with NATS
-./agg_request_planes.sh --nats
-```
-
-## Architecture Details
-
-### Network Manager
-
-The request plane implementation is centralized in the Network Manager (`lib/runtime/src/pipeline/network/manager.rs`), which:
-
-1. Reads the `DYN_REQUEST_PLANE` environment variable at startup
-2. Creates the appropriate server and client implementations
-3. Provides a transport-agnostic interface to the rest of the codebase
-4. Manages all network configuration and lifecycle
-
-### Transport Abstraction
-
-All request plane implementations conform to common trait interfaces:
- `RequestPlaneServer`: Server-side interface for receiving requests
- `RequestPlaneClient`: Client-side interface for sending requests
-
-This abstraction means your application code doesn't need to change when switching request planes.
-
-### Configuration Loading
-
-Request plane configuration is loaded from environment variables at startup and cached globally. The configuration hierarchy is:
-
-1. **Mode Selection**: `DYN_REQUEST_PLANE` (defaults to `tcp`)
-2. **Transport-Specific Config**: Mode-specific environment variables (e.g., `DYN_TCP_*`, `DYN_HTTP2_*`)
-
-## Migration Guide
-
-### From NATS to TCP
-
-1. Stop your Dynamo services
-2. Set environment variable `DYN_REQUEST_PLANE=tcp`
-3. Optionally configure TCP-specific settings (e.g., `DYN_TCP_RPC_HOST`). Note: `DYN_TCP_RPC_PORT` is optional; if not set, an OS-assigned free port is used automatically.
-4. Restart your services
-
-
-### From NATS to HTTP
-
-1. Stop your Dynamo services
-2. Set environment variable `DYN_REQUEST_PLANE=http`
-3. Optionally configure HTTP-specific settings (`DYN_HTTP_RPC_PORT`, etc.)
-4. Restart your services
-
-### Testing the Migration
-
-After switching request planes, verify your deployment:
-
-```bash
-# Test with a simple request
-curl http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "Qwen/Qwen3-0.6B",
-    "messages": [{"role": "user", "content": "Hello!"}]
-  }'
-```
-
-## Troubleshooting
-
-### Issue: Services Can't Communicate
-
-**Symptoms:** Requests timeout or fail to reach the backend
-
-**Solutions:**
- Verify all services use the same `DYN_REQUEST_PLANE` setting
- Check that server ports are not blocked by k8s network policies or firewalls
- For TCP/HTTP: Ensure host/port configurations are correct and accessible
- For NATS: Verify NATS server is running and accessible
-
-### Issue: "Invalid request plane mode" Error
-
-**Symptoms:** Service fails to start with configuration error
-
-**Solutions:**
- Check `DYN_REQUEST_PLANE` spelling (valid values: `nats`, `tcp`, `http`)
- Value is case-insensitive but must be one of the three options
- If not set, defaults to `tcp`
-
-### Issue: Port Conflicts
-
-**Symptoms:** Server fails to start due to "address already in use"
-
-**Solutions:**
- TCP: By default, TCP uses an OS-assigned free port, so port conflicts should be rare. If you explicitly set `DYN_TCP_RPC_PORT` to a specific port and get conflicts, either change the port or remove the setting to use automatic port assignment.
- HTTP default port: 8888 (adjust environment variable `DYN_HTTP_RPC_PORT`)
-
-## Performance Considerations
-
-### Latency
-
- **TCP**: Lowest latency due to direct connections and binary serialization
- **HTTP**: Moderate latency with HTTP/2 overhead
- **NATS**: Moderate latency due to nats jet stream persistence
-
-
-### Resource Usage
-
- **TCP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
- **HTTP**: Minimal infrastructure (NATS required only if using KV events, can disable with `--no-kv-events`)
- **NATS**: Requires running NATS server (additional memory/CPU)