docs: full migration of docs/ to fern format in fern/ (#6050)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: full migration of docs/ to fern format in fern/ (#6050)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2c3066bd · dagil-nvidia · GitHub · d59b9d72 · 2c3066bd · 2c3066bd
Unverified Commit 2c3066bd authored Feb 06, 2026 by dagil-nvidia Committed by GitHub Feb 06, 2026
20 changed files
--- a/fern/pages/backends/vllm/deepseek-r1.md
+++ b/fern/pages/backends/vllm/deepseek-r1.md
@@ -9,7 +9,7 @@ Dynamo supports running Deepseek R1 with data parallel attention and wide expert

 ## Instructions

-The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [vLLM Backend](README.md) Getting Started section on each node, and then run these two commands.
+The following script can be adapted to run Deepseek R1 with a variety of different configuration. The current configuration uses 2 nodes, 16 GPUs, and a dp of 16. Follow the [ReadMe](README.md) Getting Started section on each node, and then run these two commands.

 node 0
 ```bash

--- a/fern/pages/backends/vllm/multi-node.md
+++ b/fern/pages/backends/vllm/multi-node.md
@@ -80,7 +80,7 @@ python -m dynamo.frontend --router-mode kv &

 # Start prefill worker
 python -m dynamo.vllm \
-  --model meta-llama/Llama-3.3-70B-Instruct \
+  --model meta-llama/Llama-3.3-70B-Instruct
  --tensor-parallel-size 8 \
  --enforce-eager
 ```
@@ -89,7 +89,7 @@ python -m dynamo.vllm \
 ```bash
 # Start decode worker
 python -m dynamo.vllm \
-  --model meta-llama/Llama-3.3-70B-Instruct \
+  --model meta-llama/Llama-3.3-70B-Instruct
  --tensor-parallel-size 8 \
  --enforce-eager \
  --is-prefill-worker

--- a/fern/pages/backends/vllm/prometheus.md
+++ b/fern/pages/backends/vllm/prometheus.md
@@ -11,7 +11,7 @@ When running vLLM through Dynamo, vLLM engine metrics are automatically passed t

 **For the complete and authoritative list of all vLLM metrics**, always refer to the [official vLLM Metrics Design documentation](https://docs.vllm.ai/en/latest/design/metrics.html).

-**For LMCache metrics and integration**, see the [LMCache Integration Guide](LMCache-Integration.md).
+**For LMCache metrics and integration**, see the [LMCache Integration Guide](../../integrations/lmcache-integration.md).

 **For Dynamo runtime metrics**, see the [Dynamo Metrics Guide](../../observability/metrics.md).

@@ -133,10 +133,10 @@ curl -s localhost:8081/metrics | grep "^lmcache:"

 Troubleshooting LMCache-related metrics and logs (including `PrometheusLogger instance already created with different metadata` and `PROMETHEUS_MULTIPROC_DIR` warnings) is documented in:

- [LMCache Integration Guide](LMCache-Integration.md#troubleshooting)
+- [LMCache Integration Guide](../../integrations/lmcache-integration.md#troubleshooting)

 **For complete LMCache configuration and metric details**, see:
- [LMCache Integration Guide](LMCache-Integration.md) - Setup and configuration
+- [LMCache Integration Guide](../../integrations/lmcache-integration.md) - Setup and configuration
 - [LMCache Observability Documentation](https://docs.lmcache.ai/production/observability/vllm_endpoint.html) - Complete metrics reference

 ## Implementation Details

--- a/fern/pages/benchmarks/benchmarking.md
+++ b/fern/pages/benchmarks/benchmarking.md
@@ -3,6 +3,7 @@
 # SPDX-License-Identifier: Apache-2.0
 ---

+
 # Dynamo Benchmarking Guide

 This benchmarking framework lets you compare performance across any combination of:
@@ -64,7 +65,7 @@ The framework is a Python-based wrapper around `aiperf` that:

 ---

-## Client-Side Benchmarking (Local)
+# Client-Side Benchmarking (Local)

 Client-side benchmarking runs on your local machine and connects to Kubernetes deployments via port-forwarding.

@@ -87,10 +88,10 @@ Client-side benchmarking runs on your local machine and connects to Kubernetes d
 Follow these steps to benchmark Dynamo deployments using client-side benchmarking:

 ### Step 1: Establish Kubernetes Cluster and Install Dynamo
-Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](../kubernetes/installation-guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md) to set up benchmarking resources.
+Set up your Kubernetes cluster with NVIDIA GPUs and install the Dynamo Kubernetes Platform. First follow the [installation guide](../kubernetes/installation-guide.md) to install Dynamo Kubernetes Platform, then use [deploy/utils/README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md) to set up benchmarking resources.

 ### Step 2: Deploy DynamoGraphDeployments
-Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Each deployment should have a frontend service exposed.
+Deploy your DynamoGraphDeployments separately using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Each deployment should have a frontend service exposed.

 ### Step 3: Port-Forward and Benchmark Deployment A
 ```bash
@@ -298,7 +299,7 @@ Each concurrency directory contains:

 ---

-## Server-Side Benchmarking (In-Cluster)
+# Server-Side Benchmarking (In-Cluster)

 Server-side benchmarking runs directly within the Kubernetes cluster, eliminating the need for port forwarding and providing better resource utilization.

@@ -316,17 +317,17 @@ The server-side benchmarking solution:
 ## Prerequisites

 1. **Kubernetes cluster** with NVIDIA GPUs and Dynamo namespace setup (see [Dynamo Kubernetes Platform docs](../kubernetes/README.md))
-2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/tree/main/deploy/utils/README.md))
+2. **Storage** PersistentVolumeClaim configured with appropriate permissions (see [deploy/utils README](https://github.com/ai-dynamo/dynamo/blob/main/deploy/utils/README.md))
 3. **Docker image** containing the Dynamo benchmarking tools

 ## Quick Start

 ### Step 1: Deploy Your DynamoGraphDeployment
-Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/). Ensure it has a frontend service exposed.
+Deploy your DynamoGraphDeployment using the [deployment documentation](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends). Ensure it has a frontend service exposed.

 ### Step 2: Deploy and Run Benchmark Job

-**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/tree/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.
+**Note**: The server-side benchmarking job requires a Docker image containing the Dynamo benchmarking tools. Before the 0.5.1 release, you must build your own Docker image using the [container build instructions](https://github.com/ai-dynamo/dynamo/blob/main/container/README.md), push it to your container registry, then update the `image` field in `benchmarks/incluster/benchmark_job.yaml` to use your built image tag.

 ```bash
 export NAMESPACE=benchmarking
@@ -519,7 +520,7 @@ The Python benchmarking module provides a complete end-to-end benchmarking exper

 ## Testing with Mocker Backend

-For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) that simulates LLM inference without requiring actual GPU resources. This is useful for:
+For development and testing purposes, Dynamo provides a [mocker backend](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) that simulates LLM inference without requiring actual GPU resources. This is useful for:

 - **Testing deployments** without expensive GPU infrastructure
 - **Developing and debugging** router, planner, or frontend logic
@@ -528,4 +529,4 @@ For development and testing purposes, Dynamo provides a [mocker backend](https:/

 The mocker backend mimics the API and behavior of real backends (vLLM, SGLang, TensorRT-LLM) but generates mock responses instead of running actual inference.

-See the [mocker directory](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/mocker/) for usage examples and configuration options.
+See the [mocker directory](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/mocker) for usage examples and configuration options.
--- a/fern/pages/benchmarks/kv-router-ab-testing.md
+++ b/fern/pages/benchmarks/kv-router-ab-testing.md
@@ -3,8 +3,6 @@
 # SPDX-License-Identifier: Apache-2.0
 ---

-# Dynamo KV Smart Router A/B Benchmarking Guide
-
 This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.

 ## Overview
@@ -99,7 +97,7 @@ kubectl create secret generic hf-token-secret \

 ### Step 1.3: Install Dynamo Platform (Per-Namespace)

-If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation_guide.md) to install the platform in both namespaces:
+If your cluster uses namespace-restricted Dynamo operators, you'll need to install the Dynamo platform in each namespace. Follow the [Dynamo Kubernetes Installation Guide](https://github.com/ai-dynamo/dynamo/blob/main/docs/kubernetes/installation-guide.md) to install the platform in both namespaces:

 - `router-off-test`
 - `router-on-test`

--- a/fern/pages/benchmarks/sla-driven-profiling.md
+++ b/fern/pages/benchmarks/sla-driven-profiling.md
---
-# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
-# SPDX-License-Identifier: Apache-2.0
---
-
-# SLA-Driven Profiling with DynamoGraphDeploymentRequest
-
-> [!TIP]
-> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](../planner/sla-planner-quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.
-
-## Overview
-
-Dynamo provides automated SLA-driven profiling through **DynamoGraphDeploymentRequests (DGDR)**. Instead of manually running profiling scripts, you declare your performance requirements and let the Dynamo Operator handle profiling and deployment automatically.
-
-**Key Benefits:**
- **Declarative**: Specify SLAs, not implementation details
- **Automated**: No manual job setup or result processing
- **Integrated**: Seamlessly works with Dynamo Operator
- **Production-Ready**: Generates optimized configurations with SLA planner
-
-This document covers:
- Technical details of online vs offline profiling
- Profiling process internals (GPU usage, measurements, interpolation)
- Direct script usage for advanced scenarios
- Comprehensive troubleshooting
-
-## Support Matrix
-
-| Backend | Dense Models | MoE Models |
-|---------|-------------|------------|
-| vLLM | ✅ | 🚧 |
-| SGLang | ✅ | ✅ |
-| TensorRT-LLM | ✅ | 🚧 |
-
-Specifically, the profiler sweeps over the following parallelization mapping for prefill and decode:
-| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping |
-|---------|-------------|------------|
-| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP |
-| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
-| Other Models | TP | TP |
-
-> [!NOTE]
-> - Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
-
-## Using DGDR for Profiling (Recommended)
-
-The recommended way to profile models is through DGDRs. Sample configurations are provided in `deploy/`:
-
-**Available Samples:**
- **`profile_sla_dgdr.yaml`**: Standard profiling with AIPerf on real engines
- **`profile_sla_aic_dgdr.yaml`**: Fast profiling with AI Configurator simulation
- **`profile_sla_moe_dgdr.yaml`**: MoE model profiling
-
-The Dynamo Operator automatically:
-1. Discovers GPU resources (cluster-scoped operators only)
-2. Runs profiling (AIPerf on real engines or AI Configurator simulation)
-3. Generates optimal DGD configuration with SLA planner
-4. Deploys the DGD to your cluster
-
-See the [Quick Start Guide](../planner/sla-planner-quickstart.md) for prerequisites and detailed instructions.
-
-## Hardware Configuration
-
-Hardware parameters have sensible defaults and are **optional** - you can override them if needed:
-
-```yaml
-profilingConfig:
-  config:
-    # Override hardware defaults if needed
-    hardware:
-      min_num_gpus_per_engine: 1
-      max_num_gpus_per_engine: 8
-      num_gpus_per_node: 8
-
-    # Only needed when using AI Configurator (sweep.use_ai_configurator: true)
-    sweep:
-      aic_system: h200_sxm  # GPU type for AI Configurator (h100_sxm, h200_sxm, etc.)
-```
-
-### Automatic GPU Discovery (Optional Feature)
-
-Cluster-scoped operators can optionally enable automatic GPU discovery to detect hardware from cluster nodes. When enabled, hardware config is auto-detected and overrides any manually specified values.
-
-```yaml
-spec:
-  enableGpuDiscovery: true
-```
-
-This feature is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions. It is not available for namespace-restricted operators.
-
-## Profiling Method
-
-1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
-2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense model and 4 nodes for MoE models.
-3. **Parallelization Mapping Sweep**: Use the input ISL and OSL, test the performance of the engines with different parallelization mappings.
-   - For dense models, we test different TP sizes for both prefill and decode.
-   - For MoE models (SGLang), we evaluate both TEP and DEP as candidates for prefill and decode.
-   - **Prefill**:
-     - TP/TEP: We measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
-     - DEP: Attention uses data parallelism. We send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst. This stabilizes measurements when the first batch may launch before all requests arrive.
-   ![Prefill Performance](../../assets/img/h100-prefill-performance.png)
-   - **Decode**: Since the ITL (or iteration time) is relevant with how many requests are in-flight, we measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL. However, for MoE models, this is not guaranteed because the kv cache in different attention DP ranks is different. We are working on framework-side change to fix this issue. For example, the below plot shows the decode parallelization mapping sweep results for H100 for deepseek-ai/DeepSeek-R1-Distill-Llama-8B.
-   ![Decode Performance](../../assets/img/h100-decode-performance.png)
-4. **Recommendation**: Selects optimal parallelization mapping for prefill and decode that achieves the highest per GPU throughput while adhering the SLA on TTFT and ITL. Specifically, the profiler will choose the point (or a point on the curve for decode) that is left to the vertical red dashed line that represents the SLAs while has the highest y coordinate (throughput per GPU).
-5. **In-Depth Profiling on the Recommended P/D Engine**: After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes and will be used in the sla-planner.
-![ITL Interpolation](../../assets/img/pd-interpolation.png)
-   - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
-   - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths. The active kv usage determines the complexity of the memory-bounded attention kernel while the active kv usage divided the average context length determines the complexity of the computation bound MLP kernel. For example, the below figure shows the ITL of DS-Distilled Llama 8b model on H100 TP4. The ITL grows near-linearly with active kv usage under a fixed context length. And the slope increases as the context length decreases.
-
-
-To run the parallelization mapping sweep and the in-depth profiling on the recommended P/D engine, the profiler need to know the engine's forward pass time with different loads. There are two ways to achieve this: run AIPerf on real engines or use AI Configurator to run simulations.
-
-### AIPerf on Real Engines
-
-Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
-
-**Characteristics:**
- **Duration**: 2-4 hours
- **Accuracy**: Highest (real measurements)
- **GPU Requirements**: Full access to test different parallelization mappings
- **Backends**: vLLM, SGLang, TensorRT-LLM
-
-**DGDR Configuration:**
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      use_ai_configurator: false  # Default
-```
-
-### AI Configurator Simulation
-
-Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
-
-**Characteristics:**
- **Duration**: 20-30 seconds
- **Accuracy**: Estimated (may have errors for unusual configurations)
- **GPU Requirements**: None
- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
-
-**DGDR Configuration:**
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      use_ai_configurator: true
-    aic:
-      system: h200_sxm          # GPU system type
-      model_name: QWEN3_32B     # AIC model identifier
-      backend_version: "0.20.0"
-```
-
-**Supported Configurations:**
-
-For the current list of supported models, systems, and backend versions, see the [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features).
-
-To check from the command line: `aiconfigurator cli --help`
-
-**Currently supports:**
- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
-
-### Output Format
-
-After profiling, the DGDR status contains:
-
-1. **Recommended Configuration**: Optimal TP for prefill and decode
-2. **Performance Data**: Interpolation models for SLA planner
-3. **Generated DGD**: Complete deployment manifest
-
-**Example Recommendations:**
-```
-Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
-Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
-```
-
-#### Interactive Configuration Selection WebUI
-
-When running the profiler with `--pick-with-webui`, an interactive web interface is launched that allows you to visually explore profiling results and manually select configurations.
-
-**Features:**
- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
-
-**Selection Methods:**
-1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
-2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
-
-**Example DGD Config Output:**
-
-When you click "Show Config", you'll see a DynamoGraphDeployment configuration like:
-
-```yaml
-# DynamoGraphDeployment Configuration
-# Prefill: 1 GPU(s), TP=1
-# Decode: 4 GPU(s), TP=4
-# Model: Qwen/Qwen3-32B-FP8
-# Backend: trtllm
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeployment
-spec:
-  services:
-    PrefillWorker:
-      subComponentType: prefill
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --tensor-parallel-size=1
-    DecodeWorker:
-      subComponentType: decode
-      replicas: 1
-      extraPodSpec:
-        mainContainer:
-          args:
-          - --tensor-parallel-size=4
-```
-
-**Usage:**
-```bash
-python -m benchmarks.profiler.profile_sla \
-  --backend trtllm \
-  --config path/to/disagg.yaml \
-  --pick-with-webui \
-  --use-ai-configurator \
-  --model Qwen/Qwen3-32B-FP8 \
-  --aic-system h200_sxm \
-  --ttft 200 --itl 15
-```
-
-Once you have selected a configuration, the full DynamoGraphDeployment CRD will be saved in your output folder as `config_with_planner.yaml`.
-
-The WebUI launches on port 8000 by default (configurable with `--webui-port`).
-
-#### Output Performance Plots
-
-The profiler will generate the following plots to better visualize the performance data:
-
-**Parallelization Mapping Sweep Plots:**
- `prefill_performance.png`: TTFT vs Parallelization Mapping size
- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests
-
-Note these two plots are based on the input ISL and OSL.
-
-**In-Depth Profiling for the Recommended P/D Engine Plots:**
- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL for the recommended prefill engine
- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL for the recommended prefill engine
- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length for the recommended decode engine
- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length for the recommended decode engine
-
-
-### Output Interpolation Data
-
-The profiler generates `.npz` files to store the performance data for the recommended P/D engine:
-
-**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`):
- `prefill_isl`: 1D array of input sequence lengths tested
- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL
- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL
-
-**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`):
- `max_kv_tokens`: Total KV tokens capacity in decode engine
- `x_kv_usage`: 1D array of active KV usage percentages [0, 1]
- `y_context_length`: 1D array of average context lengths tested
- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point
- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point
-
-## DGDR Configuration Reference
-
-This section provides detailed explanations of all DGDR `profilingConfig` options. The DGDR controller passes this configuration to the profiler script, which is defined in `benchmarks/profiler/utils/profiler_argparse.py`.
-
-### Configuration Structure
-
-All profiler configuration goes under `spec.profilingConfig.config`:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: my-deployment
-spec:
-  model: "Qwen/Qwen3-0.6B"         # High-level: model to deploy
-  backend: vllm                    # High-level: inference backend
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Required
-    configMapRef:                  # Optional: base DGD config
-      name: my-config
-      key: disagg.yaml
-
-    config:                        # Profiler configuration
-      sla: { ... }
-      hardware: { ... }
-      sweep: { ... }
-      aic: { ... }
-      planner: { ... }
-
-  deploymentOverrides:             # Optional
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-```
-
-### SLA Configuration (Required)
-
-Define your performance requirements and workload characteristics:
-
-```yaml
-profilingConfig:
-  config:
-    sla:
-      isl: 3000      # Average input sequence length (tokens)
-      osl: 150       # Average output sequence length (tokens)
-      ttft: 200.0    # Target Time To First Token (milliseconds)
-      itl: 20.0      # Target Inter-Token Latency (milliseconds)
-```
-
-**What these control:**
- **ISL/OSL**: Based on your expected traffic patterns
- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine)
- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine)
- **Trade-offs**: Tighter SLAs require more GPU resources
-
-### Hardware Configuration (Optional)
-
-Control GPU search space and constraints:
-
-```yaml
-profilingConfig:
-  config:
-    hardware:
-      min_num_gpus_per_engine: 2      # if not provided, will automatically determine based on model and VRAM size
-      max_num_gpus_per_engine: 8      # Maximum GPUs to test
-      num_gpus_per_node: 8            # GPUs per node (for multi-node MoE)
-      gpu_type: h200_sxm              # GPU type hint
-```
-
-**When to use:**
- **min_num_gpus_per_engine**: Skip small TP sizes if your model is large
- **max_num_gpus_per_engine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
- **num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
- **gpu_type**: Informational, auto-detected by controller
-
-> [!TIP]
-> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
-
-### Sweep Configuration (Optional)
-
-Control profiling behavior:
-
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      use_ai_configurator: false              # Use offline profiling (default: false)
-      prefill_interpolation_granularity: 16   # Samples for prefill TTFT curve
-      decode_interpolation_granularity: 6     # Samples for decode ITL curve
-```
-
-**Use cases:**
- **use_ai_configurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
- **prefill_interpolation_granularity**: How many samples to benchmark for prefill TTFT curve (lower = faster but may be less accurate)
- **decode_interpolation_granularity**: How many samples to benchmark for decode ITL curve (lower = faster but may be less accurate). Since ITL interpolation is a 3d plot and takes longer to run, we default to a smaller number of samples. Increasing this value might quadratically increase the profiling time.
-
-### AI Configurator Configuration (Required if `use_ai_configurator: true`)
-
-Configure AI Configurator profiling mode:
-
-```yaml
-profilingConfig:
-  config:
-    sweep:
-      use_ai_configurator: true
-      aic_system: h200_sxm              # GPU system: h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
-      aic_hf_id: Qwen/Qwen3-32B         # Huggingface model id
-      aic_backend_version: "0.20.0"     # TensorRT-LLM version: 0.20.0, 1.0.0rc3
-```
-
-**Supported configurations:** See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features)
-
-### Planner Configuration (Optional)
-
-Pass arguments to the SLA planner:
-
-```yaml
-profilingConfig:
-  config:
-    planner:
-      planner_min_endpoint: 2                    # Minimum endpoints to maintain
-      planner_adjustment_interval: 60            # Adjustment interval (seconds)
-      planner_load_predictor: linear             # Load prediction method
-```
-
-> [!NOTE]
-> Planner arguments use `planner_` prefix. See planner documentation for full list.
-
-### Engine Configuration (Auto-configured)
-
-The controller automatically sets these from high-level fields:
-
-```yaml
-# You specify:
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-
-# Controller auto-injects into config:
-profilingConfig:
-  config:
-    deployment:
-      model: "Qwen/Qwen3-0.6B"       # From spec.model
-    engine:
-      backend: vllm                  # From spec.backend
-      config: /path/to/configmap     # From spec.profilingConfig.configMapRef (if provided)
-```
-
-**You should not manually set** `deployment.model` or `engine.backend` in `profilingConfig.config` - they are automatically injected from the high-level fields.
-
-### Complete Example: AIPerf on Real Engines
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: vllm-dense-online
-spec:
-  model: "Qwen/Qwen3-0.6B"
-  backend: vllm
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-    config:
-      sla:
-        isl: 3000
-        osl: 150
-        ttft: 200.0
-        itl: 20.0
-
-      hardware:
-        min_num_gpus_per_engine: 1
-        max_num_gpus_per_engine: 8
-
-      sweep:
-        use_ai_configurator: false
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
-
-  autoApply: true
-```
-
-### Complete Example: AI Configurator Simulation
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: trtllm-aic-offline
-spec:
-  model: "Qwen/Qwen3-32B"
-  backend: trtllm
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
-    config:
-      sla:
-        isl: 4000
-        osl: 500
-        ttft: 300.0
-        itl: 10.0
-
-      sweep:
-        use_ai_configurator: true
-
-      aic:
-        system: h200_sxm
-        model_name: QWEN3_32B
-        backend_version: "0.20.0"
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
-
-  autoApply: true
-```
-
-### Complete Example: MoE Model
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: sglang-moe
-spec:
-  model: "deepseek-ai/DeepSeek-R1"
-  backend: sglang
-
-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-    config:
-      sla:
-        isl: 2048
-        osl: 512
-        ttft: 300.0
-        itl: 25.0
-
-      hardware:
-        num_gpus_per_node: 8
-        max_num_gpus_per_engine: 32
-
-      engine:
-        is_moe_model: true       # Enable MoE profiling mode
-
-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-
-  autoApply: true
-```
-
-## Troubleshooting
-
-### Profiling Takes Too Long
-
-**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
-```yaml
-sweep:
-  use_ai_configurator: true
-```
-
-**Solution 2**: Reduce search space:
-```yaml
-config:
-  sweep:
-    min_num_gpus: 4  # Skip TP1, TP2
-    max_num_gpus: 8  # Don't test beyond TP8
-```
-
-### SLA Cannot Be Met
-
-**Symptoms**: Profiler reports no configuration meets targets
-
-**Solutions:**
-1. Relax SLA targets (increase TTFT/ITL)
-2. Add more GPU resources
-3. Try a different backend
-4. Use a smaller model
-
-### AI Configurator: Attention Head Constraint Error
-
-**Symptoms**: Profiling fails with error:
-```
-AssertionError: num_heads <N> should be divisible by tp_size <M> and the division result should be >= 4
-```
-
-**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes.
-
-**Affected Models:**
- **Qwen3-0.6B** (16 heads): Max TP = 4 ❌ Fails at TP=8
- **GPT-2** (12 heads): Max TP = 3
- Most models **\<1B parameters**: May hit this constraint
-
-**Solution**: Limit `max_num_gpus_per_engine` in your DGDR:
-
-```yaml
-profilingConfig:
-  profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.1"
-  config:
-    hardware:
-      max_num_gpus_per_engine: 4  # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
-    sweep:
-      use_ai_configurator: true
-    aic:
-      system: h200_sxm
-      model_name: QWEN3_0_6B
-```
-
-**Calculate Max TP**: `max_tp = num_attention_heads / 4`
-
-> **Note**: This is an AI Configurator limitation. Online profiling doesn't have this constraint.
-
-### Image Pull Errors
-
-**Symptoms**: `ErrImagePull` or `ImagePullBackOff`
-
-**Solution**: Ensure image pull secrets are configured:
-```bash
-kubectl create secret docker-registry nvcr-imagepullsecret \
-  --docker-server=nvcr.io \
-  --docker-username='$oauthtoken' \
-  --docker-password=<NGC_API_KEY> \
-  --namespace <your-namespace>
-```
-
-### Out of Memory During Profiling
-
-**Symptoms**: OOM errors in profiling jobs
-
-**Solutions:**
-1. Reduce `gpu_memory_utilization` in engine config
-2. Reduce `--max-context-length`
-3. Skip larger TP configurations
-4. Use fewer GPUs per test
-
-### Unsupported Parallelization Mapping in Backend
-
-**Symptoms**: Starttime/runtime error in the backend. For example, prime number of attention heads restrain TP size to be 1 (i.e., falcon-7b with 71 attention heads). Or some backend does not support different TP sizes for prefill and decode.
-
-**Solutions:**
-1. Contact the backend to add support for the use cases and bump backend version in dynamo.
-2. Restrain the max and min number of GPUs per engine to the supported range.
-
-## Next Steps
-
- **Deploy with DGDR**: See [Quick Start Guide](../planner/sla-planner-quickstart.md)
- **Understand SLA Planner**: Read [SLA Planner Deep Dive](../planner/sla-planner.md)
- **Monitor Deployments**: Set up [Observability](../kubernetes/observability/metrics.md)
- **Optimize Performance**: See [Performance Tuning](../performance/tuning.md)
-
-## Related Documentation
-
- [DGDR API Reference](../kubernetes/api-reference.md)
- [SLA Planner Quick Start](../planner/sla-planner-quickstart.md)
- [SLA Planner Architecture](../planner/sla-planner.md)
- [Profiler Arguments Reference](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/profiler/utils/profiler_argparse.py)
--- a/fern/pages/components/frontend/README.md
+++ b/fern/pages/components/frontend/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# Frontend
+
+The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.
+
+## Feature Matrix
+
+| Feature | Status |
+|---------|--------|
+| OpenAI Chat Completions API | ✅ Supported |
+| OpenAI Completions API | ✅ Supported |
+| KServe gRPC v2 API | ✅ Supported |
+| Streaming responses | ✅ Supported |
+| Multi-model serving | ✅ Supported |
+| Integrated routing | ✅ Supported |
+| Tool calling | ✅ Supported |
+
+## Quick Start
+
+### Prerequisites
+
+- Dynamo platform installed
+- `etcd` and `nats-server -js` running
+- At least one backend worker registered
+
+### HTTP Frontend
+
+```bash
+python -m dynamo.frontend --http-port 8000
+```
+
+This starts an OpenAI-compatible HTTP server with integrated preprocessing and routing. Backends are auto-discovered when they call `register_llm`.
+
+### KServe gRPC Frontend
+
+```bash
+python -m dynamo.frontend --kserve-grpc-server
+```
+
+See the [Frontend Guide](frontend-guide.md) for KServe-specific configuration and message formats.
+
+### Kubernetes
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: frontend-example
+spec:
+  graphs:
+    - name: frontend
+      replicas: 1
+      services:
+        - name: Frontend
+          image: nvcr.io/nvidia/dynamo/dynamo-vllm:latest
+          command:
+            - python
+            - -m
+            - dynamo.frontend
+            - --http-port
+            - "8000"
+```
+
+## Configuration
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `--http-port` | 8000 | HTTP server port |
+| `--kserve-grpc-server` | false | Enable KServe gRPC server |
+| `--router-mode` | `round_robin` | Routing strategy: `round_robin`, `random`, `kv` |
+
+See the [Frontend Guide](frontend-guide.md) for full configuration options.
+
+## Next Steps
+
+| Document | Description |
+|----------|-------------|
+| [Frontend Guide](frontend-guide.md) | KServe gRPC configuration and integration |
+| [Router Documentation](../router/README.md) | KV-aware routing configuration |
--- a/fern/pages/frontends/kserve.md
+++ b/fern/pages/frontends/kserve.md
@@ -3,11 +3,15 @@
 # SPDX-License-Identifier: Apache-2.0
 ---

-# KServe gRPC frontend
+# Frontend Guide

-## Motivation
+This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend.

-[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry standard protocol for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.
+## KServe gRPC Frontend
+
+### Motivation
+
+[KServe v2 API](https://github.com/kserve/kserve/tree/master/docs/predict-api/v2) is one of the industry-standard protocols for machine learning model inference. Triton inference server is one of the inference solutions that comply with KServe v2 API and it has gained a lot of adoption. To quickly enable Triton users to explore with Dynamo benefits, Dynamo provides a KServe gRPC frontend.

 This documentation assumes readers are familiar with the usage of KServe v2 API and focuses on explaining the Dynamo parts that work together to support KServe API and how users may migrate existing KServe deployment to Dynamo.

@@ -20,8 +24,9 @@ This documentation assumes readers are familiar with the usage of KServe v2 API

 ## Starting the Frontend

-To start the KServe frontend, run the below command
-```
+To start the KServe frontend, run the below command:
+
+```bash
 python -m dynamo.frontend --kserve-grpc-server
 ```

@@ -45,54 +50,58 @@ python -m dynamo.frontend --kserve-grpc-server

 If these variables are not set, the server uses tonic's default values.

-> **Note**: Tune these values based on your workload. Connection window should accommodate `concurrent_requests × request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details.
+<Note>
+Tune these values based on your workload. Connection window should accommodate `concurrent_requests x request_size`. Memory overhead equals the connection window size (shared across all streams). See [gRPC performance best practices](https://grpc.io/docs/guides/performance/) and [gRPC channel arguments](https://grpc.github.io/grpc/core/group__grpc__arg__keys.html) for more details.
+</Note>

 ## Registering a Backend

 Similar to HTTP frontend, the registered backend will be auto-discovered and added to the frontend list of serving model. To register a backend, the same `register_llm()` API will be used. Currently the frontend support serving of the following model type and model input combination:
+
 * `ModelType::Completions` and `ModelInput::Text`: Combination for LLM backend that uses custom preprocessor
 * `ModelType::Completions` and `ModelInput::Token`: Combination for LLM backend that uses Dynamo preprocessor (i.e. Dynamo vLLM / SGLang / TRTLLM backend)
-* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor based inference
+* `ModelType::TensorBased` and `ModelInput::Tensor`: Combination for backend that is used for generic tensor-based inference

 The first two combinations are backed by OpenAI Completions API, see [OpenAI Completions section](#openai-completions) for more detail. Whereas the last combination is most aligned with KServe API and the users can replace existing deployment with Dynamo once their backends implements adaptor for `NvCreateTensorRequest/NvCreateTensorResponse`, see [Tensor section](#tensor) for more detail:

 ### OpenAI Completions

-Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor based messages and OpenAI message and imposes specific structure of the KServe request message.
+Most of the Dynamo features are tailored for LLM inference and the combinations that are backed by OpenAI API can enable those features and are best suited for exploring those Dynamo features. However, this implies specific conversion between generic tensor-based messages and OpenAI message and imposes specific structure of the KServe request message.

 #### Model Metadata / Config

 The metadata and config endpoint will report the registered backend to have the below, note that this is not the exact response.
-```
+
+```json
 {
-    name: $MODEL_NAME,
-    version: 1,
-    platform: "dynamo",
-    backend: "dynamo", # model config specific
-    inputs: [
+    "name": "$MODEL_NAME",
+    "version": 1,
+    "platform": "dynamo",
+    "backend": "dynamo",
+    "inputs": [
        {
-            name: "text_input",
-            datatype: "BYTES",
-            shape: [1]
+            "name": "text_input",
+            "datatype": "BYTES",
+            "shape": [1]
        },
        {
-            name: "streaming",
-            datatype: "BOOL",
-            shape: [1],
-            optional: true
+            "name": "streaming",
+            "datatype": "BOOL",
+            "shape": [1],
+            "optional": true
        }
-    ]
-    outputs: [
+    ],
+    "outputs": [
        {
-            name: "text_output",
-            datatype: "BYTES",
-            shape: [-1]
+            "name": "text_output",
+            "datatype": "BYTES",
+            "shape": [-1]
        },
        {
-            name: "finish_reason",
-            datatype: "BYTES",
-            shape: [-1],
-            optional: true
+            "name": "finish_reason",
+            "datatype": "BYTES",
+            "shape": [-1],
+            "optional": true
        }
    ]
 }
@@ -101,26 +110,57 @@ The metadata and config endpoint will report the registered backend to have the
 #### Inference

 On receiving inference request, the following conversion will be performed:
+
 * `text_input`: the element is expected to contain the user prompt string and will be converted to `prompt` field in OpenAI Completion request
 * `streaming`: the element will be converted to `stream` field in OpenAI Completion request
+
 On receiving model response, the following conversion will be performed:
+
 * `text_output`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `text` of the choice.
 * `finish_reason`: each element corresponds to one choice in OpenAI Completion response, and the content will be set to `finish_reason` of the choice.

 ### Tensor

-This combination is used when the user is migrating an existing KServe based backend into Dynamo ecosystem.
+This combination is used when the user is migrating an existing KServe-based backend into Dynamo ecosystem.

 #### Model Metadata / Config

-When registering the backend, the backend must provide the model's metadata as tensor based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
-* [TensorModelConfig](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
-* [triton_model_config](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which will supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/main/tests/frontend/grpc/echo_tensor_worker.py) for example.
+When registering the backend, the backend must provide the model's metadata as tensor-based deployment is generic and the frontend can't make any assumptions like for OpenAI Completions model. There are two methods to provide model metadata:
+
+* [TensorModelConfig](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs): This is Dynamo defined structure for model metadata, the backend can provide the model metadata as shown in this [example](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/tests/test_tensor.py). For metadata provided in such way, the following field will be set to a fixed value: `version: 1`, `platform: "dynamo"`, `backend: "dynamo"`. Note that for model config endpoint, the rest of the fields will be set to their default values.
+* [triton_model_config](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs): For users that already have Triton model config and require the full config to be returned for client side logic, they can set the config in `TensorModelConfig::triton_model_config` which supersedes other fields in `TensorModelConfig` and be used for endpoint responses. `triton_model_config` is expected to be the serialized string of the `ModelConfig` protobuf message, see [echo_tensor_worker.py](https://github.com/ai-dynamo/dynamo/blob/main/tests/frontend/grpc/echo_tensor_worker.py) for example.

 #### Inference

-When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/blob/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.
+When receiving inference request, the backend will receive [NvCreateTensorRequest](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs) and be expected to return [NvCreateTensorResponse](https://github.com/ai-dynamo/dynamo/tree/main/lib/llm/src/protocols/tensor.rs), which are the mapping of ModelInferRequest / ModelInferResponse protobuf message in Dynamo.

 ## Python Bindings

-The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
+The frontend may be started via Python binding, this is useful when integrating Dynamo in existing system that desire the frontend to be run in the same process with other components. See [server.py](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/kserve_grpc_service/server.py) for example.
+
+## Integration
+
+### With Router
+
+The frontend includes an integrated router for request distribution. Configure routing mode:
+
+```bash
+python -m dynamo.frontend --router-mode kv --http-port 8000
+```
+
+See [Router Documentation](../router/README.md) for routing configuration details.
+
+### With Backends
+
+Backends auto-register with the frontend when they call `register_llm()`. Supported backends:
+
+- [vLLM Backend](../../backends/vllm/README.md)
+- [SGLang Backend](../../backends/sglang/README.md)
+- [TensorRT-LLM Backend](../../backends/trtllm/README.md)
+
+## See Also
+
+| Document | Description |
+|----------|-------------|
+| [Frontend Overview](README.md) | Quick start and feature matrix |
+| [Router Documentation](../router/README.md) | KV-aware routing configuration |
--- a/fern/pages/components/kvbm/README.md
+++ b/fern/pages/components/kvbm/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# KV Block Manager (KVBM)
+
+The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
+
+KVBM offers:
+- A **unified memory API** spanning GPU memory, pinned host memory, remote RDMA-accessible memory, local/distributed SSDs, and remote file/object/cloud storage systems
+- Support for **block lifecycles** (allocate → register → match) with event-based state transitions
+- Integration with **[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)**, a dynamic memory exchange layer for remote registration, sharing, and access of memory blocks
+
+> **Get started:** See the [KVBM Guide](kvbm-guide.md) for installation and deployment instructions.
+
+## When to Use KV Cache Offloading
+
+KV Cache offloading avoids expensive KV Cache recomputation, resulting in faster response times and better user experience. Providers benefit from higher throughput and lower cost per token, making inference services more scalable and efficient.
+
+Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GPU memory and cache reuse outweighs the overhead of transferring data. It is especially valuable in:
+
+| Scenario | Benefit |
+|----------|---------|
+| **Long sessions and multi-turn conversations** | Preserves large prompt prefixes, avoids recomputation, improves first-token latency and throughput |
+| **High concurrency** | Idle or partial conversations can be moved out of GPU memory, allowing active requests to proceed without hitting memory limits |
+| **Shared or repeated content** | Reuse across users or sessions (system prompts, templates) increases cache hits, especially with remote or cross-instance sharing |
+| **Memory- or cost-constrained deployments** | Offloading to RAM or SSD reduces GPU demand, allowing longer prompts or more users without adding hardware |
+
+## Feature Support Matrix
+
+|  | Feature | Support |
+|--|---------|---------|
+| **Backend** | Local | ✅ |
+|  | Kubernetes | ✅ |
+| **LLM Framework** | vLLM | ✅ |
+|  | TensorRT-LLM | ✅ |
+|  | SGLang | ❌ |
+| **Serving Type** | Aggregated | ✅ |
+|  | Disaggregated | ✅ |
+
+## Architecture
+
+![KVBM Architecture](/assets/img/kvbm-architecture.png)
+*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
+
+KVBM has three primary logical layers:
+
+**LLM Inference Runtime Layer** — The top layer includes inference runtimes (TensorRT-LLM, vLLM) that integrate through dedicated connector modules to the Dynamo KVBM. These connectors act as translation layers, mapping runtime-specific operations and events into KVBM's block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and memory tiering.
+
+**KVBM Logic Layer** — The middle layer encapsulates core KV block manager logic and serves as the runtime substrate for managing block memory. The KVBM adapter normalizes representations and data layout for incoming requests across runtimes and forwards them to the core memory manager. This layer implements table lookups, memory allocation, block layout management, lifecycle state transitions, and block reuse/eviction policies.
+
+**NIXL Layer** — The bottom layer provides unified support for all data and storage transactions. NIXL enables P2P GPU transfers, RDMA and NVLink remote memory sharing, dynamic block registration and metadata exchange, and provides a plugin interface for storage backends including block memory (GPU HBM, Host DRAM, Remote DRAM, Local SSD), local/remote filesystems, object stores, and cloud storage.
+
+> **Learn more:** See the [KVBM Design Document](../../design-docs/kvbm-design.md) for detailed architecture, components, and data flows.
+
+## Next Steps
+
+- **[KVBM Guide](kvbm-guide.md)** — Installation, configuration, and deployment instructions
+- **[KVBM Design](../../design-docs/kvbm-design.md)** — Architecture deep dive, components, and data flows
+- **[LMCache Integration](../../integrations/lmcache-integration.md)** — Use LMCache with Dynamo vLLM backend
+- **[FlexKV Integration](../../integrations/flexkv-integration.md)** — Use FlexKV for KV cache management
+- **[SGLang HiCache](../../integrations/sglang-hicache.md)** — Enable SGLang's hierarchical cache with NIXL
+- **[NIXL Documentation](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** — NIXL communication library details
--- a/fern/pages/components/kvbm/kvbm-guide.md
+++ b/fern/pages/components/kvbm/kvbm-guide.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# KVBM Guide
+The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM and TensorRT-LLM.
+
+KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
+
+## Table of Contents
+
+- [Quick Start](#quick-start)
+- [Run KVBM Standalone](#run-kvbm-standalone)
+- [Run KVBM in Dynamo with vLLM](#run-kvbm-in-dynamo-with-vllm)
+- [Run KVBM in Dynamo with TensorRT-LLM](#run-kvbm-in-dynamo-with-tensorrt-llm)
+- [Run Dynamo with SGLang HiCache](#run-dynamo-with-sglang-hicache)
+- [Disaggregated Serving with KVBM](#disaggregated-serving-with-kvbm)
+- [Configuration](#configuration)
+- [Enable and View KVBM Metrics](#enable-and-view-kvbm-metrics)
+- [Benchmarking KVBM](#benchmarking-kvbm)
+- [Troubleshooting](#troubleshooting)
+- [Developing Locally](#developing-locally)
+
+## Quick Start
+
+## Run KVBM Standalone
+
+KVBM can be used independently without using the rest of the Dynamo stack:
+
+```bash
+pip install kvbm
+```
+
+See the [support matrix](../../reference/support-matrix.md) for version compatibility.
+
+### Build from Source
+
+To build KVBM from source, see the detailed instructions in the [KVBM bindings README](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/kvbm/README.md#build-from-source).
+
+## Run KVBM in Dynamo with vLLM
+
+### Docker Setup
+
+```bash
+# Start up etcd for KVBM leader/worker registration and discovery
+docker compose -f deploy/docker-compose.yml up -d
+
+# Build a dynamo vLLM container (KVBM is built in by default)
+./container/build.sh --framework vllm
+
+# Launch the container
+./container/run.sh --framework vllm -it --mount-workspace --use-nixl-gds
+```
+
+### Aggregated Serving
+
+```bash
+cd $DYNAMO_HOME/examples/backends/vllm
+./launch/agg_kvbm.sh
+```
+
+#### Verify Deployment
+
+```bash
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "messages": [{"role": "user", "content": "Hello, how are you?"}],
+    "stream": false,
+    "max_tokens": 10
+  }'
+```
+
+#### Alternative: Using Direct vllm serve
+
+You can also use `vllm serve` directly with KVBM:
+
+```bash
+vllm serve --kv-transfer-config '{"kv_connector":"DynamoConnector","kv_role":"kv_both", "kv_connector_module_path": "kvbm.vllm_integration.connector"}' Qwen/Qwen3-0.6B
+```
+
+## Run KVBM in Dynamo with TensorRT-LLM
+
+> [!NOTE]
+> **Prerequisites:**
+> - Ensure `etcd` and `nats` are running before starting
+> - KVBM only supports TensorRT-LLM's PyTorch backend
+> - Disable partial reuse (`enable_partial_reuse: false`) to increase offloading cache hits
+> - KVBM requires TensorRT-LLM v1.2.0rc2 or newer
+
+### Docker Setup
+
+```bash
+# Start up etcd for KVBM leader/worker registration and discovery
+docker compose -f deploy/docker-compose.yml up -d
+
+# Build a dynamo TRTLLM container (KVBM is built in by default)
+./container/build.sh --framework trtllm
+
+# Launch the container
+./container/run.sh --framework trtllm -it --mount-workspace --use-nixl-gds
+```
+
+### Aggregated Serving
+
+```bash
+# Write the LLM API config
+cat > "/tmp/kvbm_llm_api_config.yaml" <<EOF
+backend: pytorch
+cuda_graph_config: null
+kv_cache_config:
+  enable_partial_reuse: false
+  free_gpu_memory_fraction: 0.80
+kv_connector_config:
+  connector_module: kvbm.trtllm_integration.connector
+  connector_scheduler_class: DynamoKVBMConnectorLeader
+  connector_worker_class: DynamoKVBMConnectorWorker
+EOF
+
+# Start dynamo frontend
+python3 -m dynamo.frontend --http-port 8000 &
+
+# Serve the model with KVBM
+python3 -m dynamo.trtllm \
+  --model-path Qwen/Qwen3-0.6B \
+  --served-model-name Qwen/Qwen3-0.6B \
+  --extra-engine-args /tmp/kvbm_llm_api_config.yaml &
+```
+
+#### Verify Deployment
+
+```bash
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "messages": [{"role": "user", "content": "Hello, how are you?"}],
+    "stream": false,
+    "max_tokens": 30
+  }'
+```
+
+#### Alternative: Using trtllm-serve
+
+```bash
+trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml
+```
+
+## Run Dynamo with SGLang HiCache
+
+SGLang's Hierarchical Cache (HiCache) extends KV cache storage beyond GPU memory to include host CPU memory. When using NIXL as the storage backend, HiCache integrates with Dynamo's memory infrastructure.
+
+### Quick Start
+
+```bash
+# Start SGLang worker with HiCache enabled
+python -m dynamo.sglang \
+  --model-path Qwen/Qwen3-0.6B \
+  --host 0.0.0.0 --port 8000 \
+  --enable-hierarchical-cache \
+  --hicache-ratio 2 \
+  --hicache-write-policy write_through \
+  --hicache-storage-backend nixl
+
+# In a separate terminal, start the frontend
+python -m dynamo.frontend --http-port 8000
+
+# Send a test request
+curl localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "Qwen/Qwen3-0.6B",
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "stream": false,
+    "max_tokens": 30
+  }'
+```
+
+> **Learn more:** See the [SGLang HiCache Integration Guide](../../integrations/sglang-hicache.md) for detailed configuration, deployment examples, and troubleshooting.
+
+## Disaggregated Serving with KVBM
+
+KVBM supports disaggregated serving where prefill and decode operations run on separate workers. KVBM is enabled on the prefill worker to offload KV cache.
+
+### Disaggregated Serving with vLLM
+
+```bash
+# 1P1D - one prefill worker and one decode worker
+# NOTE: requires at least 2 GPUs
+cd $DYNAMO_HOME/examples/backends/vllm
+./launch/disagg_kvbm.sh
+
+# 2P2D - two prefill workers and two decode workers
+# NOTE: requires at least 4 GPUs
+cd $DYNAMO_HOME/examples/backends/vllm
+./launch/disagg_kvbm_2p2d.sh
+```
+
+### Disaggregated Serving with TRT-LLM
+
+```bash
+# Launch prefill worker with KVBM
+python3 -m dynamo.trtllm \
+  --model-path Qwen/Qwen3-0.6B \
+  --served-model-name Qwen/Qwen3-0.6B \
+  --extra-engine-args /tmp/kvbm_llm_api_config.yaml \
+  --disaggregation-mode prefill &
+```
+
+## Configuration
+
+### Cache Tier Configuration
+
+Configure KVBM cache tiers using environment variables:
+
+```bash
+# Option 1: CPU cache only (GPU -> CPU offloading)
+export DYN_KVBM_CPU_CACHE_GB=4  # 4GB of pinned CPU memory
+
+# Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
+export DYN_KVBM_CPU_CACHE_GB=4
+export DYN_KVBM_DISK_CACHE_GB=8  # 8GB of disk
+
+# [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading)
+# NOTE: Experimental, may not provide optimal performance
+# NOTE: Disk offload filtering not supported with this option
+export DYN_KVBM_DISK_CACHE_GB=8
+```
+
+You can also specify exact block counts instead of GB:
+- `DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS`
+- `DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS`
+
+### SSD Lifespan Protection
+
+When disk offloading is enabled, disk offload filtering is enabled by default to extend SSD lifespan. The current policy only offloads KV blocks from CPU to disk if the blocks have frequency ≥ 2. Frequency doubles on cache hit (initialized at 1) and decrements by 1 on each time decay step.
+
+To disable disk offload filtering:
+
+```bash
+export DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER=true
+```
+
+## Enable and View KVBM Metrics
+
+### Setup Monitoring Stack
+
+```bash
+# Start basic services (etcd & natsd), along with Prometheus and Grafana
+docker compose -f deploy/docker-observability.yml up -d
+```
+
+### Enable Metrics for vLLM
+
+```bash
+DYN_KVBM_METRICS=true \
+DYN_KVBM_CPU_CACHE_GB=20 \
+python -m dynamo.vllm \
+    --model Qwen/Qwen3-0.6B \
+    --enforce-eager \
+    --connector kvbm
+```
+
+### Enable Metrics for TensorRT-LLM
+
+```bash
+DYN_KVBM_METRICS=true \
+DYN_KVBM_CPU_CACHE_GB=20 \
+python3 -m dynamo.trtllm \
+  --model-path Qwen/Qwen3-0.6B \
+  --served-model-name Qwen/Qwen3-0.6B \
+  --extra-engine-args /tmp/kvbm_llm_api_config.yaml &
+```
+
+### Firewall Configuration (Optional)
+
+```bash
+# If firewall blocks KVBM metrics ports
+sudo ufw allow 6880/tcp
+```
+
+### View Metrics
+
+Access Grafana at http://localhost:3000 (default login: `dynamo`/`dynamo`) and look for the **KVBM Dashboard**.
+
+### Available Metrics
+
+| Metric | Description |
+|--------|-------------|
+| `kvbm_matched_tokens` | Number of matched tokens |
+| `kvbm_offload_blocks_d2h` | Offload blocks from device to host |
+| `kvbm_offload_blocks_h2d` | Offload blocks from host to disk |
+| `kvbm_offload_blocks_d2d` | Offload blocks from device to disk (bypassing host) |
+| `kvbm_onboard_blocks_d2d` | Onboard blocks from disk to device |
+| `kvbm_onboard_blocks_h2d` | Onboard blocks from host to device |
+| `kvbm_host_cache_hit_rate` | Host cache hit rate (0.0-1.0) |
+| `kvbm_disk_cache_hit_rate` | Disk cache hit rate (0.0-1.0) |
+
+## Benchmarking KVBM
+
+Use [LMBenchmark](https://github.com/LMCache/LMBenchmark) to evaluate KVBM performance.
+
+### Setup
+
+```bash
+git clone https://github.com/LMCache/LMBenchmark.git
+cd LMBenchmark/synthetic-multi-round-qa
+```
+
+### Run Benchmark
+
+```bash
+# Synthetic multi-turn chat dataset
+# Arguments: model, endpoint, output prefix, qps
+./long_input_short_output_run.sh \
+    "Qwen/Qwen3-0.6B" \
+    "http://localhost:8000" \
+    "benchmark_kvbm" \
+    1
+```
+
+Average TTFT and other performance numbers will be in the output.
+
+> **TIP:** If metrics are enabled, observe KV offloading and onboarding in the Grafana dashboard.
+
+### Baseline Comparison
+
+#### vLLM Baseline (without KVBM)
+
+```bash
+vllm serve Qwen/Qwen3-0.6B
+```
+
+#### TensorRT-LLM Baseline (without KVBM)
+
+```bash
+# Create config without kv_connector_config
+cat > "/tmp/llm_api_config.yaml" <<EOF
+backend: pytorch
+cuda_graph_config: null
+kv_cache_config:
+  enable_partial_reuse: false
+  free_gpu_memory_fraction: 0.80
+EOF
+
+trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/llm_api_config.yaml
+```
+
+## Troubleshooting
+
+### No TTFT Performance Gain
+
+**Symptom:** Enabling KVBM does not show TTFT improvement or causes performance degradation.
+
+**Cause:** Not enough prefix cache hits on KVBM to reuse offloaded KV blocks.
+
+**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
+
+![Grafana Example](/assets/img/kvbm-metrics-grafana.png)
+
+### KVBM Worker Initialization Timeout
+
+**Symptom:** KVBM fails to start when allocating large memory or disk storage.
+
+**Solution:** Increase the leader-worker initialization timeout (default: 1800 seconds):
+
+```bash
+export DYN_KVBM_LEADER_WORKER_INIT_TIMEOUT_SECS=3600  # 1 hour
+```
+
+### Disk Offload Fails to Start
+
+**Symptom:** KVBM fails to start when disk offloading is enabled.
+
+**Cause:** `fallocate()` is not supported on the filesystem (e.g., Lustre, certain network filesystems).
+
+**Solution:** Enable disk zerofill fallback:
+
+```bash
+export DYN_KVBM_DISK_ZEROFILL_FALLBACK=true
+```
+
+If you encounter "write all error" or EINVAL (errno 22), also try:
+
+```bash
+export DYN_KVBM_DISK_DISABLE_O_DIRECT=true
+```
+
+## Developing Locally
+
+Inside the Dynamo container, after changing KVBM-related code (Rust and/or Python):
+
+```bash
+cd /workspace/lib/bindings/kvbm
+uv pip install maturin[patchelf]
+maturin build --release --out /workspace/dist
+uv pip install --upgrade --force-reinstall --no-deps /workspace/dist/kvbm*.whl
+```
+
+## See Also
+
+- [KVBM Overview](README.md) for a quick overview of KV Caching, KVBM and its architecture
+- [KVBM Design](../../design-docs/kvbm-design.md) for a deep dive into KVBM architecture
+- [LMCache Integration](../../integrations/lmcache-integration.md)
+- [FlexKV Integration](../../integrations/flexkv-integration.md)
+- [SGLang HiCache](../../integrations/sglang-hicache.md)
--- a/fern/pages/components/planner/README.md
+++ b/fern/pages/components/planner/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# Planner
+
+The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
+
+> **New to the Planner?** Start with the [SLA Planner Quick Start Guide](planner-guide.md) for a complete workflow including profiling and deployment.
+
+## Feature Matrix
+
+| Category | Feature | Status |
+|----------|---------|--------|
+| **Backend** | Local (bare metal) | Deprecated |
+| | Kubernetes | Supported |
+| **LLM Framework** | vLLM | Supported |
+| | TensorRT-LLM | Supported |
+| | SGLang | Supported |
+| **Serving Type** | Aggregated | Unsupported |
+| | Disaggregated | Supported |
+| **Scaling Mode** | SLA-based (TTFT/ITL targets) | Supported (primary) |
+| | Load-based (KV cache/queue thresholds) | Deprecated |
+| **Load Predictors** | ARIMA | Supported |
+| | Prophet | Supported |
+| | Kalman filter | Supported |
+| | Constant (current = next) | Supported |
+| **Connectors** | KubernetesConnector (native DGD scaling) | Supported |
+| | VirtualConnector (external environments) | Supported |
+
+## Quick Start
+
+### Prerequisites
+
+- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
+- kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md))
+- Pre-deployment profiling completed ([Profiling Guide](../profiler/profiler-guide.md))
+
+### Deploy with DGDR (Recommended)
+
+The fastest path to a planner-enabled deployment is through a DynamoGraphDeploymentRequest:
+
+```bash
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
+```
+
+This automatically profiles your model and deploys with the SLA planner. See [SLA Planner Guide](planner-guide.md) for the full workflow.
+
+### Deploy with DGD (Manual)
+
+For manual control, use the disaggregated planner templates:
+
+```bash
+# After profiling is complete
+kubectl apply -f examples/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE
+```
+
+## Documentation
+
+| Document | Description |
+|----------|-------------|
+| [Planner Guide](planner-guide.md) | Deployment, configuration, integration, troubleshooting |
+| [Planner Examples](planner-examples.md) | DGDR YAML examples, sample configurations, advanced patterns |
+| [SLA Planner Guide](planner-guide.md) | End-to-end DGDR workflow: define SLAs, profile, deploy, monitor |
+| [SLA-based Planner](planner-guide.md) | Scaling algorithm, correction factors, load prediction details |
+| [Load-based Planner](README.md) | Legacy load-based scaling (deprecated) |
+| [SLA-Driven Profiling](../profiler/profiler-guide.md) | Pre-deployment profiling process and configuration |
+| [Planner Design](../../design-docs/planner-design.md) | Architecture deep-dive for contributors |
+
+## Configuration Reference
+
+### Key Arguments
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--namespace` | `$DYN_NAMESPACE` or `dynamo` | Dynamo logical namespace |
+| `--backend` | `vllm` | Backend framework (`vllm`, `sglang`, `trtllm`) |
+| `--environment` | `kubernetes` | Deployment environment |
+| `--adjustment-interval` | `180` | Seconds between scaling decisions |
+| `--ttft` | `500.0` | Target Time To First Token (ms) |
+| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
+| `--isl` | `3000` | Expected average input sequence length |
+| `--osl` | `150` | Expected average output sequence length |
+| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
+| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
+| `--min-endpoint` | `1` | Minimum replicas per worker type |
+| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
+| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
+| `--no-operation` | `false` | Observation mode (no actual scaling) |
+| `--no-correction` | `false` | Disable correction factors |
+| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
+
+### Environment Variables
+
+| Variable | Default | Description |
+|----------|---------|-------------|
+| `DYN_NAMESPACE` | `dynamo` | Dynamo logical namespace |
+| `DYN_PARENT_DGD_K8S_NAME` | (required) | Parent DGD K8s resource name |
+| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` | Prometheus URL |
+| `PLANNER_PROMETHEUS_PORT` | `0` (disabled) | Port for planner's own Prometheus metrics |
+
+## Monitoring
+
+### Grafana Dashboard
+
+Deploy the planner dashboard:
+
+```bash
+kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
+```
+
+The dashboard shows:
+- Worker counts and GPU usage over time
+- Observed TTFT, ITL, request rate, sequence lengths
+- Predicted load and recommended replica counts
+- Correction factors (actual vs. expected performance)
+
+### Prometheus Metrics
+
+The planner queries the frontend's `/metrics` endpoint via Prometheus. Required metrics:
+- Request count and duration
+- TTFT and ITL distributions
+- Input/output sequence lengths
--- a/fern/pages/components/planner/planner-examples.md
+++ b/fern/pages/components/planner/planner-examples.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# Planner Examples
+
+Practical examples for deploying the SLA Planner with different configurations. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).
+
+## Basic Examples
+
+### Minimal DGDR with AIC (Fastest)
+
+The simplest way to deploy with the SLA planner. Uses AI Configurator for offline profiling (20-30 seconds instead of hours):
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sla-aic
+spec:
+  model: Qwen/Qwen3-32B
+  backend: vllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200
+        itl: 20
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h200_sxm
+        aicHfId: Qwen/Qwen3-32B
+        aicBackendVersion: "0.20.0"
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+
+  autoApply: true
+```
+
+Deploy:
+```bash
+export NAMESPACE=your-namespace
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
+```
+
+### Online Profiling (Real Measurements)
+
+Standard online profiling runs real GPU measurements for more accurate results. Takes 2-4 hours:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sla-online
+spec:
+  model: meta-llama/Llama-3.3-70B-Instruct
+  backend: vllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200
+        itl: 20
+      sweep:
+        useAiConfigurator: false
+        prefillInterpolationGranularity: 16
+        decodeInterpolationGranularity: 6
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"
+
+  autoApply: true
+```
+
+Deploy:
+```bash
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_dgdr.yaml -n $NAMESPACE
+```
+
+Available sample DGDRs in `benchmarks/profiler/deploy/`:
+- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
+- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator
+- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
+
+> **Profiling Config Cases**: Prior to 0.8.1, fields under `profilingConfig.config` use snake_case. Starting 0.8.1, fields use camelCase. There is backwards compatibility to snake_case, but example DGDRs use camelCase.
+
+## Kubernetes Examples
+
+### MoE Models (SGLang)
+
+For Mixture-of-Experts models like DeepSeek-R1, use SGLang backend:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sla-moe
+spec:
+  model: deepseek-ai/DeepSeek-R1
+  backend: sglang
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300
+        itl: 10
+      sweep:
+        useAiConfigurator: false
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+
+  autoApply: true
+```
+
+Deploy:
+```bash
+kubectl apply -f benchmarks/profiler/deploy/profile_sla_moe_dgdr.yaml -n $NAMESPACE
+```
+
+### Using Existing DGD Configs (Custom Setups)
+
+Reference an existing DynamoGraphDeployment config via ConfigMap:
+
+**Step 1: Create ConfigMap from your DGD config:**
+
+```bash
+kubectl create configmap deepseek-r1-config \
+  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
+  --namespace $NAMESPACE \
+  --dry-run=client -o yaml | kubectl apply -f -
+```
+
+**Step 2: Reference it in your DGDR:**
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: deepseek-r1
+spec:
+  model: deepseek-ai/DeepSeek-R1
+  backend: sglang
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+    configMapRef:
+      name: deepseek-r1-config
+      key: disagg.yaml  # Must match the key used in --from-file
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300
+        itl: 10
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h200_sxm
+        aicHfId: deepseek-ai/DeepSeek-V3
+        aicBackendVersion: "0.20.0"
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+
+  autoApply: true
+```
+
+The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` and `spec.backend` into the final configuration.
+
+### Inline Configuration (Simple Use Cases)
+
+For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler auto-generates a basic DGD configuration:
+
+```yaml
+profilingConfig:
+  config:
+    sla:
+      isl: 8000
+      osl: 200
+      ttft: 200.0
+      itl: 10.0
+
+    hardware:
+      minNumGpusPerEngine: 2
+      maxNumGpusPerEngine: 8
+      gpuType: h200_sxm
+
+    sweep:
+      prefillInterpolationGranularity: 16
+      decodeInterpolationGranularity: 6
+```
+
+### Mocker Deployment (Testing)
+
+Deploy a mocker backend that simulates GPU timing behavior without real GPUs. Useful for:
+- Large-scale experiments without GPU resources
+- Testing planner behavior and infrastructure
+- Validating deployment configurations
+
+```yaml
+spec:
+  model: <model-name>
+  backend: trtllm  # Real backend for profiling
+  useMocker: true  # Deploy mocker instead of real backend
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200
+        itl: 20
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h100_sxm
+  autoApply: true
+```
+
+Profiling runs against the real backend (via GPUs or AIC). The mocker deployment then uses profiling data to simulate realistic timing.
+
+### Model Cache PVC (0.8.1+)
+
+For large models, use a pre-populated PVC instead of downloading from HuggingFace:
+
+See [SLA-Driven Profiling](../profiler/profiler-guide.md) for configuration details.
+
+## Advanced Examples
+
+### Custom Load Predictors
+
+#### Warm-starting with Trace Data
+
+Pre-load predictors with historical request patterns before live traffic:
+
+```yaml
+# In planner arguments
+args:
+  - --load-predictor arima
+  - --load-predictor-warmup-trace /data/trace.jsonl
+  - --load-predictor-log1p
+```
+
+The trace file should be in mooncake-style JSONL format with request-count, ISL, and OSL samples.
+
+#### Kalman Filter Tuning
+
+For workloads with rapid changes, tune the Kalman filter:
+
+```yaml
+args:
+  - --load-predictor kalman
+  - --kalman-q-level 2.0      # Higher = more responsive to level changes
+  - --kalman-q-trend 0.5      # Higher = trend changes faster
+  - --kalman-r 5.0            # Lower = trusts new measurements more
+  - --kalman-min-points 3     # Fewer points before forecasting starts
+  - --load-predictor-log1p    # Often helps with request-rate series
+```
+
+#### Prophet for Seasonal Workloads
+
+For workloads with daily/weekly patterns:
+
+```yaml
+args:
+  - --load-predictor prophet
+  - --prophet-window-size 100   # Larger window for seasonal detection
+  - --load-predictor-log1p
+```
+
+### Virtual Connector
+
+For non-Kubernetes environments, use the VirtualConnector to communicate scaling decisions:
+
+```python
+from dynamo._core import DistributedRuntime, VirtualConnectorClient
+
+# Initialize client
+client = VirtualConnectorClient(distributed_runtime, namespace)
+
+# Main loop: watch for planner decisions and execute them
+while True:
+    # Block until the planner makes a new scaling decision
+    await client.wait()
+
+    # Read the decision
+    decision = await client.get()
+    print(f"Scale to: prefill={decision.num_prefill_workers}, "
+          f"decode={decision.num_decode_workers}, "
+          f"id={decision.decision_id}")
+
+    # Execute scaling in your environment
+    scale_prefill_workers(decision.num_prefill_workers)
+    scale_decode_workers(decision.num_decode_workers)
+
+    # Report completion
+    await client.complete(decision)
+```
+
+See `components/planner/test/test_virtual_connector.py` for a full working example.
+
+### Planner Configuration Passthrough
+
+Pass planner-specific settings through the DGDR:
+
+```yaml
+profilingConfig:
+  config:
+    planner:
+      plannerMinEndpoint: 2
+```
+
+### Review Before Deploy (autoApply: false)
+
+Disable auto-deployment to inspect the generated DGD:
+
+```yaml
+spec:
+  autoApply: false
+```
+
+After profiling completes:
+
+```bash
+# Extract and review generated DGD
+kubectl get dgdr sla-aic -n $NAMESPACE \
+  -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
+
+# Review and modify as needed
+vi my-dgd.yaml
+
+# Deploy manually
+kubectl apply -f my-dgd.yaml -n $NAMESPACE
+```
+
+### Profiling Artifacts with PVC
+
+Save detailed profiling artifacts (plots, logs, raw data) to a PVC:
+
+```yaml
+spec:
+  profilingConfig:
+    outputPVC: "dynamo-pvc"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200
+        itl: 20
+```
+
+Setup:
+```bash
+export NAMESPACE=your-namespace
+deploy/utils/setup_benchmarking_resources.sh
+```
+
+Access results:
+```bash
+kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
+kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
+kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
+kubectl delete pod pvc-access-pod -n $NAMESPACE
+```
+
+## Related Documentation
+
+- [Planner README](README.md) -- Overview and quick start
+- [Planner Guide](planner-guide.md) -- Deployment, configuration, integration
+- [Planner Design](../../design-docs/planner-design.md) -- Architecture deep-dive
+- [DGDR Configuration Reference](../profiler/profiler-guide.md#dgdr-configuration-structure)
+- [SLA-Driven Profiling](../profiler/profiler-guide.md)
--- a/fern/pages/planner/sla-planner-quickstart.md
+++ b/fern/pages/planner/sla-planner-quickstart.md
@@ -3,83 +3,35 @@
 # SPDX-License-Identifier: Apache-2.0
 ---

-# SLA-Driven Profiling and Planner Deployment Quick Start Guide
+# Planner Guide

-Complete workflow to deploy SLA-optimized Dynamo models using DynamoGraphDeploymentRequests (DGDR). This guide shows how to automatically profile models and deploy them with optimal configurations that meet your Service Level Agreements (SLAs).
+Deployment, configuration, and integration guide for the Dynamo SLA Planner. For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md).

-> [!WARNING]
-> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](../kubernetes/installation-guide.md).
+## Deployment

-## Overview
+### Prerequisites

-The DGDR workflow automates the entire process from SLA specification to deployment:
-
-1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information in a DGDR Custom Resource
-2. **Automatic Profiling**: The Dynamo Operator automatically profiles your model to find optimal configurations
-3. **Auto-Deploy**: The system automatically deploys the optimal configuration that meets your SLAs
+Before deploying the planner, ensure:

-```mermaid
-flowchart TD
-    A[Create DGDR] --> B[DGDR Controller]
-    B --> C{Profiling Method}
-    C -->|Online| D[Run Profiling Job<br/>2-4 hours]
-    C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
-    D --> F[Generate DGD Config]
-    E --> F
-    F --> G[Auto-Deploy DGD]
-    G --> H[Monitor & Scale]
-
-    style A fill:#e1f5fe
-    style D fill:#fff3e0
-    style E fill:#e8f5e8
-    style G fill:#f3e5f5
-    style H fill:#fff8e1
-```
-
-## What is a DynamoGraphDeploymentRequest (DGDR)?
-
-A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. Think of it as a "deployment order" where you specify:
-
- **What** model you want to deploy (`model`)
- **How** it should perform (SLA targets: `ttft`, `itl`)
- **Where** it should run (optional GPU preferences)
- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
-
-The Dynamo Operator watches for DGDRs and automatically:
-1. Discovers available GPU resources in your cluster
-2. Runs profiling (online or offline) to find optimal configurations
-3. Generates an optimized DynamoGraphDeployment (DGD) configuration
-4. Deploys the DGD to your cluster
-
-**Key Benefits:**
- **Declarative**: Specify what you want, not how to achieve it
- **Automated**: No manual profiling job setup or result processing
- **SLA-Driven**: Ensures deployments meet your performance requirements
- **Integrated**: Works seamlessly with the Dynamo Operator
-
-## Prerequisites
-
-Before creating a DGDR, ensure:
- **Dynamo platform installed** with the operator running (see [Installation Guide](../kubernetes/installation-guide.md))
- **[kube-prometheus-stack](../kubernetes/observability/metrics.md) installed and running** (required for SLA planner)
+- **Dynamo platform installed** with the operator running (see [Installation Guide](../../kubernetes/installation-guide.md))
+- **[kube-prometheus-stack](../../kubernetes/observability/metrics.md) installed and running** (required for SLA planner metric collection)
 - **Image pull secrets configured** if using private registries (typically `nvcr-imagepullsecret` for NVIDIA images)
 - **Sufficient GPU resources** available in your cluster for profiling
 - **Runtime images available** that contain both profiler and runtime components

 ### Container Images

-Each DGDR requires you to specify container images for the profiling and deployment process:
+Each DGDR requires container images for the profiling and deployment process:

 **profilingConfig.profilerImage** (Required):
-Specifies the container image used for the profiling job itself. This image must contain the profiler code and dependencies needed for SLA-based profiling.
+The container image used for the profiling job. Must contain the profiler code and dependencies for SLA-based profiling.

 **deploymentOverrides.workersImage** (Optional):
-Specifies the container image used for DynamoGraphDeployment worker components (frontend, workers, planner). This image is used for:
+The container image used for DGD worker components (frontend, workers, planner). Used for:
 - Temporary DGDs created during online profiling (for performance measurements)
 - The final DGD deployed after profiling completes

-If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. You may use our public images (0.6.1 and later) or build and push your own.
+If `workersImage` is omitted, the image from the base config file (e.g., `disagg.yaml`) is used. Public images are available from 0.6.1 onward.

 ```yaml
 spec:
@@ -89,64 +41,57 @@ spec:
    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Optional
 ```

-## Quick Start: Deploy with DGDR
+### What is a DynamoGraphDeploymentRequest (DGDR)?

-### Step 1: Create Your DGDR
-
-Dynamo provides sample DGDR configurations in `benchmarks/profiler/deploy/`. You can use these as starting points:
-
-**Available Sample DGDRs:**
- **`profile_sla_dgdr.yaml`**: Standard online profiling for dense models
- **`profile_sla_aic_dgdr.yaml`**: Fast offline profiling using AI Configurator (TensorRT-LLM)
- **`profile_sla_moe_dgdr.yaml`**: Online profiling for MoE models (SGLang)
-
-Or, you can create your own DGDR for your own needs:
-
-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: my-model-deployment  # Change the name
-  namespace: default         # Change the namespace
-spec:
-  model: "Qwen/Qwen3-0.6B"     # Update to your model
-  backend: vllm                # Backend: vllm, sglang, or trtllm
+A **DGDR** is a Kubernetes Custom Resource that serves as the primary interface for deploying models with specific performance and resource constraints. It specifies:

-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Required
-    config:
-      sla:
-        isl: 3000    # Adjust to your workload
-        osl: 150     # Adjust to your workload
-        ttft: 200    # Your target (ms)
-        itl: 20      # Your target (ms)
+- **What** model to deploy (`model`)
+- **How** it should perform (SLA targets: `ttft`, `itl`)
+- **Where** it should run (optional GPU preferences)
+- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
+- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)

-      sweep:
-        use_ai_configurator: false  # Set to true for fast profiling (TensorRT-LLM only)
+The Dynamo Operator watches for DGDRs and automatically:
+1. Discovers available GPU resources in your cluster
+2. Runs profiling (online or offline) to find optimal configurations
+3. Generates an optimized DynamoGraphDeployment (DGD) configuration
+4. Deploys the DGD to your cluster

-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.1"  # Optional
+**Key Benefits:**
+- **Declarative**: Specify what you want, not how to achieve it
+- **Automated**: No manual profiling job setup or result processing
+- **SLA-Driven**: Ensures deployments meet your performance requirements
+- **Integrated**: Works seamlessly with the Dynamo Operator

-  autoApply: true  # Auto-deploy after profiling
-```
+### DGDR Workflow

-> [!TIP]
-> For detailed explanations of all configuration options (SLA, hardware, sweep, AIC, planner), see the [DGDR Configuration Reference](../benchmarks/sla-driven-profiling.md#dgdr-configuration-reference).
+The DGDR workflow automates the entire process from SLA specification to deployment:

-### Step 2: Apply the DGDR
+1. **Define SLAs**: Specify performance requirements (TTFT, ITL) and model information
+2. **Automatic Profiling**: The operator profiles your model to find optimal configurations
+3. **Auto-Deploy**: The system deploys the optimal configuration that meets your SLAs

-The rest of this quickstart will use the DGDR sample that uses AIC profiling. If you use a different DGDR file and/or name, be sure to adjust the commands accordingly.
+```mermaid
+flowchart TD
+    A[Create DGDR] --> B[DGDR Controller]
+    B --> C{Profiling Method}
+    C -->|Online| D[Run Profiling Job<br/>2-4 hours]
+    C -->|Offline/AIC| E[AI Configurator<br/>20-30 seconds]
+    D --> F[Generate DGD Config]
+    E --> F
+    F --> G[Auto-Deploy DGD]
+    G --> H[Monitor & Scale]

-```bash
-export NAMESPACE=your-namespace
-kubectl apply -f benchmarks/profiler/deploy/profile_sla_aic_dgdr.yaml -n $NAMESPACE
+    style A fill:#e1f5fe
+    style D fill:#fff3e0
+    style E fill:#e8f5e8
+    style G fill:#f3e5f5
+    style H fill:#fff8e1
 ```

-The Dynamo Operator will immediately begin processing your request.
+### Monitoring Progress

-### Step 3: Monitor Progress
-
-Watch the DGDR status:
+Watch DGDR status:

 ```bash
 # View status
@@ -166,65 +111,47 @@ kubectl logs -f job/profile-sla-aic -n $NAMESPACE
 - `Ready`: DGD successfully deployed and running
 - `Failed`: Error occurred (check events for details)

-> [!NOTE]
-> With AI Configurator, profiling completes in **20-30 seconds**! This is much faster than online profiling which takes 2-4 hours.
-
-### Step 4: Access Your Deployment
-
-Once the DGDR reaches `Ready` state, your model is deployed and ready to serve:
-
-```bash
-# Find the frontend service
-kubectl get svc -n $NAMESPACE | grep trtllm-disagg
-
-# Port-forward to access locally
-kubectl port-forward svc/trtllm-disagg-frontend 8000:8000 -n $NAMESPACE
-
-# Test the endpoint
-curl http://localhost:8000/v1/models
-```
+### Relationship to DGD

-### Step 5 (Optional): Access the Planner Grafana Dashboard
+- **DGDR**: High-level "intent" -- what you want deployed
+- **DGD**: Low-level "implementation" -- how it's deployed

-If you want to monitor the SLA Planner's decision-making in real-time, you can deploy the Planner Grafana dashboard.
+The DGDR controller generates a DGD that:
+- Uses optimal TP configurations from profiling
+- Includes the SLA planner for autoscaling
+- Has deployment and engine settings tuned for your SLAs

-```bash
-kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
+The generated DGD is tracked via labels:
+```yaml
+metadata:
+  labels:
+    dgdr.nvidia.com/name: sla-aic
+    dgdr.nvidia.com/namespace: your-namespace
 ```

-Follow the instructions in [Dynamo Metrics Collection on Kubernetes](../kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
+## Configuration

-The dashboard displays:
- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
- **Predicted Metrics**: Planner's load predictions and recommended replica counts
- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance
-
-> [!TIP]
-> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your specific deployment namespace.
-
-## DGDR Configuration Details
+### DGDR Configuration

-### Required Fields
+#### Required Fields

 | Field | Type | Description |
 |-------|------|-------------|
-| `spec.model` | string | Model identifier (e.g., "meta-llama/Llama-3-70b") |
+| `spec.model` | string | Model identifier (e.g., `meta-llama/Llama-3-70b`) |
 | `spec.backend` | enum | Inference backend: `vllm`, `sglang`, or `trtllm` |
 | `spec.profilingConfig.profilerImage` | string | Container image for profiling job |
 | `spec.profilingConfig.config.sla` | object | SLA targets (isl, osl, ttft, itl) |

-### Optional Fields
+#### Optional Fields

 | Field | Type | Description |
 |-------|------|-------------|
-| `spec.deploymentOverrides.workersImage` | string | Container image for DGD worker components. If omitted, uses image from base config file. |
+| `spec.deploymentOverrides.workersImage` | string | Container image for DGD workers. If omitted, uses image from base config. |
 | `spec.autoApply` | boolean | Automatically deploy DGD after profiling (default: false) |
-| `spec.deploymentOverrides` | object | Customize metadata (name, namespace, labels, annotations) and image for auto-created DGD |
+| `spec.useMocker` | boolean | Deploy mocker instead of real backend (default: false) |
+| `spec.deploymentOverrides` | object | Customize metadata and image for auto-created DGD |

-### SLA Configuration
-
-The `sla` section defines performance requirements and workload characteristics:
+#### SLA Configuration

 ```yaml
 sla:
@@ -240,6 +167,8 @@ sla:
 - **ITL**: Token generation latency target (lower = more GPUs needed)
 - **Trade-offs**: Tighter SLAs require more GPU resources

+For comprehensive documentation of all configuration options, see the [DGDR Configuration Reference](../profiler/profiler-guide.md#dgdr-configuration-structure).
+
 ### Profiling Methods

 Choose between **online profiling** (real measurements, 2-4 hours) or **offline profiling** with AI Configurator (estimated, 20-30 seconds):
@@ -247,154 +176,190 @@ Choose between **online profiling** (real measurements, 2-4 hours) or **offline
 ```yaml
 # Online Profiling (Default)
 sweep:
-  use_ai_configurator: false
+  useAiConfigurator: false

-# Offline Profiling (AI Configurator - TensorRT-LLM only)
+# Offline Profiling (AI Configurator)
 sweep:
-  use_ai_configurator: true
-  aic_system: h200_sxm
-  aic_hf_id: Qwen/Qwen3-32B
-  aic_backend_version: "0.20.0"
+  useAiConfigurator: true
+  aicSystem: h200_sxm
+  aicHfId: Qwen/Qwen3-32B
+  aicBackendVersion: "0.20.0"
 ```

-> [!NOTE]
-> For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](../benchmarks/sla-driven-profiling.md#profiling-method).
+For detailed comparison, supported configurations, and limitations, see [SLA-Driven Profiling Documentation](../profiler/profiler-guide.md#profiling-method).
+
+### Load Predictors
+
+The SLA planner forecasts the number of requests, ISL, and OSL in the next adjustment interval. Four prediction models are supported:
+
+#### Constant Predictor
+- **Use case**: Stable workloads with long prediction intervals
+- **Behavior**: Assumes next load equals current load
+- **Configuration**: `load-predictor: "constant"`
+
+#### ARIMA Predictor
+- **Use case**: Time-series data with trends and seasonality
+- **Behavior**: Uses auto-ARIMA to fit optimal model parameters
+- **Configuration**: `load-predictor: "arima"`
+- **Tunable parameters**:
+  - `--load-predictor-log1p`: model `log1p(y)` instead of `y`. If not set, ARIMA starts in raw space, and if it collapses to `(0,d,0)`, it falls back to `log1p` automatically.
+
+#### Kalman Predictor
+- **Use case**: Low-latency online forecasting (observe 1 -> predict 1) with smooth adaptation
+- **Behavior**: Local linear trend Kalman filter (fast online updates; good default when ARIMA collapses to mean-only)
+- **Configuration**: `load-predictor: "kalman"`
+- **Tunable parameters**:
+  - `--kalman-q-level`: process noise for level (higher = more responsive)
+  - `--kalman-q-trend`: process noise for trend (higher = trend changes faster)
+  - `--kalman-r`: measurement noise (lower = trusts new measurements more)
+  - `--kalman-min-points`: minimum points before forecasting
+  - `--load-predictor-log1p`: model `log1p(y)` instead of `y` (often helps request-rate/count series)
+
+#### Prophet Predictor
+- **Use case**: Complex seasonal patterns and trend changes
+- **Behavior**: Facebook's [Prophet](https://facebook.github.io/prophet/) model for time-series forecasting
+- **Configuration**: `load-predictor: "prophet"`
+- **Tunable parameters**:
+  - `--prophet-window-size`: bounds internal history to control refit cost
+  - `--load-predictor-log1p`: model `log1p(y)` instead of `y`
+
+#### Warm-starting Load Predictors (Optional)
+
+You can warm-start load predictors with a mooncake-style JSONL trace file:
+
+- **CLI argument**: `--load-predictor-warmup-trace <path/to/trace.jsonl>`
+- **Effect**: preloads predictors with historical request-count / ISL / OSL samples extracted from the trace
+
+### Planner Scaling Parameters
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--adjustment-interval` | `180` | Seconds between scaling decisions |
+| `--ttft` | `500.0` | Target Time To First Token (ms) |
+| `--itl` | `50.0` | Target Inter-Token Latency (ms) |
+| `--isl` | `3000` | Expected average input sequence length |
+| `--osl` | `150` | Expected average output sequence length |
+| `--max-gpu-budget` | `8` | Maximum GPUs across all workers |
+| `--min-endpoint` | `1` | Minimum replicas per worker type |
+| `--decode-engine-num-gpu` | `1` | GPUs per decode engine |
+| `--prefill-engine-num-gpu` | `1` | GPUs per prefill engine |
+| `--no-operation` | `false` | Observation mode (no actual scaling) |
+| `--no-correction` | `false` | Disable correction factors |

-### Hardware Configuration
+#### Planner Configuration Passthrough

-For details on hardware configuration and GPU discovery options, see [Hardware Configuration in SLA-Driven Profiling](../benchmarks/sla-driven-profiling.md#hardware-configuration).
+Add planner-specific settings in the DGDR:

-### Advanced Configuration
+```yaml
+profilingConfig:
+  config:
+    planner:
+      plannerMinEndpoint: 2
+```

-#### Using Existing DGD Configs (Recommended for Custom Setups)
+## Integration

-If you have an existing DynamoGraphDeployment config (e.g., from `examples/backends/*/deploy/disagg.yaml` or custom recipes), you can reference it via ConfigMap:
+### Prometheus Setup

-**Step 1: Create ConfigMap from your DGD config file:**
+The planner queries Prometheus to collect frontend request metrics. The architecture:

-```bash
-kubectl create configmap deepseek-r1-config \
-  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
-  --namespace $NAMESPACE \
-  --dry-run=client -o yaml | kubectl apply -f -
+```mermaid
+flowchart LR
+  Frontend --"/metrics"--> Prometheus
+  Planner --"query API"--> Prometheus
+  Planner --"scaling decisions"--> Workers
+  Frontend -.->|"requests"| Workers
 ```

-**Step 2: Reference the ConfigMap in your DGDR:**
+**Components:**
+- **Frontend**: Serves requests and exposes `/metrics`
+- **Prometheus**: Scrapes frontend metrics every 5s (configurable in podmonitor manifest)
+- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
+- **Workers**: Prefill and backend workers handle inference

-```yaml
-apiVersion: nvidia.com/v1alpha1
-kind: DynamoGraphDeploymentRequest
-metadata:
-  name: deepseek-r1
-spec:
-  model: deepseek-ai/DeepSeek-R1
-  backend: sglang
+The planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with request count, ISL, OSL, TTFT, and ITL in the correct format. The Dynamo frontend provides these metrics automatically.

-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
-    configMapRef:
-      name: deepseek-r1-config
-      key: disagg.yaml  # Must match the key used in --from-file
-    config:
-      sla:
-        isl: 4000
-        osl: 500
-        ttft: 300
-        itl: 10
-      sweep:
-        use_ai_configurator: true
-      aic:
-        system: h200_sxm
-        model_name: DEEPSEEK_V3
-        backend_version: "0.20.0"
+**Prometheus endpoint configuration:**

-  deploymentOverrides:
-    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.1"
+| Variable | Default |
+|----------|---------|
+| `PROMETHEUS_ENDPOINT` | `http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local:9090` |

-  autoApply: true
-```
+If you see errors like "Failed to resolve prometheus service", ensure `PROMETHEUS_ENDPOINT` points to your Prometheus service.

-> **What's happening**: The profiler uses the DGD config from the ConfigMap as a **base template**, then optimizes it based on your SLA targets. The controller automatically injects `spec.model` into `deployment.model` and `spec.backend` into `engine.backend` in the final configuration.
+### Virtual Deployment

-#### Inline Configuration (Simple Use Cases)
+The SLA planner supports virtual deployment mode for customized environments (e.g., custom orchestrators) through the `VirtualConnector`. This connector enables the planner to communicate scaling decisions without directly managing Kubernetes resources.

-For simple use cases without a custom DGD config, provide profiler configuration directly. The profiler will auto-generate a basic DGD configuration from your `model` and `backend`:
+The `VirtualConnector` acts as a bridge between the SLA planner and external deployment environments. Instead of PATCHing DGD resources, it writes scaling decisions and waits for the external environment to acknowledge completion.

-```yaml
-profilingConfig:
-  config:
-    # SLA targets (required for profiling)
-    sla:
-      isl: 8000   # Input sequence length
-      osl: 200    # Output sequence length
-      ttft: 200.0 # Time To First Token (ms)
-      itl: 10.0   # Inter-Token Latency (ms)
-
-    # Hardware constraints (optional)
-    hardware:
-      min_num_gpus_per_engine: 2
-      max_num_gpus_per_engine: 8
-      gpu_type: h200_sxm
-
-    # Profiling sweep settings (optional)
-    sweep:
-      prefill_interpolation_granularity: 16  # Number of samples for prefill ISL sweep
-      decode_interpolation_granularity: 6    # Number of samples for decode sweep
-```
+#### Scaling Decision Flow

-> **Note**: `engine.config` is a **file path** to a DGD YAML file, not inline configuration. Use ConfigMapRef (recommended) or leave it unset to auto-generate.
+1. **Decision Generation**: The planner calculates optimal worker counts
+2. **Change Detection**: Skips scaling if target counts match current counts, logging: `"No scaling needed (prefill=X, decode=Y)"`
+3. **Readiness Check**: Verifies previous scaling operations completed by checking `scaled_decision_id >= decision_id`
+4. **Timeout Handling**: If not acknowledged within 30 minutes (1800 seconds), proceeds with new decisions
+5. **Completion Tracking**: Optionally waits for scaling completion confirmation (blocking mode)

-#### Planner Configuration Passthrough
-Add planner-specific settings. Planner arguments use a `planner_` prefix:
+#### Configuration
+
+To use virtual deployment mode:

 ```yaml
-profilingConfig:
-  config:
-    planner:
-      planner_min_endpoint: 2
+environment: "virtual"
+backend: "vllm"  # or "sglang"
 ```

-## Understanding Profiling Results
+#### Deployment Environment Requirements

-For details about the profiling process, performance plots, and interpolation data, see [SLA-Driven Profiling Documentation](../benchmarks/sla-driven-profiling.md).
+The external deployment environment must use `VirtualConnectorClient`:

-## Advanced Topics
+```python
+from dynamo._core import DistributedRuntime, VirtualConnectorClient

-### Mocker Deployment
+client = VirtualConnectorClient(distributed_runtime, namespace)
+```

-Instead of a real DGD that uses GPU resources, you can deploy a mocker deployment that uses simulated engines rather than GPUs. Mocker is available in all backend images and uses profiling data to simulate realistic GPU timing behavior. It is useful for:
- Large-scale experiments without GPU resources
- Testing Planner behavior and infrastructure
- Validating deployment configurations
+1. **Monitor Planner**: Continuously watch for scaling decisions: `await client.wait()` (blocks until change)
+2. **Parse Decisions**: Read values: `decision = await client.get()`
+3. **Execute Scaling**: Apply the scaling decisions to your infrastructure
+4. **Acknowledge Completion**: Mark done: `await client.complete(decision)`

-To deploy mocker instead of the real backend, set `useMocker: true`:
+A scaling decision (returned by `client.get()`) contains:
+- `num_prefill_workers`: Target number of prefill workers (-1 if not set)
+- `num_decode_workers`: Target number of decode workers (-1 if not set)
+- `decision_id`: Incremental ID for each scaling decision

-```yaml
-spec:
-  model: <model-name>
-  backend: trtllm  # Real backend for profiling (vllm, sglang, or trtllm)
-  useMocker: true  # Deploy mocker instead of real backend
+See `components/planner/test/test_virtual_connector.py` for a full example.

-  profilingConfig:
-    profilerImage: "nvcr.io/nvidia/dynamo/trtllm-runtime:<image-tag>"
-    ...
-  autoApply: true
+### Grafana Dashboard
+
+Deploy the planner Grafana dashboard:
+
+```bash
+kubectl apply -n monitoring -f deploy/observability/k8s/grafana-planner-dashboard-configmap.yaml
 ```

-Profiling still runs against the real backend (via GPUs or AIC) to collect performance data. The mocker deployment then uses this data to simulate realistic timing behavior.
+Follow [Dynamo Metrics Collection on Kubernetes](../../kubernetes/observability/metrics.md) to access the Grafana UI and select the **Dynamo Planner Dashboard**.
+
+The dashboard displays:
+- **Worker Counts & GPU Usage**: Current prefill/decode worker counts and cumulative GPU hours
+- **Observed Metrics**: Real-time TTFT, ITL, request rate, and sequence lengths from Prometheus
+- **Predicted Metrics**: Planner's load predictions and recommended replica counts
+- **Correction Factors**: How the planner adjusts predictions based on observed vs expected performance

-### DGDR Immutability
+> Use the **Namespace** dropdown at the top of the dashboard to filter metrics for your deployment namespace.

-DGDRs are **immutable** - if you need to update SLAs or configuration:
+## DGDR Immutability
+
+DGDRs are **immutable**. To update SLAs or configuration:

 1. Delete the existing DGDR: `kubectl delete dgdr sla-aic`
 2. Create a new DGDR with updated specifications

-### Manual Deployment Control
-
-There are two ways to manually control deployment after profiling:
+## Manual Deployment Control

-#### Option 1: Use DGDR-Generated Configuration (Recommended)
+### Option 1: Use DGDR-Generated Configuration (Recommended)

 Disable auto-deployment to review the generated DGD before applying:

@@ -403,7 +368,7 @@ spec:
  autoApply: false
 ```

-Then manually extract and apply the generated DGD:
+Then manually extract and apply:

 ```bash
 # Extract generated DGD from DGDR status
@@ -411,90 +376,43 @@ kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment

 # Or save to file first for review/modification
 kubectl get dgdr sla-aic -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
-
 vi my-dgd.yaml
 kubectl apply -f my-dgd.yaml -n $NAMESPACE
 ```

-The generated DGD includes optimized configurations and the SLA planner component. The required `planner-profile-data` ConfigMap is automatically created when profiling completes, so the DGD will deploy successfully.
+### Option 2: Use Standalone Planner Templates (Advanced)

-#### Option 2: Use Standalone Planner Templates (Advanced)
-
-For advanced use cases, you can manually deploy using the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:
+For advanced use cases, use the standalone planner templates in `examples/backends/*/deploy/disagg_planner.yaml`:

 ```bash
-# After profiling completes, profiling data is automatically stored in ConfigMaps
-
-# OPTIONAL: Inspect profiling results stored in ConfigMaps
-# View the generated DGD configuration
+# After profiling completes, profiling data is stored in ConfigMaps
 kubectl get configmap dgdr-output-<dgdr-name> -n $NAMESPACE -o yaml
-
-# View the planner profiling data (JSON format)
 kubectl get configmap planner-profile-data -n $NAMESPACE -o yaml

-# Update the PROMETHEUS_ENDPOINT environment variable in the planner template
-# to match your cluster's Prometheus service location (see comments in the template)
-
-# Update backend planner manifest as needed, then deploy
+# Update PROMETHEUS_ENDPOINT in the template, then deploy
 kubectl apply -f examples/backends/<backend>/deploy/disagg_planner.yaml -n $NAMESPACE
 ```

-> **Note**: The standalone templates are provided as examples and may need customization for your model and requirements. The DGDR-generated configuration (Option 1) is recommended as it's automatically tuned to your profiling results and SLA targets.
->
-> **Important - Prometheus Configuration**: The planner queries Prometheus to get frontend request metrics for scaling decisions. If you see errors like "Failed to resolve prometheus service", ensure the `PROMETHEUS_ENDPOINT` environment variable in your planner configuration correctly points to your Prometheus service. See the comments in the example templates for details.
-
-### Relationship to DynamoGraphDeployment (DGD)
+## Accessing Profiling Artifacts

- **DGDR**: High-level "intent" - what you want deployed
- **DGD**: Low-level "implementation" - how it's deployed
+By default, profiling jobs save essential data to ConfigMaps. For detailed artifacts, configure the DGDR to use `dynamo-pvc`:

-The DGDR controller generates a DGD that:
- Uses optimal TP configurations from profiling
- Includes SLA planner for autoscaling
- Has deployment and engine settings tuned for your SLAs
-
-The generated DGD is tracked via labels:
-```yaml
-metadata:
-  labels:
-    dgdr.nvidia.com/name: sla-aic
-    dgdr.nvidia.com/namespace: your-namespace
-```
-
-### Accessing Detailed Profiling Artifacts
-
-By default, profiling jobs save essential data to ConfigMaps for planner integration. For advanced users who need access to detailed artifacts (logs, performance plots, AIPerf results, etc), configure the DGDR to use `dynamo-pvc`. This is optional and will not affect the functionality of profiler or Planner.
-
-**What's available in ConfigMaps (always created):**
+**ConfigMaps (always created):**
 - Generated DGD configuration
 - Profiling data for Planner (`.json` files)

-**What's available in PVC if attached to DGDR (optional):**
+**PVC (optional):**
 - Performance plots (PNGs)
- DGD configuration and logs of all services for each profiled deployment
- AIPerf profiling artifacts for each AIPerf run
+- DGD configuration and logs for each profiled deployment
+- AIPerf profiling artifacts
 - Raw profiling data (`.npz` files)
 - Profiler log

-**Setup:**
-
-1. Set up the benchmarking PVC:
 ```bash
-export NAMESPACE=your-namespace
+# Setup PVC
 deploy/utils/setup_benchmarking_resources.sh
-```

-2. Add `outputPVC` to your DGDR's `profilingConfig`:
-```yaml
-spec:
-  profilingConfig:
-    outputPVC: "dynamo-pvc"
-    config:
-      # ... rest of config
-```
-
-3. After profiling completes, access results:
-```bash
+# Access results after profiling
 kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
 kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
 kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
@@ -525,25 +443,15 @@ kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE
 | **Profiling fails** | Check job logs: `kubectl logs -l job-name=profile-sla-aic -n $NAMESPACE` |
 | **SLA cannot be met** | Relax TTFT/ITL targets or add more GPUs |
 | **DGD not deployed** | Verify `autoApply: true` in DGDR spec |
+| **Prometheus errors** | Ensure `PROMETHEUS_ENDPOINT` env var points to your Prometheus service |

-> [!NOTE]
-> For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](../benchmarks/sla-driven-profiling.md#troubleshooting).
-
-## Configuration Reference
-
-For comprehensive documentation of all DGDR configuration options, see the [DGDR Configuration Reference](../benchmarks/sla-driven-profiling.md#dgdr-configuration-reference).
-
-This includes detailed explanations of:
- **SLA Configuration**: ISL, OSL, TTFT, ITL with use cases and trade-offs
- **Hardware Configuration**: GPU constraints and search space control
- **Sweep Configuration**: Profiling behavior and interpolation settings
- **AI Configurator Configuration**: System types, model mappings, backend versions
- **Planner Configuration**: Autoscaling and adjustment parameters
- **Complete Examples**: Full DGDRs for online, offline (AIC), and MoE profiling
+For comprehensive troubleshooting including AI Configurator constraints, performance debugging, and backend-specific issues, see [SLA-Driven Profiling Troubleshooting](../profiler/profiler-guide.md#troubleshooting).

 ## Related Documentation

- [DGDR API Reference](../kubernetes/api-reference.md)
- [Pre-Deployment Profiling Details](../benchmarks/sla-driven-profiling.md)
- [SLA Planner Architecture](sla-planner.md)
- [Dynamo Operator Guide](../kubernetes/dynamo-operator.md)
+- [Planner README](README.md) -- Overview and quick start
+- [Planner Examples](planner-examples.md) -- DGDR YAML examples and sample configurations
+- [Planner Design](../../design-docs/planner-design.md) -- Architecture deep-dive for contributors
+- [DGDR API Reference](../../kubernetes/api-reference.md)
+- [Pre-Deployment Profiling](../profiler/profiler-guide.md)
+- [Dynamo Operator Guide](../../kubernetes/dynamo-operator.md)
--- a/fern/pages/components/profiler/README.md
+++ b/fern/pages/components/profiler/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# Profiler
+
+The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.
+
+## Feature Matrix
+
+| Feature | vLLM | SGLang | TensorRT-LLM |
+|---------|------|--------|--------------|
+| Dense Model Profiling | ✅ | ✅ | ✅ |
+| MoE Model Profiling | 🚧 | ✅ | 🚧 |
+| AI Configurator (Offline) | ❌ | ❌ | ✅ |
+| Online Profiling (AIPerf) | ✅ | ✅ | ✅ |
+| Interactive WebUI | ✅ | ✅ | ✅ |
+| Runtime Profiling Endpoints | ❌ | ✅ | ❌ |
+
+## Quick Start
+
+### Prerequisites
+
+- Dynamo platform installed (see [Installation Guide](../../kubernetes/installation-guide.md))
+- Kubernetes cluster with GPU nodes (for DGDR-based profiling)
+- kube-prometheus-stack installed (required for SLA planner)
+
+### Using DynamoGraphDeploymentRequest (Recommended)
+
+The recommended way to profile models is through DGDRs, which automate the entire profiling and deployment workflow.
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: my-model-profiling
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+    config:
+      sla:
+        isl: 3000      # Average input sequence length
+        osl: 150       # Average output sequence length
+        ttft: 200.0    # Target Time To First Token (ms)
+        itl: 20.0      # Target Inter-Token Latency (ms)
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+
+  autoApply: true
+```
+
+```bash
+kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
+```
+
+### Using AI Configurator (Fast Offline Profiling)
+
+For TensorRT-LLM, use AI Configurator for rapid profiling (~30 seconds):
+
+```yaml
+profilingConfig:
+  config:
+    sweep:
+      useAiConfigurator: true
+      aicSystem: h200_sxm
+      aicHfId: Qwen/Qwen3-32B
+      aicBackendVersion: "0.20.0"
+```
+
+### Direct Script Usage (Advanced)
+
+For advanced scenarios, run the profiler directly:
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend vllm \
+  --config path/to/disagg.yaml \
+  --model meta-llama/Llama-3-8B \
+  --ttft 200 --itl 15 \
+  --isl 3000 --osl 150
+```
+
+## Configuration
+
+| Parameter | Default | Description |
+|-----------|---------|-------------|
+| `sla.isl` | - | Average input sequence length (tokens) |
+| `sla.osl` | - | Average output sequence length (tokens) |
+| `sla.ttft` | - | Target Time To First Token (milliseconds) |
+| `sla.itl` | - | Target Inter-Token Latency (milliseconds) |
+| `sweep.useAiConfigurator` | `false` | Use offline simulation instead of real profiling |
+| `hardware.minNumGpusPerEngine` | auto | Minimum GPUs per engine (auto-detected from model size) |
+| `hardware.maxNumGpusPerEngine` | 8 | Maximum GPUs per engine |
+
+## Profiling Methods
+
+| Method | Duration | Accuracy | GPU Required | Backends |
+|--------|----------|----------|--------------|----------|
+| Online (AIPerf) | 2-4 hours | Highest | Yes | All |
+| Offline (AI Configurator) | 20-30 seconds | Estimated | No | TensorRT-LLM |
+
+## Output
+
+The profiler generates:
+
+1. **Optimal Configuration**: Recommended TP sizes for prefill and decode engines
+2. **Performance Data**: Interpolation models for the SLA Planner
+3. **Generated DGD**: Complete deployment manifest with optimized settings
+
+Example recommendations:
+```text
+Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
+Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
+```
+
+## Next Steps
+
+| Document | Description |
+|----------|-------------|
+| [Profiler Guide](profiler-guide.md) | Configuration, methods, and troubleshooting |
+| [Profiler Examples](profiler-examples.md) | Complete DGDR YAMLs, WebUI, script examples |
+| [SLA Planner Guide](../planner/planner-guide.md) | End-to-end deployment workflow |
+| [SLA Planner Architecture](../planner/planner-guide.md) | How the Planner uses profiling data |
--- a/fern/pages/components/profiler/profiler-examples.md
+++ b/fern/pages/components/profiler/profiler-examples.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# Profiler Examples
+
+Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.
+
+## DGDR Examples
+
+### Dense Model: AIPerf on Real Engines
+
+Standard online profiling with real GPU measurements:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: vllm-dense-online
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200.0
+        itl: 20.0
+
+      hardware:
+        minNumGpusPerEngine: 1
+        maxNumGpusPerEngine: 8
+
+      sweep:
+        useAiConfigurator: false
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+
+  autoApply: true
+```
+
+### Dense Model: AI Configurator Simulation
+
+Fast offline profiling (~30 seconds, TensorRT-LLM only):
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: trtllm-aic-offline
+spec:
+  model: "Qwen/Qwen3-32B"
+  backend: trtllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300.0
+        itl: 10.0
+
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h200_sxm  # Also supports h100_sxm, b200_sxm, gb200_sxm, a100_sxm
+        aicHfId: Qwen/Qwen3-32B
+        aicBackendVersion: "0.20.0"
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.9.0"
+
+  autoApply: true
+```
+
+### MoE Model
+
+Multi-node MoE profiling with SGLang:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: sglang-moe
+spec:
+  model: "deepseek-ai/DeepSeek-R1"
+  backend: sglang
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
+    config:
+      sla:
+        isl: 2048
+        osl: 512
+        ttft: 300.0
+        itl: 25.0
+
+      hardware:
+        numGpusPerNode: 8
+        maxNumGpusPerEngine: 32
+
+      engine:
+        isMoeModel: true
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
+
+  autoApply: true
+```
+
+### Using Existing DGD Config (ConfigMap)
+
+Reference a custom DGD configuration via ConfigMap:
+
+```bash
+# Create ConfigMap from your DGD config file
+kubectl create configmap deepseek-r1-config \
+  --from-file=/path/to/your/disagg.yaml \
+  --namespace $NAMESPACE \
+  --dry-run=client -o yaml | kubectl apply -f -
+```
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: deepseek-r1
+spec:
+  model: deepseek-ai/DeepSeek-R1
+  backend: sglang
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
+    configMapRef:
+      name: deepseek-r1-config
+      key: disagg.yaml
+    config:
+      sla:
+        isl: 4000
+        osl: 500
+        ttft: 300
+        itl: 10
+      sweep:
+        useAiConfigurator: true
+        aicSystem: h200_sxm
+        aicHfId: deepseek-ai/DeepSeek-V3
+        aicBackendVersion: "0.20.0"
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.9.0"
+
+  autoApply: true
+```
+
+## Interactive WebUI
+
+Launch an interactive configuration selection interface:
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend trtllm \
+  --config path/to/disagg.yaml \
+  --pick-with-webui \
+  --use-ai-configurator \
+  --model Qwen/Qwen3-32B-FP8 \
+  --aic-system h200_sxm \
+  --ttft 200 --itl 15
+```
+
+The WebUI launches on port 8000 by default (configurable with `--webui-port`).
+
+### Features
+
+- **Interactive Charts**: Visualize prefill TTFT, decode ITL, and GPU hours analysis with hover-to-highlight synchronization between charts and tables
+- **Pareto-Optimal Analysis**: The GPU Hours table shows pareto-optimal configurations balancing latency and throughput
+- **DGD Config Preview**: Click "Show Config" on any row to view the corresponding DynamoGraphDeployment YAML
+- **GPU Cost Estimation**: Toggle GPU cost display to convert GPU hours to cost ($/1000 requests)
+- **SLA Visualization**: Red dashed lines indicate your TTFT and ITL targets
+
+### Selection Methods
+
+1. **GPU Hours Table** (recommended): Click any row to select both prefill and decode configurations at once based on the pareto-optimal combination
+2. **Individual Selection**: Click one row in the Prefill table AND one row in the Decode table to manually choose each
+
+### Example DGD Config Output
+
+When you click "Show Config", you see a DynamoGraphDeployment configuration:
+
+```yaml
+# DynamoGraphDeployment Configuration
+# Prefill: 1 GPU(s), TP=1
+# Decode: 4 GPU(s), TP=4
+# Model: Qwen/Qwen3-32B-FP8
+# Backend: trtllm
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+spec:
+  services:
+    PrefillWorker:
+      subComponentType: prefill
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --tensor-parallel-size=1
+    DecodeWorker:
+      subComponentType: decode
+      replicas: 1
+      extraPodSpec:
+        mainContainer:
+          args:
+          - --tensor-parallel-size=4
+```
+
+Once you select a configuration, the full DGD CRD is saved as `config_with_planner.yaml`.
+
+## Direct Script Examples
+
+### Basic Profiling
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend vllm \
+  --config path/to/disagg.yaml \
+  --model meta-llama/Llama-3-8B \
+  --ttft 200 --itl 15 \
+  --isl 3000 --osl 150
+```
+
+### With GPU Constraints
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend sglang \
+  --config examples/backends/sglang/deploy/disagg.yaml \
+  --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
+  --ttft 200 --itl 15 \
+  --isl 3000 --osl 150 \
+  --min-num-gpus 2 \
+  --max-num-gpus 8
+```
+
+### AI Configurator (Offline)
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend trtllm \
+  --config path/to/disagg.yaml \
+  --use-ai-configurator \
+  --model Qwen/Qwen3-32B-FP8 \
+  --aic-system h200_sxm \
+  --ttft 200 --itl 15 \
+  --isl 4000 --osl 500
+```
+
+## SGLang Runtime Profiling
+
+Profile SGLang workers at runtime via HTTP endpoints:
+
+```bash
+# Start profiling
+curl -X POST http://localhost:9090/engine/start_profile \
+  -H "Content-Type: application/json" \
+  -d '{"output_dir": "/tmp/profiler_output"}'
+
+# Run inference requests to generate profiling data...
+
+# Stop profiling
+curl -X POST http://localhost:9090/engine/stop_profile
+```
+
+A test script is provided at `examples/backends/sglang/test_sglang_profile.py`:
+
+```bash
+python examples/backends/sglang/test_sglang_profile.py
+```
+
+View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
--- a/fern/pages/components/profiler/profiler-guide.md
+++ b/fern/pages/components/profiler/profiler-guide.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# Profiler Guide
+
+This guide covers deployment, configuration, integration, and troubleshooting for the Dynamo Profiler.
+
+## What is a DynamoGraphDeploymentRequest (DGDR)?
+
+A **DynamoGraphDeploymentRequest (DGDR)** is a Kubernetes Custom Resource that serves as the primary interface for users to request model deployments with specific performance and resource constraints. You specify:
+
+- **What** model you want to deploy (`model`)
+- **How** it should perform (SLA targets: `ttft`, `itl`)
+- **Where** it should run (optional GPU preferences)
+- **Which** backend to use (`backend`: vllm, sglang, or trtllm)
+- **Which** images to use (`profilingConfig.profilerImage`, `deploymentOverrides.workersImage`)
+
+The Dynamo Operator watches for DGDRs and automatically:
+1. Discovers available GPU resources in your cluster
+2. Runs profiling (online or offline) to find optimal configurations
+3. Generates an optimized DynamoGraphDeployment (DGD) configuration
+4. Deploys the DGD to your cluster
+
+**Relationship to DGD:**
+- **DGDR**: High-level "intent" - what you want deployed
+- **DGD**: Low-level "implementation" - how it's deployed
+
+## Support Matrix
+
+| Backend | Dense Models | MoE Models |
+|---------|-------------|------------|
+| vLLM | ✅ | 🚧 |
+| SGLang | ✅ | ✅ |
+| TensorRT-LLM | ✅ | 🚧 |
+
+The profiler sweeps over the following parallelization mappings for prefill and decode:
+
+| Model Architecture | Prefill Parallelization Mapping | Decode Parallelization Mapping |
+|---------|-------------|------------|
+| MLA+MoE (DeepseekV3ForCausalLM, DeepseekV32ForCausalLM) | TEP, DEP | TEP, DEP |
+| GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
+| Other Models | TP | TP |
+
+> [!NOTE]
+> Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
+
+## Deployment
+
+### Kubernetes Deployment (DGDR)
+
+The recommended deployment method is through DGDRs. Sample configurations are provided in `benchmarks/profiler/deploy/`:
+
+| Sample | Description |
+|--------|-------------|
+| `profile_sla_dgdr.yaml` | Standard online profiling with AIPerf |
+| `profile_sla_aic_dgdr.yaml` | Fast offline profiling with AI Configurator |
+| `profile_sla_moe_dgdr.yaml` | MoE model profiling (SGLang) |
+
+#### Container Images
+
+Each DGDR requires container images for profiling and deployment:
+
+- **`profilingConfig.profilerImage`** (Required): Container image for the profiling job. Must contain the profiler code and dependencies.
+- **`deploymentOverrides.workersImage`** (Optional): Container image for DGD worker components (frontend, workers, planner). If omitted, uses image from the base config file.
+
+```yaml
+spec:
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+```
+
+#### Quick Start: Deploy with DGDR
+
+**Step 1: Create Your DGDR**
+
+Use a sample configuration or create your own:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: my-model-profiling
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+    config:
+      sla:
+        isl: 3000
+        osl: 150
+        ttft: 200.0
+        itl: 20.0
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+  autoApply: true
+```
+
+**Step 2: Apply the DGDR**
+
+```bash
+export NAMESPACE=your-namespace
+kubectl apply -f my-profiling-dgdr.yaml -n $NAMESPACE
+```
+
+**Step 3: Monitor Progress**
+
+```bash
+# View status
+kubectl get dgdr -n $NAMESPACE
+
+# Detailed status
+kubectl describe dgdr my-model-profiling -n $NAMESPACE
+
+# Watch profiling job logs
+kubectl logs -f job/profile-my-model-profiling -n $NAMESPACE
+```
+
+**DGDR Status States:**
+- `Pending`: Initial state, preparing to profile
+- `Profiling`: Running profiling job (20-30 seconds for AIC, 2-4 hours for online)
+- `Deploying`: Generating and applying DGD configuration
+- `Ready`: DGD successfully deployed and running
+- `Failed`: Error occurred (check events for details)
+
+**Step 4: Access Your Deployment**
+
+```bash
+# Find the frontend service
+kubectl get svc -n $NAMESPACE | grep frontend
+
+# Port-forward to access locally
+kubectl port-forward svc/<deployment>-frontend 8000:8000 -n $NAMESPACE
+
+# Test the endpoint
+curl http://localhost:8000/v1/models
+```
+
+> [!NOTE]
+> DGDRs are **immutable**. To update SLAs or configuration, delete the existing DGDR and create a new one.
+
+### Direct Script Execution
+
+For advanced use cases or local development:
+
+```bash
+python -m benchmarks.profiler.profile_sla \
+  --backend vllm \
+  --config path/to/disagg.yaml \
+  --model meta-llama/Llama-3-8B \
+  --ttft 200 --itl 15 \
+  --isl 3000 --osl 150 \
+  --min-num-gpus 1 \
+  --max-num-gpus 8
+```
+
+## Profiling Method
+
+The profiler follows a 5-step process:
+
+1. **Hardware Setup**: Uses defaults or user-specified hardware configuration. Optionally, cluster-scoped operators can enable automatic GPU discovery to detect specifications from cluster nodes.
+2. **Identify Sweep Ranges**: Automatically determine minimum and maximum number of GPUs per engine. Minimum is determined by the model size and GPU VRAM. Maximum is set to one node for dense models and 4 nodes for MoE models.
+3. **Parallelization Mapping Sweep**: Test performance of engines with different parallelization mappings using the input ISL and OSL.
+   - For dense models, test different TP sizes for both prefill and decode.
+   - For MoE models (SGLang), evaluate both TEP and DEP as candidates for prefill and decode.
+   - **Prefill**:
+     - TP/TEP: Measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
+     - DEP: Attention uses data parallelism. Send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst.
+   ![Prefill Performance](/assets/img/h100-prefill-performance.png)
+   - **Decode**: Measure the ITL under different numbers of in-flight requests, from 1 to the maximum the KV cache can hold. To measure ITL without being affected by piggy-backed prefill requests, the script enables KV-reuse and warms up the engine by issuing the same prompts before measuring.
+   ![Decode Performance](/assets/img/h100-decode-performance.png)
+4. **Recommendation**: Select optimal parallelization mapping for prefill and decode that achieves the highest per-GPU throughput while adhering to the SLA on TTFT and ITL.
+5. **In-Depth Profiling on the Recommended P/D Engine**: Interpolate TTFT with ISL and ITL with active KV cache and decode context length for more accurate performance estimation.
+![ITL Interpolation](/assets/img/pd-interpolation.png)
+   - **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
+   - **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths.
+
+### AIPerf on Real Engines
+
+Profiles your model by creating real test deployments in Kubernetes and measuring their performance.
+
+- **Duration**: 2-4 hours
+- **Accuracy**: Highest (real measurements)
+- **GPU Requirements**: Full access to test different parallelization mappings
+- **Backends**: vLLM, SGLang, TensorRT-LLM
+
+```yaml
+profilingConfig:
+  config:
+    sweep:
+      useAiConfigurator: false  # Default
+```
+
+### AI Configurator Simulation
+
+Uses performance simulation to rapidly estimate optimal configurations without running real deployments.
+
+- **Duration**: 20-30 seconds
+- **Accuracy**: Estimated (may have errors for unusual configurations)
+- **GPU Requirements**: None
+- **Backends**: TensorRT-LLM only (vLLM/SGLang coming soon)
+
+```yaml
+profilingConfig:
+  config:
+    sweep:
+      useAiConfigurator: true
+      aicSystem: h200_sxm
+      aicHfId: Qwen/Qwen3-32B
+      aicBackendVersion: "0.20.0"      # TRT-LLM version simulated by AIC
+```
+
+> [!NOTE]
+> `aicBackendVersion` specifies the TensorRT-LLM version that AI Configurator simulates. See the [AI Configurator supported features](https://github.com/ai-dynamo/aiconfigurator#supported-features) for available versions.
+
+**Currently supports:**
+- **Backends**: TensorRT-LLM (versions 0.20.0, 1.0.0rc3, 1.0.0rc6)
+- **Systems**: H100 SXM, H200 SXM, B200 SXM, GB200 SXM, A100 SXM
+- **Models**: Wide range including GPT, Llama, Mixtral, DeepSeek, Qwen, and more
+
+See [AI Configurator documentation](https://github.com/ai-dynamo/aiconfigurator#supported-features) for the full list.
+
+### Automatic GPU Discovery
+
+Cluster-scoped operators can optionally enable automatic GPU discovery:
+
+```yaml
+spec:
+  enableGpuDiscovery: true
+```
+
+This is only available with cluster-scoped operators (`namespaceRestriction.enabled=false`) as it requires cluster-wide node access permissions.
+
+## Configuration
+
+### DGDR Configuration Structure
+
+All profiler configuration goes under `spec.profilingConfig.config`:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeploymentRequest
+metadata:
+  name: my-deployment
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+
+  profilingConfig:
+    profilerImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+    configMapRef:                  # Optional: base DGD config
+      name: my-config
+      key: disagg.yaml
+
+    config:
+      sla: { ... }
+      hardware: { ... }
+      sweep: { ... }
+      planner: { ... }
+
+  deploymentOverrides:
+    workersImage: "nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.9.0"
+```
+
+### SLA Configuration (Required)
+
+```yaml
+sla:
+  isl: 3000      # Average input sequence length (tokens)
+  osl: 150       # Average output sequence length (tokens)
+  ttft: 200.0    # Target Time To First Token (milliseconds)
+  itl: 20.0      # Target Inter-Token Latency (milliseconds)
+```
+
+- **ISL/OSL**: Based on your expected traffic patterns
+- **TTFT**: First token latency target (lower = more GPUs needed, affects prefill engine)
+- **ITL**: Token generation latency target (lower = more GPUs needed, affects decode engine)
+- **Trade-offs**: Tighter SLAs require more GPU resources
+
+### Hardware Configuration (Optional)
+
+```yaml
+hardware:
+  minNumGpusPerEngine: 2      # Auto-determined from model size and VRAM if not provided
+  maxNumGpusPerEngine: 8      # Maximum GPUs to test
+  numGpusPerNode: 8           # GPUs per node (for multi-node MoE)
+  gpuType: h200_sxm           # GPU type hint (informational, auto-detected)
+```
+
+- **minNumGpusPerEngine**: Skip small TP sizes if your model is large
+- **maxNumGpusPerEngine**: Limit search space or work around constraints (e.g., [AIC attention heads](#ai-configurator-attention-head-constraint-error))
+- **numGpusPerNode**: Determine the upper bound of GPUs per node for dense models and configure Grove for multi-node MoE engines
+- **gpuType**: Informational only, auto-detected by the controller. For AI Configurator, use `aicSystem` in the [sweep configuration](#ai-configurator-configuration) instead
+
+> [!TIP]
+> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
+
+### Sweep Configuration (Optional)
+
+```yaml
+sweep:
+  useAiConfigurator: false              # Use real profiling (default)
+  prefillInterpolationGranularity: 16   # Samples for prefill TTFT curve
+  decodeInterpolationGranularity: 6     # Samples for decode ITL curve
+```
+
+- **useAiConfigurator**: Set to `true` for 20-30 second profiling (TensorRT-LLM only)
+- **prefillInterpolationGranularity**: Samples for prefill TTFT curve (lower = faster but less accurate)
+- **decodeInterpolationGranularity**: Samples for decode ITL curve. Since ITL interpolation is 3D and takes longer, we default to fewer samples. Increasing this value may quadratically increase profiling time.
+
+### AI Configurator Configuration
+
+Required if `useAiConfigurator: true`:
+
+```yaml
+sweep:
+  useAiConfigurator: true
+  aicSystem: h200_sxm              # h100_sxm, h200_sxm, b200_sxm, gb200_sxm, a100_sxm
+  aicHfId: Qwen/Qwen3-32B         # HuggingFace model ID
+  aicBackendVersion: "0.20.0"      # TensorRT-LLM version
+```
+
+### Planner Configuration (Optional)
+
+Pass arguments to the SLA planner:
+
+```yaml
+planner:
+  planner_min_endpoint: 2                    # Minimum endpoints to maintain
+  planner_adjustment_interval: 60            # Adjustment interval (seconds)
+  planner_load_predictor: linear             # Load prediction method
+```
+
+> [!NOTE]
+> Planner arguments use `planner_` prefix. See [SLA Planner documentation](../planner/planner-guide.md) for full list.
+
+### Model Cache PVC (Advanced)
+
+For large models, use a pre-populated PVC containing model weights instead of downloading from HuggingFace:
+
+```yaml
+deployment:
+  modelCache:
+    pvcName: "model-cache"
+    pvcPath: "hub/models--deepseek-ai--DeepSeek-R1"
+    mountPath: "/opt/model-cache"
+```
+
+Requirements:
+- The PVC must exist in the same namespace as the DGDR
+- The model weights must be accessible at `{mountPath}/{pvcPath}`
+
+### Engine Configuration (Auto-configured)
+
+The controller automatically injects these from high-level fields:
+
+```yaml
+# You specify:
+spec:
+  model: "Qwen/Qwen3-0.6B"
+  backend: vllm
+
+# Controller auto-injects:
+profilingConfig:
+  config:
+    deployment:
+      model: "Qwen/Qwen3-0.6B"
+    engine:
+      backend: vllm
+      config: /path/to/configmap
+```
+
+You should **not** manually set `deployment.model` or `engine.backend` in `profilingConfig.config`.
+
+### Using Existing DGD Configs (ConfigMap)
+
+Reference an existing DGD config via ConfigMap:
+
+```bash
+kubectl create configmap my-config \
+  --from-file=disagg.yaml=/path/to/your/disagg.yaml \
+  --namespace $NAMESPACE \
+  --dry-run=client -o yaml | kubectl apply -f -
+```
+
+```yaml
+profilingConfig:
+  configMapRef:
+    name: my-config
+    key: disagg.yaml
+```
+
+The profiler uses the DGD config as a **base template**, then optimizes it based on your SLA targets.
+
+### CLI Arguments
+
+| Argument | Type | Default | Description |
+|----------|------|---------|-------------|
+| `--backend` | string | - | Inference backend: vllm, sglang, trtllm |
+| `--config` | string | - | Path to DGD YAML config file |
+| `--model` | string | - | HuggingFace model ID |
+| `--ttft` | float | - | Target TTFT in milliseconds |
+| `--itl` | float | - | Target ITL in milliseconds |
+| `--isl` | int | - | Average input sequence length |
+| `--osl` | int | - | Average output sequence length |
+| `--min-num-gpus` | int | auto | Minimum GPUs per engine |
+| `--max-num-gpus` | int | 8 | Maximum GPUs per engine |
+| `--use-ai-configurator` | flag | false | Use offline AI Configurator |
+| `--pick-with-webui` | flag | false | Launch interactive WebUI |
+| `--webui-port` | int | 8000 | Port for WebUI |
+
+> [!NOTE]
+> CLI arguments map to DGDR config fields: `--min-num-gpus` = `hardware.minNumGpusPerEngine`, `--max-num-gpus` = `hardware.maxNumGpusPerEngine`, `--use-ai-configurator` = `sweep.useAiConfigurator`. See [DGDR Configuration Structure](#dgdr-configuration-structure) for all field mappings.
+
+## Integration
+
+### With SLA Planner
+
+The Profiler generates interpolation data that the SLA Planner uses for autoscaling decisions.
+
+**Prefill Interpolation** (`selected_prefill_interpolation/raw_data.npz`):
+- `prefill_isl`: 1D array of input sequence lengths tested
+- `prefill_ttft`: 1D array of TTFTs (ms) at each ISL
+- `prefill_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each ISL
+
+**Decode Interpolation** (`selected_decode_interpolation/raw_data.npz`):
+- `max_kv_tokens`: Total KV tokens capacity in decode engine
+- `x_kv_usage`: 1D array of active KV usage percentages [0, 1]
+- `y_context_length`: 1D array of average context lengths tested
+- `z_itl`: 1D array of ITLs (ms) at each (KV usage, context length) point
+- `z_thpt_per_gpu`: 1D array of throughput (tokens/s/GPU) at each point
+
+### With Dynamo Operator
+
+When using DGDR, the Dynamo Operator:
+
+1. Creates profiling jobs automatically
+2. Stores profiling data in ConfigMaps (`planner-profile-data`)
+3. Generates optimized DGD configurations
+4. Deploys the DGD with SLA Planner integration
+
+The generated DGD is tracked via labels:
+```yaml
+metadata:
+  labels:
+    dgdr.nvidia.com/name: my-deployment
+    dgdr.nvidia.com/namespace: your-namespace
+```
+
+### With Observability
+
+Monitor profiling jobs:
+
+```bash
+kubectl logs -f job/profile-<dgdr-name> -n $NAMESPACE
+kubectl describe dgdr <name> -n $NAMESPACE
+```
+
+## Advanced Topics
+
+### Manual Deployment Control
+
+Disable auto-deployment to review the generated DGD before applying:
+
+```yaml
+spec:
+  autoApply: false
+```
+
+Then manually extract and apply:
+
+```bash
+# Extract generated DGD from DGDR status
+kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' | kubectl apply -f -
+
+# Or save to file for review
+kubectl get dgdr my-deployment -n $NAMESPACE -o jsonpath='{.status.generatedDeployment}' > my-dgd.yaml
+```
+
+### Mocker Deployment
+
+Deploy a mocker deployment that simulates engines without GPUs:
+
+```yaml
+spec:
+  model: <model-name>
+  backend: trtllm
+  useMocker: true    # Deploy mocker instead of real backend
+  autoApply: true
+```
+
+Profiling still runs against the real backend to collect performance data. The mocker uses this data to simulate realistic timing behavior. Useful for large-scale experiments, testing Planner behavior, and validating configurations.
+
+### Accessing Profiling Artifacts
+
+By default, profiling data is stored in ConfigMaps. For detailed artifacts (plots, logs, raw data), attach a PVC:
+
+```yaml
+profilingConfig:
+  outputPVC: "dynamo-pvc"
+```
+
+**ConfigMaps (always created):**
+- `dgdr-output-<name>`: Generated DGD configuration
+- `planner-profile-data`: Profiling data for Planner (JSON)
+
+**PVC artifacts (optional):**
+- Performance plots (PNGs)
+- DGD configurations for each profiled deployment
+- AIPerf profiling artifacts
+- Raw profiling data (`.npz` files)
+- Profiler logs
+
+Access PVC results:
+```bash
+kubectl apply -f deploy/utils/manifests/pvc-access-pod.yaml -n $NAMESPACE
+kubectl wait --for=condition=Ready pod/pvc-access-pod -n $NAMESPACE --timeout=60s
+kubectl cp $NAMESPACE/pvc-access-pod:/data ./profiling-results
+kubectl delete pod pvc-access-pod -n $NAMESPACE
+```
+
+### Output Performance Plots
+
+The profiler generates plots to visualize performance data:
+
+**Parallelization Mapping Sweep Plots:**
+- `prefill_performance.png`: TTFT vs Parallelization Mapping size
+- `decode_performance.png`: ITL vs Parallelization Mapping size and in-flight requests
+
+**In-Depth Profiling Plots:**
+- `selected_prefill_interpolation/prefill_ttft_interpolation.png`: TTFT vs ISL
+- `selected_prefill_interpolation/prefill_throughput_interpolation.png`: Throughput vs ISL
+- `selected_decode_interpolation/decode_itl_interplation.png`: ITL vs KV usage and context length
+- `selected_decode_interpolation/decode_throughput_interpolation.png`: Throughput vs KV usage and context length
+
+## Runtime Profiling (SGLang)
+
+SGLang workers expose profiling endpoints for runtime performance analysis:
+
+```bash
+# Start profiling
+curl -X POST http://localhost:9090/engine/start_profile \
+  -H "Content-Type: application/json" \
+  -d '{"output_dir": "/tmp/profiler_output"}'
+
+# Run inference requests...
+
+# Stop profiling
+curl -X POST http://localhost:9090/engine/stop_profile
+```
+
+View traces using Chrome's `chrome://tracing`, [Perfetto UI](https://ui.perfetto.dev/), or TensorBoard.
+
+## Troubleshooting
+
+### Profiling Takes Too Long
+
+**Solution 1**: Use AI Configurator for rapid profiling (TensorRT-LLM only):
+```yaml
+sweep:
+  useAiConfigurator: true
+```
+
+**Solution 2**: Reduce search space:
+```yaml
+hardware:
+  minNumGpusPerEngine: 4  # Skip TP1, TP2
+  maxNumGpusPerEngine: 8  # Don't test beyond TP8
+```
+
+### SLA Cannot Be Met
+
+**Symptoms**: Profiler reports no configuration meets targets
+
+**Solutions:**
+1. Relax SLA targets (increase TTFT/ITL)
+2. Add more GPU resources
+3. Try a different backend
+4. Use a smaller model
+
+### AI Configurator: Attention Head Constraint Error
+
+**Symptoms**: Profiling fails with error:
+```text
+AssertionError: num_heads <N> should be divisible by tp_size <M> and the division result should be >= 4
+```
+
+**Cause**: AI Configurator requires **≥4 attention heads per GPU**. Small models with few heads cannot use high TP sizes.
+
+**Affected Models:**
+- **Qwen3-0.6B** (16 heads): Max TP = 4
+- **GPT-2** (12 heads): Max TP = 3
+- Most models **\<1B parameters**: May hit this constraint
+
+**Solution**: Limit `maxNumGpusPerEngine`:
+```yaml
+hardware:
+  maxNumGpusPerEngine: 4  # For Qwen3-0.6B (16 heads / 4 = max TP of 4)
+```
+
+**Calculate Max TP**: `max_tp = num_attention_heads / 4`
+
+> [!NOTE]
+> This is an AI Configurator limitation. Online profiling doesn't have this constraint.
+
+### Image Pull Errors
+
+**Symptoms**: `ErrImagePull` or `ImagePullBackOff`
+
+**Solution**: Ensure image pull secrets are configured:
+```bash
+kubectl create secret docker-registry nvcr-imagepullsecret \
+  --docker-server=nvcr.io \
+  --docker-username='$oauthtoken' \
+  --docker-password=<NGC_API_KEY> \
+  --namespace <your-namespace>
+```
+
+### Out of Memory During Profiling
+
+**Symptoms**: OOM errors in profiling jobs
+
+**Solutions:**
+1. Reduce `gpu_memory_utilization` in engine config
+2. Reduce `--max-context-length`
+3. Skip larger TP configurations
+4. Use fewer GPUs per test
+
+### Unsupported Parallelization Mapping in Backend
+
+**Symptoms**: Startup/runtime error in the backend (e.g., prime number of attention heads constraining TP to 1, or backend not supporting different TP sizes for prefill and decode).
+
+**Solutions:**
+1. Contact the backend to add support and bump backend version in Dynamo
+2. Constrain the max and min number of GPUs per engine to the supported range
+
+## See Also
+
+- [Profiler Examples](profiler-examples.md) - Complete DGDR YAML examples
+- [SLA Planner Guide](../planner/planner-guide.md) - End-to-end deployment workflow
+- [SLA Planner Architecture](../planner/planner-guide.md) - How the Planner uses profiling data
+- [DGDR API Reference](../../kubernetes/api-reference.md) - DGDR specification
+- [Profiler Arguments Reference](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/profiler/utils/profiler_argparse.py) - Full CLI reference
--- a/fern/pages/components/router/README.md
+++ b/fern/pages/components/router/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# Router
+
+The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
+
+## Quick Start
+
+### Python / CLI Deployment
+
+To launch the Dynamo frontend with the KV Router:
+
+```bash
+python -m dynamo.frontend --router-mode kv --http-port 8000
+```
+
+This command:
+- Launches the Dynamo frontend service with KV routing enabled
+- Exposes the service on port 8000 (configurable)
+- Automatically handles all backend workers registered to the Dynamo endpoint
+
+Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
+
+#### CLI Arguments
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
+| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
+| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
+| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
+| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
+
+For all available options: `python -m dynamo.frontend --help`
+
+### Kubernetes Deployment
+
+To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: my-namespace
+      componentType: frontend
+      replicas: 1
+      envs:
+        - name: DYN_ROUTER_MODE
+          value: kv  # Enable KV Smart Router
+```
+
+**Key Points:**
+- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
+- Workers automatically report KV cache events to the router
+- No worker-side configuration changes needed
+
+#### Environment Variables
+
+All CLI arguments can be configured via environment variables using the `DYN_` prefix:
+
+| CLI Argument | Environment Variable | Default |
+|--------------|---------------------|---------|
+| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
+| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
+| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
+| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
+| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
+
+For complete K8s examples and advanced configuration, see [K8s Examples](router-examples.md#k8s-examples).
+
+For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
+
+For more configuration options and tuning guidelines, see the [Router Guide](router-guide.md).
+
+## Prerequisites and Limitations
+
+**Requirements:**
+- **Dynamic endpoints only**: KV router requires `register_llm()` with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
+- Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../../development/backend-guide.md))
+- You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
+
+**Multimodal Support:**
+- **vLLM and TRT-LLM**: Multimodal routing supported for images via multimodal hashes
+- **SGLang**: Image routing not yet supported
+- **Other modalities** (audio, video, etc.): Not yet supported
+
+**Limitations:**
+- Static endpoints not supported—KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states
+
+For basic model registration without KV routing, use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
+
+## Next Steps
+
+- **[Router Guide](router-guide.md)**: Deep dive into KV cache routing, configuration, disaggregated serving, and tuning
+- **[Router Examples](router-examples.md)**: Python API usage, K8s examples, and custom routing patterns
+- **[Router Design](../../design-docs/router-design.md)**: Architecture details, algorithms, and event transport modes
--- a/fern/pages/components/router/router-examples.md
+++ b/fern/pages/components/router/router-examples.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+
+# Router Examples
+
+For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns.
+
+## Table of Contents
+
+- [Using KvPushRouter Python API](#using-kvpushrouter-python-api)
+- [K8s Examples](#k8s-examples)
+- [Routing Patterns](#routing-patterns)
+- [Custom Routing Example: Minimizing TTFT](#custom-routing-example-minimizing-ttft)
+- [KV Event Publishing for Custom Engines](#kv-event-publishing-for-custom-engines)
+
+## Using KvPushRouter Python API
+
+Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
+
+>[!Warning]
+> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
+
+### Methods
+
+The `KvPushRouter` provides the following methods:
+
+- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
+
+- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
+  - Without `request_id`: Query-only, doesn't update router state
+  - With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
+
+- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
+
+- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
+
+- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker()` for manual routing instead of `generate()`.
+
+- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
+
+### Setup
+
+First, launch your backend engines:
+```bash
+python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
+```
+
+### Example Script
+
+```python
+import asyncio
+from dynamollm import DistributedRuntime, KvPushRouter, KvRouterConfig
+
+async def main():
+    # Get runtime and create endpoint
+    runtime = DistributedRuntime.detached()
+    namespace = runtime.namespace("dynamo")
+    component = namespace.component("backend")
+    endpoint = component.endpoint("generate")
+
+    # Create KV router
+    kv_router_config = KvRouterConfig()
+    router = KvPushRouter(
+        endpoint=endpoint,
+        block_size=16,
+        kv_router_config=kv_router_config
+    )
+
+    # Your input tokens
+    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+
+    # Generate with per-request routing override
+    stream = await router.generate(
+        token_ids=token_ids,
+        model="meta-llama/Llama-2-7b-hf",
+        stop_conditions={
+            "max_tokens": 20,        # Generate exactly 20 tokens
+            "ignore_eos": True,      # Don't stop at EOS token
+        },
+        sampling_options={
+            "temperature": 0.7,
+            "top_p": 0.9,
+        },
+        router_config_override={
+            "overlap_score_weight": 2.0,    # Prioritize cache hits for this request
+            "router_temperature": 0.5,       # Add routing randomness
+        }
+    )
+
+    # Collect generated tokens
+    generated_tokens = []
+    async for response in stream:
+        if isinstance(response, dict) and "token_ids" in response:
+            generated_tokens.extend(response["token_ids"])
+
+    print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
+
+if __name__ == "__main__":
+    asyncio.run(main())
+```
+
+## K8s Examples
+
+For basic Kubernetes deployment with the KV Router, see the [Kubernetes Deployment section](README.md#kubernetes-deployment) in the Quick Start guide.
+
+### Complete K8s Examples
+
+- [TRT-LLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/deploy/agg_router.yaml)
+- [vLLM aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/vllm/deploy/agg_router.yaml)
+- [SGLang aggregated router example](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/agg_router.yaml)
+- [Distributed inference tutorial](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/kubernetes/Distributed_Inference/agg_router.yaml)
+
+**For A/B Testing and Advanced K8s Setup:**
+See the comprehensive [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md) for step-by-step instructions on deploying, configuring, and benchmarking the KV router in Kubernetes.
+
+### Example with Advanced Configuration
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: my-namespace
+      componentType: frontend
+      replicas: 1
+      envs:
+        - name: DYN_ROUTER_MODE
+          value: kv
+        - name: DYN_ROUTER_TEMPERATURE
+          value: "0.5"  # Add some randomness to prevent worker saturation
+        - name: DYN_KV_OVERLAP_SCORE_WEIGHT
+          value: "1.5"  # Prioritize TTFT over ITL
+        - name: DYN_KV_CACHE_BLOCK_SIZE
+          value: "16"
+      extraPodSpec:
+        mainContainer:
+          image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
+```
+
+### Alternative: Using Command Args in K8s
+
+You can also pass CLI arguments directly in the container command:
+
+```yaml
+extraPodSpec:
+  mainContainer:
+    image: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0
+    command:
+      - /bin/sh
+      - -c
+    args:
+      - "python3 -m dynamo.frontend --router-mode kv --router-temperature 0.5 --http-port 8000"
+```
+
+**Recommendation:** Use environment variables for easier configuration management and consistency with Dynamo's K8s patterns.
+
+## Routing Patterns
+
+The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
+
+### 1. Automatic Routing (Recommended)
+Call `generate()` directly and let the router handle everything:
+```python
+stream = await router.generate(token_ids=tokens, model="model-name")
+```
+- **Best for**: Most use cases
+- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
+
+### 2. Manual State Management (Advanced)
+Use `best_worker(request_id=...)` to select and track, then manage the request yourself:
+```python
+worker_id, _dp_rank, overlap = await router.best_worker(tokens, request_id="req-123")
+response = await client.generate(tokens, request_id="req-123")
+# await anext(response)  # Get first token
+await router.mark_prefill_complete("req-123")  # After first token
+# async for _ in response:  # Continue generating
+#     ...
+await router.free("req-123")  # After completion
+```
+- **Best for**: Custom request handling with router state tracking
+- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
+- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
+
+### 3. Hierarchical Router Probing
+Query without state updates, then route through a chosen router:
+```python
+# Probe multiple routers without updating state
+worker_id_1, dp_rank, overlap_1 = await router_1.best_worker(tokens)  # No request_id
+worker_id_2, dp_rank, overlap_2 = await router_2.best_worker(tokens)
+
+# Pick the best router based on results
+chosen_router = router_1 if overlap_1 > overlap_2 else router_2
+stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
+```
+- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
+- **Advantage**: Query multiple routers before committing to one
+
+### 4. Custom Load-Based Routing
+Use `get_potential_loads()` to implement custom routing logic:
+```python
+loads = await router.get_potential_loads(tokens)
+# Apply custom logic (e.g., weighted scoring, constraints)
+best_worker = min(loads, key=lambda x: custom_cost_fn(x))
+stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
+```
+- **Best for**: Custom optimization strategies beyond the built-in cost function
+- **Advantage**: Full control over worker selection logic
+- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
+
+All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
+
+## Custom Routing Example: Minimizing TTFT
+
+Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
+
+```python
+import asyncio
+from dynamo.llm import DistributedRuntime, KvPushRouter, KvRouterConfig
+
+async def minimize_ttft_routing():
+    # Setup router
+    runtime = DistributedRuntime.detached()
+    namespace = runtime.namespace("dynamo")
+    component = namespace.component("backend")
+    endpoint = component.endpoint("generate")
+
+    router = KvPushRouter(
+        endpoint=endpoint,
+        block_size=16,
+        kv_router_config=KvRouterConfig()
+    )
+
+    # Your input tokens
+    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
+
+    # Get potential loads for all workers
+    potential_loads = await router.get_potential_loads(token_ids)
+
+    # Find worker with minimum prefill tokens (best for TTFT)
+    best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
+
+    print(f"Worker loads: {potential_loads}")
+    print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
+
+    # Route directly to the selected worker
+    stream = await router.generate(
+        token_ids=token_ids,
+        model="meta-llama/Llama-2-7b-hf",
+        worker_id=best_worker['worker_id'],  # Force routing to optimal worker
+        stop_conditions={"max_tokens": 20}
+    )
+
+    # Process response
+    async for response in stream:
+        if isinstance(response, dict) and "token_ids" in response:
+            print(f"Generated tokens: {response['token_ids']}")
+
+if __name__ == "__main__":
+    asyncio.run(minimize_ttft_routing())
+```
+
+This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
+
+- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
+- **Maximize cache reuse**: Use `best_worker()` which considers both prefill and decode loads
+- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
+
+See [Router Design](../../design-docs/router-design.md) for architecture details and the cost function algorithm.
+
+## KV Event Publishing for Custom Engines
+
+The KV Router relies on real-time events from backend workers to track which KV cache blocks are stored on each worker. When your custom engine allocates or evicts KV cache blocks, it should publish these events so the router can make optimal routing decisions. There are two main publishing pathways: direct NATS publishing (`KvEventPublisher`) which publishes events directly to NATS and is the simplest approach for custom engines, and ZMQ-based publishing for engines with ZMQ event output (like vLLM) which uses a ZMQ publisher in the engine and `ZmqKvEventPublisher` to forward events to NATS.
+
+### Event Types
+
+The KV cache supports three event types:
+
+| Event Type | Description | When to Publish |
+|------------|-------------|-----------------|
+| `BlockStored` | New blocks added to cache | After KV cache allocation succeeds |
+| `BlockRemoved` | Blocks evicted from cache | When blocks are evicted or freed |
+| `AllBlocksCleared` | All blocks removed | On cache reset or worker restart |
+
+### Event Structure
+
+Each event contains:
+- **`event_id`**: Monotonically increasing identifier per worker
+- **`dp_rank`**: Data parallel rank (0 if DP not enabled)
+- **`data`**: One of `Stored`, `Removed`, or `Cleared`
+
+For `BlockStored` events:
+- **`token_ids`**: List of token IDs for the stored blocks
+- **`block_hashes`**: List of **sequence block hashes** from the engine's block manager. These are cumulative hashes that incorporate all tokens from the start of the sequence up to and including the current block (not just the tokens within that block). This enables prefix matching across requests.
+- **`num_block_tokens`**: Number of tokens per block (should all equal `kv_block_size`)
+- **`parent_hash`**: Hash of the parent block. Required for all blocks except the first block in a sequence (which has no parent).
+- **`lora_id`**: LoRA adapter ID (0 if not using LoRA)
+
+For `BlockRemoved` events:
+- **`block_hashes`**: List of sequence block hashes being evicted
+
+### Option 1: Direct NATS Publishing (Recommended)
+
+The `KvEventPublisher` class publishes events directly to NATS. This is the simplest approach for custom engines.
+
+```mermaid
+flowchart LR
+    subgraph Engine["Custom Engine"]
+        cache["KV Cache Manager"]
+    end
+
+    subgraph Worker["Dynamo Worker Process"]
+        pub["KvEventPublisher"]
+    end
+
+    subgraph NATS["NATS"]
+        subject["kv-events subject"]
+    end
+
+    subgraph Router["KV Router"]
+        indexer["KvIndexer"]
+    end
+
+    cache -->|"on_blocks_stored()<br/>on_blocks_removed()"| pub
+    pub -->|"publish to NATS"| subject
+    subject --> indexer
+```
+
+**When to use:**
+- Building a custom inference engine from scratch
+- Your engine doesn't have a ZMQ-based event system
+- You want the simplest integration path
+
+#### Basic Setup
+
+```python
+from dynamo.llm import KvEventPublisher
+
+class CustomEnginePublisher:
+    def __init__(self, component, worker_id: int, block_size: int, dp_rank: int = 0):
+        self.block_size = block_size
+        self.event_id = 0
+        self.kv_publisher = KvEventPublisher(
+            component=component,
+            worker_id=worker_id,
+            kv_block_size=block_size,
+            dp_rank=dp_rank,
+            enable_local_indexer=False,
+        )
+
+    def on_blocks_stored(self, token_ids: list[int], block_hashes: list[int],
+                         lora_id: int = 0, parent_hash: int | None = None):
+        """Call after KV cache blocks are allocated."""
+        self.event_id += 1
+        num_block_tokens = [self.block_size] * len(block_hashes)
+        self.kv_publisher.publish_stored(
+            event_id=self.event_id,
+            token_ids=token_ids,
+            num_block_tokens=num_block_tokens,
+            block_hashes=block_hashes,
+            lora_id=lora_id,
+            parent_hash=parent_hash,
+        )
+
+    def on_blocks_removed(self, block_hashes: list[int]):
+        """Call when KV cache blocks are evicted."""
+        self.event_id += 1
+        self.kv_publisher.publish_removed(event_id=self.event_id, block_hashes=block_hashes)
+```
+
+#### Integration with Your Engine
+
+```python
+from dynamo.llm import register_llm
+
+async def main():
+    # Register your engine with Dynamo
+    component, endpoint = await register_llm(
+        model="my-model",
+        generator=my_generate_fn,
+    )
+
+    # Initialize publisher
+    publisher = CustomEnginePublisher(
+        component=component,
+        worker_id=endpoint.connection_id(),
+        block_size=16,  # Match your engine's block size
+    )
+
+    # Hook into your engine's cache events
+    def on_prefill_complete(request_id, token_ids, blocks):
+        block_hashes = [block.hash for block in blocks]
+        publisher.on_blocks_stored(token_ids=token_ids, block_hashes=block_hashes)
+
+    def on_cache_eviction(evicted_blocks):
+        block_hashes = [block.hash for block in evicted_blocks]
+        publisher.on_blocks_removed(block_hashes=block_hashes)
+```
+
+### Option 2: ZMQ-based Publishing
+
+For engines that publish events via ZMQ (like vLLM), this option uses two components that work together:
+
+1. **ZMQ Publisher** (in your engine) - Publishes events to a ZMQ socket
+2. **ZmqKvEventPublisher** (Dynamo binding) - Subscribes to ZMQ and forwards to NATS
+
+```mermaid
+flowchart LR
+    subgraph Engine["Custom Engine / vLLM"]
+        cache["KV Cache Manager"]
+        zmq_pub["ZMQ Publisher<br/>(Pure Python)"]
+    end
+
+    subgraph ZMQ["ZMQ Socket"]
+        socket["tcp://127.0.0.1:5557"]
+    end
+
+    subgraph Worker["Dynamo Worker Process"]
+        zmq_sub["ZmqKvEventPublisher<br/>(Rust bindings)"]
+    end
+
+    subgraph NATS["NATS"]
+        subject["kv-events subject"]
+    end
+
+    subgraph Router["KV Router"]
+        indexer["KvIndexer"]
+    end
+
+    cache --> zmq_pub
+    zmq_pub -->|"PUB"| socket
+    socket -->|"SUB"| zmq_sub
+    zmq_sub --> subject
+    subject --> indexer
+```
+
+**When to use:**
+- Your engine already has a ZMQ-based event system (like vLLM)
+- You're integrating with a consolidator (like KVBM)
+- You want to decouple event publishing from your engine's main loop
+
+#### Part 1: ZMQ Subscriber (Dynamo Bindings)
+
+If your engine already publishes to ZMQ, use `ZmqKvEventPublisher` to subscribe and forward to NATS:
+
+```python
+from dynamo.llm import ZmqKvEventPublisher, ZmqKvEventPublisherConfig
+
+# Configure the ZMQ subscriber
+config = ZmqKvEventPublisherConfig(
+    worker_id=endpoint.connection_id(),
+    kv_block_size=block_size,
+    zmq_endpoint="tcp://127.0.0.1:5557",  # Where your engine publishes
+    zmq_topic="",                          # Subscribe to all topics
+    enable_local_indexer=False,
+)
+
+# Create publisher - it automatically subscribes to ZMQ and forwards to NATS
+kv_publisher = ZmqKvEventPublisher(
+    component=component,
+    config=config,
+)
+```
+
+#### Part 2: ZMQ Publisher (Pure Python)
+
+If your engine needs to publish to ZMQ (e.g., for consolidator integration), implement the ZMQ protocol:
+
+```python
+import zmq
+import msgpack
+import time
+
+class ZmqKvEventPublisher:
+    """Pure Python ZMQ publisher for KV events (vLLM-compatible format)."""
+
+    def __init__(self, zmq_endpoint: str, kv_block_size: int, topic: str = ""):
+        self.kv_block_size = kv_block_size
+        self.topic = topic
+        self.ctx = zmq.Context()
+        self.socket = self.ctx.socket(zmq.PUB)
+        self.socket.bind(zmq_endpoint)
+        self.sequence = 0
+        self.data_parallel_rank = 0
+
+    def _to_signed_i64(self, value: int | None) -> int | None:
+        if value is None:
+            return None
+        return value - 0x10000000000000000 if value > 0x7FFFFFFFFFFFFFFF else value
+
+    def publish_stored(self, event_id: int, token_ids: list[int], num_block_tokens: list[int],
+                       block_hashes: list[int], lora_id: int = 0, parent_hash: int | None = None):
+        event = {
+            "type": "BlockStored",
+            "block_hashes": [self._to_signed_i64(h) for h in block_hashes],
+            "parent_block_hash": self._to_signed_i64(parent_hash),
+            "token_ids": token_ids,
+            "block_size": self.kv_block_size,
+            "lora_id": lora_id if lora_id != 0 else None,
+        }
+        self._publish_event(event)
+
+    def publish_removed(self, event_id: int, block_hashes: list[int]):
+        event = {"type": "BlockRemoved", "block_hashes": [self._to_signed_i64(h) for h in block_hashes]}
+        self._publish_event(event)
+
+    def publish_all_cleared(self):
+        self._publish_event({"type": "AllBlocksCleared"})
+
+    def _publish_event(self, event: dict):
+        batch = [time.time(), [event], self.data_parallel_rank]
+        payload = msgpack.packb(batch, use_bin_type=True)
+        sequence_bytes = self.sequence.to_bytes(8, byteorder="big")
+        self.sequence += 1
+        self.socket.send_multipart([self.topic.encode(), sequence_bytes, payload])
+
+    def shutdown(self):
+        self.socket.close()
+        self.ctx.term()
+```
+
+### ZMQ Wire Format
+
+The ZMQ message format (compatible with vLLM):
+
+| Frame | Description |
+|-------|-------------|
+| 1 | Topic (empty string for all topics) |
+| 2 | Sequence number (8 bytes, big-endian) |
+| 3 | Msgpack payload: `[timestamp, [events], dp_rank]` |
+
+Each event in the payload is a dictionary with `type` field (`BlockStored`, `BlockRemoved`, or `AllBlocksCleared`).
+
+### Best Practices
+
+1. **Event IDs must be monotonically increasing** per worker (use a thread-safe counter)
+
+2. **Block size must match** your engine's actual `kv_block_size`
+
+3. **`parent_hash` is required** for all blocks except the first in a sequence - it links blocks to enable prefix matching
+
+## See Also
+
+- **[Router README](README.md)**: Quick start guide for the KV Router
+- **[Router Guide](router-guide.md)**: Configuration, tuning, and production setup
+- **[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
--- a/fern/pages/router/kv-cache-routing.md
+++ b/fern/pages/router/kv-cache-routing.md
@@ -3,12 +3,135 @@
 # SPDX-License-Identifier: Apache-2.0
 ---

-# KV Cache Routing
+# Router Guide

-This document explains how Dynamo's Key-Value (KV) cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data, while maintaining load balance through worker utilization metrics.
+## Overview
+
+The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
+This guide helps you get started with using the Dynamo router, with further details on configuration, disaggregated serving setup, and parameter tuning.
+
+## Quick start
+
+### Python / CLI Deployment
+
+To launch the Dynamo frontend with the KV Router:
+
+```bash
+python -m dynamo.frontend --router-mode kv --http-port 8000
+```
+
+This command:
+- Launches the Dynamo frontend service with KV routing enabled
+- Exposes the service on port 8000 (configurable)
+- Automatically handles all backend workers registered to the Dynamo endpoint
+
+Backend workers register themselves using the `register_llm` API, after which the KV Router automatically tracks worker state and makes routing decisions based on KV cache overlap.
+
+#### CLI Arguments
+
+| Argument | Default | Description |
+|----------|---------|-------------|
+| `--router-mode kv` | `round_robin` | Enable KV cache-aware routing |
+| `--router-temperature <float>` | `0.0` | Controls routing randomness (0.0 = deterministic, higher = more random) |
+| `--kv-cache-block-size <size>` | Backend-specific | KV cache block size (should match backend config) |
+| `--kv-events` / `--no-kv-events` | `--kv-events` | Enable/disable real-time KV event tracking |
+| `--kv-overlap-score-weight <float>` | `1.0` | Balance prefill vs decode optimization (higher = better TTFT) |
+
+For all available options: `python -m dynamo.frontend --help`
+
+For detailed configuration options and tuning parameters, see [Using the KV Cache Router](#using-the-kv-cache-router).
+
+### Kubernetes Deployment
+
+To enable the KV Router in Kubernetes, add the `DYN_ROUTER_MODE` environment variable to your frontend service:
+
+```yaml
+apiVersion: nvidia.com/v1alpha1
+kind: DynamoGraphDeployment
+metadata:
+  name: my-deployment
+spec:
+  services:
+    Frontend:
+      dynamoNamespace: my-namespace
+      componentType: frontend
+      replicas: 1
+      envs:
+        - name: DYN_ROUTER_MODE
+          value: kv  # Enable KV Smart Router
+```
+
+**Key Points:**
+- Set `DYN_ROUTER_MODE=kv` on the **Frontend** service only
+- Workers automatically report KV cache events to the router
+- No worker-side configuration changes needed
+
+#### Environment Variables
+
+All CLI arguments can be configured via environment variables using the `DYN_` prefix:
+
+| CLI Argument | Environment Variable | Default |
+|--------------|---------------------|---------|
+| `--router-mode kv` | `DYN_ROUTER_MODE=kv` | `round_robin` |
+| `--router-temperature` | `DYN_ROUTER_TEMPERATURE` | `0.0` |
+| `--kv-cache-block-size` | `DYN_KV_CACHE_BLOCK_SIZE` | Backend-specific |
+| `--no-kv-events` | `DYN_KV_EVENTS=false` | `true` |
+| `--kv-overlap-score-weight` | `DYN_KV_OVERLAP_SCORE_WEIGHT` | `1.0` |
+
+For complete K8s examples and advanced configuration, see [K8s Examples](router-examples.md#k8s-examples).
+For A/B testing and advanced K8s setup, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).
+
+## KV Cache Routing

-To enable KV cache aware routing start the frontend node like this:
+KV cache routing optimizes large language model inference by intelligently directing requests to workers with the most relevant cached data. By maximizing cache reuse, it reduces redundant computation and improves both throughput and latency.
+
+```mermaid
+graph TD
+    T[Tokens] --> R[KV Aware Router]
+
+    R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
+    R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
+    R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
+
+    style T fill:#fff3e0,stroke:#333,color:#333
+    style R fill:#2e8b57,stroke:#333,color:#fff
+    style W1 fill:#f3e5f5,stroke:#333,color:#333
+    style W2 fill:#c8e6c9,stroke:#333,color:#333
+    style W3 fill:#f3e5f5,stroke:#333,color:#333
+
+    linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
 ```
+
+KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
+- Missed cache reuse opportunities due to suboptimal worker selection
+- System throughput degradation from uneven request distribution across workers
+
+The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
+
+### Cost Calculation
+
+1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
+
+2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
+
+3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
+   - Lower costs indicate better routing choices
+   - `overlap_score_weight` balances cache hit optimization against load distribution
+   - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
+
+### Worker Selection
+
+The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
+
+Example calculation with `overlap_score_weight = 1.0`:
+- Worker 1: cost = 1.0 * 8 + 10 = 18
+- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
+- Worker 3: cost = 1.0 * 2 + 9 = 11
+
+### Using the KV Cache Router
+
+To enable KV cache-aware routing, start the frontend node like this:
+```bash
 python -m dynamo.frontend --router-mode kv
 ```

@@ -28,11 +151,13 @@ The main KV-aware routing arguments:

 - `--router-reset-states`: When specified, resets the router state on startup by clearing both the JetStream event stream and NATS object store, starting with a fresh state. By default (when this flag is not provided), the router persists state across restarts, downloading any available snapshot from NATS object store and continuing to consume events from where it left off. This enables routers to maintain KV cache awareness across restarts. **Warning**: Using `--router-reset-states` can bring existing router replicas into an inconsistent state. Only use this flag when launching the first router replica in a component, or consider using a different namespace/component for a clean slate.

- `--router-snapshot-threshold`: Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATs object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.
+- `--router-snapshot-threshold`: Sets the number of messages in the JetStream before triggering a snapshot. When the message count exceeds this threshold, a router will attempt to purge acknowledged messages from the stream and create a snapshot of the current radix tree state in NATS object store. Defaults to 1000000. This helps manage stream size and provides faster initialization for routers that restart.

 - `--no-track-active-blocks`: Disables tracking of active blocks (blocks being used for ongoing generation/decode phases). By default, the router tracks active blocks for load balancing. Disable this when routing to workers that only perform prefill (no decode phase), as tracking decode load is not relevant. This reduces router overhead and simplifies state management.

- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines emit `ForwardPassMetrics`. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).
+- `--no-assume-kv-reuse`: When tracking active blocks, disables the assumption of KV cache reuse. By default (`router_assume_kv_reuse=true`), the router computes actual block hashes for sequence tracking to deduplicate blocks and optimize load balancing. When disabled via this flag, the router generates random hashes for sequence blocks, treating each request's blocks as unique. This is useful in disaggregated setups where prefill transfers blocks to decode workers that may already have those blocks cached, but the engine cannot coordinate transfers to avoid duplication. Without this flag, the router's load balancing heuristics would undercount decode blocks when duplicates exist.
+
+- `--active-decode-blocks-threshold`: Initial threshold (0.0-1.0) for determining when a worker is considered busy based on KV cache block utilization. When a worker's KV cache active blocks exceed this percentage of total blocks, it will be marked as busy and excluded from routing. If not set, blocks-based busy detection is disabled. This feature works with all routing modes (`--router-mode kv|round-robin|random`) as long as backend engines publish load metrics. The threshold can be dynamically updated at runtime via the `/busy_threshold` HTTP endpoint (see [Dynamic Threshold Configuration](#dynamic-threshold-configuration)).

 - `--active-prefill-tokens-threshold`: Literal token count threshold for determining when a worker is considered busy based on prefill token utilization. When active prefill tokens exceed this threshold, the worker is marked as busy. If not set, tokens-based busy detection is disabled.

@@ -50,32 +175,81 @@ The main KV-aware routing arguments:
 >
 > **Request plane is independent of KV event transport.**
 > `DYN_REQUEST_PLANE` controls how **requests** are sent (TCP/HTTP/NATS), but KV-aware routing still uses **NATS** for KV events in both JetStream and NATS Core + Local Indexer modes.
-> If you run with `DYN_REQUEST_PLANE=tcp` (or `http`) and KV events enabled (default), you must also configure NATS, e.g. `NATS_SERVER=nats://...`.
-> Only `--no-kv-events` removes the NATS requirement.
+> When KV events are enabled (default), NATS is automatically initialized. You can optionally set `NATS_SERVER=nats://...` to specify a custom NATS server; otherwise, it defaults to `localhost:4222`.
+> Use `--no-kv-events` to disable KV events and remove the NATS requirement entirely (with request plane being `tcp` or `http`).
+>
+> When `--kv-overlap-score-weight` is set to 0, no KVIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KVIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning.
 >
-> When `--kv-overlap-score-weight` is set to 0, no KvIndexer is created and prefix matching is disabled (pure load balancing). When `--no-kv-events` is set, a KvIndexer is still created but no event subscriber is launched to consume KV events from workers. Instead, the router predicts cache state based on its own routing decisions with TTL-based expiration and pruning. In both cases, it's recommended to disable your backend workers from publishing events through `KvEventPublisher` to avoid event accumulation in JetStream. WIP to enable disabling publishing of KV events completely in these cases.
+> **Backend Configuration:** When using `--no-kv-events`, configure your backend workers to disable KV event publishing:
+> - **vLLM**: Use `--kv-events-config '{"enable_kv_cache_events": false}'`
+> - **SGLang**: Do not use `--kv-events-config`
+> - **TRT-LLM**: Do not use `--publish-events-and-metrics`
 >
 > The cli args `--router-ttl`, `--router-max-tree-size`, and `--router-prune-target-ratio` control local cache management when the router operates without receiving events from workers. When KV events are enabled (default), the router relies on worker-side eviction events and these parameters are ignored.

-## Prerequisites and Limitations
+To implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing, see [KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md).

->[!Note]
-> **KV Router Requirements**: The KV router currently works only with **dynamic endpoints** that are registered via [`register_llm()`](../development/backend-guide.md) with `model_input=ModelInput.Tokens`. Your backend handler receives pre-tokenized requests with `token_ids` instead of raw text.
+## Basic Routing
+
+Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
+
+First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
+
+```python
+client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
+```
+
+We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.

-**Current Limitations (WIP):**
- **Static endpoints**: Not yet supported. The KV router requires dynamic model discovery via etcd to track worker instances and their KV cache states.
- **Multimodal models**: Not yet supported. The KV router currently tracks token-based blocks only.
+- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
+- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
+- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`

-**What this means for your setup:**
-1. Backend workers must call `register_llm()` with `model_input=ModelInput.Tokens` (see [Backend Guide](../development/backend-guide.md) or [example implementations](https://github.com/ai-dynamo/dynamo/tree/main/lib/bindings/python/examples/hello_world))
-2. Your handler receives requests with pre-tokenized `token_ids`, not raw text or multimodal inputs
-3. You cannot use `--static-endpoint` mode with KV routing (use dynamic discovery instead)
+KV Cache routing uses direct routing with a special worker selection algorithm.

-For basic model registration without KV routing, you can use `--router-mode round-robin` or `--router-mode random` with both static and dynamic endpoints.
+For benchmarking KV router performance, see the [KV Router A/B Benchmarking Guide](../../benchmarks/kv-router-ab-testing.md).

-## Disaggregated Serving (Prefill and Decode)
+For custom routing logic and advanced patterns, see [Routing Patterns](router-examples.md#routing-patterns) in the examples documentation.

-Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.
+## Tuning Guidelines
+
+### 1. Understand Your Workload Characteristics
+
+- **Prefill-heavy workloads** (long prompts, short generations): Increase `kv-overlap-score-weight`
+- **Decode-heavy workloads** (short prompts, long generations): Decrease `kv-overlap-score-weight`
+
+### 2. Monitor Key Metrics
+
+The router logs the cost calculation for each worker:
+```text
+Formula for worker_1: 125.3 = 1.0 * 100.5 + 25.0 (cached_blocks: 15)
+```
+
+This shows:
+- Total cost (125.3)
+- Overlap weight × prefill blocks (1.0 × 100.5)
+- Active blocks (25.0)
+- Cached blocks that contribute to overlap (15)
+
+### 3. Temperature-Based Routing
+
+The `router_temperature` parameter controls routing randomness:
+- **0.0 (default)**: Deterministic selection of the best worker
+- **> 0.0**: Probabilistic selection, higher values increase randomness
+- Useful for preventing worker saturation and improving load distribution
+
+### 4. Iterative Optimization
+
+1. Begin with default settings
+2. Monitor TTFT and ITL metrics
+3. Adjust `kv-overlap-score-weight` to meet your performance goals:
+   - To reduce TTFT: Increase the weight
+   - To reduce ITL: Decrease the weight
+4. If you observe severe load imbalance, increase the temperature setting
+
+## Disaggregated Serving
+
+Dynamo supports disaggregated serving where prefill (prompt processing) and decode (token generation) are handled by separate worker pools. When you register workers with `ModelType.Prefill` (see [Backend Guide](../../development/backend-guide.md)), the frontend automatically detects them and activates an internal prefill router.

 ### Automatic Prefill Router Activation

@@ -120,7 +294,7 @@ await register_llm(
 await prefill_endpoint.serve_endpoint(prefill_handler.generate)
 ```

-> [!NOTE]
+> [!Note]
 > The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. See example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/launch/disagg_router.sh).

 ### Request Flow
@@ -152,190 +326,13 @@ graph TD
    linkStyle 5 stroke:#2196f3,stroke-width:2px
 ```

-## Overview
-
-The KV-aware router operates on two key principles to optimize request routing:
-
-### Global KV Cache State Synchronization
-
-KV events from engines are collected by the router to maintain a global view of cached blocks across all workers. The router supports two event transport modes:
-
-#### Mode 1: JetStream (Default)
-
-KV events are sent to a persistent NATS JetStream. Each KV router/indexer replica acts as a durable consumer, pulling messages from this shared stream. This architecture ensures consistency across router replicas and persistence across restarts.
-
- **Best for**: Production deployments requiring durability and multi-replica router consistency
- **Tradeoffs**: Requires JetStream setup; slightly higher latency due to persistence guarantees
-
-```mermaid
-graph TD
-    subgraph Engines
-        E1[Engine 1<br/>KVPublisher]
-        E2[Engine 2<br/>KVPublisher]
-        E3[Engine 3<br/>KVPublisher]
-    end
-
-    subgraph "NATS JetStream"
-        JS[(Persistent KV Events Stream<br/>- Block created<br/>- Block removed)]
-    end
-
-    subgraph "NATS Object Store"
-        OS[(Radix Tree<br/>State Snapshot)]
-    end
-
-    subgraph "Router Replicas"
-        R1[Router 1<br/>KVIndexer]
-        R2[Router 2<br/>KVIndexer]
-    end
-
-    E1 -->|Publish Events| JS
-    E2 -->|Publish Events| JS
-    E3 -->|Publish Events| JS
-
-    JS -->|Consume as Durable Consumer| R1
-    JS -->|Consume as Durable Consumer| R2
-    JS -->|Periodic Snapshot| OS
-
-    style JS fill:#e1f5fe,stroke:#333,color:#333
-    style OS fill:#e1f5fe,stroke:#333,color:#333
-    style E1 fill:#f3e5f5,stroke:#333,color:#333
-    style E2 fill:#f3e5f5,stroke:#333,color:#333
-    style E3 fill:#f3e5f5,stroke:#333,color:#333
-    style R1 fill:#2e8b57,stroke:#333,color:#fff
-    style R2 fill:#2e8b57,stroke:#333,color:#fff
-
-    linkStyle 0,1,2,3,4,5 stroke:#2196f3,stroke-width:2px
-```
-
-#### Mode 2: NATS Core with Local Indexer
-
-When workers are started with `--enable-local-indexer`, each worker maintains its own local radix tree (local indexer) and publishes events over NATS Core (fire-and-forget pub/sub) instead of JetStream. Each worker assigns monotonically increasing event IDs to its events. The router detects gaps in event sequences and recovers missed events by querying the worker's local indexer directly.
-
- **Best for**: Lower-latency setups; simpler deployments without JetStream; single-router scenarios
- **Tradeoffs**: State persists on workers (not centralized); recovery depends on workers being available
- **Enable with**: `--enable-local-indexer` flag on workers (vLLM, mocker)
-
-```mermaid
-graph TD
-    subgraph Engines
-        E1[Engine 1<br/>LocalKvIndexer]
-        E2[Engine 2<br/>LocalKvIndexer]
-        E3[Engine 3<br/>LocalKvIndexer]
-    end
-
-    subgraph "NATS Core"
-        NC[KV Events Pub/Sub<br/>- Block created<br/>- Block removed]
-    end
-
-    subgraph "Router Replicas"
-        R1[Router 1<br/>KVIndexer]
-        R2[Router 2<br/>KVIndexer]
-    end
-
-    E1 -->|Publish Events| NC
-    E2 -->|Publish Events| NC
-    E3 -->|Publish Events| NC
-
-    NC -->|Subscribe| R1
-    NC -->|Subscribe| R2
-
-    style NC fill:#e1f5fe,stroke:#333,color:#333
-    style E1 fill:#f3e5f5,stroke:#333,color:#333
-    style E2 fill:#f3e5f5,stroke:#333,color:#333
-    style E3 fill:#f3e5f5,stroke:#333,color:#333
-    style R1 fill:#2e8b57,stroke:#333,color:#fff
-    style R2 fill:#2e8b57,stroke:#333,color:#fff
-
-    linkStyle 0,1,2,3,4 stroke:#2196f3,stroke-width:2px
-```
-
-**How gap detection works:**
-1. Each worker assigns monotonically increasing event IDs starting from 0
-2. The router tracks the last received event ID per worker
-3. If an event arrives with `event_id > last_id + 1`, the router detects a gap
-4. The router queries the worker's local indexer for the missing event range `[last_id+1, event_id-1]`
-5. On worker discovery (Added event), the router dumps the worker's entire local indexer state
-
-**Startup behavior:**
- When a worker is discovered, the router queries and ingests its full local indexer state
- When a worker is removed, the router removes all its blocks from the global radix tree
-
->[!Note]
-> The router automatically selects the transport mode based on worker configuration. If all connected workers have `enable_local_indexer=true`, the router uses NATS Core mode. Otherwise, it uses JetStream mode.
-
-### Local Active Block Management with Replica Sync
-
-Second, in addition to cached blocks, each router replica needs to track active blocks (blocks being used for ongoing generation) as load metrics. Since this information is highly time-sensitive, it should be predicted immediately when:
- The router receives and routes a request
- The first token is generated (prefill complete)
- The response ends (request freed)
-
-This is managed locally in each router via a "slot manager". To maintain consistency across the system, router replicas synchronize these local predictions with each other through NATS core messaging.
-
-```mermaid
-sequenceDiagram
-    participant C1 as Client 1
-    participant R1 as Router 1<br/>(Slot Manager)
-    participant R2 as Router 2<br/>(Slot Manager)
-    participant C2 as Client 2
-
-    Note over R1,R2: Router Replica Sync Enabled
-
-    C1->>R1: Request A
-    activate R1
-    R1->>R1: Predict blocks & route to worker
-    R1-->>R2: Sync: AddRequest(A)
-
-    C2->>R2: Request B
-    activate R2
-    R2->>R2: Predict blocks & route to worker
-    R2-->>R1: Sync: AddRequest(B)
-
-    R1->>R1: First token received<br/>(prefill complete)
-    R1-->>R2: Sync: MarkPrefillCompleted(A)
-    R1->>C1: Stream response
-
-    R2->>R2: First token received<br/>(prefill complete)
-    R2-->>R1: Sync: MarkPrefillCompleted(B)
-    R2->>C2: Stream response
-
-    R1->>R1: Response complete<br/>(free blocks)
-    R1-->>R2: Sync: Free(A)
-    deactivate R1
-
-    R2->>R2: Response complete<br/>(free blocks)
-    R2-->>R1: Sync: Free(B)
-    deactivate R2
-
-    Note over R1,R2: Both routers have consistent<br/>view of active blocks
-```
-
-This dual-layer approach—persistent global KV cache state via JetStream and ephemeral active block synchronization via router replicas—enables the system to make optimal routing decisions that balance cache reuse with load distribution.
-
-## Basic Routing
-Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
-
-First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
-
-```python
-client = namespace('dynamo').component('VllmWorker').endpoint('generate').client()
-```
-
-We can then use the default routing methods exposed by the client class to send requests to the `VllmWorker` component.
-
- **Random routing**: Default strategy, available via `client.generate()` or `client.random()`
- **Round-robin routing**: Cycles through available workers via `client.round_robin()`
- **Direct routing**: Explicitly targets a specific worker via `client.direct(input, component_id)`
-
-KV Cache routing uses direct routing with a special worker selection algorithm.
-
 ## Serving Multiple Router Replicas

 For improved fault tolerance, you can launch multiple frontend + router replicas. Since the frontend and router are currently tied together, you'll need to use different HTTP ports for each instance. (The separation of the frontend and Router is WIP.)

 ### Router State Management

-The KV Router tracks two types of state (see [KV Router Architecture](README.md) for details):
+The KV Router tracks two types of state (see [Router Design](../../design-docs/router-design.md) for details):

 1. **Prefix blocks (cached KV blocks)**: Maintained in a radix tree, tracking which blocks are cached on each worker. This state is **persistent** - backed by NATS JetStream events and object store snapshots. New router replicas automatically sync this state on startup, ensuring consistent cache awareness across restarts.

@@ -345,16 +342,16 @@ The KV Router tracks two types of state (see [KV Router Architecture](README.md)

 ```bash
 # Router replica 1
-python -m dynamo.frontend --router-mode kv --port 8000 --router-replica-sync
+python -m dynamo.frontend --router-mode kv --http-port 8000 --router-replica-sync

 # Router replica 2 (can be started later)
-python -m dynamo.frontend --router-mode kv --port 8001 --router-replica-sync
+python -m dynamo.frontend --router-mode kv --http-port 8001 --router-replica-sync
 ```

 The `--router-replica-sync` flag enables active block synchronization between replicas:
 - Active blocks are shared via NATS core messaging (fire-and-forget)
 - Replicas exchange routing decisions to maintain consistent load estimates
- A new replica start with zero active blocks but quickly converge through request handling, by itself and active syncing with other replicas
+- A new replica starts with zero active blocks but quickly converges through request handling, by itself and active syncing with other replicas

 Without this flag, each replica maintains its own isolated view of active blocks, potentially leading to suboptimal routing.

@@ -369,7 +366,7 @@ Persistence behavior depends on which event transport mode is active:
 - You can launch a third Router replica even if the first two are down, and it will recover the full prefix state

 ```bash
-python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync
+python -m dynamo.frontend --router-mode kv --http-port 8002 --router-replica-sync
 ```

 **NATS Core with Local Indexer Mode:**
@@ -380,325 +377,13 @@ python -m dynamo.frontend --router-mode kv --port 8002 --router-replica-sync

 >[!Note]
 > If you need to start with a fresh state in JetStream mode, you have two options:
-> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](../design-docs/distributed-runtime.md)) which will start a new stream and NATS object store path
+> 1. **Recommended**: Use a different namespace/component (see [Distributed Runtime](../../design-docs/distributed-runtime.md)) which will start a new stream and NATS object store path
 > 2. **Use with caution**: Launch a router with the `--router-reset-states` flag, which will purge the entire stream and radix snapshot. This should only be done when launching the first router replica in a component, as it can bring existing router replicas into an inconsistent state.

-## Understanding KV Cache
-The leading Large Language Models (LLMs) today are auto-regressive and based off of the [transformer architecture](https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf). One key inference optimization technique is to cache the already computed keys and values and to reuse them for the future tokens. This is called the [KV Cache](https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/#key-value_caching).
-
-### KV Cache Optimizations
-Every inference framework will have a KV Cache for each worker. A popular inference framework library is [vLLM](https://github.com/vllm-project/vllm) where a key contribution was [PagedAttention](https://arxiv.org/abs/2309.06180), which allowed them to manage KV Cache in an efficient way by chunking requests into blocks.
-
-Another popular inference framework, [SGLang](https://github.com/sgl-project/sglang), contributed [RadixAttention](https://arxiv.org/abs/2312.07104) which introduced a
-prefix tree which allows for efficient matching, inserting and eviction of KV Cache blocks. The prefix tree structure popularized KV Cache reuse.
-
-In Dynamo, we introduce a KVPublisher which emits KV Cache events that occur at each worker and a KVIndexer which keeps track of these events globally.
-
-To get a feel for how KV Cache management works on a single worker with KV Cache reuse turned on and where the KVPublisher gets plugged in, we can walk through the KV Block management flow:
-1. Request tokenization: The incoming prompt is converted into tokens
-2. Block partitioning: The token sequence is divided into fixed-size blocks (e.g., 16 or 64 tokens per block)
-3. Block hashing: Each block of tokens is hashed to create a unique identifier
-4. Cache lookup:
-    - For each block, the system checks if a matching block already exists in the KV cache
-    - If a match is found, the existing KV cache block is reused
-    - If no match is found, the system proceeds to the next step
-5. Resource allocation:
-    - For blocks without matches, the system attempts to allocate new memory space
-    - If sufficient memory is available, allocate memory space and proceed to step 7
-    - If memory is constrained, proceed to step 6
-6. Cache eviction (when necessary):
-    - The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal
-    - Selected blocks are evicted from the cache
-    - **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.**
-    - Alternatively, some systems may offload less-frequently used blocks to CPU memory.
-7. KV computation:
-    - For new blocks, the model computes key and value tensors
-    - These tensors are stored in the newly allocated cache blocks
-    - **KVPublisher emits a kv stored event notifying KVIndexer about newly stored blocks**.
-
-Further details can be found for: [TRT-LLM](https://developer.nvidia.com/blog/introducing-new-kv-cache-reuse-optimizations-in-nvidia-tensorrt-llm/), [vLLM](https://docs.vllm.ai/en/latest/design/automatic_prefix_caching.html#design-automatic-prefix-caching) and [SGLang](https://lmsys.org/blog/2024-01-17-sglang/).
-
-## KV Cache Routing and Load Balancing
-```mermaid
-graph TD
-    T[Tokens] --> R[KV Aware Router]
-
-    R -.-> W1["Worker 1<br/>Cached: 2 blocks<br/>Prefill: 8 blks<br/>Decode: 10 blks"]
-    R ==>|Selected| W2["Worker 2<br/>Cached: 5 blocks<br/>Prefill: 5 blks<br/>Decode: 5 blks"]
-    R -.-> W3["Worker 3<br/>Cached: 8 blocks<br/>Prefill: 2 blks<br/>Decode: 9 blks"]
-
-    style T fill:#fff3e0,stroke:#333,color:#333
-    style R fill:#2e8b57,stroke:#333,color:#fff
-    style W1 fill:#f3e5f5,stroke:#333,color:#333
-    style W2 fill:#c8e6c9,stroke:#333,color:#333
-    style W3 fill:#f3e5f5,stroke:#333,color:#333
-
-    linkStyle 0,1,2,3 stroke:#8b4513,stroke-width:2px
-```
-
-KV Cache reuse introduces complexity to LLM serving load balancing. While it can significantly reduce computation costs, routing strategies that ignore worker-specific KV states can lead to:
- Missed cache reuse opportunities due to suboptimal worker selection
- System throughput degradation from uneven request distribution across workers
-
-The router uses a cost function that considers both the prefill cost (influenced by cached blocks) and the decode load to make optimal routing decisions:
-
-### Cost Calculation
-
-1. **Prefill blocks**: Calculated by dividing the number of tokens requiring prefill processing by the block size. The system predicts this based on input tokens and available cached blocks per worker, updating the count when the first output token signals prefill completion.
-
-2. **Decode blocks**: Estimated from the request's input tokens and each worker's active sequences. The count updates when requests complete and their blocks are freed.
-
-3. **Cost formula**: `cost = overlap_score_weight * prefill_blocks + decode_blocks`
-   - Lower costs indicate better routing choices
-   - `overlap_score_weight` balances cache hit optimization against load distribution
-   - Higher weights favor cache reuse (improving TTFT), while lower weights prioritize even load distribution (improving ITL)
-
-### Worker Selection
-
-The router selects the worker with the lowest cost. When `router_temperature` is set to a non-zero value, the router uses softmax sampling on the normalized cost logits to introduce randomness in the selection, which can help with load distribution.
-
-Example calculation with `overlap_score_weight = 1.0`:
- Worker 1: cost = 1.0 * 8 + 10 = 18
- **Worker 2: cost = 1.0 * 5 + 5 = 10** (selected - lowest cost)
- Worker 3: cost = 1.0 * 2 + 9 = 11
-
-## Events
-
-### KVPublisher
-The KVPublisher can be initialized and then called in the inference framework where blocks are allocated and removed.
-
-The two types of events are:
- KV stored event
- KV removed event
-
-The publisher can be initialized and used through C bindings or Python bindings.
-
-### Deterministic Event IDs
-
-Engines do not need to emit deterministic block identifiers in KV events, as the router uses local block hashes (computed from token content) for tracking and matching blocks across workers. However, it is strongly preferred that engines do emit deterministic block identifiers, as this keeps the KvIndexer's internal lookup table smaller and more efficient. To ensure deterministic behavior, all workers should use identical engine versions/configuration. If your engine relies on Python's builtin `hash()` for any event IDs, set `PYTHONHASHSEED=0`; otherwise this setting has no effect.
-
-### KVIndexer
-The KVIndexer builds and maintains a global view of cached blocks in a prefix tree. We modify the original prefix tree by also storing the worker id on each node. This is so we can return the number of matched blocks for each worker.
-
-The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
-
-### Inter-Router Communication
-
-In distributed deployments with multiple routers, each router maintains visibility over only a portion of the total requests. To ensure consistent routing decisions, routers synchronize their states through three event types:
-
-1. **AddRequest**: Notifies other routers when a request is assigned to a worker. Includes request ID, worker ID, token sequence blocks, and overlap score to track block usage across the system.
-
-2. **MarkPrefillCompleted**: Signals when a request moves from prefill to decode phase, allowing routers to update their worker load calculations by excluding completed prefill tokens.
-
-3. **Free**: Indicates request completion and resource release, enabling accurate block reference counting across all routers.
-
-Each event carries a unique router ID to prevent self-event processing. This asynchronous communication system ensures optimal routing decisions by maintaining consistent KV cache state across all routers, even as they handle different request streams.
-
-## Using KvPushRouter Python API
-
-Instead of launching the KV Router via command line, you can create a `KvPushRouter` object directly in Python. This allows per-request routing configuration overrides.
-
->[!Warning]
-> **Multiple Routers in Same Process**: If you need to run multiple `KvPushRouter` instances for fault tolerance or load distribution, you must launch them in **separate processes** (e.g., using `python -m dynamo.frontend` with different ports). Creating multiple `KvPushRouter` objects in the same Python process is not supported - they share the same cancellation token from the component's primary lease, so dropping one router will cancel all routers in that process. For in-process routing, use a single `KvPushRouter` instance.
-
-### Methods
-
-The `KvPushRouter` provides the following methods:
-
- **`generate(token_ids, model, ...)`**: Route and execute a request, returning an async stream of responses. Automatically handles worker selection, state tracking, and lifecycle management.
-
- **`best_worker(token_ids, router_config_override=None, request_id=None)`**: Query which worker would be selected for given tokens. Returns `(worker_id, dp_rank, overlap_blocks)`.
-  - Without `request_id`: Query-only, doesn't update router state
-  - With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
-
- **`best_worker_id(token_ids, router_config_override=None, request_id=None)`**: **[DEPRECATED - use `best_worker()` instead]** Query which worker would be selected for given tokens. Returns `(worker_id, overlap_blocks)`.
-  - Without `request_id`: Query-only, doesn't update router state
-  - With `request_id`: Updates router state to track the request. **Note**: If used with `request_id`, you must call `mark_prefill_complete()` and `free()` at the appropriate lifecycle points to maintain accurate load tracking
-
- **`get_potential_loads(token_ids)`**: Get detailed load information for all workers, including potential prefill tokens and active decode blocks. Returns a list of load dictionaries.
-
- **`mark_prefill_complete(request_id)`**: Signal that a request has completed its prefill phase. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker_id()` for manual routing instead of `generate()`.
-
- **`free(request_id)`**: Signal that a request has completed and its resources should be released. Only used for [manual lifecycle management](#2-manual-state-management-advanced) when using `best_worker_id()` for manual routing instead of `generate()`.
-
- **`dump_events()`**: Dump all KV cache events from the router's indexer as a JSON string. Useful for debugging and analysis.
-
-### Setup
-
-First, launch your backend engines:
-```bash
-python -m dynamo.vllm --model meta-llama/Llama-2-7b-hf
-```
-
-### Example Script
-
-```python
-import asyncio
-from dynamo._core import DistributedRuntime, KvPushRouter, KvRouterConfig
-
-async def main():
-    # Get runtime and create endpoint
-    runtime = DistributedRuntime.detached()
-    namespace = runtime.namespace("dynamo")
-    component = namespace.component("backend")
-    endpoint = component.endpoint("generate")
-
-    # Create KV router
-    kv_router_config = KvRouterConfig()
-    router = KvPushRouter(
-        endpoint=endpoint,
-        block_size=16,
-        kv_router_config=kv_router_config
-    )
-
-    # Your input tokens
-    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-
-    # Generate with per-request routing override
-    stream = await router.generate(
-        token_ids=token_ids,
-        model="meta-llama/Llama-2-7b-hf",
-        stop_conditions={
-            "max_tokens": 20,        # Generate exactly 20 tokens
-            "ignore_eos": True,      # Don't stop at EOS token
-        },
-        sampling_options={
-            "temperature": 0.7,
-            "top_p": 0.9,
-        },
-        router_config_override={
-            "overlap_score_weight": 2.0,    # Prioritize cache hits for this request
-            "router_temperature": 0.5,       # Add routing randomness
-        }
-    )
-
-    # Collect generated tokens
-    generated_tokens = []
-    async for response in stream:
-        if isinstance(response, dict) and "token_ids" in response:
-            generated_tokens.extend(response["token_ids"])
-
-    print(f"Generated {len(generated_tokens)} tokens: {generated_tokens}")
-
-if __name__ == "__main__":
-    asyncio.run(main())
-```
-
-### Routing Patterns
-
-The `KvPushRouter` supports multiple usage patterns depending on your control requirements:
-
-#### 1. Automatic Routing (Recommended)
-Call `generate()` directly and let the router handle everything:
-```python
-stream = await router.generate(token_ids=tokens, model="model-name")
-```
- **Best for**: Most use cases
- **Router automatically**: Selects best worker, updates state, routes request, tracks lifecycle
-
-#### 2. Manual State Management (Advanced)
-Use `best_worker_id(request_id=...)` to select and track, then manage the request yourself:
-```python
-worker_id, overlap = await router.best_worker_id(tokens, request_id="req-123")
-response = await client.generate(tokens, request_id="req-123")
-# await anext(response)  # Get first token
-await router.mark_prefill_complete("req-123")  # After first token
-# async for _ in response:  # Continue generating
-#     ...
-await router.free("req-123")  # After completion
-```
- **Best for**: Custom request handling with router state tracking
- **Requires**: Calling `mark_prefill_complete()` and `free()` at correct lifecycle points
- **Caution**: Incorrect lifecycle management degrades load balancing accuracy
-
-#### 3. Hierarchical Router Probing
-Query without state updates, then route through a chosen router:
-```python
-# Probe multiple routers without updating state
-worker_id_1, overlap_1 = await router_1.best_worker_id(tokens)  # No request_id
-worker_id_2, overlap_2 = await router_2.best_worker_id(tokens)
-
-# Pick the best router based on results
-chosen_router = router_1 if overlap_1 > overlap_2 else router_2
-stream = await chosen_router.generate(tokens, model="model-name", worker_id=worker_id)
-```
- **Best for**: Multi-tier deployments (e.g., Envoy Gateway routing to multiple router groups)
- **Advantage**: Query multiple routers before committing to one
-
-#### 4. Custom Load-Based Routing
-Use `get_potential_loads()` to implement custom routing logic:
-```python
-loads = await router.get_potential_loads(tokens)
-# Apply custom logic (e.g., weighted scoring, constraints)
-best_worker = min(loads, key=lambda x: custom_cost_fn(x))
-stream = await router.generate(tokens, model="model-name", worker_id=best_worker['worker_id'])
-```
- **Best for**: Custom optimization strategies beyond the built-in cost function
- **Advantage**: Full control over worker selection logic
- **See also**: Detailed example below in "Custom Routing Example: Minimizing TTFT"
-
-All patterns support `router_config_override` to adjust routing behavior per-request without recreating the router.
-
-### Custom Routing Example: Minimizing TTFT
-
-Here's an example of using `get_potential_loads()` to implement custom routing that minimizes Time To First Token (TTFT) by selecting the worker with the least prefill work:
-
-```python
-import asyncio
-from dynamo._core import DistributedRuntime, KvPushRouter, KvRouterConfig
-
-async def minimize_ttft_routing():
-    # Setup router
-    runtime = DistributedRuntime.detached()
-    namespace = runtime.namespace("dynamo")
-    component = namespace.component("backend")
-    endpoint = component.endpoint("generate")
-
-    router = KvPushRouter(
-        endpoint=endpoint,
-        block_size=16,
-        kv_router_config=KvRouterConfig()
-    )
-
-    # Your input tokens
-    token_ids = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
-
-    # Get potential loads for all workers
-    potential_loads = await router.get_potential_loads(token_ids)
-
-    # Find worker with minimum prefill tokens (best for TTFT)
-    best_worker = min(potential_loads, key=lambda x: x['potential_prefill_tokens'])
-
-    print(f"Worker loads: {potential_loads}")
-    print(f"Selected worker {best_worker['worker_id']} with {best_worker['potential_prefill_tokens']} prefill tokens")
-
-    # Route directly to the selected worker
-    stream = await router.generate(
-        token_ids=token_ids,
-        model="meta-llama/Llama-2-7b-hf",
-        worker_id=best_worker['worker_id'],  # Force routing to optimal worker
-        stop_conditions={"max_tokens": 20}
-    )
-
-    # Process response
-    async for response in stream:
-        if isinstance(response, dict) and "token_ids" in response:
-            print(f"Generated tokens: {response['token_ids']}")
-
-if __name__ == "__main__":
-    asyncio.run(minimize_ttft_routing())
-```
-
-This approach gives you complete control over routing decisions, allowing you to optimize for different metrics based on your specific requirements. As some examples:
-
- **Minimize TTFT**: Select worker with lowest `potential_prefill_tokens`
- **Maximize cache reuse**: Use `best_worker_id()` which considers both prefill and decode loads
- **Balance load**: Consider both `potential_prefill_tokens` and `potential_decode_blocks` together
-
-See [KV Router Architecture](README.md) for performance tuning details.
-
 ## Dynamic Threshold Configuration

+Dynamic threshold configuration allows you to adjust worker busy thresholds at runtime without restarting the frontend, enabling real-time tuning of load balancing behavior based on observed system performance.
+
 The busy thresholds can be updated at runtime without restarting the frontend. The frontend exposes HTTP endpoints at `/busy_threshold`:

 **Get or set a model's thresholds (POST):**
@@ -728,3 +413,10 @@ curl -X POST http://localhost:8000/busy_threshold \
 curl http://localhost:8000/busy_threshold
 # Response: {"thresholds": [{"model": "meta-llama/Llama-2-7b-hf", "active_decode_blocks_threshold": 0.85, "active_prefill_tokens_threshold": 1000}]}
 ```
+
+## See Also
+
+- **[Router README](README.md)**: Quick start guide for the KV Router
+- **[Router Examples](router-examples.md)**: Python API usage, K8s examples, and custom routing patterns
+- **[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
+- **[KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md)**: Integrate custom inference engines with KV-aware routing
--- a/fern/pages/design-docs/architecture.md
+++ b/fern/pages/design-docs/architecture.md
@@ -40,14 +40,14 @@ To address the growing demands of distributed inference serving, NVIDIA introduc
 The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:

 - [Dynamo Disaggregated Serving](disagg-serving.md)
- [Dynamo Smart Router](../router/kv-cache-routing.md)
- [Dynamo KV Cache Block Manager](../kvbm/kvbm-intro.md)
- [Planner](../planner/planner-intro.md)
+- [Dynamo Smart Router](../components/router/README.md)
+- [Dynamo KV Cache Block Manager](../components/kvbm/README.md)
+- [Planner](../components/planner/README.md)
 - [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)

 Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.

-![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../../assets/img/architecture.png "Dynamo Architecture")
+![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](/assets/img/architecture.png "Dynamo Architecture")

 Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.

@@ -61,7 +61,7 @@ Dynamo prioritizes seamless integration. Its modular design enables it to work h

 Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.

-![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../../assets/img/disagg-perf-benefit.png)
+![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](/assets/img/disagg-perf-benefit.png)

 * Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL

@@ -70,7 +70,7 @@ The disaggregation of prefill and decode phases offers valuable flexibility. Sin

 ### KV aware routing

-![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../../assets/img/kv-routing.png)
+![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](/assets/img/kv-routing.png)

 * Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL

@@ -80,7 +80,7 @@ Existing routing methods, including load-based routing, overlook the specific pr
 ### KV cache manager

 The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
-![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../../assets/img/kvbm-agg-performance.png)
+![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](/assets/img/kvbm-agg-performance.png)

 * Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.