docs: full migration of docs/ to fern format in fern/ (#6050)

Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>

docs: full migration of docs/ to fern format in fern/ (#6050)
Signed-off-by: Dan Gil <dagil@nvidia.com> Co-authored-by: Cursor <cursoragent@cursor.com>
2c3066bd · dagil-nvidia · GitHub · d59b9d72 · 2c3066bd · 2c3066bd
Unverified Commit 2c3066bd authored Feb 06, 2026 by dagil-nvidia Committed by GitHub Feb 06, 2026
20 changed files
--- a/fern/assets/img/arch-comparison.svg
+++ b/fern/assets/img/arch-comparison.svg
--- a/fern/assets/img/decision-flowchart.svg
+++ b/fern/assets/img/decision-flowchart.svg
--- a/fern/assets/img/e2e-workflow.svg
+++ b/fern/assets/img/e2e-workflow.svg
--- a/fern/assets/img/grafana-disagg-trace.png
+++ b/fern/assets/img/grafana-disagg-trace.png
--- a/fern/assets/img/grafana1.png
+++ b/fern/assets/img/grafana1.png
--- a/fern/assets/img/param-mapping.svg
+++ b/fern/assets/img/param-mapping.svg
--- a/fern/pages/README.md
+++ b/fern/pages/README.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+This directory contains the documentation source files for NVIDIA Dynamo.
+## Prerequisites
+- Python 3.11 or later
+- [uv](https://docs.astral.sh/uv/) package manager
+## Build Instructions
+### Option 1: Dedicated Docs Environment (Recommended)
+This approach builds the docs without requiring the full project dependencies (including `ai-dynamo-runtime`):
+```bash
+# One-time setup: Create docs environment and install dependencies
+uv venv .venv-docs
+uv pip install --python .venv-docs --group docs
+# Generate documentation
+uv run --python .venv-docs --no-project docs/generate_docs.py
+```
+The generated HTML will be available in `docs/build/html/`.
+### Option 2: Using Full Development Environment
+If you already have the full project dependencies installed (i.e., you're actively developing the codebase), you can use `uv run` directly:
+```bash
+uv run --group docs docs/generate_docs.py
+```
+This will use your existing project environment and add the docs dependencies.
+### Option 3: Using Docker
+Build the docs in a Docker container with all dependencies isolated:
+```bash
+docker build -f container/Dockerfile.docs -t dynamo-docs .
+```
+The documentation will be built inside the container. To extract the built docs:
+```bash
+# Run the container and copy the output
+docker run --rm -v $(pwd)/docs/build:/workspace/dynamo/docs/build dynamo-docs
+# Or create a container to copy files from
+docker create --name temp-docs dynamo-docs
+docker cp temp-docs:/workspace/dynamo/docs/build ./docs/build
+docker rm temp-docs
+```
+This approach is ideal for CI/CD pipelines or when you want complete isolation from your local environment.
+## Directory Structure
+- `docs/` - Documentation source files (Markdown and reStructuredText)
+- `docs/conf.py` - Sphinx configuration
+- `docs/_static/` - Static assets (CSS, JS, images)
+- `docs/_extensions/` - Custom Sphinx extensions
+- `docs/build/` - Generated documentation output (not tracked in git)
+## Redirect Creation
+When moving or renaming files a redirect must be created.
+Redirect entries should be added to the `redirects` dictionary in `conf.py`. For detailed information on redirect syntax, see the [sphinx-reredirects usage documentation](https://documatt.com/sphinx-reredirects/usage/#introduction).
+## Dependency Management
+Documentation dependencies are defined in `pyproject.toml` under the `[dependency-groups]` section:
+```toml
+[dependency-groups]
+docs = [
+    "sphinx>=8.1",
+    "nvidia-sphinx-theme>=0.0.8",
+    # ... other doc dependencies
+]
+```
+## Troubleshooting
+### Build Warnings
+The build process treats warnings as errors. Common issues:
+- **Missing toctree entries**: Documents must be referenced in a table of contents
+- **Non-consecutive headers**: Don't skip header levels (e.g., H1 → H3)
+- **Broken links**: Ensure all internal and external links are valid
+### Missing Dependencies
+If you encounter import errors, ensure the docs dependencies are installed:
+```bash
+uv pip install --python .venv-docs --group docs
+```
+## Viewing the Documentation
+After building, open `docs/build/html/index.html` in your, or use Python's built-in HTTP server:
+```bash
+cd docs/build/html
+python -m http.server 8000
+# Then visit http://localhost:8000 in your browser
+```
--- a/fern/pages/agents/tool-calling.md
+++ b/fern/pages/agents/tool-calling.md
@@ -3,8 +3,6 @@
 # SPDX-License-Identifier: Apache-2.0
 ---
-# Tool Calling with Dynamo
 You can connect Dynamo to external tools and services using function calling (also known as tool calling). By providing a list of available functions, Dynamo can choose
 to output function arguments for the relevant function(s) which you can execute to augment the prompt with relevant external information.

--- a/fern/pages/api/nixl-connect/README.md
+++ b/fern/pages/api/nixl-connect/README.md
@@ -11,13 +11,14 @@ The relaxed registration comes with some performance overheads, but simplifies t
 Especially for larger data transfer operations, such as between models in a multi-model graph, the overhead would be marginal.
 The `dynamo.nixl_connect` library can be imported by any Dynamo container hosted application.
-> [!NOTE]
+> [!Note]
 > Dynamo NIXL Connect will pick the best available method of data transfer available to it.
 > The available methods depend on the hardware and software configuration of the machines and network running the graph.
 > GPU Direct RDMA operations require that both ends of the operation have:
 > - NIC and GPU capable of performing RDMA operations
 > - Device drivers that support GPU-NIC direct interactions (aka "zero copy") and RDMA operations
 > - Network that supports InfiniBand or RoCE
+>
 > With any of the above not satisfied, GPU Direct RDMA will not be available to the graph's workers, and less-optimal methods will be utilized to ensure basic functionality.
 > For additional information, please read this [GPUDirect RDMA](https://docs.nvidia.com/cuda/pdf/GPUDirect_RDMA.pdf) document.
@@ -85,12 +86,12 @@ flowchart LR
  e2@{ animate: true; }
 ```
-> [!NOTE]
+> [!Note]
 > When RDMA isn't available, the NIXL data transfer will still complete using non-accelerated methods.
 ### Multimodal Example
-In the case of the [Dynamo Multimodal Disaggregated Example](../../multimodal/vllm.md):
+In the case of the [Dynamo Multimodal Disaggregated Example](../../features/multimodal/multimodal-vllm.md):
 1. The HTTP frontend accepts a text prompt and a URL to an image.
@@ -134,17 +135,17 @@ flowchart LR
  o2@{ animate: true; }
 ```
-> [!NOTE]
+> [!Note]
 > In this example, it is the data transfer between the Prefill Worker and the Encode Worker that utilizes the Dynamo NIXL Connect library.
 > The KV Cache transfer between Decode Worker and Prefill Worker utilizes a different connector that also uses the NIXL-based I/O subsystem underneath.
 #### Code Examples
-See [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) or [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) from our Multimodal example,
+See [MultimodalPDWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) or [MultimodalDecodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/worker_handler.py) from our Multimodal example,
 for how they coordinate directly with the Encode Worker by creating a [`WritableOperation`](writable-operation.md),
 sending the operation's metadata via Dynamo's round-robin dispatcher, and awaiting the operation for completion before making use of the transferred data.
-See [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/tree/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) from our Multimodal example,
+See [MultimodalEncodeWorkerHandler](https://github.com/ai-dynamo/dynamo/blob/main/components/src/dynamo/vllm/multimodal_handlers/encode_worker_handler.py) from our Multimodal example,
 for how the resulting embeddings are registered with the NIXL subsystem by creating a [`Descriptor`](descriptor.md),
 a [`WriteOperation`](write-operation.md) is created using the metadata provided by the requesting worker,
 and the worker awaits for the data transfer to complete for yielding a response.
@@ -165,5 +166,5 @@ and the worker awaits for the data transfer to complete for yielding a response.
  - [NVIDIA Dynamo](https://developer.nvidia.com/dynamo) @ [GitHub](https://github.com/ai-dynamo/dynamo)
  - [NVIDIA Inference Transfer Library (NIXL)](https://developer.nvidia.com/blog/introducing-nvidia-dynamo-a-low-latency-distributed-inference-framework-for-scaling-reasoning-ai-models/#nvidia_inference_transfer_library_nixl_low-latency_hardware-agnostic_communication%C2%A0) @ [GitHub](https://github.com/ai-dynamo/nixl)
-  - [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal)
+  - [Dynamo Multimodal Example](https://github.com/ai-dynamo/dynamo/tree/main/examples/multimodal.md)
  - [NVIDIA GPU Direct](https://developer.nvidia.com/gpudirect)
--- a/fern/pages/api/nixl-connect/connector.md
+++ b/fern/pages/api/nixl-connect/connector.md
@@ -15,7 +15,7 @@ The connector provides two methods of moving data between workers:
  - Preparing local memory to be read by a remote worker.
-In both cases, local memory is registered with the NIXL-based I/O subsystem via the [`Descriptor`](descriptor.md) class and provided to the connector.
+In both cases, local memory is registered with the NIXL-based I/O subsystem via the [`Descriptor`](#descriptor) class and provided to the connector.
 When RDMA is available, the connector then configures the RDMA subsystem to expose the memory for the requested operation and returns an operation control object;
 otherwise the connector will select the best available RDMA alternative.
 The operation control object, either a [`ReadableOperation`](readable-operation.md) or a [`WritableOperation`](writable-operation.md),
@@ -24,7 +24,7 @@ provides NIXL metadata ([RdmaMetadata](rdma-metadata.md)) via its `.metadata()`
 The NIXL metadata must be provided to the remote worker expected to complete the operation.
 The metadata contains required information (identifiers, keys, etc.) which enables the remote worker to interact with the provided memory.
-> [!WARNING]
+> [!Warning]
 > NIXL metadata contains a worker's address as well as security keys to access specific registered memory descriptors.
 > This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
@@ -37,7 +37,7 @@ The metadata contains required information (identifiers, keys, etc.) which enabl
      self.connector = dynamo.nixl_connect.Connector()
 ```
-> [!TIP]
+> [!Tip]
 > See [`ReadOperation`](read-operation.md#example-usage), [`ReadableOperation`](readable-operation.md#example-usage),
 > [`WritableOperation`](writable-operation.md#example-usage), and [`WriteOperation`](write-operation.md#example-usage)
 > for additional examples.

--- a/fern/pages/api/nixl-connect/rdma-metadata.md
+++ b/fern/pages/api/nixl-connect/rdma-metadata.md
@@ -9,13 +9,13 @@ A Pydantic type intended to provide JSON serialized NIXL metadata about a [`Read
 NIXL metadata contains detailed information about a worker process and how to access memory regions registered with the corresponding agent.
 This data is required to perform data transfers using the NIXL-based I/O subsystem.
-> [!WARNING]
+> [!Warning]
 > NIXL metadata contains information to connect corresponding backends across agents, as well as identification keys to access specific registered memory regions.
 > This data provides direct memory access between workers, and should be considered sensitive and therefore handled accordingly.
 Use the respective class's `.metadata()` method to generate an `RdmaMetadata` object for an operation.
-> [!TIP]
+> [!Tip]
 > Classes using `RdmaMetadata` objects must be paired correctly.
 > [`ReadableOperation`](readable-operation.md) with [`ReadOperation`](read-operation.md), and
 > [`WritableOperation`](write-operation.md) with [`WriteOperation`](write-operation.md).

--- a/fern/pages/backends/sglang/README.md
+++ b/fern/pages/backends/sglang/README.md
@@ -24,8 +24,8 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 - [Dynamo SGLang Integration](#dynamo-sglang-integration)
 - [Installation](#installation)
 - [Quick Start](#quick-start)
- [Aggregated Serving](#aggregated-serving)
+- [Single Node Examples](#run-single-node-examples)
- [Disaggregated Serving](#disaggregated-serving)
+- [Multi-Node and Advanced Examples](#advanced-examples)
 - [Deploy on SLURM or Kubernetes](#deployment)
 ## Feature Support Matrix
@@ -35,11 +35,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature | SGLang | Notes |
 |---------|--------|-------|
 | [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
+| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | WIP [PR](https://github.com/sgl-project/sglang/pull/7730) |
-| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ |  |
-| [**Multimodal Support**](../../multimodal/sglang.md) | ✅ |  |
+| [**Multimodal Support**](../../features/multimodal/multimodal-sglang.md) | ✅ |  |
-| [**KVBM**](../../kvbm/kvbm-architecture.md) | ❌ | Planned |
+| [**KVBM**](../../components/kvbm/README.md) | ❌ | Planned |
 ## Dynamo SGLang Integration
@@ -55,7 +55,6 @@ Dynamo SGLang uses SGLang's native argument parser, so **most SGLang engine argu
 | Argument | Description | Default | SGLang Equivalent |
 |----------|-------------|---------|-------------------|
 | `--endpoint` | Dynamo endpoint in `dyn://namespace.component.endpoint` format | Auto-generated based on mode | N/A |
-| `--migration-limit` | Max times a request can migrate between workers for fault tolerance. See [Request Migration Architecture](../../fault-tolerance/request-migration.md). | `0` (disabled) | N/A |
 | `--dyn-tool-call-parser` | Tool call parser for structured outputs (takes precedence over `--tool-call-parser`) | `None` | `--tool-call-parser` |
 | `--dyn-reasoning-parser` | Reasoning parser for CoT models (takes precedence over `--reasoning-parser`) | `None` | `--reasoning-parser` |
 | `--use-sglang-tokenizer` | Use SGLang's tokenizer instead of Dynamo's | `False` | N/A |
@@ -90,23 +89,18 @@ For more details, see the [Request Cancellation Architecture](../../fault-tolera
 ### Install latest release
 We suggest using uv to install the latest release of ai-dynamo[sglang]. You can install it with `curl -LsSf https://astral.sh/uv/install.sh | sh`
-<details>
+<Accordion title="Expand for instructions">
-<summary>Expand for instructions</summary>
 ```bash
 # create a virtual env
 uv venv --python 3.12 --seed
 # install the latest release (which comes bundled with a stable sglang version)
 uv pip install "ai-dynamo[sglang]"
 ```
+</Accordion>
-</details>
 ### Install editable version for development
-<details>
+<Accordion title="Expand for instructions">
-<summary>Expand for instructions</summary>
 This requires having rust installed. We also recommend having a proper installation of the cuda toolkit as sglang requires `nvcc` to be available.
 ```bash
@@ -123,14 +117,11 @@ uv pip install -e .
 # install any sglang version >= 0.5.3.post2
 uv pip install "sglang[all]==0.5.3.post2"
 ```
+</Accordion>
-</details>
 ### Using docker containers
-<details>
+<Accordion title="Expand for instructions">
-<summary>Expand for instructions</summary>
 We are in the process of shipping pre-built docker containers that contain installations of DeepEP, DeepGEMM, and NVSHMEM in order to support WideEP and P/D. For now, you can quickly build the container from source with the following command.
 ```bash
@@ -156,8 +147,7 @@ docker run \
    --ipc host \
    dynamo-sglang:latest
 ```
+</Accordion>
-</details>
 ## Quick Start
@@ -178,6 +168,7 @@ docker compose -f deploy/docker-compose.yml up -d
 > [!TIP]
 > Each example corresponds to a simple bash script that runs the OpenAI compatible server, processor, and optional router (written in Rust) and LLM engine (written in Python) in a single terminal. You can easily take each command and run them in separate terminals.
+>
 > Additionally - because we use sglang's argument parser, you can pass in any argument that sglang supports to the worker!
@@ -204,9 +195,7 @@ cd $DYNAMO_HOME/examples/backends/sglang
 ./launch/agg_embed.sh
 ```
-<details>
+<Accordion title="Send the following request to verify your deployment:">
-<summary>Send the following request to verify your deployment:</summary>
 ```bash
 curl localhost:8000/v1/embeddings \
  -H "Content-Type: application/json" \
@@ -215,8 +204,7 @@ curl localhost:8000/v1/embeddings \
    "input": "Hello, world!"
  }'
 ```
+</Accordion>
-</details>
 ### Disaggregated serving
@@ -273,4 +261,4 @@ We currently provide deployment examples for Kubernetes and SLURM.
 - **[Deploying Dynamo with SGLang on Kubernetes](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/deploy/README.md)**
 ## SLURM
- **[Deploying Dynamo with SGLang on SLURM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/slurm_jobs/README.md)**
+- **[Deploying Dynamo with SGLang on SLURM](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/sglang/slurm-jobs/README.md)**
--- a/fern/pages/backends/sglang/diffusion-lm.md
+++ b/fern/pages/backends/sglang/diffusion-lm.md
+---
+# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+---
+# Running Diffusion LMs with SGLang
+Diffusion Language Models (Diffusion LMs) are a class of generative models that use diffusion processes for text generation. This guide shows how to deploy diffusion models like LLaDA2.0 using SGLang as the backend with Dynamo. Diffusion LMs work differently from autoregressive models - they iteratively refine generated text through a diffusion process.
+## Launch the Deployment
+### Using the Launch Script (Recommended)
+The easiest way to start the diffusion LM service is using the provided launch script:
+```bash
+bash examples/backends/sglang/launch/diffusion_llada.sh
+```
+### Manual Launch Steps
+If you prefer to launch components manually:
+**Start frontend**
+```bash
+python -m dynamo.frontend --http-port 8001 &
+```
+**Run diffusion worker**
+```bash
+export CUDA_VISIBLE_DEVICES=0,1
+python -m dynamo.sglang \
+  --model-path inclusionAI/LLaDA2.0-mini-preview \
+  --tp-size 2 \
+  --skip-tokenizer-init \
+  --trust-remote-code \
+  --endpoint dyn://dynamo.backend.generate \
+  --enable-metrics \
+  --disable-cuda-graph \
+  --disable-overlap-schedule \
+  --attention-backend triton \
+  --dllm-algorithm LowConfidence
+```
+## Diffusion Algorithms
+The diffusion worker uses the **LowConfidence** algorithm for the iterative refinement process. This algorithm refines tokens with low confidence scores, progressively replacing masked tokens with the model's predictions until confidence thresholds are met.
+For more details on diffusion algorithms and configuration options, refer to the [SGLang Diffusion Language Models documentation](https://github.com/sgl-project/sglang/blob/main/docs/supported_models/diffusion_language_models.md).
+## Testing the Deployment
+Once deployed, you can test the service using curl:
+```bash
+curl -X POST http://localhost:8001/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "inclusionAI/LLaDA2.0-mini-preview",
+    "messages": [
+      {
+        "role": "user",
+        "content": "Hello! How are you?"
+      }
+    ],
+    "temperature": 0.7,
+    "max_tokens": 512
+  }'
+```
+Or use the completions endpoint:
+```bash
+curl -X POST http://localhost:8001/v1/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "inclusionAI/LLaDA2.0-mini-preview",
+    "prompt": "Once upon a time",
+    "max_tokens": 256
+  }'
+```
\ No newline at end of file
--- a/fern/pages/backends/sglang/profiling.md
+++ b/fern/pages/backends/sglang/profiling.md
@@ -5,6 +5,9 @@
 # Profiling SGLang Workers in Dynamo
+> [!NOTE]
+> **See also**: [Profiler Component Overview](../../components/profiler/README.md) for SLA-driven profiling and deployment optimization.
 Dynamo exposes profiling endpoints for SGLang workers via the system server's `/engine/*` routes. This allows you to start and stop PyTorch profiling on running inference workers without restarting them.
 These endpoints wrap SGLang's internal `TokenizerManager.start_profile()` and `stop_profile()` methods. See SGLang's documentation for the full list of supported parameters.

--- a/fern/pages/backends/trtllm/README.md
+++ b/fern/pages/backends/trtllm/README.md
@@ -23,7 +23,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 ## Table of Contents
 - [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#tensorrt-llm-quick-start)
+- [Quick Start](#quick-start)
 - [Single Node Examples](#single-node-examples)
 - [Advanced Examples](#advanced-examples)
 - [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
@@ -31,7 +31,9 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 - [Benchmarking](#benchmarking)
 - [Multimodal Support](#multimodal-support)
 - [Logits Processing](#logits-processing)
+- [DP Rank Routing](#dp-rank-routing-attention-data-parallelism)
 - [Performance Sweep](#performance-sweep)
+- [Known Issues and Mitigations](#known-issues-and-mitigations)
 ## Feature Support Matrix
@@ -40,11 +42,11 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature | TensorRT-LLM | Notes |
 |---------|--------------|-------|
 | [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
+| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | Not supported yet |
-| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ |  |
-| [**Load Based Planner**](../../planner/load-planner.md) | 🚧 | Planned |
+| [**Load Based Planner**](../../components/planner/README.md) | 🚧 | Planned |
-| [**KVBM**](../../kvbm/kvbm-architecture.md) | ✅ | |
+| [**KVBM**](../../components/kvbm/README.md) | ✅ | |
 ### Large Scale P/D and WideEP Features
@@ -97,10 +99,10 @@ apt-get update && apt-get -y install git git-lfs
 ## Single Node Examples
-> [!WARNING]
+> [!IMPORTANT]
 > Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
-For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv-cache-routing.md).
+For detailed information about the architecture and how KV-aware routing works, see the [Router Guide](../../components/router/router-guide.md).
 ### Aggregated
 ```bash
@@ -123,7 +125,7 @@ cd $DYNAMO_HOME/examples/backends/trtllm
 ### Disaggregated with KV Routing
-> [!WARNING]
+> [!IMPORTANT]
 > In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
 ```bash
@@ -152,10 +154,10 @@ Below we provide a selected list of advanced examples. Please open up an issue i
 ### Multinode Deployment
-For comprehensive instructions on multinode serving, see the [multinode-examples.md](multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
+For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4-plus-eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
 ### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](llama4-plus-eagle.md)**
+- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4-plus-eagle.md)**
 ### Kubernetes Deployment
@@ -170,26 +172,16 @@ NOTE: To send a request to a multi-node deployment, target the node which is run
 ### Benchmarking
 To benchmark your deployment with AIPerf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/llm/perf.sh)
+`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
 ## KV Cache Transfer in Disaggregated Serving
-Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](kv-cache-transfer.md).
+Dynamo with TensorRT-LLM supports two methods for transferring KV cache in disaggregated serving: UCX (default) and NIXL (experimental). For detailed information and configuration instructions for each method, see the [KV cache transfer guide](./kv-cache-transfer.md).
 ## Request Migration
-You can enable [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
-```bash
-# For decode and aggregated workers
-python3 -m dynamo.trtllm ... --migration-limit=3
-```
-> [!WARNING]
-> **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
-See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for details on how this works.
 ## Request Cancellation
@@ -213,11 +205,11 @@ NOTE: To send a request to a multi-node deployment, target the node which is run
 ## Benchmarking
 To benchmark your deployment with AIPerf, see this utility script, configuring the
-`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/tree/main/benchmarks/llm/perf.sh)
+`model` name and `host` based on your deployment: [perf.sh](https://github.com/ai-dynamo/dynamo/blob/main/benchmarks/llm/perf.sh)
 ## Multimodal support
-Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../multimodal/trtllm.md).
+Dynamo with the TensorRT-LLM backend supports multimodal models, enabling you to process both text and images (or pre-computed embeddings) in a single request. For detailed setup instructions, example requests, and best practices, see the [TensorRT-LLM Multimodal Guide](../../features/multimodal/multimodal-trtllm.md).
 ## Logits Processing
@@ -276,12 +268,67 @@ sampling_params.logits_processor = create_trtllm_adapters(processors)
 - Processors must modify logits in-place and not return a new tensor.
 - If your processor needs tokenization, ensure the tokenizer is initialized (do not skip tokenizer init).
+## DP Rank Routing (Attention Data Parallelism)
+TensorRT-LLM supports [attention data parallelism](https://lmsys.org/blog/2024-12-04-sglang-v0-4/#data-parallelism-attention-for-deepseek-models) (attention DP) for models like DeepSeek. When enabled, multiple attention DP ranks run within a single worker, each with its own KV cache. Dynamo can route requests to specific DP ranks based on KV cache state.
+### Dynamo vs TRT-LLM Internal Routing
+- **Dynamo DP Rank Routing**: The router selects the optimal DP rank based on KV cache overlap and instructs TRT-LLM to use that rank with strict routing (`attention_dp_relax=False`). Use this with `--router-mode kv` for cache-aware routing.
+- **TRT-LLM Internal Routing**: TRT-LLM's scheduler assigns DP ranks internally. Use this with `--router-mode round-robin` or `random` when KV-aware routing isn't needed.
+### Enabling DP Rank Routing
+```bash
+# Worker with attention DP
+# (TP=2 acts as the "world size", in effect creating 2 attention DP ranks)
+CUDA_VISIBLE_DEVICES=0,1 python3 -m dynamo.trtllm \
+  --model-path <MODEL_PATH> \
+  --tensor-parallel-size 2 \
+  --enable-attention-dp \
+  --publish-events-and-metrics
+# Frontend with KV routing
+python3 -m dynamo.frontend --router-mode kv
+```
+The `--enable-attention-dp` flag sets `attention_dp_size = tensor_parallel_size` and configures Dynamo to publish KV events per DP rank. The router automatically creates routing targets for each `(worker_id, dp_rank)` combination.
+> [!NOTE]
+> Attention DP requires TRT-LLM's PyTorch backend. AutoDeploy does not support attention DP.
 ## Performance Sweep
-For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance_sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
+For detailed instructions on running comprehensive performance sweeps across both aggregated and disaggregated serving configurations, see the [TensorRT-LLM Benchmark Scripts for DeepSeek R1 model](https://github.com/ai-dynamo/dynamo/tree/main/examples/backends/trtllm/performance-sweeps/README.md). This guide covers recommended benchmarking setups, usage of provided scripts, and best practices for evaluating system performance.
 ## Dynamo KV Block Manager Integration
 Dynamo with TensorRT-LLM currently supports integration with the Dynamo KV Block Manager. This integration can significantly reduce time-to-first-token (TTFT) latency, particularly in usage patterns such as multi-turn conversations and repeated long-context requests.
-Here is the instruction: [Running KVBM in TensorRT-LLM](../../kvbm/trtllm-setup.md) .
+Here is the instruction: [Running KVBM in TensorRT-LLM](../../components/kvbm/kvbm-guide.md#run-kvbm-in-dynamo-with-tensorrt-llm) .
+## Known Issues and Mitigations
+### KV Cache Exhaustion Causing Worker Deadlock (Disaggregated Serving)
+**Issue:** In disaggregated serving mode, TensorRT-LLM workers can become stuck and unresponsive after sustained high-load traffic. Once in this state, workers require a pod/process restart to recover.
+**Symptoms:**
+- Workers function normally initially but hang after heavy load testing
+- Inference requests get stuck and eventually timeout
+- Logs show warnings: `num_fitting_reqs=0 and fitting_disagg_gen_init_requests is empty, may not have enough kvCache`
+- Error logs may contain: `asyncio.exceptions.InvalidStateError: invalid state`
+**Root Cause:** When `max_tokens_in_buffer` in the cache transceiver config is smaller than the maximum input sequence length (ISL) being processed, KV cache exhaustion can occur under heavy load. This causes context transfers to timeout, leaving workers stuck waiting for phantom transfers and entering an irrecoverable deadlock state.
+**Mitigation:** Ensure `max_tokens_in_buffer` exceeds your maximum expected input sequence length. Update your engine configuration files (e.g., `prefill.yaml` and `decode.yaml`):
+```yaml
+cache_transceiver_config:
+  backend: DEFAULT
+  max_tokens_in_buffer: 65536  # Must exceed max ISL
+```
+For example, see `examples/backends/trtllm/engine_configs/gpt-oss-120b/prefill.yaml`.
+**Related Issue:** [#4327](https://github.com/ai-dynamo/dynamo/issues/4327)
--- a/fern/pages/backends/trtllm/gemma3-sliding-window-attention.md
+++ b/fern/pages/backends/trtllm/gemma3-sliding-window-attention.md
@@ -8,7 +8,7 @@
 This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
 VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.
-> [!NOTE]
+> [!Note]
 > - Ensure that required services such as `nats` and `etcd` are running before starting.
 > - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
 > - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.

--- a/fern/pages/backends/trtllm/kv-cache-transfer.md
+++ b/fern/pages/backends/trtllm/kv-cache-transfer.md
@@ -16,43 +16,11 @@ By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX
 ### Specify Backends for NIXL
-NIXL supports multiple communication backends that can be configured via environment variables. By default, UCX is used if no backends are explicitly specified.
+TensorRT-LLM supports two NIXL communication backends: UCX and LIBFABRIC. By default, UCX is used if no backend is explicitly specified. Dynamo currently only supports the UCX backend, as LIBFABRIC support is still a work in progress. Please do not change the NIXL backend in the Dynamo runtime image.
-**Environment Variable Format:**
-```bash
-DYN_KVBM_NIXL_BACKEND_<BACKEND>=<value>
-```
-**Supported Backends:**
- `UCX` - Unified Communication X (default)
- `GDS` - GPU Direct Storage
-**Examples:**
-```bash
-# Enable UCX backend (default behavior)
-export DYN_KVBM_NIXL_BACKEND_UCX=true
-# Enable GDS backend
-export DYN_KVBM_NIXL_BACKEND_GDS=true
-# Enable multiple backends
-export DYN_KVBM_NIXL_BACKEND_UCX=true
-export DYN_KVBM_NIXL_BACKEND_GDS=true
-# Explicitly disable a backend
-export DYN_KVBM_NIXL_BACKEND_GDS=false
-```
-**Valid Values:**
- `true`, `1`, `on`, `yes` - Enable the backend
- `false`, `0`, `off`, `no` - Disable the backend
-> [!NOTE]
-> If no `DYN_KVBM_NIXL_BACKEND_*` environment variables are set, UCX is used as the default backend.
 ## Alternative Method: UCX
 TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
-> [!NOTE]
+> [!Note]
-> The environment variable `TRTLLM_USE_UCX_KV_CACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.
+> The environment variable `TRTLLM_USE_UCX_KVCACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.
--- a/fern/pages/backends/trtllm/llama4-plus-eagle.md
+++ b/fern/pages/backends/trtllm/llama4-plus-eagle.md
@@ -5,7 +5,7 @@
 # Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
-This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](multinode/multinode-examples.md) to set up the environment for the following scenarios:
+This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios:
 - **Aggregated Serving:**
  Deploy the entire Llama 4 model on a single GB200x4 node for end-to-end serving.
@@ -34,7 +34,7 @@ export MODEL_PATH="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
 export SERVED_MODEL_NAME="nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8"
 ```
-See [this](multinode/multinode-examples.md#setup) section from multinode guide to learn more about the above options.
+See [this](./multinode/multinode-examples.md#setup) section from multinode guide to learn more about the above options.
 ## Aggregated Serving
@@ -56,7 +56,7 @@ export DECODE_ENGINE_CONFIG="/mnt/examples/backends/trtllm/engine_configs/llama4
 ## Example Request
-See [here](multinode/multinode-examples.md#example-request) to learn how to send a request to the deployment.
+See [here](./multinode/multinode-examples.md#example-request) to learn how to send a request to the deployment.
 ```
 curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{

--- a/fern/pages/backends/trtllm/multinode/multinode-examples.md
+++ b/fern/pages/backends/trtllm/multinode/multinode-examples.md
@@ -38,7 +38,7 @@ For simplicity of the example, we will make some assumptions about your slurm cl
   If your cluster supports similar container based plugins, you may be able to
   modify the script to use that instead.
 3. Third, we assume you have already built a recent Dynamo+TRTLLM container image as
-   described [here](https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container).
+   described [here](../README.md#build-container).
   This is the image that can be set to the `IMAGE` environment variable in later steps.
 4. Fourth, we assume you pre-allocate a group of nodes using `salloc`. We
   will allocate 8 nodes below as a reference command to have enough capacity
@@ -77,7 +77,7 @@ following environment variables based:
 ```bash
 # NOTE: IMAGE must be set manually for now
 # To build an iamge, see the steps here:
-# https://github.com/ai-dynamo/dynamo/tree/main/docs/backends/trtllm/README.md#build-container
+# ../README.md#build-container
 export IMAGE="<dynamo_trtllm_image>"
 # MOUNTS are the host:container path pairs that are mounted into the containers
@@ -149,7 +149,7 @@ Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
 following the setup above, follow these steps below to launch a **disaggregated**
 deployment across 8 nodes:
-> [!TIP]
+> [!Tip]
 > Make sure you have a fresh environment and don't still have the aggregated
 > example above still deployed on the same set of nodes.
@@ -176,7 +176,7 @@ deployment across 8 nodes:
 ./srun_disaggregated.sh
 ```
-> [!TIP]
+> [!Tip]
 > To launch multiple replicas of the configured prefill/decode workers, you can set
 > NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).

--- a/fern/pages/backends/vllm/README.md
+++ b/fern/pages/backends/vllm/README.md
@@ -23,7 +23,7 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 ## Table of Contents
 - [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#vllm-quick-start)
+- [Quick Start](#quick-start)
 - [Single Node Examples](#run-single-node-examples)
 - [Advanced Examples](#advanced-examples)
 - [Deploy on Kubernetes](#kubernetes-deployment)
@@ -36,13 +36,13 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
 | Feature | vLLM | Notes |
 |---------|------|-------|
 | [**Disaggregated Serving**](../../design-docs/disagg-serving.md) | ✅ |  |
-| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md#conditional-disaggregation) | 🚧 | WIP |
+| [**Conditional Disaggregation**](../../design-docs/disagg-serving.md) | 🚧 | WIP |
-| [**KV-Aware Routing**](../../router/kv-cache-routing.md) | ✅ |  |
+| [**KV-Aware Routing**](../../components/router/README.md) | ✅ |  |
-| [**SLA-Based Planner**](../../planner/sla-planner.md) | ✅ |  |
+| [**SLA-Based Planner**](../../components/planner/planner-guide.md) | ✅ |  |
-| [**Load Based Planner**](../../planner/load-planner.md) | 🚧 | WIP |
+| [**Load Based Planner**](../../components/planner/README.md) | 🚧 | WIP |
-| [**KVBM**](../../kvbm/kvbm-architecture.md) | ✅ |  |
+| [**KVBM**](../../components/kvbm/README.md) | ✅ |  |
-| [**LMCache**](LMCache-Integration.md) | ✅ |  |
+| [**LMCache**](../../integrations/lmcache-integration.md) | ✅ |  |
-| [**Prompt Embeddings**](prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
+| [**Prompt Embeddings**](./prompt-embeddings.md) | ✅ | Requires `--enable-prompt-embeds` flag |
 ### Large Scale P/D and WideEP Features
@@ -87,7 +87,7 @@ This includes the specific commit [vllm-project/vllm#19790](https://github.com/v
 ## Run Single Node Examples
-> [!WARNING]
+> [!IMPORTANT]
 > Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
 ### Aggregated Serving
@@ -144,7 +144,9 @@ Below we provide a selected list of advanced deployments. Please open up an issu
 Run **Meta-Llama-3.1-8B-Instruct** with **Eagle3** as a draft model using **aggregated speculative decoding** on a single node.
 This setup demonstrates how to use Dynamo to create an instance using Eagle-based speculative decoding under the **VLLM aggregated serving framework** for faster inference while maintaining accuracy.
-**Guide:** [Speculative Decoding Quickstart](speculative-decoding.md)
+**Guide:** [Speculative Decoding Quickstart](../../features/speculative-decoding/speculative-decoding-vllm.md)
+> **See also:** [Speculative Decoding Feature Overview](../../features/speculative-decoding/README.md) for cross-backend documentation.
 ### Kubernetes Deployment
@@ -177,17 +179,11 @@ When using KV-aware routing, ensure deterministic hashing across processes to av
 ```bash
 vllm serve ... --enable-prefix-caching --prefix-caching-algo sha256
 ```
-See the high-level notes in [KV Cache Routing](../../router/kv-cache-routing.md) on deterministic event IDs.
+See the high-level notes in [Router Design](../../design-docs/router-design.md#deterministic-event-ids) on deterministic event IDs.
 ## Request Migration
-You can enable [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. Use the `--migration-limit` flag to specify how many times a request can be migrated to another worker:
+Dynamo supports [request migration](../../fault-tolerance/request-migration.md) to handle worker failures gracefully. When enabled, requests can be automatically migrated to healthy workers if a worker fails mid-generation. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for configuration details.
-```bash
-python3 -m dynamo.vllm ... --migration-limit=3
-```
-This allows a request to be migrated up to 3 times before failing. See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for details on how this works.
 ## Request Cancellation