docs: update Fern docs for main branch (#5706)

Signed-off-by: Jont828 <jt572@cornell.edu>

docs: update Fern docs for main branch (#5706)
Signed-off-by: Jont828 <jt572@cornell.edu>
7ca6a562 · Jonathan Tong · GitHub · 704c1dad · 7ca6a562 · 7ca6a562
Unverified Commit 7ca6a562 authored Jan 30, 2026 by Jonathan Tong Committed by GitHub Jan 30, 2026
20 changed files
--- a/fern/pages/backends/trtllm/README.md
+++ b/fern/pages/backends/trtllm/README.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "LLM Deployment using TensorRT-LLM"
 ---

+# LLM Deployment using TensorRT-LLM
+
 This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.

 ## Use the Latest Release
@@ -57,14 +58,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

 Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

-### Start NATS and ETCD in the background
+### Start Infrastructure Services (Local Development Only)

-Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
+For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):

 ```bash
 docker compose -f deploy/docker-compose.yml up -d
 ```

+> [!NOTE]
+> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
+> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
+> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
+
 ### Build container

 ```bash
@@ -91,9 +97,8 @@ apt-get update && apt-get -y install git git-lfs

 ## Single Node Examples

-<Warning>
-Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
-</Warning>
+> [!WARNING]
+> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.

 For detailed information about the architecture and how KV-aware routing works, see the [KV Cache Routing documentation](../../router/kv-cache-routing.md).

@@ -118,9 +123,8 @@ cd $DYNAMO_HOME/examples/backends/trtllm

 ### Disaggregated with KV Routing

-<Warning>
-In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.
-</Warning>
+> [!WARNING]
+> In disaggregated workflow, requests are routed to the prefill worker to maximize KV cache reuse.

 ```bash
 cd $DYNAMO_HOME/examples/backends/trtllm
@@ -182,9 +186,8 @@ You can enable [request migration](../../fault-tolerance/request-migration.md) t
 python3 -m dynamo.trtllm ... --migration-limit=3
 ```

-<Warning>
-**Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.
-</Warning>
+> [!WARNING]
+> **Prefill workers do not support request migration** and must use `--migration-limit=0` (the default). Prefill workers only process prompts and return KV cache state - they don't maintain long-running generation requests that would benefit from migration.

 See the [Request Migration Architecture](../../fault-tolerance/request-migration.md) documentation for details on how this works.


--- a/fern/pages/backends/trtllm/gemma3-sliding-window-attention.md
+++ b/fern/pages/backends/trtllm/gemma3-sliding-window-attention.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Gemma 3 with Variable Sliding Window Attention"
 ---

+# Gemma 3 with Variable Sliding Window Attention
+
 This guide demonstrates how to deploy google/gemma-3-1b-it with Variable Sliding Window Attention (VSWA) using Dynamo. Since google/gemma-3-1b-it is a small model, each aggregated, decode, or prefill worker only requires one H100 GPU or one GB200 GPU.
 VSWA is a mechanism in which a model’s layers alternate between multiple sliding window sizes. An example of this is Gemma 3, which incorporates both global attention layers and sliding window layers.

-<Note>
- Ensure that required services such as `nats` and `etcd` are running before starting.
- Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
- It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.
-</Note>
+> [!NOTE]
+> - Ensure that required services such as `nats` and `etcd` are running before starting.
+> - Request access to `google/gemma-3-1b-it` on Hugging Face and set your `HF_TOKEN` environment variable for authentication.
+> - It's recommended to continue using the VSWA feature with the Dynamo 0.5.0 release and the TensorRT-LLM dynamo runtime image nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.0. The 0.5.1 release bundles TensorRT-LLM v1.1.0rc5, which has a regression that breaks VSWA.

 ## Aggregated Serving
 ```bash

--- a/fern/pages/backends/trtllm/gpt-oss.md
+++ b/fern/pages/backends/trtllm/gpt-oss.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Running gpt-oss-120b Disaggregated with TensorRT-LLM"
 ---

+# Running gpt-oss-120b Disaggregated with TensorRT-LLM
+
 Dynamo supports disaggregated serving of gpt-oss-120b with TensorRT-LLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single B200 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.

 ## Overview

--- a/fern/pages/backends/trtllm/kv-cache-transfer.md
+++ b/fern/pages/backends/trtllm/kv-cache-transfer.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "KV Cache Transfer in Disaggregated Serving"
 ---

+# KV Cache Transfer in Disaggregated Serving
+
 In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:

+## Using NIXL for KV Cache Transfer
+
+Start the disaggregated service: See [Disaggregated Serving](./README.md#disaggregated) to learn how to start the deployment.
+
 ## Default Method: NIXL
 By default, TensorRT-LLM uses **NIXL** (NVIDIA Inference Xfer Library) with UCX (Unified Communication X) as backend for KV cache transfer between prefill and decode workers. [NIXL](https://github.com/ai-dynamo/nixl) is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.

 ### Specify Backends for NIXL

-TODO: Add instructions for how to specify different backends for NIXL.
+NIXL supports multiple communication backends that can be configured via environment variables. By default, UCX is used if no backends are explicitly specified.

-## Alternative Method: UCX
+**Environment Variable Format:**
+```bash
+DYN_KVBM_NIXL_BACKEND_<BACKEND>=<value>
+```
+
+**Supported Backends:**
+- `UCX` - Unified Communication X (default)
+- `GDS` - GPU Direct Storage
+
+**Examples:**
+```bash
+# Enable UCX backend (default behavior)
+export DYN_KVBM_NIXL_BACKEND_UCX=true

-TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. There are two ways to enable UCX as the KV cache transfer backend:
+# Enable GDS backend
+export DYN_KVBM_NIXL_BACKEND_GDS=true
+
+# Enable multiple backends
+export DYN_KVBM_NIXL_BACKEND_UCX=true
+export DYN_KVBM_NIXL_BACKEND_GDS=true
+
+# Explicitly disable a backend
+export DYN_KVBM_NIXL_BACKEND_GDS=false
+```
+
+**Valid Values:**
+- `true`, `1`, `on`, `yes` - Enable the backend
+- `false`, `0`, `off`, `no` - Disable the backend
+
+> [!NOTE]
+> If no `DYN_KVBM_NIXL_BACKEND_*` environment variables are set, UCX is used as the default backend.
+
+## Alternative Method: UCX

-1. **Recommended:** Set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.
-2. Alternatively, set the environment variable `TRTLLM_USE_UCX_KV_CACHE=1` and configure `cache_transceiver_config.backend: DEFAULT` in the engine configuration YAML.
+TensorRT-LLM can also leverage **UCX** (Unified Communication X) directly for KV cache transfer between prefill and decode workers. To enable UCX as the KV cache transfer backend, set `cache_transceiver_config.backend: UCX` in your engine configuration YAML file.

-This flexibility allows users to choose the most suitable method for their deployment and compatibility requirements.
+> [!NOTE]
+> The environment variable `TRTLLM_USE_UCX_KV_CACHE=1` with `cache_transceiver_config.backend: DEFAULT` does not enable UCX. You must explicitly set `backend: UCX` in the configuration.
--- a/fern/pages/backends/trtllm/llama4-plus-eagle.md
+++ b/fern/pages/backends/trtllm/llama4-plus-eagle.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM"
 ---

+# Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
+
 This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](multinode/multinode-examples.md) to set up the environment for the following scenarios:

 - **Aggregated Serving:**

--- a/fern/pages/backends/trtllm/multinode/multinode-examples.md
+++ b/fern/pages/backends/trtllm/multinode/multinode-examples.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Example: Multi-node TRTLLM Workers with Dynamo on Slurm"
 ---

+# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
+
 > **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).

 To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
@@ -148,10 +149,9 @@ Assuming you have at least 8 nodes allocated (4 for prefill, 4 for decode)
 following the setup above, follow these steps below to launch a **disaggregated**
 deployment across 8 nodes:

-<Tip>
-Make sure you have a fresh environment and don't still have the aggregated
-example above still deployed on the same set of nodes.
-</Tip>
+> [!TIP]
+> Make sure you have a fresh environment and don't still have the aggregated
+> example above still deployed on the same set of nodes.

 ```bash
 # Defaults set in srun_disaggregated.sh, but can customize here.
@@ -176,10 +176,9 @@ example above still deployed on the same set of nodes.
 ./srun_disaggregated.sh
 ```

-<Tip>
-To launch multiple replicas of the configured prefill/decode workers, you can set
-NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).
-</Tip>
+> [!TIP]
+> To launch multiple replicas of the configured prefill/decode workers, you can set
+> NUM_PREFILL_WORKERS and NUM_DECODE_WORKERS respectively (default: 1).

 ## Understanding the Output


--- a/fern/pages/backends/trtllm/prometheus.md
+++ b/fern/pages/backends/trtllm/prometheus.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "TensorRT-LLM Prometheus Metrics"
 ---

+# TensorRT-LLM Prometheus Metrics
+
 ## Overview

 When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.

--- a/fern/pages/backends/vllm/LMCache-Integration.md
+++ b/fern/pages/backends/vllm/LMCache-Integration.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "LMCache Integration in Dynamo"
 ---

+# LMCache Integration in Dynamo
+
 ## Introduction

 LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling **prefill-once, reuse-everywhere** semantics. As described in the [official documentation](https://docs.lmcache.ai/index.html), LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance.

--- a/fern/pages/backends/vllm/README.md
+++ b/fern/pages/backends/vllm/README.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "LLM Deployment using vLLM"
 ---

+# LLM Deployment using vLLM
+
 This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.

 ## Use the Latest Release
@@ -55,14 +56,19 @@ git checkout $(git describe --tags $(git rev-list --tags --max-count=1))

 Below we provide a guide that lets you run all of our the common deployment patterns on a single node.

-### Start NATS and ETCD in the background
+### Start Infrastructure Services (Local Development Only)

-Start using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml)
+For local/bare-metal development, start etcd and optionally NATS using [Docker Compose](https://github.com/ai-dynamo/dynamo/tree/main/deploy/docker-compose.yml):

 ```bash
 docker compose -f deploy/docker-compose.yml up -d
 ```

+> [!NOTE]
+> - **etcd** is optional but is the default local discovery backend. You can also use `--kv_store file` to use file system based discovery.
+> - **NATS** is optional - only needed if using KV routing with events (default). You can disable it with `--no-kv-events` flag for prediction-based routing
+> - **On Kubernetes**, neither is required when using the Dynamo operator, which explicitly sets `DYN_DISCOVERY_BACKEND=kubernetes` to enable native K8s service discovery (DynamoWorkerMetadata CRD)
+
 ### Pull or build container

 We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
@@ -81,9 +87,8 @@ This includes the specific commit [vllm-project/vllm#19790](https://github.com/v

 ## Run Single Node Examples

-<Warning>
-Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
-</Warning>
+> [!WARNING]
+> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.

 ### Aggregated Serving

@@ -127,9 +132,8 @@ cd examples/backends/vllm
 bash launch/dep.sh
 ```

-<Tip>
-Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
-</Tip>
+> [!TIP]
+> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.

 ## Advanced Examples


--- a/fern/pages/backends/vllm/deepseek-r1.md
+++ b/fern/pages/backends/vllm/deepseek-r1.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Running Deepseek R1 with Wide EP"
 ---

-Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a seperate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
+# Running Deepseek R1 with Wide EP
+
+Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a separate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`

 ## Instructions


--- a/fern/pages/backends/vllm/gpt-oss.md
+++ b/fern/pages/backends/vllm/gpt-oss.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Running gpt-oss-120b Disaggregated with vLLM"
 ---

+# Running gpt-oss-120b Disaggregated with vLLM
+
 Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.

 ## Overview

--- a/fern/pages/backends/vllm/multi-node.md
+++ b/fern/pages/backends/vllm/multi-node.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Multi-node Examples"
 ---

+# Multi-node Examples
+
 This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities.

 ## Prerequisites

--- a/fern/pages/backends/vllm/prometheus.md
+++ b/fern/pages/backends/vllm/prometheus.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "vLLM Prometheus Metrics"
 ---

+# vLLM Prometheus Metrics
+
 ## Overview

 When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.

--- a/fern/pages/backends/vllm/prompt-embeddings.md
+++ b/fern/pages/backends/vllm/prompt-embeddings.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Prompt Embeddings"
 ---

+# Prompt Embeddings
+
 Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.



--- a/fern/pages/backends/vllm/speculative-decoding.md
+++ b/fern/pages/backends/vllm/speculative-decoding.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)"
 ---

+# Running **Meta-Llama-3.1-8B-Instruct** with Speculative Decoding (Eagle3)
+
 This guide walks through how to deploy **Meta-Llama-3.1-8B-Instruct** using **aggregated speculative decoding** with **Eagle3** on a single node.
 Since the model is only **8B parameters**, you can run it on **any GPU with at least 16GB VRAM**.


--- a/fern/pages/benchmarks/benchmarking.md
+++ b/fern/pages/benchmarks/benchmarking.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Dynamo Benchmarking Guide"
 ---

+# Dynamo Benchmarking Guide
+
 This benchmarking framework lets you compare performance across any combination of:
 - **DynamoGraphDeployments**
 - **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.)

--- a/fern/pages/benchmarks/kv-router-ab-testing.md
+++ b/fern/pages/benchmarks/kv-router-ab-testing.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Dynamo KV Smart Router A/B Benchmarking Guide"
 ---

+# Dynamo KV Smart Router A/B Benchmarking Guide
+
 This guide walks you through setting up and running A/B benchmarks to compare Dynamo's KV Smart Router against standard round-robin routing on a Kubernetes cluster.

 ## Overview

--- a/fern/pages/benchmarks/sla-driven-profiling.md
+++ b/fern/pages/benchmarks/sla-driven-profiling.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "SLA-Driven Profiling with DynamoGraphDeploymentRequest"
 ---

-<Tip>
-**New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](../planner/sla-planner-quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.
-</Tip>
+# SLA-Driven Profiling with DynamoGraphDeploymentRequest
+
+> [!TIP]
+> **New to DGDR and SLA-Driven Profiling?** Start with the [SLA-Driven Profiling and Planner Deployment Quick Start Guide](../planner/sla-planner-quickstart.md) for step-by-step instructions. This document provides deeper technical details about the profiling process.

 ## Overview

@@ -39,9 +39,8 @@ Specifically, the profiler sweeps over the following parallelization mapping for
 | GQA+MoE (Qwen3MoeForCausalLM) | TP, TEP, DEP | TP, TEP, DEP |
 | Other Models | TP | TP |

-<Note>
- Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.
-</Note>
+> [!NOTE]
+> - Exact model x parallelization mapping support is dependent on the backend. The profiler does not guarantee that the recommended P/D engine configuration is supported and bug-free by the backend.

 ## Using DGDR for Profiling (Recommended)

@@ -344,9 +343,8 @@ profilingConfig:
 - **num_gpus_per_node**: Determine the upper bound of number of GPUs per node for dense models and configure Grove for multi-node MoE engines.
 - **gpu_type**: Informational, auto-detected by controller

-<Tip>
-If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.
-</Tip>
+> [!TIP]
+> If you don't specify hardware constraints, the controller auto-detects based on your model size and available cluster resources.

 ### Sweep Configuration (Optional)

@@ -395,9 +393,8 @@ profilingConfig:
      planner_load_predictor: linear             # Load prediction method
 ```

-<Note>
-Planner arguments use `planner_` prefix. See planner documentation for full list.
-</Note>
+> [!NOTE]
+> Planner arguments use `planner_` prefix. See planner documentation for full list.

 ### Engine Configuration (Auto-configured)


--- a/fern/pages/design-docs/architecture.md
+++ b/fern/pages/design-docs/architecture.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "High Level Architecture"
 ---

+# High Level Architecture
+
 Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:

 - **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency

--- a/fern/pages/design-docs/disagg-serving.md
+++ b/fern/pages/design-docs/disagg-serving.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance"
 ---

+# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
+
 The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.

 Disaggregated execution of a request has three main steps: