docs: update Fern docs for main branch (#5706)

Signed-off-by: Jont828 <jt572@cornell.edu>

docs: update Fern docs for main branch (#5706)
Signed-off-by: Jont828 <jt572@cornell.edu>
7ca6a562 · Jonathan Tong · GitHub · 704c1dad · 7ca6a562 · 7ca6a562
Unverified Commit 7ca6a562 authored Jan 30, 2026 by Jonathan Tong Committed by GitHub Jan 30, 2026
20 changed files
--- a/fern/pages/kvbm/kvbm-intro.md
+++ b/fern/pages/kvbm/kvbm-intro.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "KV Block Manager"
 ---

+# KV Block Manager
+
 The Dynamo KV Block Manager (KVBM) is a scalable runtime component
 designed to handle memory allocation, management, and remote sharing of
 Key-Value (KV) blocks for inference tasks across heterogeneous and

--- a/fern/pages/kvbm/kvbm-motivation.md
+++ b/fern/pages/kvbm/kvbm-motivation.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Motivation behind KVBM"
 ---

+# Motivation behind KVBM
+
 Large language models (LLMs) and other AI workloads increasingly rely on KV caches that extend beyond GPU and local CPU memory into remote storage tiers. However, efficiently managing the lifecycle of KV blocks in remote storage presents challenges:

 * Tailored for GenAI use-cases

--- a/fern/pages/kvbm/kvbm-reading.md
+++ b/fern/pages/kvbm/kvbm-reading.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "KVBM Further Reading"
 ---

+# KVBM Further Reading
+
 - [vLLM](https://docs.vllm.ai/en/latest/features/automatic_prefix_caching.html)
 - [SGLang](https://github.com/sgl-project/sglang/tree/main/benchmark/hicache)
 - [EMOGI](https://arxiv.org/abs/2006.06890)
\ No newline at end of file
--- a/fern/pages/kvbm/trtllm-setup.md
+++ b/fern/pages/kvbm/trtllm-setup.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Running KVBM in TensorRT-LLM"
 ---

+# Running KVBM in TensorRT-LLM
+
 This guide explains how to leverage KVBM (KV Block Manager) to manage KV cache and do KV offloading in TensorRT-LLM (trtllm).

 To learn what KVBM is, please check [here](kvbm-architecture.md)

-<Note>
- Ensure that `etcd` and `nats` are running before starting.
- KVBM only supports TensorRT-LLM’s PyTorch backend.
- Disable partial reuse `enable_partial_reuse: false` in the LLM API config’s `kv_connector_config` to increase offloading cache hits.
- KVBM requires TensorRT-LLM v1.1.0rc5 or newer.
- Enabling KVBM metrics with TensorRT-LLM is still a work in progress.
-</Note>
+> [!NOTE]
+> - Ensure that `etcd` and `nats` are running before starting.
+> - KVBM only supports TensorRT-LLM's PyTorch backend.
+> - Disable partial reuse `enable_partial_reuse: false` in the LLM API config's `kv_connector_config` to increase offloading cache hits.
+> - KVBM requires TensorRT-LLM v1.2.0rc2 or newer.
+> - Enabling KVBM metrics with TensorRT-LLM is still a work in progress.

 ## Quick Start

@@ -50,10 +50,9 @@ export DYN_KVBM_DISK_CACHE_GB=8
 # DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS to specify exact block counts instead of GB
 ```

-<Note>
-When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than `2`. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step.
-To disable disk offload filtering, set `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` to true or 1.
-</Note>
+> [!NOTE]
+> When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than `2`. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step.
+> To disable disk offload filtering, set `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` to true or 1.

 ```bash
 # write an example LLM API config
@@ -94,6 +93,16 @@ curl localhost:8000/v1/chat/completions   -H "Content-Type: application/json"

 ```

+KVBM is also supported on the prefill worker of disaggregated serving. To launch the prefill worker, run:
+```bash
+# [DYNAMO] To serve an LLM model with dynamo
+python3 -m dynamo.trtllm \
+  --model-path Qwen/Qwen3-0.6B \
+  --served-model-name Qwen/Qwen3-0.6B \
+  --extra-engine-args /tmp/kvbm_llm_api_config.yaml
+  --disaggregation-mode prefill &
+```
+
 Alternatively, can use "trtllm-serve" with KVBM by replacing the above two [DYNAMO] cmds with below:
 ```bash
 trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --extra_llm_api_options /tmp/kvbm_llm_api_config.yaml

--- a/fern/pages/kvbm/vllm-setup.md
+++ b/fern/pages/kvbm/vllm-setup.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Running KVBM in vLLM"
 ---

+# Running KVBM in vLLM
+
 This guide explains how to leverage KVBM (KV Block Manager) to manage KV cache and do KV offloading in vLLM.

 To learn what KVBM is, please check [here](kvbm-architecture.md)
@@ -43,29 +44,27 @@ cd $DYNAMO_HOME/examples/backends/vllm
 ./launch/disagg_kvbm_2p2d.sh
 ```

-<Note>
-Configure or tune KVBM cache tiers (choose one of the following options):
-```bash
-# Option 1: CPU cache only (GPU -> CPU offloading)
-# 4 means 4GB of pinned CPU memory would be used
-export DYN_KVBM_CPU_CACHE_GB=4
-# Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
-export DYN_KVBM_CPU_CACHE_GB=4
-# 8 means 8GB of disk would be used
-export DYN_KVBM_DISK_CACHE_GB=8
-# [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading, bypassing CPU)
-# NOTE: this option is only experimental and it might not give out the best performance.
-# NOTE: disk offload filtering is not supported when using this option.
-export DYN_KVBM_DISK_CACHE_GB=8
-```
-You can also use "DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS" or
-"DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS" to specify exact block counts instead of GB
-</Note>
-
-<Note>
-When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than `2`. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step.
-To disable disk offload filtering, set `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` to true or 1.
-</Note>
+> [!NOTE]
+> Configure or tune KVBM cache tiers (choose one of the following options):
+> ```bash
+> # Option 1: CPU cache only (GPU -> CPU offloading)
+> # 4 means 4GB of pinned CPU memory would be used
+> export DYN_KVBM_CPU_CACHE_GB=4
+> # Option 2: Both CPU and Disk cache (GPU -> CPU -> Disk tiered offloading)
+> export DYN_KVBM_CPU_CACHE_GB=4
+> # 8 means 8GB of disk would be used
+> export DYN_KVBM_DISK_CACHE_GB=8
+> # [Experimental] Option 3: Disk cache only (GPU -> Disk direct offloading, bypassing CPU)
+> # NOTE: this option is only experimental and it might not give out the best performance.
+> # NOTE: disk offload filtering is not supported when using this option.
+> export DYN_KVBM_DISK_CACHE_GB=8
+> ```
+> You can also use "DYN_KVBM_CPU_CACHE_OVERRIDE_NUM_BLOCKS" or
+> "DYN_KVBM_DISK_CACHE_OVERRIDE_NUM_BLOCKS" to specify exact block counts instead of GB
+
+> [!NOTE]
+> When disk offloading is enabled, to extend SSD lifespan, disk offload filtering would be enabled by default. The current policy is only offloading KV blocks from CPU to disk if the blocks have frequency equal or more than `2`. Frequency is determined via doubling on cache hit (init with 1) and decrement by 1 on each time decay step.
+> To disable disk offload filtering, set `DYN_KVBM_DISABLE_DISK_OFFLOAD_FILTER` to true or 1.

 ### Sample Request
 ```bash

--- a/fern/pages/multimodal/index.md
+++ b/fern/pages/multimodal/index.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Multimodal Inference in Dynamo"
 ---

+# Multimodal Inference in Dynamo
+
 Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.

-<Warning>
-**Security Requirement**: Multimodal processing must be explicitly enabled at startup.
-See the relevant documentation for each backend for the necessary flags.
-This prevents unintended processing of multimodal data from untrusted sources.
-</Warning>
+> [!WARNING]
+> **Security Requirement**: Multimodal processing must be explicitly enabled at startup.
+> See the relevant documentation for each backend for the necessary flags.
+> This prevents unintended processing of multimodal data from untrusted sources.

 ## Backend Documentation


--- a/fern/pages/multimodal/sglang.md
+++ b/fern/pages/multimodal/sglang.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "SGLang Multimodal"
 ---

+# SGLang Multimodal
+
 This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal uses specialized **E/PD or E/P/D** flows with **NIXL (RDMA)** for zero-copy tensor transfer.

 ## Support Matrix

--- a/fern/pages/multimodal/trtllm.md
+++ b/fern/pages/multimodal/trtllm.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "TensorRT-LLM Multimodal"
 ---

+# TensorRT-LLM Multimodal
+
 This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.

 You can provide multimodal inputs in the following ways:

--- a/fern/pages/multimodal/vllm.md
+++ b/fern/pages/multimodal/vllm.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "vLLM Multimodal"
 ---

+# vLLM Multimodal
+
 This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.

-<Warning>
-**Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
-This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).
-</Warning>
+> [!WARNING]
+> **Security Requirement**: All multimodal workers require the `--enable-multimodal` flag to be explicitly set at startup. This is a security feature to prevent unintended processing of multimodal data from untrusted sources. Workers will fail at startup if multimodal flags (e.g., `--multimodal-worker`, `--multimodal-processor`) are used without `--enable-multimodal`.
+> This flag is analogous to `--enable-mm-embeds` in vllm serve but also extends it to all multimodal content (url, embeddings, b64).

 ## Support Matrix

@@ -160,9 +160,8 @@ cd $DYNAMO_HOME/examples/backends/vllm
 bash launch/disagg_multimodal_epd.sh --model llava-hf/llava-1.5-7b-hf
 ```

-<Note>
-Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.
-</Note>
+> [!NOTE]
+> Disaggregation is currently only confirmed to work with LLaVA. Qwen2.5-VL is not confirmed to be supported.

 ## ECConnector Serving


--- a/fern/pages/observability/README.md
+++ b/fern/pages/observability/README.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Dynamo Observability"
 ---

+# Dynamo Observability
+
 ## Getting Started Quickly

 This is an example to get started quickly on a single machine.

--- a/fern/pages/observability/health-checks.md
+++ b/fern/pages/observability/health-checks.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Dynamo Health Checks"
 ---

+# Dynamo Health Checks
+
 ## Overview

 Dynamo provides health check and liveness HTTP endpoints for each component which

--- a/fern/pages/observability/logging.md
+++ b/fern/pages/observability/logging.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Dynamo Logging"
 ---

+# Dynamo Logging
+
 ## Overview

 Dynamo provides structured logging in both text as well as JSONL. When

--- a/fern/pages/observability/metrics-developer-guide.md
+++ b/fern/pages/observability/metrics-developer-guide.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Metrics Developer Guide"
 ---

+# Metrics Developer Guide
+
 This guide explains how to create and use custom metrics in Dynamo components using the Dynamo metrics API.

 ## Metrics Exposure

--- a/fern/pages/observability/metrics.md
+++ b/fern/pages/observability/metrics.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Dynamo Metrics"
 ---

+# Dynamo Metrics
+
 ## Overview

 Dynamo provides built-in metrics capabilities through the Dynamo metrics API, which is automatically available whenever you use the `DistributedRuntime` framework. This document serves as a reference for all available metrics in Dynamo.

--- a/fern/pages/observability/prometheus-grafana.md
+++ b/fern/pages/observability/prometheus-grafana.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Metrics Visualization with Prometheus and Grafana"
 ---

+# Metrics Visualization with Prometheus and Grafana
+
 ## Overview

 This guide shows how to set up Prometheus and Grafana for visualizing Dynamo metrics on a single machine for demo purposes.

--- a/fern/pages/observability/tracing.md
+++ b/fern/pages/observability/tracing.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Distributed Tracing with Tempo"
 ---

+# Distributed Tracing with Tempo
+
 ## Overview

 Dynamo supports OpenTelemetry-based distributed tracing for visualizing request flows across Frontend and Worker components. Traces are exported to Tempo via OTLP (OpenTelemetry Protocol) and visualized in Grafana.

--- a/fern/pages/performance/aiconfigurator.md
+++ b/fern/pages/performance/aiconfigurator.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Finding Best Initial Configs using AIConfigurator"
 ---

+# Finding Best Initial Configs using AIConfigurator
+
 [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.

 ## Why Use AIConfigurator?

--- a/fern/pages/performance/tuning.md
+++ b/fern/pages/performance/tuning.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Disaggregation and Performance Tuning"
 ---

+# Disaggregation and Performance Tuning
+
 Disaggregation gains performance by separating the prefill and decode into different engines to reduce interferences between the two.
 However, performant disaggregation requires careful tuning of the inference parameters.
 Specifically, there are three sets of parameters that needs to be tuned:
@@ -32,10 +33,9 @@ Typically, the number of GPUs vs the performance follows the following pattern:
 | Maximum number limited by communication scalability | Worst overall throughput/GPU, best latency/user                                           |
 | More than maximum                                   | Communication overhead dominates, poor performance                                        |

-<Note>
-for decode-only engines, sometimes larger number of GPUs has to larger KV cache per GPU and more decoding requests running in parallel, which leads to both better throughput/GPU and better latency/user.
-For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free GPU memory fraction:
-</Note>
+> [!NOTE]
+> for decode-only engines, sometimes larger number of GPUs has to larger KV cache per GPU and more decoding requests running in parallel, which leads to both better throughput/GPU and better latency/user.
+> For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free GPU memory fraction:

 | TP Size | KV Cache Size (GB) | KV Cache per GPU (GB) | Per GPU Improvement over TP1 |
 | ------: | -----------------: | --------------------: | ---------------------------: |
@@ -46,9 +46,8 @@ For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free
 The best number of GPUs to use in the prefill and decode engines can be determined by running a few fixed ISL/OSL/concurrency test using [AIPerf](https://github.com/ai-dynamo/aiperf/tree/main) and compare with the SLA.
 AIPerf is pre-installed in the dynamo container.

-<Tip>
-If you are unfamiliar with AIPerf, please see this helpful [tutorial](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorial.md) to get you started.
-</Tip>
+> [!TIP]
+> If you are unfamiliar with AIPerf, please see this helpful [tutorial](https://github.com/ai-dynamo/aiperf/blob/main/docs/tutorial.md) to get you started.

 Besides the parallelization mapping, other common knobs to tune are maximum batch size, maximum number of tokens, and block size.
 For prefill engines, usually a small batch size and large `max_num_token` is preferred.

--- a/fern/pages/planner/load-planner.md
+++ b/fern/pages/planner/load-planner.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Load-based Planner"
 ---

+# Load-based Planner
+
 This document covers load-based planner in `examples/llm/components/planner.py`.

-<Warning>
-Load-based planner is inoperable as vllm, sglang, and trtllm examples all do not use prefill queues. Please use SLA planner for now.
-</Warning>
+> [!WARNING]
+> Load-based planner is inoperable as vllm, sglang, and trtllm examples all do not use prefill queues. Please use SLA planner for now.

-<Warning>
-Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon.
-</Warning>
+> [!WARNING]
+> Bare metal deployment with local connector is deprecated. The only option to deploy load-based planner is via k8s. We will update the examples in this document soon.

 ## Load-based Scaling Up/Down Prefill/Decode Workers


--- a/fern/pages/planner/planner-intro.md
+++ b/fern/pages/planner/planner-intro.md
 ---
 # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 # SPDX-License-Identifier: Apache-2.0
-title: "Planner"
 ---

+# Planner
+
 The planner monitors the state of the system and adjusts workers to
 ensure that the system runs efficiently.

@@ -17,8 +18,7 @@ Key features include:
 - **Graceful scaling** that ensures no requests are dropped during
  scale-down operations

-<Tip>
-**New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](sla-planner-quickstart.md) for a complete, step-by-step workflow.
-
-**Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.
-</Tip>
+> [!TIP]
+> **New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](sla-planner-quickstart.md) for a complete, step-by-step workflow.
+>
+> **Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.