docs: remove duplicate H1 headings from Fern pages (#6410)

Signed-off-by: Dan Gil <dagil@nvidia.com>

docs: remove duplicate H1 headings from Fern pages (#6410)
Signed-off-by: Dan Gil <dagil@nvidia.com>
03360b84 · dagil-nvidia · GitHub · 01ecc8c7 · 03360b84 · 03360b84
Unverified Commit 03360b84 authored Feb 25, 2026 by dagil-nvidia Committed by GitHub Feb 25, 2026
20 changed files
--- a/docs/pages/backends/trtllm/kv-cache-transfer.md
+++ b/docs/pages/backends/trtllm/kv-cache-transfer.md
@@ -4,8 +4,6 @@
 title: KV Cache Transfer
 ---
-# KV Cache Transfer in Disaggregated Serving
 In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
 ## Using NIXL for KV Cache Transfer

--- a/docs/pages/backends/trtllm/llama4-plus-eagle.md
+++ b/docs/pages/backends/trtllm/llama4-plus-eagle.md
@@ -4,8 +4,6 @@
 title: Llama4 + Eagle
 ---
-# Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
 This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios:
 - **Aggregated Serving:**

--- a/docs/pages/backends/trtllm/multinode/multinode-examples.md
+++ b/docs/pages/backends/trtllm/multinode/multinode-examples.md
@@ -4,8 +4,6 @@
 title: Multinode Examples
 ---
-# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
 > **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
 To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),

--- a/docs/pages/backends/trtllm/prometheus.md
+++ b/docs/pages/backends/trtllm/prometheus.md
@@ -4,8 +4,6 @@
 title: Prometheus
 ---
-# TensorRT-LLM Prometheus Metrics
 ## Overview
 When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.

--- a/docs/pages/backends/vllm/README.md
+++ b/docs/pages/backends/vllm/README.md
@@ -4,8 +4,6 @@
 title: vLLM
 ---
-# LLM Deployment using vLLM
 This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
 ## Use the Latest Release

--- a/docs/pages/backends/vllm/deepseek-r1.md
+++ b/docs/pages/backends/vllm/deepseek-r1.md
@@ -4,8 +4,6 @@
 title: DeepSeek-R1
 ---
-# Running Deepseek R1 with Wide EP
 Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a separate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
 ## Instructions

--- a/docs/pages/backends/vllm/gpt-oss.md
+++ b/docs/pages/backends/vllm/gpt-oss.md
@@ -4,8 +4,6 @@
 title: GPT-OSS
 ---
-# Running gpt-oss-120b Disaggregated with vLLM
 Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
 ## Overview

--- a/docs/pages/backends/vllm/multi-node.md
+++ b/docs/pages/backends/vllm/multi-node.md
@@ -4,8 +4,6 @@
 title: Multi-Node
 ---
-# Multi-node Examples
 This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities.
 ## Prerequisites

--- a/docs/pages/backends/vllm/prometheus.md
+++ b/docs/pages/backends/vllm/prometheus.md
@@ -4,8 +4,6 @@
 title: Prometheus
 ---
-# vLLM Prometheus Metrics
 ## Overview
 When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.

--- a/docs/pages/backends/vllm/prompt-embeddings.md
+++ b/docs/pages/backends/vllm/prompt-embeddings.md
@@ -4,8 +4,6 @@
 title: Prompt Embeddings
 ---
-# Prompt Embeddings
 Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.

--- a/docs/pages/backends/vllm/vllm-omni.md
+++ b/docs/pages/backends/vllm/vllm-omni.md
@@ -4,8 +4,6 @@
 title: vLLM-Omni
 ---
-# [Experimental] Omni Models with vLLM
 Dynamo supports multimodal generation through the [vLLM-Omni](https://github.com/vllm-project/vllm-omni) backend. This integration exposes text-to-text, text-to-image, and text-to-video capabilities via OpenAI-compatible API endpoints.
 ## Prerequisites

--- a/docs/pages/benchmarks/benchmarking.md
+++ b/docs/pages/benchmarks/benchmarking.md
@@ -5,9 +5,6 @@ title: Dynamo Benchmarking
 subtitle: Benchmark and compare performance across Dynamo deployment configurations
 ---
-# Dynamo Benchmarking Guide
 This benchmarking framework lets you compare performance across any combination of:
 - **DynamoGraphDeployments**
 - **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.)

--- a/docs/pages/components/frontend/README.md
+++ b/docs/pages/components/frontend/README.md
@@ -4,8 +4,6 @@
 title: Frontend
 ---
-# Frontend
 The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.
 ## Feature Matrix

--- a/docs/pages/components/frontend/frontend-guide.md
+++ b/docs/pages/components/frontend/frontend-guide.md
@@ -4,8 +4,6 @@
 title: Frontend Guide
 ---
-# Frontend Guide
 This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend.
 ## KServe gRPC Frontend

--- a/docs/pages/components/frontend/nvext.md
+++ b/docs/pages/components/frontend/nvext.md
@@ -4,8 +4,6 @@
 title: NVIDIA Request Extensions (nvext)
 ---
-# NVIDIA Request Extensions (`nvext`)
 `nvext` is a top-level JSON object on the request body that provides NVIDIA-specific extensions to the OpenAI-compatible API. `nvext` fields are consumed by the Dynamo frontend, preprocessor, router, and backend workers to control routing, preprocessing, response metadata, scheduling, and engine-level priority.
 ## Usage

--- a/docs/pages/components/kvbm/README.md
+++ b/docs/pages/components/kvbm/README.md
@@ -4,8 +4,6 @@
 title: KVBM
 ---
-# KV Block Manager (KVBM)
 The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM.
 KVBM offers:

--- a/docs/pages/components/kvbm/kvbm-guide.md
+++ b/docs/pages/components/kvbm/kvbm-guide.md
@@ -5,7 +5,6 @@ title: KVBM Guide
 subtitle: Enable KV offloading using KV Block Manager (KVBM) for Dynamo deployments
 ---
-# KVBM Guide
 The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM.
 KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.

--- a/docs/pages/components/planner/README.md
+++ b/docs/pages/components/planner/README.md
@@ -4,8 +4,6 @@
 title: Planner
 ---
-# Planner
 The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
 The SLA Planner supports two scaling modes:

--- a/docs/pages/components/planner/planner-examples.md
+++ b/docs/pages/components/planner/planner-examples.md
@@ -4,8 +4,6 @@
 title: Planner Examples
 ---
-# Planner Examples: Throughput-Based Scaling
 Practical examples for deploying the SLA Planner with throughput-based scaling. All examples below use the DGDR workflow with pre-deployment profiling. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).
 ## Basic Examples

--- a/docs/pages/components/planner/planner-guide.md
+++ b/docs/pages/components/planner/planner-guide.md
@@ -4,8 +4,6 @@
 title: Planner Guide
 ---
-# Planner Guide
 The Dynamo SLA Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down.
 For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md).