Unverified Commit 03360b84 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: remove duplicate H1 headings from Fern pages (#6410)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent 01ecc8c7
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: KV Cache Transfer title: KV Cache Transfer
--- ---
# KV Cache Transfer in Disaggregated Serving
In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer: In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:
## Using NIXL for KV Cache Transfer ## Using NIXL for KV Cache Transfer
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Llama4 + Eagle title: Llama4 + Eagle
--- ---
# Llama 4 Maverick Instruct with Eagle Speculative Decoding on SLURM
This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios: This guide demonstrates how to deploy Llama 4 Maverick Instruct with Eagle Speculative Decoding on GB200x4 nodes. We will be following the [multi-node deployment instructions](./multinode/multinode-examples.md) to set up the environment for the following scenarios:
- **Aggregated Serving:** - **Aggregated Serving:**
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Multinode Examples title: Multinode Examples
--- ---
# Example: Multi-node TRTLLM Workers with Dynamo on Slurm
> **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/). > **Note:** The scripts referenced in this example (such as `srun_aggregated.sh` and `srun_disaggregated.sh`) can be found in [`examples/basics/multinode/trtllm/`](https://github.com/ai-dynamo/dynamo/tree/main/examples/basics/multinode/trtllm/).
To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16), To run a single Dynamo+TRTLLM Worker that spans multiple nodes (ex: TP16),
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Prometheus title: Prometheus
--- ---
# TensorRT-LLM Prometheus Metrics
## Overview ## Overview
When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. When running TensorRT-LLM through Dynamo, TensorRT-LLM's Prometheus metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both TensorRT-LLM engine metrics (prefixed with `trtllm_`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: vLLM title: vLLM
--- ---
# LLM Deployment using vLLM
This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation. This directory contains reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
## Use the Latest Release ## Use the Latest Release
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: DeepSeek-R1 title: DeepSeek-R1
--- ---
# Running Deepseek R1 with Wide EP
Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a separate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel` Dynamo supports running Deepseek R1 with data parallel attention and wide expert parallelism. Each data parallel attention rank is a separate dynamo component that will emit its own KV Events and Metrics. vLLM controls the expert parallelism using the flag `--enable-expert-parallel`
## Instructions ## Instructions
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: GPT-OSS title: GPT-OSS
--- ---
# Running gpt-oss-120b Disaggregated with vLLM
Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs. Dynamo supports disaggregated serving of gpt-oss-120b with vLLM. This guide demonstrates how to deploy gpt-oss-120b using disaggregated prefill/decode serving on a single H100 node with 8 GPUs, running 1 prefill worker on 4 GPUs and 1 decode worker on 4 GPUs.
## Overview ## Overview
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Multi-Node title: Multi-Node
--- ---
# Multi-node Examples
This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities. This guide covers deploying vLLM across multiple nodes using Dynamo's distributed capabilities.
## Prerequisites ## Prerequisites
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Prometheus title: Prometheus
--- ---
# vLLM Prometheus Metrics
## Overview ## Overview
When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint. When running vLLM through Dynamo, vLLM engine metrics are automatically passed through and exposed on Dynamo's `/metrics` endpoint (default port 8081). This allows you to access both vLLM engine metrics (prefixed with `vllm:`) and Dynamo runtime metrics (prefixed with `dynamo_*`) from a single worker backend endpoint.
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Prompt Embeddings title: Prompt Embeddings
--- ---
# Prompt Embeddings
Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow. Dynamo supports prompt embeddings (also known as prompt embeds) as a secure alternative input method to traditional text prompts. By allowing applications to use pre-computed embeddings for inference, this feature not only offers greater flexibility in prompt engineering but also significantly enhances privacy and data security. With prompt embeddings, sensitive user data can be transformed into embeddings before ever reaching the inference server, reducing the risk of exposing confidential information during the AI workflow.
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: vLLM-Omni title: vLLM-Omni
--- ---
# [Experimental] Omni Models with vLLM
Dynamo supports multimodal generation through the [vLLM-Omni](https://github.com/vllm-project/vllm-omni) backend. This integration exposes text-to-text, text-to-image, and text-to-video capabilities via OpenAI-compatible API endpoints. Dynamo supports multimodal generation through the [vLLM-Omni](https://github.com/vllm-project/vllm-omni) backend. This integration exposes text-to-text, text-to-image, and text-to-video capabilities via OpenAI-compatible API endpoints.
## Prerequisites ## Prerequisites
......
...@@ -5,9 +5,6 @@ title: Dynamo Benchmarking ...@@ -5,9 +5,6 @@ title: Dynamo Benchmarking
subtitle: Benchmark and compare performance across Dynamo deployment configurations subtitle: Benchmark and compare performance across Dynamo deployment configurations
--- ---
# Dynamo Benchmarking Guide
This benchmarking framework lets you compare performance across any combination of: This benchmarking framework lets you compare performance across any combination of:
- **DynamoGraphDeployments** - **DynamoGraphDeployments**
- **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.) - **External HTTP endpoints** (existing services deployed following standard documentation from vLLM, llm-d, AIBrix, etc.)
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Frontend title: Frontend
--- ---
# Frontend
The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting. The Dynamo Frontend is the API gateway for serving LLM inference requests. It provides OpenAI-compatible HTTP endpoints and KServe gRPC endpoints, handling request preprocessing, routing, and response formatting.
## Feature Matrix ## Feature Matrix
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Frontend Guide title: Frontend Guide
--- ---
# Frontend Guide
This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend. This guide covers the KServe gRPC frontend configuration and integration for the Dynamo Frontend.
## KServe gRPC Frontend ## KServe gRPC Frontend
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: NVIDIA Request Extensions (nvext) title: NVIDIA Request Extensions (nvext)
--- ---
# NVIDIA Request Extensions (`nvext`)
`nvext` is a top-level JSON object on the request body that provides NVIDIA-specific extensions to the OpenAI-compatible API. `nvext` fields are consumed by the Dynamo frontend, preprocessor, router, and backend workers to control routing, preprocessing, response metadata, scheduling, and engine-level priority. `nvext` is a top-level JSON object on the request body that provides NVIDIA-specific extensions to the OpenAI-compatible API. `nvext` fields are consumed by the Dynamo frontend, preprocessor, router, and backend workers to control routing, preprocessing, response metadata, scheduling, and engine-level priority.
## Usage ## Usage
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: KVBM title: KVBM
--- ---
# KV Block Manager (KVBM)
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM. The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM.
KVBM offers: KVBM offers:
......
...@@ -5,7 +5,6 @@ title: KVBM Guide ...@@ -5,7 +5,6 @@ title: KVBM Guide
subtitle: Enable KV offloading using KV Block Manager (KVBM) for Dynamo deployments subtitle: Enable KV offloading using KV Block Manager (KVBM) for Dynamo deployments
--- ---
# KVBM Guide
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM. The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer and write-through cache for frameworks like vLLM and TensorRT-LLM.
KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems. KVBM is modular and can be used standalone via `pip install kvbm` or as the memory management component in the full Dynamo stack. This guide covers installation, configuration, and deployment of the Dynamo KV Block Manager (KVBM) and other KV cache management systems.
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Planner title: Planner
--- ---
# Planner
The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes. The Planner monitors system performance and automatically scales prefill/decode workers to meet latency SLAs. It runs as a component inside the Dynamo inference graph on Kubernetes.
The SLA Planner supports two scaling modes: The SLA Planner supports two scaling modes:
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Planner Examples title: Planner Examples
--- ---
# Planner Examples: Throughput-Based Scaling
Practical examples for deploying the SLA Planner with throughput-based scaling. All examples below use the DGDR workflow with pre-deployment profiling. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md). Practical examples for deploying the SLA Planner with throughput-based scaling. All examples below use the DGDR workflow with pre-deployment profiling. For deployment concepts, see the [Planner Guide](planner-guide.md). For a quick overview, see the [Planner README](README.md).
## Basic Examples ## Basic Examples
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Planner Guide title: Planner Guide
--- ---
# Planner Guide
The Dynamo SLA Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down. The Dynamo SLA Planner is an autoscaling controller that adjusts prefill and decode engine replica counts at runtime to meet latency SLAs. It reads traffic signals (Prometheus metrics or load predictor output) and engine performance profiles to decide when to scale up or down.
For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md). For a quick overview, see the [Planner README](README.md). For architecture internals, see [Planner Design](../../design-docs/planner-design.md).
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment