Unverified Commit 03360b84 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: remove duplicate H1 headings from Fern pages (#6410)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent 01ecc8c7
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Runtime Guide title: Runtime Guide
--- ---
# Dynamo Runtime
<h4>A Datacenter Scale Distributed Inference Serving Framework</h4> <h4>A Datacenter Scale Distributed Inference Serving Framework</h4>
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0) [![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
......
...@@ -5,8 +5,6 @@ title: Fault Tolerance ...@@ -5,8 +5,6 @@ title: Fault Tolerance
subtitle: Handle failures gracefully with request migration, cancellation, and graceful shutdown subtitle: Handle failures gracefully with request migration, cancellation, and graceful shutdown
--- ---
# Fault Tolerance
Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability. Dynamo provides comprehensive fault tolerance mechanisms to ensure reliable LLM inference in production deployments. This section covers the various strategies and features that enable Dynamo to handle failures gracefully and maintain service availability.
## Overview ## Overview
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Graceful Shutdown title: Graceful Shutdown
--- ---
# Graceful Shutdown
This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up. This document describes how Dynamo components handle shutdown signals to ensure in-flight requests complete successfully and resources are properly cleaned up.
## Overview ## Overview
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Request Rejection title: Request Rejection
--- ---
# Request Rejection (Load Shedding)
This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions. This document describes how Dynamo implements request rejection to prevent system overload and maintain service stability under high load conditions.
## Overview ## Overview
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Testing title: Testing
--- ---
# Fault Tolerance Testing
This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios. This document describes the test infrastructure for validating Dynamo's fault tolerance mechanisms. The testing framework supports request cancellation, migration, etcd HA, and hardware fault injection scenarios.
## Overview ## Overview
......
...@@ -5,8 +5,6 @@ title: Disaggregated Serving ...@@ -5,8 +5,6 @@ title: Disaggregated Serving
subtitle: Find optimal prefill/decode configuration for disaggregated serving deployments subtitle: Find optimal prefill/decode configuration for disaggregated serving deployments
--- ---
# Disaggregated Serving Guide
[AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput. [AIConfigurator](https://github.com/ai-dynamo/aiconfigurator/tree/main) is a performance optimization tool that helps you find the optimal configuration for deploying LLMs with Dynamo. It automatically determines the best number of prefill and decode workers, parallelism settings, and deployment parameters to meet your SLA targets while maximizing throughput.
## Why Use AIConfigurator? ## Why Use AIConfigurator?
......
...@@ -5,8 +5,6 @@ title: LoRA Adapters ...@@ -5,8 +5,6 @@ title: LoRA Adapters
subtitle: Serve fine-tuned LoRA adapters with dynamic loading and routing in Dynamo subtitle: Serve fine-tuned LoRA adapters with dynamic loading and routing in Dynamo
--- ---
# LoRA Adapters
LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing. LoRA (Low-Rank Adaptation) enables efficient fine-tuning and serving of specialized model variants without duplicating full model weights. Dynamo provides built-in support for dynamic LoRA adapter loading, caching, and inference routing.
## Backend Support ## Backend Support
......
...@@ -5,8 +5,6 @@ title: Multimodality Support ...@@ -5,8 +5,6 @@ title: Multimodality Support
subtitle: Deploy multimodal models with image, video, and audio support in Dynamo subtitle: Deploy multimodal models with image, video, and audio support in Dynamo
--- ---
# Multimodal Inference in Dynamo
Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models. Dynamo supports multimodal inference across multiple LLM backends, enabling models to process images, video, and audio alongside text. This section provides comprehensive documentation for deploying multimodal models.
> [!IMPORTANT] > [!IMPORTANT]
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: SGLang Multimodal title: SGLang Multimodal
--- ---
# SGLang Multimodal
This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal supports **EPD**, **E/PD**, and **E/P/D** flows, with NIXL (RDMA) for zero-copy tensor transfer in disaggregated modes. This document provides a comprehensive guide for multimodal inference using SGLang backend in Dynamo. SGLang multimodal supports **EPD**, **E/PD**, and **E/P/D** flows, with NIXL (RDMA) for zero-copy tensor transfer in disaggregated modes.
## Support Matrix ## Support Matrix
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: TensorRT-LLM Multimodal title: TensorRT-LLM Multimodal
--- ---
# TensorRT-LLM Multimodal
This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo. This document provides a comprehensive guide for multimodal inference using TensorRT-LLM backend in Dynamo.
You can provide multimodal inputs in the following ways: You can provide multimodal inputs in the following ways:
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: vLLM Multimodal title: vLLM Multimodal
--- ---
# vLLM Multimodal
This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo. This document provides a comprehensive guide for multimodal inference using vLLM backend in Dynamo.
<Warning> <Warning>
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Speculative Decoding title: Speculative Decoding
--- ---
# Speculative Decoding
Speculative decoding is an optimization technique that uses a smaller "draft" model to predict multiple tokens, which are then verified by the main model in parallel. This can significantly reduce latency for autoregressive generation. Speculative decoding is an optimization technique that uses a smaller "draft" model to predict multiple tokens, which are then verified by the main model in parallel. This can significantly reduce latency for autoregressive generation.
## Backend Support ## Backend Support
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Speculative Decoding with vLLM title: Speculative Decoding with vLLM
--- ---
# Speculative Decoding with vLLM
Using Speculative Decoding with the vLLM backend. Using Speculative Decoding with the vLLM backend.
> **See also**: [Speculative Decoding Overview](./README.md) for cross-backend documentation. > **See also**: [Speculative Decoding Overview](./README.md) for cross-backend documentation.
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: FlexKV title: FlexKV
--- ---
# FlexKV Integration in Dynamo
## Introduction ## Introduction
[FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like SGLang, TensorRT-LLM, and vLLM. [FlexKV](https://github.com/taco-project/FlexKV) is a scalable, distributed runtime for KV cache offloading developed by Tencent Cloud's TACO team in collaboration with the community. It acts as a unified KV caching layer for inference engines like SGLang, TensorRT-LLM, and vLLM.
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: KV Events for Custom Engines title: KV Events for Custom Engines
--- ---
# KV Event Publishing for Custom Engines
This document explains how to implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing. This document explains how to implement KV event publishing for custom inference engines, enabling them to participate in Dynamo's KV cache-aware routing.
## Overview ## Overview
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: LMCache title: LMCache
--- ---
# LMCache Integration in Dynamo
## Introduction ## Introduction
LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling **prefill-once, reuse-everywhere** semantics. As described in the [official documentation](https://docs.lmcache.ai/index.html), LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance. LMCache is a high-performance KV cache layer that supercharges LLM serving by enabling **prefill-once, reuse-everywhere** semantics. As described in the [official documentation](https://docs.lmcache.ai/index.html), LMCache lets LLMs prefill each text only once by storing the KV caches of all reusable texts, allowing reuse of KV caches for any reused text (not necessarily prefix) across any serving engine instance.
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: SGLang HiCache title: SGLang HiCache
--- ---
# Enable SGLang Hierarchical Cache (HiCache)
This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo. This guide shows how to enable SGLang's Hierarchical Cache (HiCache) inside Dynamo.
## 1) Start the SGLang worker with HiCache enabled ## 1) Start the SGLang worker with HiCache enabled
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Deployment Guide title: Deployment Guide
--- ---
# Deploying Dynamo on Kubernetes
High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides. High-level guide to Dynamo Kubernetes deployments. Start here, then dive into specific guides.
## Important Terminology ## Important Terminology
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Integration with Dynamo title: Integration with Dynamo
--- ---
# Checkpoint/Restore for Fast Pod Startup
> ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details. > ⚠️ **Experimental Feature**: ChReK is currently in **beta/preview**. The ChReK DaemonSet runs in privileged mode to perform CRIU operations. See [Limitations](#limitations) for details.
Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start. Checkpointing captures the complete state of a running worker pod (including GPU memory) and saves it to storage. New pods can restore from this checkpoint instead of performing a full cold start.
......
...@@ -4,8 +4,6 @@ ...@@ -4,8 +4,6 @@
title: Minikube Setup title: Minikube Setup
--- ---
# Minikube Setup Guide
Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally. Don't have a Kubernetes cluster? No problem! You can set up a local development environment using Minikube. This guide walks through the set up of everything you need to run Dynamo Kubernetes Platform locally.
## 1. Install Minikube ## 1. Install Minikube
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment