"docs/components/frontend/README.md" did not exist on "dd6c399565fe203898e14f1d92c87be35f07f24f"
Unverified Commit 03360b84 authored by dagil-nvidia's avatar dagil-nvidia Committed by GitHub
Browse files

docs: remove duplicate H1 headings from Fern pages (#6410)


Signed-off-by: default avatarDan Gil <dagil@nvidia.com>
parent 01ecc8c7
......@@ -4,8 +4,6 @@
title: Profiler
---
# Profiler
The Dynamo Profiler is an automated performance analysis tool that measures model inference characteristics to optimize deployment configurations. It determines optimal tensor parallelism (TP) settings for prefill and decode phases, generates performance interpolation data, and enables SLA-driven autoscaling through the Planner.
## Feature Matrix
......
......@@ -4,8 +4,6 @@
title: Profiler Examples
---
# Profiler Examples
Complete examples for profiling with DGDRs, the interactive WebUI, and direct script usage.
## DGDR Examples
......
......@@ -4,8 +4,6 @@
title: Profiler Guide
---
# Profiler Guide
## Overview
The Dynamo Profiler analyzes model inference performance and generates optimized deployment configurations (DynamoGraphDeployments). Given a model, hardware, and SLA targets, it determines the best parallelization strategy, selects optimal prefill and decode engine configurations, and produces a ready-to-deploy DGD YAML.
......
......@@ -4,8 +4,6 @@
title: Router
---
# Router
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
## Quick Start
......
......@@ -5,8 +5,6 @@ title: Agent Hints
subtitle: Per-request hints for scheduling, load balancing, and KV cache optimization
---
# Agent Hints
Agent hints are optional per-request hints passed via the `nvext.agent_hints` field in the request body. They allow the calling agent or application to communicate request-level metadata that the router uses to improve scheduling, load balancing, and KV cache utilization.
```json
......
......@@ -4,8 +4,6 @@
title: Router Examples
---
# Router Examples
For quick start instructions, see the [Router README](README.md). This document provides further examples for using the Dynamo Router, including Python API usage, Kubernetes deployments, and custom routing patterns.
## Table of Contents
......
......@@ -5,8 +5,6 @@ title: Router Guide
subtitle: Enable KV-aware routing using Router for Dynamo deployments
---
# Router Guide
## Overview
The Dynamo KV Router intelligently routes requests by evaluating their computational costs across different workers. It considers both decoding costs (from active blocks) and prefill costs (from newly computed blocks), using KV cache overlap to minimize redundant computation. Optimizing the KV Router is critical for achieving maximum throughput and minimum latency in distributed inference setups.
......
......@@ -5,8 +5,6 @@ title: Standalone KV Indexer
subtitle: Run the KV cache indexer as an independent service for querying block state
---
# Standalone KV Indexer
## Overview
The standalone KV indexer runs the KV cache radix tree as an independent service, separate from the router. It subscribes to KV events from workers, maintains a radix tree of cached blocks, and exposes a query endpoint (`kv_indexer_query`) that external clients can use to inspect or query KV cache state.
......
......@@ -4,8 +4,6 @@
title: Overall Architecture
---
# High Level Architecture
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting SGLang, TRT-LLM, vLLM and others, while capturing essential LLM capabilities:
- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
......
......@@ -4,8 +4,6 @@
title: Disaggregated Serving
---
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.
Disaggregated execution of a request has three main steps:
......
......@@ -4,8 +4,6 @@
title: Discovery Plane
---
# Discovery Plane
Dynamo's service discovery layer lets components find each other at runtime. Workers register their endpoints when they start, and frontends discover them automatically.
The discovery backend adapts to the deployment environment.
......
......@@ -4,8 +4,6 @@
title: Distributed Runtime
---
# Dynamo Distributed Runtime
## Overview
Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in Rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., Python bindings can be found in `/lib/bindings/python`). The runtime supports multiple discovery backends (Kubernetes-native or etcd) and request planes (TCP, HTTP, or NATS). `DistributedRuntime` follows a hierarchical structure:
......
......@@ -4,8 +4,6 @@
title: Architecture Flow
---
# Dynamo Architecture Flow
This diagram shows the NVIDIA Dynamo disaggregated inference system. Color-coded flows indicate different types of operations.
## 🔵 Main Request Flow (Blue)
......
......@@ -4,8 +4,6 @@
title: Event Plane
---
# Dynamo Event Plane
The event plane provides Dynamo with a pub/sub layer for near real-time event exchange between components. It delivers KV cache updates, worker load metrics, and sequence tracking events, enabling features like KV-aware routing and disaggregated serving.
## When Is the Event Plane Used?
......
......@@ -4,8 +4,6 @@
title: KVBM Design
---
# KVBM Design
This document provides an in-depth look at the architecture, components, framework integrations via the connector API, and the detailed workings of the Dynamo KV Block Manager (KVBM). The design of KVBM takes inspiration from the KV block managers used in SGLang and vLLM, with added influence from historical memory tiering strategies common in general GPU programming. For more details, see [Further Reading](#further-reading).
## KVBM Components
......
......@@ -4,8 +4,6 @@
title: Planner Design
---
# Planner Design
> **Tier 3 design documentation** for contributors and architects. For user-facing docs, see [docs/components/planner/](../components/planner/README.md).
## Overview
......
......@@ -4,8 +4,6 @@
title: Request Plane
---
# Dynamo Request Planes User Guide
## Overview
Dynamo supports multiple transport mechanisms for its request plane (the communication layer between services). You can choose from three different request plane modes based on your deployment requirements:
......
......@@ -4,8 +4,6 @@
title: Router Design
---
# Router Design
This document describes the internal architecture of the Dynamo KV Router, including block tracking mechanisms, the KV cache optimization system, event handling, and transport modes.
## KV Router Architecture
......
......@@ -5,8 +5,6 @@ title: Writing Python Workers in Dynamo
subtitle: Create custom Python workers and engines for Dynamo
---
# Writing Python Workers in Dynamo
This guide explains how to create your own Python worker in Dynamo.
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
......
......@@ -4,8 +4,6 @@
title: Jail Stream
---
# JailedStream Implementation
## Overview
The `JailedStream` is a standalone implementation for handling "jail" detection in token streams. It provides a clean, builder-based API for accumulating tokens when certain sequences are detected, then releasing them as a single chunk when the jail ends.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment