Unverified Commit ece08dc9 authored by Neal Vaidya's avatar Neal Vaidya Committed by GitHub
Browse files

docs: restructure docs directory and move fern config to fern/ (#6700)


Signed-off-by: default avatarNeal Vaidya <nealv@nvidia.com>
Co-authored-by: default avatarClaude Opus 4.6 <noreply@anthropic.com>
parent 1412e44b
......@@ -40,7 +40,7 @@ Offloading KV cache to CPU or storage is most effective when KV Cache exceeds GP
## Architecture
![KVBM Architecture](../../../assets/img/kvbm-architecture.png)
![KVBM Architecture](../../assets/img/kvbm-architecture.png)
*High-level layered architecture view of Dynamo KV Block Manager and how it interfaces with different components of the LLM inference ecosystem*
KVBM has three primary logical layers:
......
......@@ -364,7 +364,7 @@ trtllm-serve Qwen/Qwen3-0.6B --host localhost --port 8000 --backend pytorch --ex
**Solution:** Enable KVBM metrics and check the Grafana dashboard for `Onboard Blocks - Host to Device` and `Onboard Blocks - Disk to Device`. Large numbers of onboarded KV blocks indicate good cache reuse:
![Grafana Example](../../../assets/img/kvbm-metrics-grafana.png)
![Grafana Example](../../assets/img/kvbm-metrics-grafana.png)
### KVBM Worker Initialization Timeout
......
......@@ -68,7 +68,7 @@ args:
- --loadbased-adjustment-interval=5
```
The planner will auto-discover the frontend metrics endpoint from the DGD. See [disagg_planner_load.yaml](../../../../tests/planner/scaling/disagg_planner_load.yaml) for a complete example.
The planner will auto-discover the frontend metrics endpoint from the DGD. See [disagg_planner_load.yaml](https://github.com/ai-dynamo/dynamo/blob/main/tests/planner/scaling/disagg_planner_load.yaml) for a complete example.
### Manual DGD Deployment
......
......@@ -271,12 +271,12 @@ The profiler follows a 5-step process:
- **Prefill**:
- TP/TEP: Measure TTFT with batch size = 1 (assuming ISL is long enough to saturate compute) without KV reuse.
- DEP: Attention uses data parallelism. Send a single burst with total concurrency `attention_dp_size × attn_dp_num_req_ratio` (defaults to 4) and compute the reported TTFT as `time_to_first_token.max / attn_dp_num_req_ratio` from the AIPerf summary of that burst.
![Prefill Performance](../../../assets/img/h100-prefill-performance.png)
![Prefill Performance](../../assets/img/h100-prefill-performance.png)
- **Decode**: Measure the ITL under different numbers of in-flight requests, from 1 to the maximum the KV cache can hold. To measure ITL without being affected by piggy-backed prefill requests, the script enables KV-reuse and warms up the engine by issuing the same prompts before measuring.
![Decode Performance](../../../assets/img/h100-decode-performance.png)
![Decode Performance](../../assets/img/h100-decode-performance.png)
4. **Recommendation**: Select optimal parallelization mapping for prefill and decode that achieves the highest per-GPU throughput while adhering to the SLA on TTFT and ITL.
5. **In-Depth Profiling on the Recommended P/D Engine**: Interpolate TTFT with ISL and ITL with active KV cache and decode context length for more accurate performance estimation.
![ITL Interpolation](../../../assets/img/pd-interpolation.png)
![ITL Interpolation](../../assets/img/pd-interpolation.png)
- **Prefill**: Measures TTFT and throughput per GPU across different input lengths with batch size=1.
- **Decode**: Measures ITL and throughput per GPU under various KV cache loads and decode context lengths.
......
......@@ -176,7 +176,7 @@ The main KV-aware routing arguments (frontend uses the same `--router-*` flag na
- `--router-prune-target-ratio`: Target size ratio to prune down to when `--router-max-tree-size` is exceeded. For example, with a value of 0.8 (default) and max tree size of 1048576, the router will prune down to approximately 838860 blocks when the threshold is exceeded. Defaults to 0.8 when `--no-router-kv-events` is used. This creates headroom before the next pruning cycle.
- `--router-event-threads`: Number of event processing threads for the KV indexer (default: 4). When set to 1, the router uses a single-threaded radix tree with channel-based event processing. When set to a value greater than 1 (the default), the router uses a concurrent radix tree with a thread pool of the specified size for higher event throughput. This setting only applies when KV events are enabled (the default). When `--no-router-kv-events` is set (approximate mode), the router always uses a single-threaded indexer with TTL-based expiration and pruning regardless of this setting. Can be set via `DYN_ROUTER_EVENT_THREADS` env var. For details on the underlying index data structures (`RadixTree`, `ConcurrentRadixTree`, `PositionalIndexer`) and their concurrency model (inline reads, sticky-routed writes via thread pool), see the [KV Router Index documentation](../../../../lib/kv-router/README.md).
- `--router-event-threads`: Number of event processing threads for the KV indexer (default: 4). When set to 1, the router uses a single-threaded radix tree with channel-based event processing. When set to a value greater than 1 (the default), the router uses a concurrent radix tree with a thread pool of the specified size for higher event throughput. This setting only applies when KV events are enabled (the default). When `--no-router-kv-events` is set (approximate mode), the router always uses a single-threaded indexer with TTL-based expiration and pruning regardless of this setting. Can be set via `DYN_ROUTER_EVENT_THREADS` env var. For details on the underlying index data structures (`RadixTree`, `ConcurrentRadixTree`, `PositionalIndexer`) and their concurrency model (inline reads, sticky-routed writes via thread pool), see the [KV Router Index documentation](../../../lib/kv-router/README.md).
<Note>
......@@ -319,7 +319,7 @@ await register_model(
await prefill_endpoint.serve_endpoint(prefill_handler.generate)
```
<Note>The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags (e.g., `--router-block-size`, `--router-kv-events`). See the [Standalone Router README](../../../../components/src/dynamo/router/README.md) and example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/launch/disagg_router.sh).</Note>
<Note>The unified frontend with automatic prefill routing is currently enabled for vLLM and TensorRT-LLM backends. For SGLang (work in progress), you need to launch a separate standalone router as the prefill router targeting the prefill endpoints. The standalone router (`python -m dynamo.router`) uses `--router-*`-prefixed flags (e.g., `--router-block-size`, `--router-kv-events`). See the [Standalone Router README](../../../components/src/dynamo/router/README.md) and example script: [`examples/backends/sglang/launch/disagg_router.sh`](https://github.com/ai-dynamo/dynamo/blob/main/examples/backends/sglang/launch/disagg_router.sh).</Note>
### Request Flow
......@@ -445,7 +445,7 @@ curl http://localhost:8000/busy_threshold
- **[Router README](README.md)**: Quick start guide for the KV Router
- **[Router Examples](router-examples.md)**: Python API usage, K8s examples, and custom routing patterns
- **[KV Router Index Data Structures](../../../../lib/kv-router/README.md)**: `RadixTree`, `ConcurrentRadixTree`, and `PositionalIndexer` internals and concurrency model
- **[KV Router Index Data Structures](../../../lib/kv-router/README.md)**: `RadixTree`, `ConcurrentRadixTree`, and `PositionalIndexer` internals and concurrency model
- **[Router Design](../../design-docs/router-design.md)**: Architecture details and event transport modes
- **[KV Event Publishing for Custom Engines](../../integrations/kv-events-custom-engines.md)**: Integrate custom inference engines with KV-aware routing
- **[Prometheus and Grafana Setup](../../observability/prometheus-grafana.md)**: General Prometheus/Grafana configuration
......
......@@ -9,7 +9,7 @@ subtitle: Run the KV cache indexer as an independent service for querying block
The standalone KV indexer runs the KV cache radix tree as an independent service, separate from the router. It subscribes to KV events from workers, maintains a radix tree of cached blocks, and exposes a query endpoint (`kv_indexer_query`) that external clients can use to inspect or query KV cache state.
This is distinct from the [Standalone Router](../../../../components/src/dynamo/router/README.md), which is a full routing service. The standalone indexer provides only the indexing and query layer without routing logic.
This is distinct from the [Standalone Router](../../../components/src/dynamo/router/README.md), which is a full routing service. The standalone indexer provides only the indexing and query layer without routing logic.
## Use Cases
......@@ -123,4 +123,4 @@ The standalone indexer internally:
- **[Router Guide](router-guide.md)**: Full KV router configuration and tuning
- **[Router Design](../../design-docs/router-design.md)**: Architecture and event transport modes
- **[Standalone Router](../../../../components/src/dynamo/router/README.md)**: Full routing service (routes requests to workers)
- **[Standalone Router](../../../components/src/dynamo/router/README.md)**: Full routing service (routes requests to workers)
......@@ -46,7 +46,7 @@ The following diagram outlines Dynamo's high-level architecture. To enable large
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../../assets/img/architecture.png "Dynamo Architecture")
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../assets/img/architecture.png "Dynamo Architecture")
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
......@@ -60,7 +60,7 @@ Dynamo prioritizes seamless integration. Its modular design enables it to work h
Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../../assets/img/disagg-perf-benefit.png)
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../assets/img/disagg-perf-benefit.png)
* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
......@@ -69,7 +69,7 @@ The disaggregation of prefill and decode phases offers valuable flexibility. Sin
### KV aware routing
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../../assets/img/kv-routing.png)
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../assets/img/kv-routing.png)
* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
......@@ -79,7 +79,7 @@ Existing routing methods, including load-based routing, overlook the specific pr
### KV cache manager
The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../../assets/img/kvbm-agg-performance.png)
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../assets/img/kvbm-agg-performance.png)
* Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.
......
......@@ -7,7 +7,7 @@ title: Discovery Plane
Dynamo's service discovery layer lets components find each other at runtime. Workers register their endpoints when they start, and frontends discover them automatically.
The discovery backend adapts to the deployment environment.
![Discovery plane architecture showing Kubernetes and etcd backends](../../assets/img/discovery-plane.svg)
![Discovery plane architecture showing Kubernetes and etcd backends](../assets/img/discovery-plane.svg)
## Discovery Backends
......@@ -85,7 +85,7 @@ Frontends and routers discover available workers by watching the relevant prefix
Each runtime maintains a lease with etcd (default TTL: 10 seconds). If a worker crashes or loses connectivity:
![Lease lifecycle showing DistributedRuntime keep-alive heartbeat to etcd](../../assets/img/discovery-plane-lease.svg)
![Lease lifecycle showing DistributedRuntime keep-alive heartbeat to etcd](../assets/img/discovery-plane-lease.svg)
1. Keep-alive heartbeats stop.
2. The lease expires after the TTL.
......
......@@ -14,7 +14,7 @@ Key use cases:
- **Worker load metrics** -- Workers report utilization so the router can balance load.
- **Sequence tracking** -- Coordinates active sequences across router replicas for fault-tolerant routing.
![Event plane architecture showing NATS and ZMQ transport options connecting Frontend, Planner, and Worker](../../assets/img/event-plane-transport.svg)
![Event plane architecture showing NATS and ZMQ transport options connecting Frontend, Planner, and Worker](../assets/img/event-plane-transport.svg)
## Choosing a Transport
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment