"lib/vscode:/vscode.git/clone" did not exist on "6a69ef4f1871bad7f54384389a42b7bb34b607d9"
architecture.md 9.16 KB
Newer Older
1
2
3
---
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
4
title: Overall Architecture
5
subtitle: Architecture and components of the Dynamo inference runtime
6
7
---

8
# Dynamo Architecture
9

10
Dynamo is a distributed inference runtime for generative AI systems that must operate at high throughput, low latency, and high reliability under changing traffic conditions. It is backend-agnostic (SGLang, TRT-LLM, vLLM, and others) and is built around three cooperating concerns:
11

12
13
14
- A fast **request path** for token generation
- A responsive **control path** for scaling and placement
- A resilient **state path** for KV reuse and failure recovery
15

16
This document presents Dynamo as an architecture, not a feature list: what each plane owns, how requests move, how the system adapts, and how it remains correct under failure.
17

18
## Design Goals
19

20
Dynamo is designed to satisfy the following goals simultaneously:
21

22
23
24
25
26
1. **Latency stability**: keep TTFT and ITL predictable under bursty and mixed-length traffic.
2. **GPU efficiency**: disaggregate prefill and decode so each can scale independently.
3. **Compute reuse**: minimize KV recomputation through KV-aware routing and cache lifecycle management.
4. **Operational resilience**: treat worker crashes, restarts, and overload as normal operating events.
5. **Deployment portability**: support Kubernetes-native control paths and non-Kubernetes runtime modes.
27

28
## Why This Architecture Exists
29

30
Modern LLM serving hits recurring bottlenecks:
31

32
33
34
35
36
- **Prefill/decode imbalance** leaves GPUs underutilized when traffic mix shifts ([DistServe](https://arxiv.org/abs/2401.09670)).
- **KV recomputation** increases TTFT and wastes compute when routing ignores cache overlap ([DeepSeek](https://arxiv.org/abs/2501.12948)).
- **Memory pressure** from long contexts and concurrency exceeds HBM capacity without multi-tier cache management ([KVBM](https://docs.nvidia.com/dynamo/components/kvbm), [Mooncake](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [FlexKV](https://github.com/taco-project/FlexKV), [LMCache](https://lmcache.ai/)).
- **Dynamic demand** breaks static provisioning assumptions ([AzureTrace](https://github.com/Azure/AzurePublicDataset)).
- **Real-world failures** (pod restart, partition, hot-spot overload) require first-class recovery behavior.
37

38
Dynamo addresses these constraints by separating serving, control, and state propagation into explicit planes and control loops.
39

40
## Architecture Overview
41

42
![Dynamo architecture showing Request Plane (Client, Frontend, Router, Prefill/Decode workers), Control Plane (Planner, Dynamo Operator, Dynamo Graph, Grove, Model Express, Runtime Resources), and Storage & Events Plane (KVBM, NIXL, Local SSD/NFS/Remote Storage)](../assets/img/dynamo-architecture.svg "Dynamo Architecture")
43

44
## System Model
45

46
### Request Plane (critical data path)
47

48
The request plane is responsible for request/response execution:
49

50
51
52
53
- **Frontend** accepts and normalizes requests.
- **Router** selects workers based on load and KV overlap.
- **Prefill workers** compute prompt KV state.
- **Decode workers** generate output tokens.
54

55
This path is optimized for low overhead and continuous token streaming.
56

57
### Control Plane (adaptation and orchestration path)
58

59
The control plane is responsible for desired-state management:
60

61
62
63
64
65
- **Planner** computes scaling targets from live metrics.
- **Dynamo Operator** reconciles Kubernetes resources from Dynamo CRDs.
- **Discovery + Endpoints/CRD** establish liveness and discoverability.
- **Grove/KAI Scheduler path** provides topology-aware placement and grouped scaling in multinode Kubernetes deployments.
- **Model Express** is an optional model-management endpoint when configured.
66

67
This path is optimized for correctness and convergence to target capacity.
68

69
### Storage & Events Plane (state propagation path)
70

71
The storage/events plane is responsible for cache state visibility and movement:
72

73
74
75
- **KV Events** publish cache lifecycle transitions.
- **KVBM** manages block reuse, eviction, and offload/recall across memory tiers.
- **NIXL** performs high-speed KV/data transfer across workers and memory domains.
76

77
This path is optimized for cache reuse and cross-worker handoff efficiency.
78

79
## End-to-End Request Narrative (Disaggregated Mode)
80

81
82
83
84
85
86
87
88
89
1. Client sends request to **Frontend**.
2. Frontend validates/preprocesses and forwards to **Router**.
3. Router chooses a **Prefill worker**.
4. Prefill computes KV and returns transfer metadata.
5. Router chooses a **Decode worker**.
6. Decode receives KV state (typically via **NIXL** transfer path).
7. Decode streams tokens back through Frontend.
8. **KV Events** update cache visibility for future routing decisions.
9. **KVBM** may offload or recall KV blocks based on pressure and reuse potential.
90

91
92
For flow-level detail, see [Architecture Flow](dynamo-flow.md).
For request transport options, see [Request Plane](request-plane.md).
93

94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
## Control Loops

### Serving Loop

Maintains low-latency request execution across frontend, router, prefill, and decode workers.

### Planning Loop

Maintains capacity alignment with demand:

- Planner consumes runtime metrics.
- Planner computes prefill/decode targets.
- Connector layer applies targets to runtime resources.

Planner supports throughput-based and load-based strategies. See [Planner Design](planner-design.md).

### Resilience Loop

Maintains system continuity under failure:

- Health checks detect unhealthy workers.
- Discovery liveness removes stale endpoints.
- Graceful shutdown drains in-flight work.
- Request migration/cancellation controls in-flight behavior.
- Load shedding prevents cascading collapse under overload.

See [Fault Tolerance](../fault-tolerance/README.md).

## Kubernetes-Native Realization (CRD + Grove)

In Kubernetes deployments, the same architecture maps to declarative resources:

- Dynamo Operator reconciles `DynamoGraphDeployment`.
- Discoverability is derived from `DynamoWorkerMetadata` + EndpointSlices.
- Grove-backed multinode deployments model worker groups as `PodCliqueSet` and `PodClique`.
- Independent prefill/decode elasticity is represented via `PodCliqueScalingGroup` with separate `replicas` and `min` targets.

The diagram labels such as `PodClique A/B`, `ScalingGroup "Prefill"`, `ScalingGroup "Decode"`, and `(replicas, min)` represent this grouped scaling model.

## Fault Tolerance Architecture
134

135
Fault tolerance is embedded across layers:
136

137
138
139
140
141
142
| Layer | Mechanism | Practical effect |
|------|-----------|------------------|
| Request | Migration, cancellation | In-flight work can continue or terminate intentionally |
| Worker | Health checks, graceful shutdown, endpoint draining | Failed/terminating workers stop taking new traffic safely |
| System | Request rejection/load shedding | Prevents overload from propagating across workers |
| Infrastructure | Discovery lease expiry, event-path recovery | Stale membership is removed and traffic reroutes |
143

144
This model assumes failures are routine, not exceptional.
145

146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
## Performance Rationale

### Disaggregated Serving

Separating prefill and decode improves utilization and enables phase-specific scaling.

![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../assets/img/disagg-perf-benefit.png)

*Tested on H100 with R1 Distilled Llama 70B FP8 on vLLM. 3K ISL / 150 OSL.*

### KV-Aware Routing

Routing with cache overlap + load signals reduces prefill recomputation and improves latency.
For an external production case study, see [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#how-baseten-uses-nvidia-dynamo).

![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../assets/img/kv-routing.png)

*Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 H100 nodes. Avg 4K ISL / 800 OSL.*

### KV Block Manager (KVBM)

KVBM extends effective cache capacity using multi-tier memory offload/recall.
168

169
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../assets/img/kvbm-agg-performance.png)
170

171
172
173
174
175
176
177
178
179
180
181
*Tested across QPS values using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.*

### NIXL Data Transfer

NIXL reduces KV handoff cost in distributed serving by optimizing cross-worker transfer behavior across heterogeneous memory.

## Implementation Model

- **Rust** for performance-sensitive runtime components.
- **Python** for backend integration and extensibility.
- Modular subsystem boundaries so routing, planning, memory, and transport can evolve independently.
182

183
## Related Documentation
184

185
186
187
188
189
190
191
192
- [Architecture Flow](dynamo-flow.md)
- [Router Design](router-design.md)
- [Planner Design](planner-design.md)
- [Discovery Plane](discovery-plane.md)
- [Event Plane](event-plane.md)
- [Request Plane](request-plane.md)
- [Fault Tolerance](../fault-tolerance/README.md)
- [Grove](../kubernetes/grove.md)
193
194
195

## Acknowledgements

196
Dynamo is informed by prior open-source work from:
197

198
- vLLM
199
200
201
202
203
- SGLang
- DistServe
- Mooncake
- AIBrix
- BentoML