Unverified Commit 5b415abb authored by Anish's avatar Anish Committed by GitHub
Browse files

docs: rewrite architecture.md as a clearer system architecture narrative (#7022)


Signed-off-by: default avatarathreesh <anish.maddipoti@utexas.edu>
Signed-off-by: default avatarAnish <80174047+athreesh@users.noreply.github.com>
Signed-off-by: default avatarakshatha-k <akshutk@gmail.com>
Co-authored-by: default avatarakshatha-k <33278067+akshatha-k@users.noreply.github.com>
Co-authored-by: default avatarakshatha-k <akshutk@gmail.com>
Co-authored-by: default avatardagil-nvidia <dagil@nvidia.com>
parent be2f1dc1
...@@ -302,7 +302,7 @@ Place images in `docs/assets/` and reference them with relative paths from your ...@@ -302,7 +302,7 @@ Place images in `docs/assets/` and reference them with relative paths from your
markdown files: markdown files:
```markdown ```markdown
![Architecture diagram](../assets/architecture.png) ![Architecture Diagram](../assets/img/dynamo-architecture.svg)
``` ```
### Custom components ### Custom components
......
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
<svg viewBox="0 0 1000 550" xmlns="http://www.w3.org/2000/svg">
<defs>
<marker id="arrowhead" markerWidth="6" markerHeight="4" refX="5.5" refY="2" orient="auto">
<polygon points="0 0, 6 2, 0 4" fill="#234CDA" />
</marker>
<marker id="arrowhead-dashed" markerWidth="6" markerHeight="4" refX="5.5" refY="2" orient="auto">
<polygon points="0 0, 6 2, 0 4" fill="#234CDA" />
</marker>
</defs>
<rect width="1000" height="550" fill="#FFFFFF"/>
<rect x="170" y="30" width="300" height="220" fill="#D2DBFD" stroke="#234CDA" stroke-width="0.7"/>
<rect x="610" y="30" width="370" height="430" fill="#D2DBFD" stroke="#234CDA" stroke-width="0.7"/>
<rect x="90" y="320" width="410" height="210" fill="#D2DBFD" stroke="#234CDA" stroke-width="0.7"/>
<rect x="170" y="30" width="300" height="30" fill="#A6B8F8" stroke="#234CDA" stroke-width="0.7"/>
<rect x="610" y="30" width="370" height="30" fill="#A6B8F8" stroke="#234CDA" stroke-width="0.7"/>
<rect x="90" y="320" width="410" height="35" fill="#A6B8F8" stroke="#234CDA" stroke-width="0.7"/>
<g font-family="Helvetica, Arial, sans-serif" font-size="14" font-weight="bold" fill="#000">
<text x="180" y="50">REQUEST PLANE</text>
<text x="620" y="50">CONTROL PLANE</text>
<text x="100" y="343">STORAGE &amp; EVENTS PLANE</text>
</g>
<rect x="10" y="30" width="80" height="150" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="25" y="110" font-family="Helvetica" font-size="12">CLIENT</text>
<rect x="190" y="80" width="90" height="35" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="205" y="102" font-family="Helvetica" font-size="11">Frontend</text>
<rect x="330" y="80" width="90" height="35" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="355" y="102" font-family="Helvetica" font-size="11">Router</text>
<rect x="210" y="170" width="80" height="50" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="225" y="190" font-family="Helvetica" font-size="10">Prefill</text>
<text x="225" y="202" font-family="Helvetica" font-size="10">workers</text>
<rect x="370" y="170" width="80" height="50" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="390" y="190" font-family="Helvetica" font-size="10">Decode</text>
<text x="390" y="202" font-family="Helvetica" font-size="10">workers</text>
<rect x="630" y="75" width="140" height="90" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="640" y="95" font-family="Helvetica" font-size="11">Planner</text>
<rect x="640" y="105" width="120" height="40" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="650" y="130" font-family="Helvetica" font-size="11">AIConfigurator</text>
<rect x="820" y="75" width="120" height="50" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="830" y="105" font-family="Helvetica" font-size="11">Dynamo Operator</text>
<rect x="810" y="210" width="100" height="35" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="820" y="232" font-family="Helvetica" font-size="11">Dynamo Graph</text>
<rect x="870" y="280" width="70" height="35" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="890" y="302" font-family="Helvetica" font-size="11">Grove</text>
<rect x="650" y="390" width="110" height="35" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="665" y="412" font-family="Helvetica" font-size="11">Model Express</text>
<rect x="105" y="375" width="70" height="30" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="125" y="395" font-family="Helvetica" font-size="11">KVBM</text>
<rect x="310" y="375" width="70" height="35" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="330" y="397" font-family="Helvetica" font-size="11">NIXL</text>
<rect x="280" y="460" width="110" height="50" fill="#E7ECFF" stroke="#234CDA" stroke-width="0.7"/>
<text x="295" y="480" font-family="Helvetica" font-size="9">Local SSD/NFS/</text>
<text x="295" y="495" font-family="Helvetica" font-size="9">Remote Storage</text>
<g stroke="#234CDA" stroke-width="1" fill="none">
<path d="M 90 60 L 168 60" stroke-dasharray="4,4" marker-end="url(#arrowhead-dashed)" />
<path d="M 170 140 L 92 140" stroke-dasharray="4,4" marker-end="url(#arrowhead-dashed)" />
<path d="M 470 140 L 608 140" stroke-dasharray="4,4" marker-end="url(#arrowhead-dashed)" />
<path d="M 610 200 L 472 200" stroke-dasharray="4,4" marker-end="url(#arrowhead-dashed)" />
<path d="M 195 250 L 195 318" stroke-dasharray="4,4" marker-end="url(#arrowhead-dashed)" />
<path d="M 315 320 L 315 252" stroke-dasharray="4,4" marker-end="url(#arrowhead-dashed)" />
<path d="M 425 320 L 425 252" stroke-dasharray="4,4" marker-end="url(#arrowhead-dashed)" />
<path d="M 705 425 L 705 480 L 502 480" stroke-dasharray="4,4" marker-end="url(#arrowhead-dashed)" />
<path d="M 280 97 L 328 97" marker-end="url(#arrowhead)" />
<path d="M 375 115 L 375 140 L 250 140 L 250 168" marker-end="url(#arrowhead)" />
<path d="M 375 140 L 410 140 L 410 168" marker-end="url(#arrowhead)" />
<path d="M 290 195 L 368 195" marker-end="url(#arrowhead)" />
<path d="M 770 115 L 818 115" marker-end="url(#arrowhead)" />
<path d="M 880 125 L 880 208" marker-end="url(#arrowhead)" />
<path d="M 860 245 L 860 358" marker-end="url(#arrowhead)" />
<path d="M 900 245 L 900 278" marker-end="url(#arrowhead)" />
<path d="M 900 315 L 900 358" marker-end="url(#arrowhead)" />
<path d="M 175 390 L 308 390" marker-end="url(#arrowhead)" />
<path d="M 325 410 L 325 458" marker-end="url(#arrowhead)" />
<path d="M 355 460 L 355 412" marker-end="url(#arrowhead)" />
</g>
<g font-family="Helvetica" font-size="10" fill="#000">
<text x="100" y="55">request</text>
<text x="100" y="130">response</text>
<text x="485" y="130">observability</text>
<text x="485" y="190">capacity, placement</text>
<text x="485" y="215">fault tolerance</text>
<text x="100" y="285">lifecycle, handoff,</text>
<text x="100" y="300">blocks (KVBM)</text>
<text x="320" y="285">model</text>
<text x="320" y="300">weights</text>
<text x="435" y="285">cache visibility/</text>
<text x="435" y="300">KV events to router</text>
<text x="890" y="180">(Reconcile)</text>
<text x="845" y="375">Runtime Resources</text>
<text x="270" y="430">KV block</text>
<text x="270" y="445">offload</text>
<text x="365" y="445">Model weights,</text>
<text x="365" y="430">KV block onboard</text>
<text x="330" y="190" font-size="9" text-anchor="middle">KV state</text>
<text x="330" y="208" font-size="9" text-anchor="middle">Checkpoints</text>
</g>
</svg>
...@@ -2,96 +2,200 @@ ...@@ -2,96 +2,200 @@
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0 # SPDX-License-Identifier: Apache-2.0
title: Overall Architecture title: Overall Architecture
subtitle: Architecture and components of the Dynamo inference runtime
--- ---
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting SGLang, TRT-LLM, vLLM and others, while capturing essential LLM capabilities: # Dynamo Architecture
- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency Dynamo is a distributed inference runtime for generative AI systems that must operate at high throughput, low latency, and high reliability under changing traffic conditions. It is backend-agnostic (SGLang, TRT-LLM, vLLM, and others) and is built around three cooperating concerns:
- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
- **LLM-aware request routing**: Eliminates unnecessary KV cache recomputation
- **Accelerated data transfer**: Reduces inference response time using NIXL
- **KV cache offloading**: Uses multiple memory hierarchies for higher system throughput and lower latency
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach - A fast **request path** for token generation
- A responsive **control path** for scaling and placement
- A resilient **state path** for KV reuse and failure recovery
## Motivation behind Dynamo This document presents Dynamo as an architecture, not a feature list: what each plane owns, how requests move, how the system adapts, and how it remains correct under failure.
Scaling inference for generative AI and reasoning models presents complex challenges in three key areas: performance, correctness, and efficiency. Here's what we're solving: ## Design Goals
There are multi-faceted challenges: Dynamo is designed to satisfy the following goals simultaneously:
- *Difficult UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further complicates matters. Developers need a clear, intuitive way to define, optimize, and update inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, hindering model deployment and innovation. A modern distributed inference stack must consider usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance. 1. **Latency stability**: keep TTFT and ITL predictable under bursty and mixed-length traffic.
2. **GPU efficiency**: disaggregate prefill and decode so each can scale independently.
3. **Compute reuse**: minimize KV recomputation through KV-aware routing and cache lifecycle management.
4. **Operational resilience**: treat worker crashes, restarts, and overload as normal operating events.
5. **Deployment portability**: support Kubernetes-native control paths and non-Kubernetes runtime modes.
- *GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (which generates large prompt embeddings) is highly compute-intensive, while decode (which generates tokens) is latency-sensitive. A disaggregated approach that separate prefill and decode ensures optimal GPU utilization and increases overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)). ## Why This Architecture Exists
- *Expensive KV cache re-computation*: When requests aren't efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948)) Modern LLM serving hits recurring bottlenecks:
- *Memory bottlenecks*: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. ([Mooncake](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [FlexKV](https://github.com/taco-project/FlexKV), [LMCache](https://lmcache.ai/)) - **Prefill/decode imbalance** leaves GPUs underutilized when traffic mix shifts ([DistServe](https://arxiv.org/abs/2401.09670)).
- **KV recomputation** increases TTFT and wastes compute when routing ignores cache overlap ([DeepSeek](https://arxiv.org/abs/2501.12948)).
- **Memory pressure** from long contexts and concurrency exceeds HBM capacity without multi-tier cache management ([KVBM](https://docs.nvidia.com/dynamo/components/kvbm), [Mooncake](https://kvcache-ai.github.io/Mooncake/design/mooncake-store.html), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [FlexKV](https://github.com/taco-project/FlexKV), [LMCache](https://lmcache.ai/)).
- **Dynamic demand** breaks static provisioning assumptions ([AzureTrace](https://github.com/Azure/AzurePublicDataset)).
- **Real-world failures** (pod restart, partition, hot-spot overload) require first-class recovery behavior.
- *Fluctuating demand and inefficient GPU allocation*: Inference workloads are use-case specific and dynamic—demand surges inherently cause unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset)) Dynamo addresses these constraints by separating serving, control, and state propagation into explicit planes and control loops.
- *Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it requires deep networking expertise to configure correctly, making it impractical for broad inference use cases. Developers need a library optimized for inference workloads that can abstract heterogeneous memory (remote memory or storage) and dynamically select the best transport mechanism via a unified API. ## Architecture Overview
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access. ![Dynamo architecture showing Request Plane (Client, Frontend, Router, Prefill/Decode workers), Control Plane (Planner, Dynamo Operator, Dynamo Graph, Grove, Model Express, Runtime Resources), and Storage &amp; Events Plane (KVBM, NIXL, Local SSD/NFS/Remote Storage)](../assets/img/dynamo-architecture.svg "Dynamo Architecture")
## Key benefits ## System Model
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features: ### Request Plane (critical data path)
- [Dynamo Disaggregated Serving](disagg-serving.md) The request plane is responsible for request/response execution:
- [Dynamo Smart Router](../components/router/README.md)
- [Dynamo KV Cache Block Manager](../components/kvbm/README.md)
- [Planner](../components/planner/README.md)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths. - **Frontend** accepts and normalizes requests.
- **Router** selects workers based on load and KV overlap.
- **Prefill workers** compute prompt KV state.
- **Decode workers** generate output tokens.
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../assets/img/architecture.png "Dynamo Architecture") This path is optimized for low overhead and continuous token streaming.
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand. ### Control Plane (adaptation and orchestration path)
Beyond efficient event communication, data transfer across multi-node deployments is crucial at scale. To address this, Dynamo utilizes NIXL, a technology designed to expedite transfers through reduced synchronization and intelligent batching. This acceleration is particularly vital for disaggregated serving, ensuring minimal latency when prefill workers pass KV cache data to decode workers. The control plane is responsible for desired-state management:
Dynamo prioritizes seamless integration. Its modular design enables it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. We built critical performance-sensitive modules with Rust for speed, memory safety, and robust concurrency. Meanwhile, we used Python for its flexibility, enabling rapid prototyping and effortless customization. - **Planner** computes scaling targets from live metrics.
- **Dynamo Operator** reconciles Kubernetes resources from Dynamo CRDs.
- **Discovery + Endpoints/CRD** establish liveness and discoverability.
- **Grove/KAI Scheduler path** provides topology-aware placement and grouped scaling in multinode Kubernetes deployments.
- **Model Express** is an optional model-management endpoint when configured.
## Performance benefits of key features This path is optimized for correctness and convergence to target capacity.
### Disaggregated serving ### Storage & Events Plane (state propagation path)
Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization. The storage/events plane is responsible for cache state visibility and movement:
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../assets/img/disagg-perf-benefit.png) - **KV Events** publish cache lifecycle transitions.
- **KVBM** manages block reuse, eviction, and offload/recall across memory tiers.
- **NIXL** performs high-speed KV/data transfer across workers and memory domains.
* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL This path is optimized for cache reuse and cross-worker handoff efficiency.
## End-to-End Request Narrative (Disaggregated Mode)
The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation can provide tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput. 1. Client sends request to **Frontend**.
2. Frontend validates/preprocesses and forwards to **Router**.
3. Router chooses a **Prefill worker**.
4. Prefill computes KV and returns transfer metadata.
5. Router chooses a **Decode worker**.
6. Decode receives KV state (typically via **NIXL** transfer path).
7. Decode streams tokens back through Frontend.
8. **KV Events** update cache visibility for future routing decisions.
9. **KVBM** may offload or recall KV blocks based on pressure and reuse potential.
### KV aware routing For flow-level detail, see [Architecture Flow](dynamo-flow.md).
For request transport options, see [Request Plane](request-plane.md).
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../assets/img/kv-routing.png) ## Control Loops
### Serving Loop
Maintains low-latency request execution across frontend, router, prefill, and decode workers.
### Planning Loop
Maintains capacity alignment with demand:
- Planner consumes runtime metrics.
- Planner computes prefill/decode targets.
- Connector layer applies targets to runtime resources.
Planner supports throughput-based and load-based strategies. See [Planner Design](planner-design.md).
### Resilience Loop
Maintains system continuity under failure:
- Health checks detect unhealthy workers.
- Discovery liveness removes stale endpoints.
- Graceful shutdown drains in-flight work.
- Request migration/cancellation controls in-flight behavior.
- Load shedding prevents cascading collapse under overload.
See [Fault Tolerance](../fault-tolerance/README.md).
## Kubernetes-Native Realization (CRD + Grove)
In Kubernetes deployments, the same architecture maps to declarative resources:
- Dynamo Operator reconciles `DynamoGraphDeployment`.
- Discoverability is derived from `DynamoWorkerMetadata` + EndpointSlices.
- Grove-backed multinode deployments model worker groups as `PodCliqueSet` and `PodClique`.
- Independent prefill/decode elasticity is represented via `PodCliqueScalingGroup` with separate `replicas` and `min` targets.
The diagram labels such as `PodClique A/B`, `ScalingGroup "Prefill"`, `ScalingGroup "Decode"`, and `(replicas, min)` represent this grouped scaling model.
## Fault Tolerance Architecture
* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL Fault tolerance is embedded across layers:
| Layer | Mechanism | Practical effect |
|------|-----------|------------------|
| Request | Migration, cancellation | In-flight work can continue or terminate intentionally |
| Worker | Health checks, graceful shutdown, endpoint draining | Failed/terminating workers stop taking new traffic safely |
| System | Request rejection/load shedding | Prevents overload from propagating across workers |
| Infrastructure | Discovery lease expiry, event-path recovery | Stale membership is removed and traffic reroutes |
Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The preceeding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput. This model assumes failures are routine, not exceptional.
### KV cache manager ## Performance Rationale
### Disaggregated Serving
Separating prefill and decode improves utilization and enables phase-specific scaling.
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../assets/img/disagg-perf-benefit.png)
*Tested on H100 with R1 Distilled Llama 70B FP8 on vLLM. 3K ISL / 150 OSL.*
### KV-Aware Routing
Routing with cache overlap + load signals reduces prefill recomputation and improves latency.
For an external production case study, see [How Baseten achieved 2x faster inference with NVIDIA Dynamo](https://www.baseten.co/blog/how-baseten-achieved-2x-faster-inference-with-nvidia-dynamo/#how-baseten-uses-nvidia-dynamo).
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../assets/img/kv-routing.png)
*Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 H100 nodes. Avg 4K ISL / 800 OSL.*
### KV Block Manager (KVBM)
KVBM extends effective cache capacity using multi-tier memory offload/recall.
The Dynamo KV Block Manager (KVBM) enables KV cache offloading to system CPU memory, local SSDs, and network-attached storage, allowing more KV blocks to be reused instead of recomputed. In many cases, KV transfer is faster than recomputation, so KVBM helps reduce time-to-first-token (TTFT). The following plot highlights the performance gains achieved through CPU memory offloading. In a scenario involving 20 multi-turn conversations with 15 users, KVBM with CPU memory offloading achieved a 2.2×–12× improvement in TTFT (depending on QPS), demonstrating benefits that extend beyond basic prefix caching.
![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../assets/img/kvbm-agg-performance.png) ![Line graph comparing Pure GPU prefix caching with vLLM and KVBM host offloading for TTFT (Time To First Token)](../assets/img/kvbm-agg-performance.png)
* Tested with different QPS using Qwen3-8B on H100. Avg 20K ISL / 100 OSL. *Tested across QPS values using Qwen3-8B on H100. Avg 20K ISL / 100 OSL.*
### NIXL Data Transfer
NIXL reduces KV handoff cost in distributed serving by optimizing cross-worker transfer behavior across heterogeneous memory.
## Implementation Model
- **Rust** for performance-sensitive runtime components.
- **Python** for backend integration and extensibility.
- Modular subsystem boundaries so routing, planning, memory, and transport can evolve independently.
### NVIDIA Inference Transfer Library (NIXL) ## Related Documentation
NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL can abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support a single tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and throughput. - [Architecture Flow](dynamo-flow.md)
- [Router Design](router-design.md)
- [Planner Design](planner-design.md)
- [Discovery Plane](discovery-plane.md)
- [Event Plane](event-plane.md)
- [Request Plane](request-plane.md)
- [Fault Tolerance](../fault-tolerance/README.md)
- [Grove](../kubernetes/grove.md)
## Acknowledgements ## Acknowledgements
We'd like to acknowledge several open source software stacks that motivated our creation Dynamo. Dynamo is informed by prior open-source work from:
- vLLM and vLLM-project - vLLM
- SGLang - SGLang
- DistServe - DistServe
- Mooncake - Mooncake
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment