Unverified Commit 8d636ebd authored by Suman Tatiraju's avatar Suman Tatiraju Committed by GitHub
Browse files
parent 6d46288c
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
name: Generate Documentation
on:
push:
branches:
- main
- release/*
pull_request:
paths:
- 'docs/**'
- 'container/Dockerfile.docs'
- '.github/workflows/generate-docs.yml'
jobs:
build-docs:
name: Build Documentation
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Generate documentation
run: |
docker build -t docs-builder -f container/Dockerfile.docs .
- name: Copy documentation out of container
run: |
docker create --name docs-container docs-builder
docker cp docs-container:/workspace/dynamo/docs/build/html dynamo-docs/
- name: Remove documentation container
if: always()
run: |
docker rm docs-container || true
- name: Upload documentation artifact
uses: actions/upload-artifact@v4
with:
name: dynamo-docs-${{ github.run_id }}
path: dynamo-docs
retention-days: 15
\ No newline at end of file
......@@ -21,7 +21,7 @@ limitations under the License.
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
[![Discord](https://dcbadge.limes.pink/api/server/D92uqZRjCZ?style=flat)](https://discord.gg/nvidia-dynamo)
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Support Matrix](support_matrix.md)** | **[Guides](docs/guides)** | **[Architecture and Features](docs/architecture.md)** | **[APIs](lib/bindings/python/README.md)** | **[SDK](deploy/sdk/README.md)** |
| **[Roadmap](https://github.com/ai-dynamo/dynamo/issues/762)** | **[Support Matrix](docs/support_matrix.md)** | **[Guides](docs/guides)** | **[Architecture and Features](docs/architecture/architecture.md)** | **[APIs](lib/bindings/python/README.md)** | **[SDK](deploy/dynamo/sdk/README.md)** |
### 📢 **Please join us for our** [ **first Dynamo in-person meetup with vLLM and SGLang leads**](https://events.nvidia.com/nvidiadynamousermeetups) **on 6/5 (Thu) in SF!** ###
......@@ -38,7 +38,7 @@ Built in Rust for performance and in Python for extensibility, Dynamo is fully o
### Installation
The following examples require a few system level packages.
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [support_matrix.md](support_matrix.md)
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](support_matrix.md)
```
apt-get update
......
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
FROM ubuntu:24.04
COPY --from=ghcr.io/astral-sh/uv:latest /uv /uvx /bin/
RUN apt-get update && \
apt-get install -y --no-install-recommends \
build-essential \
curl \
doxygen \
pandoc \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /workspace/dynamo
ENV VIRTUAL_ENV=/workspace/dynamo/.venv
RUN uv venv $VIRTUAL_ENV --python 3.12 && \
uv pip install ablog \
attrs \
breathe \
docutils \
exhale \
httplib2 \
ipython \
myst-nb \
nbclient \
nbsphinx \
nvidia-sphinx-theme \
sphinx \
sphinx-book-theme \
sphinx-copybutton \
sphinx-design \
sphinx-prompt \
sphinx-sitemap \
sphinx-tabs \
sphinxcontrib-bibtex \
sphinxcontrib-mermaid
# Set visitor script to be included on every HTML page
ENV VISITS_COUNTING_SCRIPT="//assets.adobedtm.com/b92787824f2e0e9b68dc2e993f9bd995339fe417/satelliteLib-7ba51e58dc61bcb0e9311aadd02a0108ab24cc6c.js"
COPY . /workspace/dynamo
RUN . .venv/bin/activate && \
python3 docs/generate_docs.py
\ No newline at end of file
# Dynamo CLI Documentation
The Dynamo CLI is a powerful tool for serving, containerizing, and deploying Dynamo applications. It leverages core pieces of the BentoML deployment stack and provides a range of commands to manage your Dynamo services.
Overview
At a high level, the Dynamo CLI allows you to:
- `run` - quickly chat with a model
- `serve` - run a set of services locally (via `depends()` or `.link()`)
- `build` - create an archive of your services (called a `bento`)
# Commands
## `run`
The `run` command allows you to quickly chat with a model. Under the hood - it is running the `dynamo-run` Rust binary. You can find the arguments that it takes here: [dynamo-run docs](../../../../../launch/README.md)
**Example**
```bash
dynamo run deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
```
## `serve`
The `serve` command lets you run a defined inference graph locally. You must point toward your file and intended class using file:Class syntax
**Usage**
```bash
dynamo serve [SERVICE]
```
**Arguments**
- `SERVICE` - The service to start. You use file:Class syntax to specify the service.
**Flags**
- `--file`/`-f` - Path to optional YAML configuration file. An example of the YAML file can be found in the configuration section of the [SDK docs](../sdk/README.md)
- `--dry-run` - Print out the dependency graph and values without starting any services.
- `--service-name` - Only serve the specified service name. The rest of the discoverable components in the graph are not started.
- `--working-dir` - Specify the directory to find the Service instance
- Any additional flags that follow Class.key=value will be passed to the service constructor for the target service and parsed. Please see the configuration section of the [SDK docs](../sdk/README.md) for more details.
**Example**
```bash
cd examples
# Spin up Frontend, Middle, and Backend components
dynamo serve hello_world:Frontend
# Spin up only the Middle component in the graph that is discoverable from the Frontend service
dynamo serve --service-name Middle hello_world:Frontend
```
## `build`
The `build` commmand allows you to package up your inference graph and its dependancies and create an archive of it. This is commonly paired with the `--containerize` flag to create a single docker container that runs your inference graph. As with `serve`, you point toward the first service in your dependency graph.
**Usage**
```bash
dynamo build [SERVICE]
```
**Arguments**
- `SERVICE` - The service to build. You use file:Class syntax to specify the service.
**Flags**
- `--working-dir` - Specify the directory to find the Service instance
- `--containerize` - Whether to containerize the Bento after building
**Example**
```bash
cd examples/hello_world
dynamo build hello_world:Frontend
```
../../../../docs/guides/cli_overview.md
\ No newline at end of file
# Documentation for the Dynamo SDK
# Table of Contents
- [Introduction](#introduction)
- [Installation](#installation)
- [Core Concepts](#core-concepts)
- [Writing a Service](#writing-a-service)
- [Configuring a Service](#configuring-a-service)
- [Composing Services into an Graph](#composing-services-into-an-graph)
# Introduction
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. Docs and examples for those can be found [here](../../../../../README.md).
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns and leverages many of its core primitives. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example [here](../../README.md).
# Installation
The SDK can be installed using pip:
```bash
pip install ai-dynamo
```
# Core Concepts
As you read about each concept, it is helpful to have the [basic example](../../README.md) up as well so you can refer back to it.
## Defining a Service
A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
```python
@service(
dynamo={
"namespace": "dynamo",
},
resources={"gpu": 2, "cpu": "10", "memory": "20Gi"},
workers=1,
)
```
Key configuration options:
1. `dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../../../../../docs/architecture.md)
2. `resources`: Dictionary defining resource requirements. Used primarily when deploying to K8s, but gpu is also used for local execution.
3. `workers`: Number of parallel instances of the service to spin up.
## Writing a Service
Let's walk through an example to understand how you write a dynamo service.
```python
import ServiceB
@service(dynamo={"namespace": "dynamo"}, resources={"gpu": 1})
class ServiceA:
# Define service dependencies
service_b = depends(ServiceB)
def __init__(self, model_name="meta-llama/Llama-3.1-8B-Instruct"):
self.model_name = model_name
self.engine = None
@async_on_start
async def async_init(self):
# Initialize resources that require async operations
self.engine = await initialize_model_engine(self.model_name)
print(f"ServiceA initialized with model: {self.model_name}")
@async_on_shutdown
async def async_shutdown(self):
# Clean up resources
if self.engine:
await self.engine.shutdown()
print("ServiceA engine shut down")
@endpoint()
async def generate(self, request: ChatCompletionRequest):
# Call dependent service
processed_request = await self.service_b.preprocess(request)
# Use the engine to generate a response
response = await self.engine.generate(processed_request)
return response
```
### Class-Based Architecture
Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
1. Class attributes for dependencies using `depends()`
2. An `__init__` method for standard initialization
3. Optional lifecycle hooks like `@async_on_start` and `@async_on_shutdown`
4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
via `@endpoint("my_endpoint_name")`, but otherwise will default to the name of the
function being decorated if omitted.
This approach provides a clean separation of concerns and makes the service structure easy to understand.
### Service Dependencies with `depends()`
The `depends()` function is a powerful BentoML feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
3. You are able to access `ServiceB` endpoints as if it were a local function!
```python
# What happens internally when you use depends(ServiceB)
service_b = await runtime.namespace("dynamo").component("ServiceB").endpoint("preprocess").client()
# But with Dynamo SDK, you simply write:
service_b = depends(ServiceB)
# And then call methods directly:
result = await service_b.preprocess(data)
```
**NOTE** - through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. You can do this via:
```python
import VllmWorker
runtime = dynamo_context["runtime"]
comp_ns, comp_name = VllmWorker.dynamo_address() # dynamo://{namespace}/{name}
print(f"[Processor] comp_ns: {comp_ns}, comp_name: {comp_name}")
self.worker_client = (
await runtime.namespace(comp_ns)
.component(comp_name)
.endpoint("generate")
.client()
)
```
This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
You can find more docs on depends [here](https://docs.bentoml.com/en/latest/build-with-bentoml/distributed-services.html#interservice-communication)
### Lifecycle Hooks
Dynamo supports key lifecycle hooks to manage service initialization and cleanup. We currently only support a subset of BentoML's lifecycle hooks but are working on adding support for the rest.
#### `@async_on_start`
The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
```python
@async_on_start
async def async_init(self):
# Perfect for operations that need to be awaited
self.db = await connect_to_db()
self.tokenizer = await load_tokenizer()
self.engine = await initialize_engine(self.model)
```
This is especially useful for:
- Initializing external connections
- Setting up runtime resources that require async operations
#### `@async_on_shutdown`
The `@async_on_shutdown` hook is called when the service is shutdown handles cleanup.
```python
@async_on_shutdown
async def async_shutdown(self):
if self._engine_context is not None:
await self._engine_context.__aexit__(None, None, None)
print("VllmWorkerRouterLess shutting down")
```
This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
## Configuring a Service
Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
1. Directly in the `@service` decorator
2. Through YAML configuration files
3. Via command-line arguments
4. Using environment variables
These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
### Configuration via Service Decorator
The most basic method is to specify parameters directly in the service decorator:
```python
@service(
dynamo={"namespace": "prod"},
resources={"gpu": 2, "cpu": "4", "memory": "16Gi"},
workers=2,
)
class MyService:
def __init__(self, model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", temperature=0.7):
self.model_name = model_name
self.temperature = temperature
```
This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
### Configuration via YAML
For more flexible configuration, especially across environments, you can use YAML files:
```yaml
# config.yaml
MyService:
# Override service decorator settings
ServiceArgs:
workers: 4
resources:
gpu: 4
# Service instance parameters
model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
temperature: 0.8
```
The YAML file has a hierarchical structure:
- Top level keys are service class names
- `ServiceArgs` contains parameters for the service decorator
- Other keys are passed as arguments to the service constructor
- Additional keys specific to the service can be accessed via the config system
### Loading YAML Configuration
Use the CLI to load configuration from a YAML file:
```bash
dynamo serve service:MyService -f config.yaml
```
The configuration is parsed and stored in the `DYNAMO_SERVICE_CONFIG` environment variable, which is then passed to the service workers.
### Configuration Precedence
When multiple configuration sources are used, they follow this precedence order (highest to lowest):
1. Command-line arguments
2. YAML configuration
3. Service decorator defaults
4. Constructor defaults
### Accessing Configuration in Services
Inside a service, you can access configuration using the `ServiceConfig` class:
```python
from dynamo.sdk.lib.config import ServiceConfig
class MyService:
def __init__(self):
config = ServiceConfig.get_instance()
# Get with default value
self.model_name = config.get("MyService", {}).get("model_name", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
self.temperature = config.get("MyService", {}).get("temperature", 0.7)
# Require a config value (raises error if missing)
self.api_key = config.require("MyService", "api_key")
# Get all config for this service
all_my_config = config.get("MyService", {})
```
### Parsing Configuration as CLI Arguments
For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
```python
from dynamo.sdk.lib.config import ServiceConfig
def setup_my_lib():
config = ServiceConfig.get_instance()
# Get all MyService config with prefix "lib_" as CLI args
cli_args = config.as_args("MyService", prefix="lib_")
# Returns: ["--option1", "value1", "--flag2", "--option3", "value3"]
# Pass to an external library's argument parser
lib_parser = MyLibArgumentParser()
lib_args = lib_parser.parse_args(cli_args)
return lib_args
```
This pattern is used in the example vLLM integration:
```python
def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
config = ServiceConfig.get_instance()
vllm_args = config.as_args(service_name, prefix=prefix)
parser = FlexibleArgumentParser()
# Add custom arguments
parser.add_argument("--router", type=str, choices=["random", "round-robin", "kv"], default="random")
parser.add_argument("--remote-prefill", action="store_true")
# Add VLLM's arguments
parser = AsyncEngineArgs.add_cli_args(parser)
# Parse both custom and VLLM arguments
args = parser.parse_args(vllm_args)
# Convert to engine arguments
engine_args = AsyncEngineArgs.from_cli_args(args)
# Add custom args to the engine args
engine_args.router = args.router
engine_args.remote_prefill = args.remote_prefill
return engine_args
```
### Overriding Service Decorator with ServiceArgs
The `ServiceArgs` section in YAML configuration allows you to override any parameter in the `@service` decorator:
```yaml
MyService:
ServiceArgs:
dynamo:
namespace: "staging" # Override namespace
resources:
gpu: 4 # Use more GPUs
workers: 8 # Scale up workers
```
This is particularly useful for:
- Changing resource allocations between environments
- Modifying worker counts based on expected load
- Switching between namespaces for different deployments
Under the hood, the `DynamoService` class reads these arguments during initialization:
```python
def _get_service_args(self, service_name: str) -> Optional[dict]:
"""Get ServiceArgs from environment config if specified"""
config_str = os.environ.get("DYNAMO_SERVICE_CONFIG")
if config_str:
config = json.loads(config_str)
service_config = config.get(service_name, {})
return service_config.get("ServiceArgs")
return None
```
### Complete Configuration Example
Here's a comprehensive example showing how all these pieces fit together:
1. First, define your service with basic defaults:
```python
@service(
dynamo={"namespace": "default"},
resources={"gpu": 1},
workers=1,
)
class LLMService:
def __init__(self, model_name="gpt-2", temperature=0.7, max_tokens=1024):
self.model_name = model_name
self.temperature = temperature
self.max_tokens = max_tokens
# Get additional configuration
config = ServiceConfig.get_instance()
service_config = config.get("LLMService", {})
# Extract service-specific parameters
self.cache_size = service_config.get("cache_size", 1000)
self.use_kv_cache = service_config.get("use_kv_cache", True)
```
2. Create a YAML configuration for production:
```yaml
# prod_config.yaml
LLMService:
ServiceArgs:
dynamo:
namespace: "prod"
resources:
gpu: 4
memory: "64Gi"
workers: 8
# Constructor parameters
model_name: "llama-3-70b-instruct"
temperature: 0.8
max_tokens: 2048
# Service-specific parameters
cache_size: 10000
use_kv_cache: true
```
3. Deploy with mixed configuration:
```bash
dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
```
The service will receive the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
- `dynamo.namespace = "prod"`
- `resources.gpu = 4`
- `workers = 8`
- `model_name = "llama-3-70b-instruct"`
- `temperature = 0.9` (from CLI override)
- `max_tokens = 2048`
- `cache_size = 10000`
- `use_kv_cache = true`
### Service Configuration Best Practices
1. **Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
2. **Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
3. **Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
4. **Use CLI for Quick Testing**: Override specific values for experimentation
5. **Document Configuration Keys**: Make sure to document all available configuration options
Following these practices will help you create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
### Composing Services into an Graph
There are two main ways to compose services in Dynamo:
1. Use `depends()` (Recommended)
The depends() approach is the recommended way for production deployments:
- Automatically deploys all dependencies
- Creates a static inference graph at deployment time
- Provides type hints and better IDE support
2. Use `.link()` (Experimental)
Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
### Understanding the `.link()` syntax
Lets take the example of a `Processor` component. This component can currently do 2 things:
1. Process a request and send it to a `Router` to decide what worker to send it to.
2. Process a request and send it to a `Worker` directly.
A snippet of the Processor is shown below:
```python
class Processor(ProcessMixIn):
"""
vLLM pre and post processing
"""
worker = depends(VllmWorker)
router = depends(Router)
# logic for processing a request based on router or worker
```
You can think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor will spin up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor which will remove the `worker` dependency from the Processor.
```python
Processor.link(Router)
```
This will remove the `worker` dependency from the Processor and only spin up the Router.
\ No newline at end of file
../../../../docs/API/sdk.md
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Python Bindings
Python bindings for the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
## 🚀 Quick Start
1. Install `uv`: https://docs.astral.sh/uv/#getting-started
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
2. Install `protoc` protobuf compiler: https://grpc.io/docs/protoc-installation/.
For example on an Ubuntu/Debian system:
```
apt install protobuf-compiler
```
3. Setup a virtualenv
```
uv venv
source .venv/bin/activate
uv pip install maturin
```
4. Build and install dynamo wheel
```
maturin develop --uv
```
## Run Examples
### Prerequisite
See [README.md](../runtime/README.md#prerequisites).
### Hello World Example
1. Start 3 separate shells, and activate the virtual environment in each
```
source .venv/bin/activate
```
2. In one shell (shell 1), run example server the instance-1
```
python3 ./examples/hello_world/server.py
```
3. (Optional) In another shell (shell 2), run example the server instance-2
```
python3 ./examples/hello_world/server.py
```
4. In the last shell (shell 3), run the example client:
```
python3 ./examples/hello_world/client.py
```
If you run the example client in rapid succession, and you started more than
one server instance above, you should see the requests from the client being
distributed across the server instances in each server's output. If only one
server instance is started, you should see the requests go to that server
each time.
## Performance
The performance impacts of synchronizing the Python and Rust async runtimes
is a critical consideration when optimizing the performance of a highly
concurrent and parallel distributed system.
The Python GIL is a global critical section and is ultimately the death of
parallelism. To compound that, when Rust async futures become ready,
accessing the GIL on those async event loop needs to be considered carefully.
Under high load, accessing the GIL or performing CPU intensive tasks on
on the event loop threads can starve out other async tasks for CPU resources.
However, performing a `tokio::task::spawn_blocking` is not without overheads
as well.
If bouncing many small message back-and-forth between the Python and Rust
event loops where Rust requires GIL access, this is pattern where moving the
code from Python to Rust will give you significant gains.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo SDK
# Table of Contents
- [Introduction](#introduction)
- [Installation](#installation)
- [Core Concepts](#core-concepts)
- [Writing a Service](#writing-a-service)
- [Configuring a Service](#configuring-a-service)
- [Composing Services into an Graph](#composing-services-into-an-graph)
## Introduction
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See Python Bindings](./python_bindings.md).
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns and leverages many of its core primitives. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
## Installation
The SDK can be installed using pip:
```bash
pip install ai-dynamo
```
## Core Concepts
As you read about each concept, it is helpful to have the [basic example](../examples/hello_world.md) up as well so you can refer back to it.
### Defining a Service
A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
```python
@service(
dynamo={
"namespace": "dynamo",
},
resources={"gpu": 2, "cpu": "10", "memory": "20Gi"},
workers=1,
)
```
Key configuration options:
1. `dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../architecture/architecture.md)
2. `resources`: Dictionary defining resource requirements. The GPUs field is used for local and remote deployment. The other fields are used to determine resources when deploying to K8s.
3. `workers`: Number of parallel instances of the service to spin up.
### Writing a Service
Let's walk through an example to understand how you write a dynamo service.
```python
import ServiceB
@service(dynamo={"namespace": "dynamo"}, resources={"gpu": 1})
class ServiceA:
# Define service dependencies
service_b = depends(ServiceB)
def __init__(self, model_name="meta-llama/Llama-3.1-8B-Instruct"):
self.model_name = model_name
self.engine = None
@async_on_start
async def async_init(self):
# Initialize resources that require async operations
self.engine = await initialize_model_engine(self.model_name)
print(f"ServiceA initialized with model: {self.model_name}")
@async_on_shutdown
async def async_shutdown(self):
# Clean up resources
if self.engine:
await self.engine.shutdown()
print("ServiceA engine shut down")
@endpoint()
async def generate(self, request: ChatCompletionRequest):
# Call dependent service
processed_request = await self.service_b.preprocess(request)
# Use the engine to generate a response
response = await self.engine.generate(processed_request)
return response
```
#### Class-Based Architecture
Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
1. Class attributes for dependencies using `depends()`
2. An `__init__` method for standard initialization
3. Optional lifecycle hooks like `@async_on_start` and `@async_on_shutdown`
4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
via `@endpoint("my_endpoint_name")`, but otherwise defaults to the name of the
function being decorated if omitted.
This approach provides a clean separation of concerns and makes the service structure easy to understand.
#### Service Dependencies with `depends()`
The `depends()` function is a powerful BentoML feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
3. You are able to access `ServiceB` endpoints as if it were a local function!
```python
# What happens internally when you use depends(ServiceB)
service_b = await runtime.namespace("dynamo").component("ServiceB").endpoint("preprocess").client()
# But with Dynamo SDK, you simply write:
service_b = depends(ServiceB)
# And then call methods directly:
result = await service_b.preprocess(data)
```
```{note}
Through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. To do so:
```
```python
import VllmWorker
# this runtime object gives you access to the underlying python bindings
runtime = dynamo_context["runtime"]
comp_ns, comp_name = VllmWorker.dynamo_address() # dynamo://{namespace}/{name}
print(f"[Processor] comp_ns: {comp_ns}, comp_name: {comp_name}")
self.worker_client = (
await runtime.namespace(comp_ns)
.component(comp_name)
.endpoint("generate")
.client()
)
```
This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
You can find more docs on depends [here](https://docs.bentoml.com/en/latest/build-with-bentoml/distributed-services.html#interservice-communication)
#### Lifecycle Hooks
Dynamo supports key lifecycle hooks to manage service initialization and cleanup. We currently only support a subset of BentoML's lifecycle hooks but are working on adding support for the rest.
##### `@async_on_start`
The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
```python
@async_on_start
async def async_init(self):
# Perfect for operations that need to be awaited
self.db = await connect_to_db()
self.tokenizer = await load_tokenizer()
self.engine = await initialize_engine(self.model)
```
This is especially useful for:
- Initializing external connections
- Setting up runtime resources that require async operations
#### `@async_on_shutdown`
The `@async_on_shutdown` hook is called when the service is shutdown handles cleanup.
```python
@async_on_shutdown
async def async_shutdown(self):
if self._engine_context is not None:
await self._engine_context.__aexit__(None, None, None)
print("VllmWorkerRouterLess shutting down")
```
This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
### Configuring a Service
Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
1. Directly in the `@service` decorator
2. Through YAML configuration files
3. Via command-line arguments
4. Using environment variables
These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
#### Configuration via Service Decorator
The most basic method is to specify parameters directly in the service decorator:
```python
@service(
dynamo={"namespace": "prod"},
resources={"gpu": 2, "cpu": "4", "memory": "16Gi"},
workers=2,
)
class MyService:
def __init__(self, model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", temperature=0.7):
self.model_name = model_name
self.temperature = temperature
```
This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
#### Configuration via YAML
For more flexible configuration, especially across environments, you can use YAML files:
```yaml
# config.yaml
MyService:
# Override service decorator settings
ServiceArgs:
workers: 4
resources:
gpu: 4
# Service instance parameters
model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
temperature: 0.8
```
The YAML file has a hierarchical structure:
- Top level keys are service class names
- `ServiceArgs` contains parameters for the service decorator
- Other keys are passed as arguments to the service constructor
- Additional keys specific to the service can be accessed via the config system
#### Loading YAML Configuration
Use the CLI to load configuration from a YAML file:
```bash
dynamo serve service:MyService -f config.yaml
```
The configuration is parsed and stored in the `DYNAMO_SERVICE_CONFIG` environment variable, which is then passed to the service workers.
#### Configuration Precedence
When multiple configuration sources are used, they follow this precedence order (highest to lowest):
1. Command-line arguments
2. YAML configuration
3. Service decorator defaults
4. Constructor defaults
#### Accessing Configuration in Services
Inside a service, you can access configuration using the `ServiceConfig` class:
```python
from dynamo.sdk.lib.config import ServiceConfig
class MyService:
def __init__(self):
config = ServiceConfig.get_instance()
# Get with default value
self.model_name = config.get("MyService", {}).get("model_name", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
self.temperature = config.get("MyService", {}).get("temperature", 0.7)
# Require a config value (raises error if missing)
self.api_key = config.require("MyService", "api_key")
# Get all config for this service
all_my_config = config.get("MyService", {})
```
#### Parsing Configuration as CLI Arguments
For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
```python
from dynamo.sdk.lib.config import ServiceConfig
def setup_my_lib():
config = ServiceConfig.get_instance()
# Get all MyService config with prefix "lib_" as CLI args
cli_args = config.as_args("MyService", prefix="lib_")
# Returns: ["--option1", "value1", "--flag2", "--option3", "value3"]
# Pass to an external library's argument parser
lib_parser = MyLibArgumentParser()
lib_args = lib_parser.parse_args(cli_args)
return lib_args
```
This pattern is used in the example vLLM integration:
```python
def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
config = ServiceConfig.get_instance()
vllm_args = config.as_args(service_name, prefix=prefix)
parser = FlexibleArgumentParser()
# Add custom arguments
parser.add_argument("--router", type=str, choices=["random", "round-robin", "kv"], default="random")
parser.add_argument("--remote-prefill", action="store_true")
# Add VLLM's arguments
parser = AsyncEngineArgs.add_cli_args(parser)
# Parse both custom and VLLM arguments
args = parser.parse_args(vllm_args)
# Convert to engine arguments
engine_args = AsyncEngineArgs.from_cli_args(args)
# Add custom args to the engine args
engine_args.router = args.router
engine_args.remote_prefill = args.remote_prefill
return engine_args
```
#### Overriding Service Decorator with ServiceArgs
The `ServiceArgs` section in YAML configuration allows you to override any parameter in the `@service` decorator:
```yaml
MyService:
ServiceArgs:
dynamo:
namespace: "staging" # Override namespace
resources:
gpu: 4 # Use more GPUs
workers: 8 # Scale up workers
```
This is particularly useful for:
- Changing resource allocations between environments
- Modifying worker counts based on expected load
- Switching between namespaces for different deployments
Under the hood, the `DynamoService` class reads these arguments during initialization:
```python
def _get_service_args(self, service_name: str) -> Optional[dict]:
"""Get ServiceArgs from environment config if specified"""
config_str = os.environ.get("DYNAMO_SERVICE_CONFIG")
if config_str:
config = json.loads(config_str)
service_config = config.get(service_name, {})
return service_config.get("ServiceArgs")
return None
```
#### Complete Configuration Example
Here's a comprehensive example showing how all these pieces fit together:
1. First, define your service with basic defaults:
```python
@service(
dynamo={"namespace": "default"},
resources={"gpu": 1},
workers=1,
)
class LLMService:
def __init__(self, model_name="gpt-2", temperature=0.7, max_tokens=1024):
self.model_name = model_name
self.temperature = temperature
self.max_tokens = max_tokens
# Get additional configuration
config = ServiceConfig.get_instance()
service_config = config.get("LLMService", {})
# Extract service-specific parameters
self.cache_size = service_config.get("cache_size", 1000)
self.use_kv_cache = service_config.get("use_kv_cache", True)
```
2. Create a YAML configuration for production:
```yaml
# prod_config.yaml
LLMService:
ServiceArgs:
dynamo:
namespace: "prod"
resources:
gpu: 4
memory: "64Gi"
workers: 8
# Constructor parameters
model_name: "llama-3-70b-instruct"
temperature: 0.8
max_tokens: 2048
# Service-specific parameters
cache_size: 10000
use_kv_cache: true
```
3. Deploy with mixed configuration:
```bash
dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
```
The service receives the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
- `dynamo.namespace = "prod"`
- `resources.gpu = 4`
- `workers = 8`
- `model_name = "llama-3-70b-instruct"`
- `temperature = 0.9` (from CLI override)
- `max_tokens = 2048`
- `cache_size = 10000`
- `use_kv_cache = true`
#### Service Configuration Best Practices
1. **Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
2. **Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
3. **Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
4. **Use CLI for Quick Testing**: Override specific values for experimentation
5. **Document Configuration Keys**: Make sure to document all available configuration options
Following these practices help create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
### Deploying a Single Service
You can deploy a single service for local development even if you have a dependancy graph defined using `depends()` using `dynamo serve --service-name <ClassName> <entrypoint> <configuration arguments>`. You can see an example of this in our multinode documentation [here](../examples/multinode.md).
### Composing Services into an Graph
There are two main ways to compose services in Dynamo:
1. Use `depends()` (Recommended)
The depends() approach is the recommended way for production deployments:
- Automatically deploys all dependencies
- Creates a static inference graph at deployment time
- Provides type hints and better IDE support
2. Use `.link()` (Experimental)
Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
#### Understanding the `.link()` syntax
Lets take the example of a `Processor` component. This component can currently do 2 things:
1. Process a request and send it to a `Router` to decide what worker to send it to.
2. Process a request and send it to a `Worker` directly.
Consider this snippet of the Processor:
```python
class Processor(ProcessMixIn):
"""
vLLM pre and post processing
"""
worker = depends(VllmWorker)
router = depends(Router)
# logic for processing a request based on router or worker
```
Think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor spins up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor that removes the `worker` dependency from the Processor.
```python
Processor.link(Router)
```
This removes the `worker` dependency from the Processor and only spin up the Router.
\ No newline at end of file
# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line, and also
# from the environment for the first two.
SPHINXOPTS ?=
SPHINXBUILD ?= sphinx-build
SOURCEDIR = .
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
clean:
@rm -fr ${BUILDDIR}
.PHONY: help Makefile clean
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%:
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
// Add RunLLM widget
document.addEventListener("DOMContentLoaded", function () {
var script = document.createElement("script");
script.type = "module";
script.id = "runllm-widget-script"
script.src = "https://widget.runllm.com";
script.setAttribute("version", "stable");
script.setAttribute("runllm-keyboard-shortcut", "Mod+j"); // cmd-j or ctrl-j to open the widget.
script.setAttribute("runllm-name", "dynamo");
script.setAttribute("runllm-position", "BOTTOM_RIGHT");
script.setAttribute("runllm-position-y", "120px");
script.setAttribute("runllm-position-x", "20px");
script.setAttribute("runllm-assistant-id", "758");
script.async = true;
document.head.appendChild(script);
});
[
{
"name": "0.1.0 (current release)",
"version": "0.1.0",
"url": "https://docs.nvidia.com/dynamo/latest/index.html"
},
{
"name": "older releases",
"version": "archives",
"url": "https://docs.nvidia.com/dynamo/archives/"
}
]
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
# Dynamo architecture and key features
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# High Level Architecture
Dynamo is high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Dynamo is designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others) and captures LLM-specific capabilities such as
Dynamo is high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments. Designed to be inference engine agnostic (supports TRT-LLM, vLLM, SGLang or others), it captures LLM-specific capabilities such as:
- **Disaggregated prefill & decode inference** – Maximizes GPU throughput and facilitates trade off between throughput and latency.
- **Dynamic GPU scheduling** – Optimizes performance based on fluctuating demand
......@@ -11,97 +28,83 @@ Dynamo is high-throughput low-latency inference framework designed for serving g
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach
## Motivation
## Motivation behind Dynamo
Scaling inference for generative AI and reasoning models are fundamentally hard problems—not just in terms of performance, but also in correctness and efficiency. Today, most inference serving frameworks struggle to handle the sheer complexity of large-scale distributed execution.
Scaling inference for generative AI and reasoning models are fundamentally hard problems—not just in terms of performance, but also in correctness and efficiency. Most inference serving frameworks struggle to handle the sheer complexity of large-scale distributed execution.
There are multi-faceted challenges:
- *Extremely hard UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further amplifies inefficiencies. Developers need a clear, intuitive way to define, optimize, and modify inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, slowing down model deployment and innovation. A modern distributed inference stack must be designed with usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
- *Difficult UX*: User experience is critical for distributed inference runtimes because managing large-scale inference systems is already complex, and poor usability further complicates matters. Developers need a clear, intuitive way to define, optimize, and update inference execution without wrestling with low-level infrastructure details. Without simple UX, inference runtimes remain inaccessible, prone to errors, and inefficient, hindering model deployment and innovation. A modern distributed inference stack must consider usability at its core—empowering developers to scale AI effortlessly for agentic workflows while ensuring correctness and performance.
- *GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (where large prompt embeddings are generated) is highly compute-intensive, while decode (where tokens are generated) is latency-sensitive. A disaggregated approach is needed to separate prefill and decode, ensuring optimal GPU utilization and increasing overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)).
- *GPU underutilization*: Traditional monolithic inference pipelines often leave GPUs idle due to the imbalance between prefill and decode stages. Prefill (which generates large prompt embeddings) is highly compute-intensive, while decode (which generates tokens) is latency-sensitive. A disaggregated approach that separate prefill and decode ensures optimal GPU utilization and increases overall throughput ([DistServe](https://arxiv.org/abs/2401.09670)).
- *Expensive KV cache re-computation*: When requests are not efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948))
- *Expensive KV cache re-computation*: When requests aren't efficiently routed, KV caches (intermediate states of transformer model) often get flushed and recomputed, leading to wasted computation cycles and increased latency. KV-aware request routing eliminates redundant KV cache regeneration, significantly boosting efficiency.([DeepSeek](https://arxiv.org/abs/2501.12948))
- *Memory bottlenecks*: Large-scale inference workloads demand extensive KV cache storage, which can quickly overwhelm GPU memory capacity. KV cache offloading across memory hierarchies (HBM, DDR, NVMe or remote storage) enables models to scale beyond GPU memory limits and speeds up latency. ([Mooncake](https://github.com/kvcache-ai/Mooncake/blob/main/doc/en/mooncake-store-preview.md), [AIBrix](https://blog.vllm.ai/2025/02/21/aibrix-release.html), [LMCache](https://lmcache.ai/))
- *Fluctuating demand and inefficient GPU allocation*: Inference workloads are use case specific and are inherently dynamic—demand surges unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset))
- *Fluctuating demand and inefficient GPU allocation*: Inference workloads are use-case specific and dynamic—demand surges inherently cause unpredictably, yet traditional serving stacks allocate GPUs statically. Dynamic GPU scheduling ensures that resources are allocated based on real-time demand, preventing over-provisioning and improving utilization ([AzureTrace](https://github.com/Azure/AzurePublicDataset))
- *Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Existing contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it demands deep networking expertise to configure correctly, making it impractical for broad inference use cases. What developers really need is a library, optimized for inference workloads that can abstract heterogeneous memory (remote memory, or storage) and dynamically selects the best transport backend via a unified API.
- *Inefficient data transfer*: Distributed inference workloads introduce unique and highly dynamic communication patterns that differ fundamentally from training. Unlike training, where worker roles remain largely static, inference requires real-time worker scaling, dynamic load balancing, and adaptive memory management—necessitating a communication layer that can efficiently handle these evolving requirements. Contemporary libraries are built for static, synchronous operations and lack the dynamicity needed for inference serving. While UCX provides high-performance networking, it requires deep networking expertise to configure correctly, making it impractical for broad inference use cases. Developers need a library optimized for inference workloads that can abstract heterogeneous memory (remote memory or storage) and dynamically select the best transport mechanism via a unified API.
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Furthermore, Dynamo features NIXL (Nvidia Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
To address the growing demands of distributed inference serving, NVIDIA introduces Dynamo. This innovative product tackles key challenges in scheduling, memory management, and data transfer. Dynamo employs KV-aware routing for optimized decoding, leveraging existing KV caches. For efficient global memory management at scale, it strategically stores and evicts KV caches across multiple memory tiers—GPU, CPU, SSD, and object storage—enhancing both time-to-first-token and overall throughput. Dynamo features NIXL (NVIDIA Inference tranXfer Library), a new data transfer engine designed for dynamic scaling and low-latency storage access.
## High level architecture and key benefits
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes four key features.
The following diagram outlines Dynamo's high-level architecture. To enable large-scale distributed and disaggregated inference serving, Dynamo includes five key features:
- [Dynamo Disaggregated Serving](disagg_serving.md)
- [Dynamo Smart Router](kv_cache_routing.md)
- [Dynamo Distributed KV Cache Manager](kv_cache_manager.md)
- [Dynamo KV Cache Block Manager](kvbm_intro.rst)
- [Planner](../guides/planner.md)
- [NVIDIA Inference Transfer Library (NIXL)](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)
Every component in the Dynamo architecture is independently scalable and portable. The API server can adapt to task-specific deployment. A smart router processes user requests to route them to the optimal worker for performance. Specifically, for Large Language Models (LLMs), Dynamo employs KV cache-aware routing, which directs requests to the worker with the highest cache hit rate while maintaining load balance, expediting decoding. This routing strategy leverages a KV cache manager that maintains a global radix tree registry for hit rate calculation. The KV cache manager also oversees a multi-tiered memory system, enabling rapid KV cache storage and eviction. This design results in substantial TTFT reductions, increased throughput, and the ability to process extensive context lengths.
![](images/architecture.png "Dynamo Architecture")
![Diagram of the NVIDIA Dynamo architecture for distributed AI inference, including User Requests, Planner, API Server, Smart Router, and Disaggregated Serving](../images/architecture.png "Dynamo Architecture")
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if an increase in requests with long input sequences is detected, the Planner automatically scales up prefill workers to meet the heightened demand.
Dynamo enables dynamic worker scaling, responding to real-time deployment signals. These signals, captured and communicated through an event plane, empower the Planner to make intelligent, zero-downtime adjustments. For instance, if Dynamo detects an increase in requests with long input sequences, the Planner automatically scales up prefill workers to meet the heightened demand.
Beyond efficient event communication, data transfer across multi-node deployments is crucial at scale. To address this, Dynamo utilizes NIXL, a technology designed to expedite transfers through reduced synchronization and intelligent batching. This acceleration is particularly vital for disaggregated serving, ensuring minimal latency when prefill workers pass KV cache data to decode workers.
Dynamo prioritizes seamless integration. Its modular design allows it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. Critical performance-sensitive modules are built with Rust for speed, memory safety, and robust concurrency. Meanwhile, Python is employed for its flexibility, enabling rapid prototyping and effortless customization.
Dynamo prioritizes seamless integration. Its modular design enables it to work harmoniously with your existing infrastructure and preferred open-source components. To achieve optimal performance and extensibility, Dynamo leverages the strengths of both Rust and Python. We built critical performance-sensitive modules with Rust for speed, memory safety, and robust concurrency. Meanwhile, we used Python for its flexibility, enabling rapid prototyping and effortless customization.
## Performance benefits of key features
### Disaggregated serving
Disaggregating prefill and decode significantly boosts performance, gaining efficiency the more GPUs that are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
<figure>
<img src='images/disagg_perf_benefit.png' alt='missing' width="1200" height="400" />
<p>Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL</p>
</figure>
Disaggregating prefill and decode boosts performance, gaining efficiency when more GPUs are involved in inference. For example, for Llama 70B, single-node tests show a 30% throughput/GPU improvement, while two-node setups achieve over 2X gains due to better parallelization.
<!--
![](images/disagg_perf_benefit.png)[1]
![Two scatter plots comparing the performance of disagg and baseline configurations on one node versus two nodes](../images/disagg_perf_benefit.png)
[1]: Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
-->
* Tested on H100s with R1 Distilled Llama 70B model FP8 using vLLM. 3K ISL/ 150 OSL
The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation allows for tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput.
The disaggregation of prefill and decode phases offers valuable flexibility. Since these phases directly correlate with time-to-first-token (TTFT) and inter-token latency (ITL) respectively, adjusting worker allocation can provide tailored performance. This enables optimization for specific service level agreements (SLAs), whether prioritizing faster TTFT, lower ITL, or higher throughput.
### KV aware routing
![Two bar charts comparing Random routing and Dynamo with KV aware routing for Time To First Token (3x faster with Dynamo) and Avg request latency (2x faster with Dynamo).](../images/kv_routing.png)
<figure>
<img src='images/kv_routing.png' alt='missing' />
<p>Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL</p>
</figure>
<!--
![](images/kv_routing.png)[2]
* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
[2]: Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
-->
Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The figures above illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput.
Existing routing methods, including load-based routing, overlook the specific properties of LLMs that could improve performance. Addressing this, routing user queries to workers with the highest KV cache hit rate (rather than simply the least busy node) allows for immediate processing, even under heavy load. The preceeding figures illustrate the effectiveness of KV aware routing on 100,000 real R1 user queries, achieving a 3x improvement in TTFT and a 2x reduction in average request latency. Depending on traffic, this approach can also enhance throughput.
### KV cache manager
Dynamo's design enables KV cache offloading to system CPU memory, and will be extended to support SSDs and networked object storage in subsequent releases. In many accelerated servers, the CPU (system) memory is much larger than the GPU memory and fast enough to store and serve KV cache data. The following plot highlights the performance gains achieved through system memory offloading, even with prefix caching enabled via inference engine. In a scenario involving 10 multi-turn conversations with 80 users, system memory offloading resulted in a 40% improvement in TTFT, demonstrating additional benefits beyond basic prefix caching.
Dynamo's design enables KV cache offloading to system CPU memory. In accelerated servers, the CPU (system) memory is often larger than the GPU memory and fast enough to store and serve KV cache data. The following plot highlights the performance gains achieved through system memory offloading, even with prefix caching enabled via inference engine. In a scenario involving 10 multi-turn conversations with 80 users, system memory offloading resulted in a 40% improvement in TTFT, demonstrating benefits beyond basic prefix caching.
![Line graph comparing Pure GPU prefix caching and Dynamo KV manager host offloading for TTFT (Time To First Token) across rounds with 80 users](../images/kv_manager.png)
<figure>
<img src='images/kv_manager.png' alt='missing' />
<p>Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL</p>
</figure>
* Tested with 100K requests to R1 using R1 Distilled Llama 70B FP8 on 2 nodes of H100s. Avg 4K ISL / 800 OSL
### NIXL
### NVIDIA Inference Transfer Library (NIXL)
NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL is able to abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support only one tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and overall throughput.
NIXL streamlines data transfer through simplified synchronization and batching and simplified source and destination abstractions. NIXL can abstract data movement across different types of memory and fast storage, whereas other data transfer libraries typically support a single tier of memory. These enhancements yield significant performance gains, accelerating both time-to-first-token (TTFT) and throughput.
## Acknowledgement
## Acknowledgements
We would like to acknowledge several open source software stacks for motivating us to create Dynamo.
We'd like to acknowledge several open source software stacks that motivated our creation Dynamo.
- vLLM and vLLM-project
- SGLang
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
The prefill and decode phases of LLM requests have different computation characteristics and memory footprints. Disaggregating these phases into specialized llm engines allows for better hardware allocation, improved scalability, and overall enhanced performance. For example, using a larger TP for the memory-bound decoding phase while a smaller TP for the computation-bound prefill phase allows both phases to be computed efficiently. In addition, for requests with long context, separating their prefill phase into dedicated prefill engines allows the ongoing decoding requests to be efficiently processed without being blocked by these long prefills.
......@@ -103,7 +121,7 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
#### Auto-Discovery for new workers
In Dynamo, we use etcd (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it will add it's endpoint information to etcd allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers will put memory descriptors of their kv cache (used in NIXL transfer) in etcd. Newly added prefill workers also register with etcd for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers will lazy-pull the descriptors when they start serving a remote prefill request for the first time.
In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
You can watch this happen live by running the following:
......@@ -173,4 +191,4 @@ Since worker information is stored in etcd, we can shutdown workers by simply re
- Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
- Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
You can also visualize this by revoking a workers etcd lease while it has ongoing requests. We have an example script in the repo that does this [here](../lib/bindings/python/examples/hello_world/revoke_lease.py)
\ No newline at end of file
You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Distributed Runtime
## Overview
Dynamo `DistributedRuntime` is the core infrastructure in dynamo that enables distributed communication and coordination between different dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via binding (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., ETCD for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
- `Component`: A `Component` is a discoverable object within a `Namespace` that represents a logical unit of workers.
- `Endpoint`: An `Endpoint` is a network-accessible service that provides a specific service or function.
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other, which will be covered later.
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
For example, the deployment configuration `examples/llm/configs/disagg.yaml` have four workers:
......@@ -21,27 +38,28 @@ Since the four workers are deployed in different processes, each of them have th
## Initialization
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on ETCD.
In this section, we explain what happens under the hood when `DistributedRuntime/Namespace/Component/Endpoint` objects are created. There are two modes for `DistributedRuntime` initialization: dynamic and static. In static mode, components and endpoints are defined using known addresses and do not change during runtime. In dynamic modes, components and endpoints are discovered through the network and can change during runtime. We focus on the dynamic mode in the rest of this document. Static mode is basically dynamic mode without registration and discovery and hence does not rely on etcd.
> [!CAUTION]
> The hierarchy and naming in ETCD and NATS might be changed and this document might not reflect the latest changes. However, the main idea would remain the same.
```{caution}
The hierarchy and naming in etcd and NATS may change over time, and this document might not reflect the latest changes. Regardless of such changes, the main concepts would remain the same.
```
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it will establish connections to the following two services:
- ETCD (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without ETCD.
- `DistributedRuntime`: When a `DistributedRuntime` object is created, it establishes connections to the following two services:
- etcd (dynamic mode only): for service discovery. In static mode, `DistributedRuntime` can operate without etcd.
- NATS (both static and dynamic mode): for messaging.
where ETCD and NATS are two global services (there could be multiple ETCD and NATS services for high availability).
where etcd and NATS are two global services (there could be multiple etcd and NATS services for high availability).
For ETCD, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` will use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease will be revoked or expired and the kv pairs stored with this lease_id will be removed.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in ETCD. It provides the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, similar to `Namespace`, it will not be registered in ETCD. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
For etcd, it also creates a primary lease and spin up a background task to keep the lease alive. All objects registered under this `DistributedRuntime` use this lease_id to maintain their life cycle. There is also a cancellation token that is tied to the primary lease. When the cancellation token is triggered or the background task failed, the primary lease is revoked or expired and the kv pairs stored with this lease_id is removed.
- `Namespace`: `Namespace`s are primarily a logical grouping mechanism and is not registered in etcd. It provides the root path for all components under this `Namespace`.
- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
- `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
- NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
- ETCD Registration: The endpoint information is stored in ETCD at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) will share the same `Namespace`, `Componenet`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
- etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) share the same `Namespace`, `Componenet`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
## Calling Endpoints
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an ETCD watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The ETCD watcher will continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [PushRouter](/lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
......
......@@ -16,10 +16,10 @@ limitations under the License.
-->
# KV Cache Routing in Dynamo
# KV Cache Routing
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
## Dynamo Architecture
## Architecture
Dynamo's architecture consists of three key concepts:
- **Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
......@@ -28,11 +28,11 @@ Dynamo's architecture consists of three key concepts:
A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).
A common pattern is to spin up multiple of the same components which serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
Colloquially, we will refer to a dynamo component that serves an endpoint for LLM inference as a **worker**.
Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
## Basic Routing in Dynamo
## Basic Routing
Dynamo supports several routing strategies when sending requests from one component to another component's endpoint.
First, we must create a client tied to a components endpoint, we can do this using the labels defined above. Here we are getting a client tied to the `generate` endpoint of the `VllmWorker` component.
......@@ -76,7 +76,7 @@ To get a feel for how KV Cache management works on a single worker with KV Cache
- The system applies an eviction policy (e.g., LRU, LFU) to identify blocks for removal
- Selected blocks are evicted from the cache
- **KVPublisher emits a KV removed event notifying KVIndexer about the removed block.**
- Alternatively, some systems may offload less-frequently used blocks to CPU memory. See [KV Offloading in Dynamo](kv_cache_manager.md).
- Alternatively, some systems may offload less-frequently used blocks to CPU memory.
7. KV computation:
- For new blocks, the model computes key and value tensors
- These tensors are stored in the newly allocated cache blocks
......@@ -111,7 +111,7 @@ In the above image, our cost function is (KV match - Load) so we select Worker 2
- **Worker 2 = (0.50 - 0.50) = 0**
- Worker 3 = (0.75 - 0.80) = -0.05
## Dynamo Events
## Events
In Dynamo, we want to support KV Cache Routing and load balancing for many backends that have different implementations of KV Cache and record different metrics. To that end, we built a KVPublisher that can be plugged into any framework to publish KV Events and a KvMetricsPublisher that can publish Metric Events.
......@@ -169,7 +169,10 @@ Sample Output:
543219876: 7,
}
```
> **Note**: This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
```{note}
This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
```
### KvMetricsPublisher
We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
......@@ -217,10 +220,12 @@ Number of Requests Waiting: 1
GPU Prefix Cache Hit Rate: 0.1
***
```
> **Note**: This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
```{note}
This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
```
### [KV Router](../examples/llm/components/kv_router.py)
### [KV Router](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/kv_router.py)
The Router component makes intelligent worker selection decisions
1. Receives incoming requests as tokens
2. Queries the KVIndexer to find potential cache hits across workers
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# KVBM Architecture
The KVBM serves as a critical infrastructure component for scaling LLM inference workloads efficiently. By cleanly separating runtime logic from memory management, and by enabling distributed block sharing, KVBM lays the foundation for high-throughput, multi-node, and memory-disaggregated AI systems.
![A block diagram showing a layered architecture view of Dynamo KV Block manager.](../images/kvbm-arch.png)
**High level layered architecture view of Dynamo KV Block manager and how it interfaces with different components of LLM inference ecosystem**
The KVBM has three primary logical layers. The top layer-the LLM inference runtimes (TRTLLM, vLLM and SGLang)-integrates through a dedicated connector module to the Dynamo KVBM module. These connectors act as translation layers, mapping runtime-specific operations and events into the KVBM’s block-oriented memory interface. This decouples memory management from the inference runtime, enabling backend portability and providing memory tiering.
The middle layer, the KVBM layer, encapsulates the core logic of the KV block manager and serves as the runtime substrate for managing block memory. The KVBM adapter layer normalizes the representations and data layout for the incoming requests across runtimes and forwards them to the core memory manager. The KVBM and the core modules implement required internal functionality, such as table lookups, memory allocation, block layout management, lifecycle, and state transitions and block reuse or eviction was on policies. The KVBM layer also has required abstractions for external components to override or augment its behavior.
The last layer, the NIXL layer, provides unified support for enabling all data and storage transactions. NIXL enables P2P GPU transfers, enables RDMA and NVLINK remote memory sharing, dynamic block registration and metadata exchange and provides a plugin interface for storage backends.
NIXL integrates with several backends:
- Block memory (Eg. GPU HBM, Host DRAM, Remote DRAM, Local SSD when exposed as block device)
- Local file system (for example, POSIX)
- Remote file system (for example, NFS)
- Object stores (for example, S3-compatible)
- Cloud storage (for example, blob storage APIs)
**[NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md)** abstracts away the registration and integration complexity for each backends via custom optimizable plugin architecture and enables memory blocks to be published, serialized, and accessed remotely, allowing the disaggregation of compute and memory across nodes. Combined with the Dynamo KV Block Manager (KVBM), storage providers no longer need to retrofit or optimize individual LLM inference engines. Instead, they can focus on tuning their own stack, providing optimized endpoints, knowing that integration is smooth, standardized, and efficient. And for those who *do* want to go further, Dynamo KVBM offers a clean separation of concerns, making custom optimization not only possible, but simple.
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Understanding KVBM components
The design of the KVBM is inspired from vLLM and SGLang KV block managers but with a twist from historical memory tiering design aspired in general GPU programming \[1,2\]. Figure 2 shows the internal architecture of KVBM and how it works across workers using NIXL.
![Internal architecture and key modules in the Dynamo KVBM. ](../images/kvbm-internal-arch.png)
**Internal architecture and key modules in the Dynamo KVBM**
#### KvBlockManager as Orchestration Layer
The \`KvBlockManager\<H, D\>\` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Critical to note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
\`KvBlockManager\<H, D\>\` owns:
* A device-side \`BlockPool\<Device\>\`
* A host-side \`BlockPool\<Host\>\`
* A remote NIXL agent that allows communication and memory sharing across nodes
* A block set registry for remote lookup and import/export of block metadata
Implementation-wise, \`KvBlockManagerState\` holds actual logic: it's initialized by \`KvBlockManagerConfig\`, which merges runtime, model, and layout configs. Remote awareness is injected by \`NixlOptions\`.
#### Block Layout and Memory Mapping
Each block is a 2D array \`\[num\_layers\]\[page\_size × inner\_dim\]\`. The memory layout is abstracted by the \`BlockLayouttrait\`. The default implementation is \`FullyContiguous\`, which stores all layers for all blocks in one region with alignment-aware stride computation:
````
```
block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
```
````
This memory layout is shared by both CPU and GPU pools but uses storage-specific backends:
* \`DeviceStorage\` → CUDA device buffer
* \`PinnedStorage\` → page-locked host memory
* \`SystemStorage\` → CPU heap memory (fallback/test)
* \`NixlStorage\` → remote memory via NIXL RDMA handles (includes storage)
Each layout is constructed using a \`LayoutConfig\`, and storage is either passed directly or allocated via a StorageAllocator.
#### BlockPool and Memory Pools (Active \+ Inactive)
Each \`BlockPool\<T\>\` (where \`T\` is \`DeviceStorage\`, \`PinnedStorage\`, etc.) tracks two sub-pools:
* \`ActivePool\`: Contains blocks currently in use by sequences
* \`InactivePool\`: Recycled blocks ready for allocation. Think free list.
When a token block is requested (for example, \`get\_mutable\_block()\`), the allocator pops from \`InactivePool\`, transitions its state, and returns a writable handle. On sequence commit or eviction, blocks are reset and returned to the inactive pool.
The state machine (\`BlockState\`) that tracks the block lifecycle transitions includes:
| State | Description | Ownership | Valid Actions / Transitions |
| ----- | ----- | ----- | ----- |
| Reset | Block is uninitialized or has been reset. No sequence is associated. | Held in InactivePool, reusable | init\_sequence(salt\_hash) → Partial |
| Partial | Block is being filled with tokens for a new sequence. In-progress. | Owned by the sequence creator | add\_token() / add\_tokens() (accumulate)- commit() → Complete- reset() → Reset |
| Complete | Block is fully filled with token data but not yet visible to others. | Still owned by creator thread | register() → Registered- reset() → Reset |
| Registered | Block is finalized and visible for reuse. Available in the deduplication cache. Can use block for lookups | Shared ownership (global registry) | Auto drop() → triggers Remove event and transitions to Reset |
The valid KBVM block manager transitions are:
| From → To | Trigger | Validation |
| ----- | ----- | ----- |
| Reset → Partial | init\_sequence(salt\_hash) | Must not be in use |
| Partial → Complete | commit() | Must be full |
| Complete → Registered | register() | Must be finalized |
| Registered → Reset | Drop of RegistrationHandle | Automatic |
| Partial → Reset | Aborted sequence | Explicit or drop |
| Complete → Reset | Invalidated | Explicit or drop |
An example lifecycle of a block in the KVBM block manager can be thought as below:
Let’s say a sequence requests a new KV block:
1. Allocator pops from InactivePool → Block is in Reset
2. init\_sequence() → Transitions to Partial
3. Tokens are appended → State remains Partial
4. On full → commit() → State becomes Complete
5. Register() → Block is hashed and moved to Registered. Blocks can now be used to lookup.
6. On eviction or end-of-life → drop() of RAII handle returns block to Reset
#### Lifecycle Management via RAII and Event Plane
The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an \`EventManager\`. On registration and drop:
* \`PublishHandle\` triggers Register events
* Dropping it triggers Remove events
This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane and any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation tailored and optimized for their platform.
#### Remote Memory Integration via NIXL
The NIXL agent exposes remote memory buffers using \`NixlBlockSet\`, \`RemoteBlocks\`, and layout descriptors. Key operations include:
* \`nixl\_register()\`: Registers memory region with NIXL runtime
* \`serialize() / deserialize()\`: Converts layout and memory into transferable descriptors
* \`import\_remote\_blockset()\`: Loads remote node’s block layouts into the manager
* \`get\_remote\_blocks\_mutable()\`: Fetches transferable memory views from another node
\`RemoteBlocks\` is a lightweight abstraction over shared memory for cross-node block usage (via UCX or other backends).
The left side of the Figure 2 illustrates a bidirectional remote memory registration and layout synchronization protocol between workers (e.g., Worker 1 and Worker 2\) using NIXL.
1. *Agent Creation & Memory Registration:*
Each worker independently sets up a NixlAgent:
* Registers its memory regions (e.g., device memory) via nixl\_register().
* These regions correspond to blocks managed in the local BlockPool.
Once memory is registered, NIXL creates remote-accessible descriptors, which are bound to the memory layout.
2. *Metadata exchange:*
After memory registration, workers exchange serialized layout metadata, encapsulated in a \`SerializedNixlBlockLayout\`.
Why is this step critical?
* LLM inference workloads often differ in *tensor parallel (TP)* configurations.
* Worker 1 might have TP=4, while Worker 2 has TP=8.
* Hence, even if both systems use similar \`FullyContiguous\` layouts, their internal slicing and alignment assumptions differ.
* The metadata exchange bridges this semantic mismatch by sharing:
* LayoutConfig (num\_layers, page\_size, inner\_dim, dtype)
* BlockSetID
* Base address \+ stride information (including alignment)
* Device ID \+ memory type (host/device)
* Once shared, each worker can reconstruct the layout on its side using deserialize().
This enables NIXL to:
* Understand where each layer/block lives
* Perform correct gather-scatter operations during RDMA-like transfers
Without this step, remote fetches would result in data corruption or misaligned tokens.
3. *Serialization & Deserialization: Making Layouts Portable*
In the serialization stage, KVBM exports, \`FullyContiguous::serialize()\` encodes:
* FullyContiguousConfig
* base\_offset
* Physical memory descriptors (NixlStorage) including:
* Memory type (VRAM, DRAM)
* Address & size
* Device ID
This is sent over using NIXL transfer and then injected into a KVBM scheduler state. In the deserialization stage, \`SerializedNixlBlockLayout::deserialize()\` rehydrates this into:
* A fully reconstructed memory layout view
* Local representation of a remote memory slice with correct offsets and size semantics
* Enables direct access to remote memory with consistent logical semantics
This guarantees that even across different system configurations (hardware or LLM shape), both parties agree on the memory view for each KV block.
4. *Ownership handles and lifetime tracking*
Memory ownership in NIXL is tightly coupled with RAII-based handles:
* When a block is registered, it returns a \`PublishHandle\` which wraps a \`RegistrationHandle\`.
* On drop of this handle, an automatic Remove event is published, which:
* Deregisters the block from the NIXL layer
* Removes it from the remote block registry
* This ensures that once the block is evicted from the cache or no longer used in inference, all references are invalidated cleanly across nodes.
This mechanism avoids:
* Stale memory access
* Dangling pointers on GPU or host
* Manual deregistration bugs
The system can batch and publish registration events via a Publisher, optimizing performance under high concurrency.
#### Storage backends and pluggability
Integrating KVBM with storage backend is extremely trivial by extending or wrapping \`NixlEnabledStorage\` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers. We are deferring detailed integration guidance as we are actively collaborating with storage partners to simplify and standardize these integration paths.
```
An example system architecture
+------------------------------+
|Distributed Inference engine |
+------------------------------+
|
v
+------------------------------+
| Dynamo KV Block Manager |
+------------------------------+
|
+----------------+----------------+
| |
v v
+------------------------------+ +----------------------------+
| NIXL Storage Agent | | Event Plane |
| - Volume registration | | - NATS-based Pub/Sub |
| - get()/put() abstraction | | - StoreEvent / RemoveEvent |
+------------------------------+ +----------------------------+
| |
v v
+-----------------------------+ +-----------------------------+
| G4 Storage Infrastructure | | Storage Provider Subscriber |
| (SSD, Object store, etc.) | | - Parse Events |
| - Store KV blocks | | - Build fast tree/index |
+-----------------------------+ | - Optimize G4 tiering |
+-----------------------------+
```
For now, the following breakdown provides a high-level understanding of how KVBM interacts with external storage using the NIXL storage interface and the Dynamo Event Plane:
##### NIXL Storage Interface (for Backend Integration)
The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
* registerVolume(descriptor): Register a logical volume for KV cache data.
* unregisterVolume(): Cleanly deregister and release volume mappings.
* get() / put(): Block-level APIs used by KVBM to fetch and store token blocks.
These abstractions allow backends to be integrated without tying into the host’s file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Please note that these APIs are still being finalized.
##### Dynamo Event Plane (Pub/Sub Coordination Layer)
To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations. Particularly there are two events emitted.
* StoreEvent: Emitted when a KV block is registered.
* RemoveEvent: Emitted when a KV block is released or evicted.
Each KVEvent (\~100 bytes) contains:
* sequence\_hash: Unique identifier of the KV block
* prefix\_hash: Prefix grouping for query-level aggregation
* block\_size: Size in bytes
* storage\_location: Logical volume identifier
* event\_type: Store or Remove
* extra\_metadata: Reserved fields for partner-specific optimization
These events are batched and published periodically (e.g., every \~10s or dynamically based on system load) for scalability.
##### A conceptual design of a storage advisor
This section provides an overview for the storage provider who is interested in integrating as a custom backend to KVBM and providing optimized performance. ***Please note, this is optional and not required for KVBM to integrate with a backend.***
External storage systems are not tightly coupled with Dynamo’s execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
* Storage volumes are pre-provisioned and mounted by the storage provider.
* These volumes are then registered with Dynamo via the NIXL Storage Agent using registerVolume() APIs. Dynamo itself does not manage mounts or provisioning.
* The Dynamo KV Block Manager interacts only with logical block-level APIs (i.e., get() and put()).
* In parallel, the Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel.
* Storage vendors implement a lightweight subscriber process that listens to these events without interfering with the KV Manager’s runtime behavior.
* This decoupling ensures that external storage systems can optimize block placement and lifecycle tracking without modifying or instrumenting the core Dynamo codebase.
Now, to enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream. Here is a high level conceptual design:
* On receiving a StoreEvent, the storage system:
* Inserts a record into an internal prefix tree, hash map, or LRU index.
* This record includes the prefix\_hash and sequence\_hash, which logically identify the token block and its grouping.
* Associated metadata (e.g., block\_size, storage\_location) is also captured.
* On receiving a RemoveEvent, the system:
* Deletes or prunes the corresponding record from its index.
* Optionally triggers cleanup or tier migration workflows.
This event-driven indexing allows the storage system to track which KV blocks are live and where they belong—enabling low-latency lookup, efficient space reclamation, and multi-tier coordination. With real-time visibility into KV block usage patterns, the storage system can implement smart tiering policies, such as:
* Hot block promotion: Frequently accessed KV blocks can be migrated to fast SSD volumes.
* Cold block demotion: Infrequently used blocks can be demoted to slower storage (e.g., HDDs, cloud object storage).
* Proactive compaction: If block sizes or prefix patterns indicate fragmentation, the storage backend can coalesce or rewrite blocks.
These optimizations are performed entirely outside of Dynamo, with the assumption that storage providers adhere to SLA guarantees and volume availability.
Critically, this entire system is designed to be non-intrusive:
* The Dynamo KV Block Manager remains agnostic to how data is stored or optimized.
* The Event Plane does not block or intercept any critical path of inference.
* Storage vendors are given the freedom to innovate and optimize without requiring changes to the inference runtime.
This design ensures that performance, resilience, and extensibility scale independently across the KV layer and the storage backend layer.
..
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
KV Block Manager
================
The Dynamo KV Block Manager (KVBM) is a scalable runtime component designed to handle memory allocation, management, and remote sharing of Key-Value (KV) blocks for inference tasks across heterogeneous and distributed environments. It acts as a unified memory layer for frameworks like vLLM, SGLang, and TRT-LLM.
It offers:
* A **unified memory API** that spans GPU memory, pinned host memory, remote RDMA-accessible memory, local or distributed pool of SSDs and remote file/object/cloud storage systems.
* Support for evolving **block lifecycles** (allocate → register → match) with event-based state transitions that storage can subscribe to.
* Integration with **NIXL**, a dynamic memory exchange layer used for remote registration, sharing, and access of memory blocks over RDMA/NVLink.
The Dynamo KV Block Manager serves as a reference implementation that emphasizes modularity and extensibility. Its pluggable design enables developers to customize components and optimize for specific performance, memory, and deployment needs.
.. toctree::
:hidden:
Motivation <kvbm_motivation.md>
KVBM Architecture <kvbm_architecture.md>
Understanding KVBM components <kvbm_components.md>
KVBM Further Reading <kvbm_reading>
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Motivation behind KVBM
Large language models (LLMs) and other AI workloads increasingly rely on KV caches that extend beyond GPU and local CPU memory into remote storage tiers. However, efficiently managing the lifecycle of KV blocks in remote storage presents challenges:
* Tailored for GenAI use-cases
* Lack of visibility into real-time block usage patterns.
* Need for lightweight, ownership-driven memory management over complex object stores with unneeded overheads.
* Modular and need simplified UX and to be memory safe.
* Inability to differentiate between hot (frequently accessed) and cold (infrequently accessed) blocks across the stack without intrusive application-level changes.
* Difficulty in optimizing storage placement across heterogeneous storage tiers (for example, SSDs, object storage, and cloud storage).
Conventional systems either lack dynamic feedback mechanisms or require deep integration into core storage paths, which both increases complexity and reduces portability.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment