> To ensure compatibility, please refer to the examples in the release branch or tag that matches the version you installed.
### Building the Dynamo Base Image
Although not needed for local development, deploying your Dynamo pipelines to Kubernetes will require you to use a Dynamo base image to your container registry. You can use any container registry of your choice, such as:
To co-ordinate across the data center Dynamo relies on an etcd and nats cluster. To run locally these need to be available.
- Docker Hub (docker.io)
- NVIDIA NGC Container Registry (nvcr.io)
- Any private registry
We publish our images in [nvcr.io](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime) and you can use them.
-[etcd](https://etcd.io/) can be run directly as `./etcd`.
Alternatively you could build and push an image from source:
- For specific details on the `--framework vllm` build [read about the VLLM backend](components/backends/vllm/README.md)
.
- For specific details on the `--framework tensorrtllm` build, see [Read about the TensorRT-LLM backend](components/backends/trtllm/README.md)
.
Note about AWS environments:
We publish Python wheels specialized for each of our supported engines: vllm, sglang, llama.cpp and trtllm. The examples that follow use sglang, read on for other engines.
- If deploying Dynamo in AWS, make sure to build the container with EFA support using the `--make-efa` flag.
After building, you can use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...
Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...
```
```
### LLM Serving
If the model is not available locally it will be downloaded from HuggingFace and cached.
Dynamo provides a simple way to spin up a local set of inference
You can also pass a local path: `python -m dynamo.sglang.worker --model-path ~/llms/Qwen3-0.6B`
components including:
### Running an LLM API server
Dynamo provides a simple way to spin up a local set of inference components including:
-**OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
-**OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
-**Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
-**Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
-**Workers** – Set of pre-configured LLM serving engines.
-**Workers** – Set of pre-configured LLM serving engines.
To run a minimal configuration you can use a pre-configured
example.
#### Start Dynamo Distributed Runtime Services
First start the Dynamo Distributed Runtime services:
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
```
#### Start Dynamo LLM Serving Components
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
python -m dynamo.frontend [--http-port 8080]
Next serve a minimal configuration with an http server, basic
round-robin router, and a single worker.
```bash
# Start the vllm engine, connecting to nats and etcd to receive requests. You can run several of these,
cd examples/llm
# both for the same model and for multiple models. The frontend node will discover them.
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.
### Engines
In the introduction we installed the `sglang` engine. There are other options.
All of these requires nats and etcd, as well as a frontend (`python -m dynamo.frontend [--interactive]`).
# vllm
```
uv pip install ai-dynamo[vllm]
```
Run the backend/worker like this:
```
python -m dynamo.vllm --help
```
vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
# sglang
```
uv pip install ai-dynamo[sglang]
```
Run the backend/worker like this:
```
python -m dynamo.sglang.worker --help
```
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs.
# TRT-LLM
This currently requires a container TODO ADD THE DOCS PLZ THANK YOU
If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
### Local Development
### Local Development
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
1. Install libraries
Otherwise, to develop locally, we recommend working inside of the container
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
#### Conda Environment
2. Install Rust
Alternately, you can use a conda environment
```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
```bash
3. Create a Python virtual env:
conda activate <ENV_NAME>
pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
```
uv venv dynamo
source dynamo/bin/activate
```
cargo build --release
4. Install build tools
# To install ai-dynamo-runtime from source
```
cd lib/bindings/python
uv pip install pip maturin
pip install.
```
cd ../../../
[Maturin](https://github.com/PyO3/maturin) is the Rust<->Python bindings build tool.
pip install".[all]"
5. Build the Rust bindings
```
cd lib/bindings/python
maturin develop --uv
```
Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md)
6. Install the wheel
```
```
cd $PROJECT_ROOT
uv pip install .
```
Note editable (`-e`) does not work because the `dynamo` package is split over multiple directories, one per backend.
You should now be able to run `python -m dynamo.frontend`.
Remember that nats and etcd must be running (see earlier).
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
### Deployment to Kubernetes
Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md) to deploy to Kubernetes.
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo SDK
## Introduction
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
## Installation
The SDK can be installed using pip:
```bash
pip install ai-dynamo
```
## Core Concepts
As you read about each concept, it is helpful to have the [basic example](../examples/hello_world.md) up as well so you can refer back to it.
### Defining a Service
A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
```python
@service(
dynamo={
"namespace":"dynamo",
},
resources={"gpu":2,"cpu":"10","memory":"20Gi"},
workers=1,
)
```
Key configuration options:
1.`dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../architecture/architecture.md)
2.`resources`: Dictionary defining resource requirements. The GPUs field is used for local and remote deployment. The other fields are used to determine resources when deploying to K8s.
3.`workers`: Number of parallel instances of the service to spin up.
### Writing a Service
Let's walk through an example to understand how you write a dynamo service.
Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
1. Class attributes for dependencies using `depends()`
2. An `__init__` method for standard initialization
3. Optional lifecycle hooks like `@async_on_start` and `@on_shutdown`
4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
via `@endpoint("my_endpoint_name")`, but otherwise defaults to the name of the
function being decorated if omitted.
This approach provides a clean separation of concerns and makes the service structure easy to understand.
#### Service Dependencies with `depends()`
The `depends()` function is a powerful feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
3. You are able to access `ServiceB` endpoints as if it were a local function!
```python
# What happens internally when you use depends(ServiceB)
Through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. To do so:
```
```python
importVllmWorker
# this runtime object gives you access to the underlying python bindings
This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
#### Lifecycle Hooks
Dynamo supports key lifecycle hooks to manage service initialization and cleanup.
##### `@async_on_start`
The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
```python
@async_on_start
asyncdefasync_init(self):
# Perfect for operations that need to be awaited
self.db=awaitconnect_to_db()
self.tokenizer=awaitload_tokenizer()
self.engine=awaitinitialize_engine(self.model)
```
This is especially useful for:
- Initializing external connections
- Setting up runtime resources that require async operations
#### `@on_shutdown`
The `@on_shutdown` hook is called when the service is shutdown handles cleanup.
```python
@on_shutdown
defshutdown(self):
# gracefully Handle shutdown / cleanup
logger.info("worker shutting down")
```
This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
### Configuring a Service
Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
1. Directly in the `@service` decorator
2. Through YAML configuration files
3. Via command-line arguments
4. Using environment variables
These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
#### Configuration via Service Decorator
The most basic method is to specify parameters directly in the service decorator:
This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
#### Configuration via YAML
For more flexible configuration, especially across environments, you can use YAML files:
For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
```python
fromdynamo.sdk.lib.configimportServiceConfig
defsetup_my_lib():
config=ServiceConfig.get_instance()
# Get all MyService config with prefix "lib_" as CLI args
dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
```
The service receives the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
-`dynamo.namespace = "prod"`
-`resources.gpu = 4`
-`workers = 8`
-`model_name = "llama-3-70b-instruct"`
-`temperature = 0.9` (from CLI override)
-`max_tokens = 2048`
-`cache_size = 10000`
-`use_kv_cache = true`
#### Service Configuration Best Practices
1.**Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
2.**Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
3.**Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
4.**Use CLI for Quick Testing**: Override specific values for experimentation
5.**Document Configuration Keys**: Make sure to document all available configuration options
Following these practices help create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
### Deploying a Single Service
You can deploy a single service for local development even if you have a dependency graph defined using `depends()` using `dynamo serve --service-name <ClassName> <entrypoint> <configuration arguments>`. You can see an example of this in our multinode documentation [here](../examples/multinode.md).
### Composing Services into an Graph
There are two main ways to compose services in Dynamo:
1. Use `depends()` (Recommended)
The depends() approach is the recommended way for production deployments:
- Automatically deploys all dependencies
- Creates a static inference graph at deployment time
- Provides type hints and better IDE support
2. Use `.link()` (Experimental)
Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
#### Understanding the `.link()` syntax
Lets take the example of a `Processor` component. This component can currently do 2 things:
1. Process a request and send it to a `Router` to decide what worker to send it to.
2. Process a request and send it to a `Worker` directly.
Consider this snippet of the Processor:
```python
classProcessor(ProcessMixIn):
"""
vLLM pre and post processing
"""
worker=depends(VllmWorker)
router=depends(Router)
# logic for processing a request based on router or worker
```
Think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor spins up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor that removes the `worker` dependency from the Processor.
```python
Processor.link(Router)
```
This removes the `worker` dependency from the Processor and only spin up the Router.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
-->
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
...
@@ -117,78 +105,3 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
...
@@ -117,78 +105,3 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
- Add prefill worker: no explicit action needed.
- Add prefill worker: no explicit action needed.
- Delete prefill worker: flush engine.
- Delete prefill worker: flush engine.
### How this works under the hood
#### Auto-Discovery for new workers
In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
You can watch this happen live by running the following:
```bash
# in terminal 1 - run the disaggregated serving example
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
```
```bash
# in terminal 2 - watch the namespace in etcd
watch -cd etcdctl get --prefix <namespace>
```
You should see something like this show up as the disaggregated serving example starts up:
Since worker information is stored in etcd, we can shutdown workers by simply revoking their etcd leases. After a lease is revoked:
- Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
- Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
SPDX-License-Identifier: Apache-2.0
-->
Licensed under the Apache License, Version 2.0 (the "License");
# KV Cache Routing
you may not use this file except in compliance with the License.
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
To enable KV cache aware routing start the frontend node like this:
```
python -m dynamo.frontend --router-mode kv
```
Unless required by applicable law or agreed to in writing, software
The engine announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node.
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
>[!NOTE]
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
>This information is temporary and will change soon.
# KV Cache Routing
The KV-aware routing arguments:
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
## Architecture
-`--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks).
Dynamo's architecture consists of three key concepts:
-`--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.
-**Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
-`--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
-**Component**: The deployable unit in Dynamo. Components are self-contained and typically map to separate Docker containers. In our examples, we use labels like `VllmWorker `, `Router`, `Processor` for the components. Components can be created in Python or Rust.
-**Endpoint**: Functions attached to components that transform inputs into outputs. Endpoints are discoverable and callable by other components. In our examples we use the label `generate` for most of the endpoints.
A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).
A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
## Architecture
Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
...
@@ -150,32 +144,6 @@ The KVIndexer builds and maintains a global view of cached blocks in a prefix tr
...
@@ -150,32 +144,6 @@ The KVIndexer builds and maintains a global view of cached blocks in a prefix tr
The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
```
### WorkerMetricsPublisher
### WorkerMetricsPublisher
We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
- num_requests_waiting
- num_requests_waiting
...
@@ -191,48 +159,3 @@ Currently, the WorkerMetricsPublisher exists as a Python binding.
...
@@ -191,48 +159,3 @@ Currently, the WorkerMetricsPublisher exists as a Python binding.
### KvMetricsAggregator
### KvMetricsAggregator
The KvMetricsAggregator receives these metrics and aggregates them. It has a method `get_metrics` which returns an object of `AggregatedMetrics`.
The KvMetricsAggregator receives these metrics and aggregates them. It has a method `get_metrics` which returns an object of `AggregatedMetrics`.
The Router component makes intelligent worker selection decisions
1. Receives incoming requests as tokens
2. Queries the KVIndexer to find potential cache hits across workers
3. Collects performance metrics from workers (via KvMetricsAggregator)
4. Uses a cost function to determine the optimal worker for each request
5. Returns chosen worker
The processor manages tokenizing the request, sending it to the KV Router and then once it receives a response, directs the request to the selected worker using direct() routing.
@@ -19,8 +19,6 @@ If you are a **🧑💻 Dynamo Contributor** first follow the instructions in
...
@@ -19,8 +19,6 @@ If you are a **🧑💻 Dynamo Contributor** first follow the instructions in
You would have to rebuild the dynamo platform images as the code evolves. For more details please look at the [Cloud Guide](../guides/dynamo_deploy/dynamo_cloud.md)
You would have to rebuild the dynamo platform images as the code evolves. For more details please look at the [Cloud Guide](../guides/dynamo_deploy/dynamo_cloud.md)
Export the [Dynamo Base Image](../get_started.md#building-the-dynamo-base-image) you want to use (or built during the prerequisites step) as the `DYNAMO_IMAGE` environment variable.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Hello World: Aggregated and Disaggregated Deployment Examples
The `example` directory contains a [hello world example](../examples/hello_world.md) that implements a simplified disaggregated serving architecture used for deploying Large Language Models (LLMs). It removes the LLM related inference code and focuses on how Dynamo handles routing, task queue, and metadata communication between prefill and decode workers.
## Components
- frontend: A simple http server handles incoming requests
- processor: A pre/post processing server and invokes router server
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- worker: A dummy decode worker
- prefill-worker: A dummy prefill worker
## Deployment Architectures
This figure shows an overview of the major components to deploy:
4. Go to Node 2 and start Worker service as in step 3.
Now you should see both workers are ready in Node 1's terminal.
5. Query the Frontend with following two prompts. The router would assign different workers for each prompt and you can observe it from the responses.
-`Response: {"worker_output":"Tell me a joke_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
-`Response: {"worker_output":"Which team won 2020 World Series_GeneratedBy_NODE2HOSTNAME","request_id":"id_number"}`
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Tell me a joke",
"request_id":"id_number"
}'
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Which team won 2020 World Series",
"request_id":"id_number"
}'
```
6. Then modify the prompt; prompts with similar prefixes are routed to the same worker due to the routing algorithm used in this demo. For example, following query is routed to the worker that proceesed `Tell me a joke` prompt.
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Tell me a fact",
"request_id":"id_number"
}'
```
-`Response: {"worker_output":"Tell me a fact_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
## The Disaggregated Deployment
This example uses 3 nodes to demo the disagg serving.
- Node 1
- Runs NATS and etcd services
- Deploys Frontend and Processor
- Deploys DummyWorker as the decode worker
- Node 2
- Deploys DummyWorker as the decode worker
- Node 3
- Deploys Prefill as the prefill worker
### Run the Deployment
1. Repeat step 1 to 4 to deploy Frontend, Processor, Router and 2 Workers as decode worker
dynamo serve components.prefill_worker:PrefillWorker
```
3. Query the Frontend. This time decode workers push requests to the prefill queue, and prefill worker pulles task from the queue to simulate the prefill task. The actual prefill is skipped in this demo.
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "This is prefill disagg serving example",
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command.
For more details, see [Planner Architecture Overview](../architecture/planner_intro.rst).
You can deploy dynamo on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. Each node must be properly configured with Infiniband and/or RoCE for communication between decode and prefill workers.
Note that this can be easily extended to more nodes. You can also run the Frontend, Processor, and Router on a separate CPU only node if you'd like as long as all nodes have access to the NATS/ETCD endpoints!
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes the NATS/ETCD endpoints must be accessible by all other nodes.
```bash
# node 1
docker compose -f deploy/metrics/docker-compose.yml up -d
```
**Step 2**: Create the inference graph for this node. Here we use the `agg_router.py` (even though we are doing disaggregated serving) graph because we want the `Frontend`, `Processor`, `Router`, and `VllmWorker` to spin up (we spin up the other decode worker and prefill worker separately on different nodes later).
**Step 3**: Create a configuration file for this node. We've provided a sample one for you in `configs/multinode-405b.yaml` for the 405B model. Note that we still include the `PrefillWorker` component in the configuration file even though we are not using it on node 1. This is because we can reuse the same configuration file on all nodes and just spin up individual workers on the other ones.
**Step 4**: Start the frontend, processor, router, and VllmWorker on node 1.
```bash
# node 1
cd$DYNAMO_HOME/examples/llm
dynamo serve graphs.agg_router:Frontend -f ./configs/multinode-405b.yaml
```
**Step 5**: Start the first prefill worker on node 2.
Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly with the configuration file from before.
```bash
# node 2
export NATS_SERVER ='<your-nats-server-address>'# note this should start with nats://...
dynamo serve components.worker:VllmWorker -f ./configs/multinode-405b.yaml --service-name VllmWorker
```
Note the use of `--service-name`. This only spins up the worker that you are requesting and ignore any `depends` statements.
###### Client
In another terminal:
```bash
# this test request has around 200 tokens isl
curl <node1-ip>:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-H"Accept: text/event-stream"\
-d'{
"model": "nvidia/Llama-3.1-405B-Instruct-FP8",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 300
}'
```
#### Multi-node sized models
Multinode model support is coming soon. You can track progress [here](https://github.com/ai-dynamo/dynamo/issues/513)!
##### Aggregated Deployment
The steps for aggregated deployment of multi-node sized models is similar to
single-node sized models, except that you need to first configure the nodes
to be interconnected according to the framework's multi-node deployment guide.
In the below example, vLLM is be used as the framework to serve `DeepSeek-R1` model
using tensor parallel 16 on two H100x8 nodes.
**Step 1**: On each of the nodes, set up Ray cluster so that vLLM can access the resource
collectively:
```bash
# head node
ray start --head--port=6379
# example output and keep note of the IP address of the head node
# Local node IP: <head-node-address>
# set vLLM env arg
export VLLM_HOST_IP=<head-node-address>
# other node
ray start --address=<head-node-address>:6379
export VLLM_HOST_IP=<current-node-address>
# verify the accessibility by checking aggregated GPU count shown in ray status
**Step 2**: On the head node, follow [LLM deployment examples](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/README.md) to
setup dynamo deployment for aggregated serving, using the configuration file,
`configs/multinode_agg_r1.yaml`, for DeepSeek-R1:
```bash
cd$DYNAMO_HOME/examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/multinode_agg_r1.yaml
```
###### Client
In another terminal, you can send the same curl request as described above but
with `"model": "deepseek-ai/DeepSeek-R1"`
```bash
# this test request has around 200 tokens isl
curl <node1-ip>:8000/v1/chat/completions \
-H"Content-Type: application/json"\
-H"Accept: text/event-stream"\
-d'{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 300
}'
```
##### Disaggregated Deployment
In this example, we deploy two replicas of the model (one prefill worker
and one decode worker). We use 4 H100x8 nodes and group every two of them
into one Ray cluster in the same way as described in aggregated deployment.
However, for etcd and nats server, we only run them in
one node and let's consider that node to be the head node of the whole deployment.
Note that if you are starting etcd server directly instead of using `docker compose`,
you should add additional arguments to be discoverable in other node.
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LLM Deployment Examples using TensorRT-LLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
See [Deployment Architectures](llm_deployment.md#deployment-architectures) to learn about the general idea of the architecture.
Note that this TensorRT-LLM version does not support all the options yet.
```{note}
TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
```
## Getting Started
1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts
### Prerequisites
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
### Build docker
#### Step 1: Build TensorRT-LLM base container image
Because of the known issue of C++11 ABI compatibility within the NGC pytorch container, we rebuild TensorRT-LLM from source.
See [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html) for more information.
Use the helper script to build a TensorRT-LLM container base image. The script uses a specific commit id from TensorRT-LLM main branch.
```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -yinstall git git-lfs
# The script uses python packages like docker-squash to squash image
For more information see [here](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-build-tensorrt-llm-in-one-step) for more details on building from source.
If you already have a TensorRT-LLM container image, you can skip this step.
This build script internally points to the base container image built with step 1. If you skipped previous step because you already have the container image available, you can run the build script with that image as a base.
```bash
# Build dynamo image with other TRTLLM base image.
Okay, so I'm trying to figure out how to respond to the user's greeting.
They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking."
Hmm, I need to come up with a suitable reply. ...
```
## LLM Serving
Dynamo provides a simple way to spin up a local set of inference components including:
-**OpenAI-compatible Frontend**:
High-performance OpenAI compatible http api server written in Rust.
-**Basic and Kv Aware Router**:
Route and load balance traffic to a set of workers.
-**Workers**:
Set of pre-configured LLM serving engines.
To run a minimal configuration, use a pre-configured example.
### Start Dynamo Distributed Runtime Services
To start the Dynamo Distributed Runtime services the first time:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
### Start Dynamo LLM Serving Components
[Explore the VLLM Example](../components/backends/vllm/README.md)
## Local Development
If you use vscode or cursor, use the `.devcontainer` folder built on [Microsoft's Extension](https://code.visualstudio.com/docs/devcontainers/containers).
For instructions, see the Dynamo repository's [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
Otherwise, to develop locally, we recommend working inside of the container:
# Decrement the number of active requests when complete with one
self.request_active_slots-=1
```
```
The granularity at which metrics are published is up to the backend/worker implementation.
For example, you may want to update metrics on every single generation step during token
generation, or you may only want to update once per request, depending on your use case.
Assuming long generation time or long output token sequence lengths, it would be more
accurate to publish metrics at every generation step.
With the worker publishing KV metrics, you should now be able to connect it
to a KV Router through the `KvMetricsAggregator`.
These metrics can then be inputs to a cost function to determine which
of the available worker's the request should be routed to.
For a [python-based KV Router](../../examples/llm/components/kv_router.py)
implementation, the router is like any other worker, and it can expose
an endpoint that can do arbitrary things based on your use case.
For example, you can initialize the `KvMetricsAggregator` and `KvIndexer`
The `model_path` can be:
in your class implementation:
- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
```python
- The path to a GGUF file, if your engine supports that.
@service(
dynamo={
"namespace":"your_namespace",
},
)
classRouter:
# ...
@async_on_start
asyncdefasync_init(self):
self.runtime=dynamo_context["runtime"]
# Initialize a listener/aggregator for collecting KV metrics
# from the specified component (workers) publishing them
With this flexibility, you can also define your own cost function that takes the
The `model_type` can be:
KV metrics from the KvMetricsAggregator above and the set of available workers
- ModelType.Backend. Dynamo handles pre-processing. Your `generate` method receives a `request` dict containing a `token_ids` array of int. It must return a dict also containing a `token_ids` array and an optional `finish_reason` string.
as inputs, and returns which worker ID that the request should be routed to.
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat). Your engine handles pre-processing.
Since the router is like any other worker in this context, you can also track
- ModelType.Completion. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions). Your engine handles pre-processing.
your own custom metrics and use them in your cost function:
```python
`register_llm` can also take the following kwargs:
@service(
-`model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name, the folder name, or the GGUF file name.
dynamo={
-`context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
"namespace":"your_namespace",
-`kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
},
)
classRouter:
# ...
def_cost_function(
See `components/backends` for full code examples.
self,
scores:OverlapScores|None,
metrics:AggregatedMetrics|None,
token_length:int,
custom_metrics:dict={},
):
"""
Args:
scores (OverlapScores | None): The number of matching blocks between
the request and the prefix cache of each worker.
metrics (AggregatedMetrics | None): Several worker metrics polled
by the `KvMetricsAggregator`, currently including the
GPU cache usage, number of waiting requests, and the
GPU prefix cache hit rate.
token_length (int): The number of tokens in the request.
custom_metrics (dict): Arbitrary (optional) data from your implementation.
Returns:
### Component names
worker_id (str): The best worker ID based on the inputs.
"""
# This is a dummy implementation for demonstration purposes, see the
A worker needs three names to register itself: namespace.component.endpoint
# llm/tensorrt_llm/hello_world examples for more realistic implementations.
worker_ids=[]
# KV cache block hit scores
**Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
forworker_id,scoreinscores.scores.items():
**Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
print(f"{worker_id} # of matching KV Blocks of size {self.indexer.block_size()}: {score}")
**Endpoint*: Like a URL. "generate", "load_metrics".
worker_ids.append(worker_id)
**Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.
# Aggregated KVMetrics published by workers
If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.
forendpoint_metricsinmetrics.endpoints:
print(f"Endpoint metrics: {endpoint_metrics}")
# Replace this random choice with your custom criteria to
If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.
# select the best worker ID.
best_worker_id=random.choice(worker_ids)
returnbest_worker_id
Example 1: Data parallel load balanced, one model one pipeline two instances.
For more details on receiving and routing based on the worker's published KV
metrics, see the [KV Cache Routing Guide](../../docs/architecture/kv_cache_routing.md).
### Disaggregated Serving
#### NIXL
NIXL (NVIDIA Inter-process Link) enables efficient GPU memory sharing between processes. In Prefill/Decode disaggregation, we use NIXL to transfer computed KV cache blocks from prefill workers to decode workers. Here are the core concepts:
- GPU memory registration for sharing between processes
- Connection establishment between Prefill and Decode workers
- Efficient block-based KV cache transfers
- Asynchronous transfer notifications
For a complete implementation example using NIXL for disaggregated serving, see the [vLLM example](../examples/llm_deployment.md).
#### Disaggregation in Dynamo
Aside from the NIXL specifics above, at its core, disaggregation in Dynamo builds
on the same concepts used for any Dynamo client<->worker or worker<->worker
interaction over the DistributedRuntime.
First you can define a worker for each as usual:
```python
classDecodeWorker:
# ...
classPrefillWorker:
# ...
```
```
Second, you decide when/how the (Decode) worker should decide to Prefill remotely
Example 3: Different endpoints.
(by calling a separate Prefill worker), or decide to simply do the Prefill itself.
In some scenarios, it may be more efficient for the Decode worker to just do the
Prefill itself rather than do the extra communication, such as if the input
sequence length is below some small threshold. If you wanted to disable
disaggregation, the DecodeWorker could just always do the Prefill step as well.
```python
The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.
@service(
dynamo={
"namespace":"your_namespace",
},
)
classDecodeWorker:
def__init__(self):
self.runtime=dynamo_context["runtime"]
# Whether the decode worker should call a separate Prefill worker or not
Depending on the load distribution of requests and number of Prefill/Decode
worker instances, instead of directly forwarding requests to the Prefill
worker endpoint, it may be advantageous to send Prefill requests into a queue
that the Prefill workers can pull from on-demand instead. You can see an example
of that [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/hello_world/disagg_skeleton/components/prefill_worker.py).
For an introductory example on doing disaggregation with Dynamo using simple models, see
[this example](../examples/disagg_skeleton.md).
For more information on Disaggregated Serving, see the
[general guide](../architecture/disagg_serving.md) and [performance tuning guide](disagg_perf_tuning.md).
## Best Practices
1.**Resource Management**: Configure resource requirements based on your needs:
```python
@service(
resources={
"cpu":"10",
"memory":"20Gi",
"gpu":"1",
}
)
```
2.**Async Operations**: Use async/await for I/O operations:
```python
@endpoint()
asyncdefgenerate(self,request):
# Use async operations for better performance
result=awaitself.some_async_operation()
```
## Additional Resources
In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`.
- Check the [examples](https://github.com/ai-dynamo/dynamo/tree/main/examples) directory for more detailed implementations
- Refer to the [Dynamo SDK Docs](../API/sdk.md) for API details.
- For Disaggregated Serving, see the [general guide](../architecture/disagg_serving.md) and [performance tuning guide](disagg_perf_tuning.md).