"deploy/observability/grafana_dashboards/kvbm.json" did not exist on "55659eae70847ee32f200ea25b50792c47310806"
Unverified Commit 8d636ebd authored by Suman Tatiraju's avatar Suman Tatiraju Committed by GitHub
Browse files
parent 6d46288c
......@@ -15,18 +15,9 @@ See the License for the specific language governing permissions and
limitations under the License.
-->
# Using `dynamo serve` to deploy inference graphs locally
# Serving Inference Graphs (`dynamo serve`)
This guide explains how to create, configure, and deploy inference graphs for large language models using the `dynamo serve` command.
## Table of Contents
- [What are inference graphs?](#what-are-inference-graphs)
- [Creating an inference graph](#creating-an-inference-graph)
- [Serving the inference graph](#deploying-the-inference-graph)
- [Guided Example](#guided-example)
## What are inference graphs?
This guide explains how to create, configure, and deploy inference graphs locally for large language models using the `dynamo serve` command.
Inference graphs are compositions of service components that work together to handle LLM inference. A typical graph might include:
......@@ -37,9 +28,13 @@ Inference graphs are compositions of service components that work together to ha
## Creating an inference graph
Once you've written your various Dynamo services (docs on how to write these can be found [here](../../deploy/sdk/docs/sdk/README.md)), you can create an inference graph by composing these services together using the following two mechanisms:
Once you've written Dynamo services ([see the SDK](https://github.com/ai-dynamo/dynamo/blob/main/deploy/dynamo/sdk/docs/sdk/README.md)), create an inference graph by composing them together using the following mechanisms:
1. Dependencies with `depends()`
2. Dynamic composition with `.link()`
See the following sections for more details.
### 1. Dependencies with `depends()`
### Dependencies with `depends()`
```python
from components.worker import VllmWorker
......@@ -58,7 +53,7 @@ Benefits of `depends()`:
- Creates type-safe client connections between services
- Allows calling dependent service methods directly
### 2. Dynamic composition with `.link()`
### Dynamic composition with `.link()`
```python
# From examples/llm/graphs/agg.py
......@@ -82,15 +77,24 @@ The `.link()` method is useful for:
## Deploying the inference graph
Once you've defined your inference graph and its configuration, you can deploy it locally using the `dynamo serve` command! We recommend running the `--dry-run` command so you can see what arguments will be pasesd into your final graph. And then
Once you've defined your inference graph and its configuration, deploy it locally using the `dynamo serve` command. We recommend running the `--dry-run` command to see what arguments will be pasesd into your final graph.
Lets walk through an example.
Consider the following example.
## Guided Example
### Guided Example
The files referenced here can be found [here](../../examples/llm/components/). You will need 1 GPU minimum to run this example. This example can be run from the `examples/llm` directory
The files referenced in this example can be found [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components). You need 1 GPU minimum to run this example. This example can be run from the `examples/llm` directory.
### 1. Define your components
This example walks through:
1. [Defining your components](#define-your-components)
2. [Defining your graph](#define-your-graph)
3. [Defining your configuration](#define-your-configuration)
4. [Serving your graph](#serve-your-graph)
See the following sections for details.
#### Define your components
In this example we'll be deploying an aggregated serving graph. Our components include:
......@@ -125,9 +129,9 @@ class VllmWorker:
...
```
Note that our prebuilt components have the maximal set of dependancies needed to run the component. This allows you to plug in different components to the same graph to create different architectures. When you write your own components, you can be as flexible as you'd like.
Note that our prebuilt components have the maximal set of dependancies needed to run the component, which allows you to plug different components into the same graph to create different architectures. When writing your own components, you can be as flexible as you like.
### 2. Define your graph
#### Define your graph
```python
# graphs/agg.py
......@@ -138,18 +142,17 @@ from components.worker import VllmWorker
Frontend.link(Processor).link(VllmWorker)
```
### 3. Define your configuration
#### Define your configuration
We've provided a set of basic configurations for this example [here](../../examples/llm/configs/agg.yaml). All of these can be changed and also be overridden by passing in CLI flags to serve!
We provide [basic configurations](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/configs/agg.yaml) that you can change; you can also override them by passing in CLI flags to `dynamo serve`.
### 4. Serve your graph
#### Serve your graph
As a prerequisite, ensure you have NATS and etcd running by running the docker compose in the deploy directory. You can find it [here](../../deploy/metrics/docker-compose.yml).
Before serving your graph, ensure that NATS and etcd are running using the [docker compose file](https://github.com/ai-dynamo/dynamo/blob/main/deploy/metrics/docker-compose.yml) file in the deploy directory.
```bash
docker compose up -d
```
Note that the we point toward the first node in our graph. In this case, it's the `Frontend` service.
```bash
......@@ -157,7 +160,7 @@ Note that the we point toward the first node in our graph. In this case, it's th
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --dry-run
```
This will print out something like
This returns output like:
```bash
Service Configuration:
......@@ -200,7 +203,7 @@ You can override any of these configuration options by passing in CLI flags to s
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Processor.router=random --dry-run
```
Which will print out something like
Which prints out output like:
```bash
#...
......@@ -237,8 +240,8 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
## Close deployment
> [!IMPORTANT]
> We are aware of an issue where vLLM subprocesses might not be killed when `ctrl-c` is pressed.
> We are working on addressing this. Relevant vLLM issues can be found [here](https://github.com/vllm-project/vllm/pull/8492) and [here](https://github.com/vllm-project/vllm/issues/6219#issuecomment-2439257824).
```{important}
We are aware of an issue where vLLM subprocesses might not be killed when `ctrl-c` is pressed.
We are working on addressing this. Relevant vLLM issues can be found [here](https://github.com/vllm-project/vllm/pull/8492) and [here](https://github.com/vllm-project/vllm/issues/6219#issuecomment-2439257824).
To stop the serve, you can press `ctrl-c` which will kill the different components. In order to kill the remaining vLLM subprocesses you can run `nvidia-smi` and `kill -9` the remaining processes or run `pkill python3` from inside of the container.
To stop the serve, you can press `ctrl-c` which kills the components. In order to kill the remaining vLLM subprocesses you can run `nvidia-smi` and `kill -9` the remaining processes or run `pkill python3` from inside of the container.
......@@ -17,7 +17,7 @@ limitations under the License.
# Planner
The planner is a component that monitors the state of the system and makes adjustments to workers to ensure that the system is running efficiently. Currently, planner can scale up and down the number of vllm workers based on the kv cache load and prefill queue size:
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently. Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
* Backend:
* local ✅
* kubernetes ✅
......@@ -40,12 +40,12 @@ To adjust the number of prefill/decode workers, planner monitors the following m
* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload.
* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload.
Every `metric-pulling-interval`, planner will gather the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner will block the metric pulling and adjustment.
Every `metric-pulling-interval`, planner gathers the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism will pick up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, currently, planner revoke the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker will be immediately removed from the router and will not get any new requests. The decode worker will then finish all the current requests in their original stream and exit gracefully.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism picks up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, planner revokes the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests. The decode worker then finishes all the current requests in their original stream and exits gracefully.
There are two additional rules set by planner to prevent over-compensation:
1. After a new decode worker is added, since it needs time to populate the kv cache, planner will not scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
## Comply with SLA
......@@ -144,10 +144,11 @@ We currently support two backends:
### Local Backend
Circus is a Python program which can be used to monitor and control processes and sockets. Dynamo serve uses circus to start each node in a graph and monitors each subprocesses. We leverage a core feature to do this called `Watcher`. A `Watcher` is the target program that you would like to run (which in our case is `serve_dynamo.py`). When planner decides to scale up or down, it will either add or remove a watcher from the existing `circus`.
Circus is a Python program that can be used to monitor and control processes and sockets. Dynamo serve uses circus to start each node in a graph and monitors each subprocesses. We leverage a core feature to do this called `Watcher`. A `Watcher` is the target program that you would like to run (which in our case is `serve_dynamo.py`). When planner decides to scale up or down, it either adds or removes a watcher from the existing `circus`.
> [!NOTE]
> Although circus allows you to `increment` an existing watcher, it was not designed to allow variables to be passed in which does not allow us to schedule on a GPU. So instead we start a new watcher per process. When planner decides to add or remove a worker, we have logic to handle this adding/removing and incrementing/decrementing the workers.
``` {note}
Although circus allows you to `increment` an existing watcher, it was not designed to allow variables to be passed in which does not allow us to schedule on a GPU. So instead we start a new watcher per process. When planner decides to add or remove a worker, we have logic to handle this adding/removing and incrementing/decrementing the workers.
```
#### Statefile
......@@ -155,7 +156,7 @@ The statefile is a json file created when initially running `dynamo serve` and i
When one Decode worker is spun up, the statefile looks like:
```json
```none
{
"dynamo_VllmWorker": {..., resources={...}},
}
......@@ -163,7 +164,7 @@ When one Decode worker is spun up, the statefile looks like:
Now another decode worker is added:
```json
```none
{
"dynamo_VllmWorker": {..., resources={...}},
"dynamo_VllmWorker_1": {..., resources={...}},
......@@ -172,7 +173,7 @@ Now another decode worker is added:
Then one decode worker is removed:
```json
```none
{
"dynamo_VllmWorker": {..., resources={...}},
}
......@@ -180,16 +181,17 @@ Then one decode worker is removed:
If the last decode worker is removed, the statefile looks like:
```json
```none
{
"dynamo_VllmWorker": {...},
}
```
Note that we keep the initial non-suffix entry in order to know what cmd we will need to spin up another worker. This is the same for prefill workers as well.
We keep the initial non-suffix entry in order to know what cmd we'll need to spin up another worker. This is the same for prefill workers as well.
> [!NOTE]
> At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher.
``` {note}
At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher.
```
### Kubernetes Backend
......
......@@ -35,7 +35,7 @@ python sin_synth.py \
--osl2 150
```
This will generate a [mooncake style trace](https://github.com/kvcache-ai/Mooncake) with
This generates a [mooncake style trace](https://github.com/kvcache-ai/Mooncake) with
* duration = 600 seconds
* isl/osl = 3000/150
* request rate varies sinusoidally from 0.75 to 3 requests with a period of 150 seconds
......@@ -76,7 +76,7 @@ and open `http://localhost:6006` in your browser. The following metrics are avai
* `num_decode_workers`: the number of decode workers
* `num_gpu`: the total number of GPUs used
The benchmark results will be printed out in terminal 3 that runs the `genai-perf` command.
The benchmark results are printed out in terminal 3 that runs the `genai-perf` command.
In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--no-operation` flag to watch and log the metrics without making any adjustments:
......@@ -92,7 +92,7 @@ genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deeps
The below two figures show the performance comparison between planner and the baseline 2p2d deployment. Planner achieves 1.5x speedup while using 7.4% less GPU resources.
![Planner Performance Comparison](./images/planner_perf.png)
![Two bar charts comparing 2P2D and Planner. Planner shows lower GPU usage and lower average sequence latency.](../../images/planner_perf.png)
![Planner Tensorboard](./images/planner_tensorboard.png)
![Planner Tensorboard; four line graphs comparing two runs: 2p2d_rr5-20_2 and planner_rr5-20.](../../images/planner_tensorboard.png)
..
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
.. This hidden toctree includes readmes etc that aren't meant to be in the main table of contents but should be accounted for in the sphinx project structure
.. toctree::
:maxdepth: 2
:hidden:
guides/README.md
runtime/README.md
examples/disagg_skeleton.md
\ No newline at end of file
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
..
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Welcome to NVIDIA Dynamo
========================
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
Dive in: Examples
-----------------------
.. grid:: 1 2 2 2
:gutter: 3
:margin: 0
:padding: 3 4 0 0
.. grid-item-card:: :doc:`Hello World </examples/hello_world>`
:link: /examples/hello_world
:link-type: doc
Demonstrates the basic concepts of Dynamo by creating a simple multi-service pipeline.
.. grid-item-card:: :doc:`LLM Deployment </examples/llm_deployment>`
:link: /examples/llm_deployment
:link-type: doc
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
.. grid-item-card:: :doc:`Multinode </examples/multinode>`
:link: /examples/multinode
:link-type: doc
Demonstrates deployment for disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`.
.. grid-item-card:: :doc:`TensorRT-LLM </examples/trtllm>`
:link: /examples/trtllm
:link-type: doc
Presents TensorRT-LLM examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
Overview
--------
The NVIDIA Dynamo Platform is a high-performance, low-latency inference platform
designed to serve all AI models—across any framework, architecture, or deployment scale.
Dynamo is inference engine agnostic, supporting TRT-LLM, vLLM, SGLang, and others, and captures
LLM-specific capabilities such as:
* **Disaggregated prefill & decode inference** - Maximizes GPU throughput and facilitates trade off between throughput and latency.
* **Dynamic GPU scheduling** - Optimizes performance based on fluctuating demand.
* **LLM-aware request routing** - Eliminates unnecessary KV cache re-computation.
* **Accelerated data transfer** - Reduces inference response time using NIXL.
* **KV cache offloading** - Leverages several memory hierarchies for higher system throughput.
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source
and driven by a transparent, OSS (Open Source Software) first development approach.
.. toctree::
:hidden:
Welcome to Dynamo <self>
Support Matrix <support_matrix.md>
Getting Started <get_started.md>
.. toctree::
:hidden:
:caption: Architecture & Features
High Level Architecture <architecture/architecture.md>
Distributed Runtime <architecture/distributed_runtime.md>
Disaggregated Serving <architecture/disagg_serving.md>
KV Block Manager <architecture/kvbm_intro.rst>
KV Cache Routing <architecture/kv_cache_routing.md>
Planner <guides/planner.md>
.. toctree::
:hidden:
:caption: Dynamo Command Line Interface
CLI Overview <guides/cli_overview.md>
Running Dynamo (dynamo run) <guides/dynamo_run.md>
Serving Inference Graphs (dynamo serve) <guides/dynamo_serve.md>
Building Dynamo (dynamo build) <guides/dynamo_build.md>
Deploying Inference Graphs (dynamo deploy) <guides/dynamo_deploy/README.md>
.. toctree::
:hidden:
:caption: Usage Guides
Writing Python Workers in Dynamo <guides/backend.md>
Disaggregation and Performance Tuning <guides/disagg_perf_tuning.md>
KV Cache Router Performance Tuning <guides/kv_router_perf_tuning.md>
Planner Benchmark Example <guides/planner_benchmark/benchmark_planner.md>
.. toctree::
:hidden:
:caption: Deployment Guides
Dynamo Cloud Kubernetes Platform <guides/dynamo_deploy/dynamo_cloud.md>
Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform <guides/dynamo_deploy/operator_deployment.md>
Manual Helm Deployment <guides/dynamo_deploy/manual_helm_deployment.md>
Minikube Setup Guide <guides/dynamo_deploy/minikube.md>
.. toctree::
:hidden:
:caption: API
SDK Reference <API/sdk.md>
Python API <API/python_bindings.md>
.. toctree::
:hidden:
:caption: Examples
Hello World Example <examples/hello_world.md>
LLM Deployment Examples <examples/llm_deployment.md>
Multinode Examples <examples/multinode.md>
LLM Deployment Examples using TensorRT-LLM <examples/trtllm.md>
# Dynamo Distributed KV Cache Manager
Calculating LLM KV values for user requests is resource-intensive and thus expensive. Leveraging KV cache to minimize the need for its recomputation is common practice. However, as AI demand increases, solely relying on GPU memory for KV cache would not be sustainable to meet SLA under fixed budget. It poses a significant demand to a more effective KV cache reuse management mechanism.
The Dynamo KV Cache Manager feature addresses this challenge by enabling the offloading of older or less frequently accessed KV cache blocks to more cost-effective memory and storage solutions, such as CPU memory, local storage or networked object or file storage. This capability enables organizations to store up to petabytes of KV cache data at a fraction of the cost of keeping it in GPU memory. By offloading KV cache to alternative memory hierarchies, developers can free up valuable GPU resources while still retaining and reusing historical KV cache to reduce inference computation costs.
<figure>
<img src='images/kv_cache_mgr.png' alt='missing' />
<p>Figure 1. Dynamo Distributed KV Cache Manager offloads less frequently accessed KV cache to more economical memory hierarchies </p>
</figure>
The Dynamo KV Cache Manager uses advanced caching policies that prioritize placing frequently accessed data in GPU memory, while less accessed data is moved to shared CPU memory, SSDs, or networked object storage. It incorporates eviction policies that strike a balance between over-caching (which can introduce lookup latencies) and under-caching (which leads to missed lookups and KV cache re-computation).
Additionally, this feature can manage KV cache across multiple GPU nodes, supporting both distributed and disaggregated inference serving, and offers hierarchical caching capabilities, creating offloading strategies at the GPU, node, and cluster levels.
The Dynamo KV Cache Manager is designed to be framework-agnostic to support various backends, including TensorRT-TLLM, vLLM, and SGLang, and to facilitate the scaling of KV cache storage across large, distributed clusters using NVLink, NVIDIA Quantum switches, and NVIDIA Spectrum switches. It integrates with [NIXL](https://github.com/ai-dynamo/nixl/blob/main/docs/nixl.md) to enable data transfers across different worker instances and storage backends.
## Design
- Separation of Mechanism and Policy
- Mechanism: Manages memory allocation, caching hierarchy, and data flow.
- Policy: Determines caching strategies, including the choice of data structures (e.g., radix tree, distributed hash tables) and eviction algorithms.
This separation ensures that the underlying infrastructure can evolve without disrupting the caching logic. This design decision was created to enable each customer to come up with their own policies and mechanisms to manage memory that fits their access pattern.
- Hierarchical caching
- A radix tree provides a clean, structured approach for organizing KV storage in distributed inference. A local tree can be built per node, with a global tree at the cluster level, ensuring an efficient abstraction.
- The hierarchy spans HBM, local node KV stores, and external storage, with each layer caching data for the next to optimize lookups. Data movements across the tiers are handled using NIXL APIs for seamless communication. The data flow is fully asynchronous and is transparent to worker instances.
- Multiple backends are supported as long as they are compatible with KV manager APIs.
- RDMA transfers are preferred for optimal performance.
- Registration with runtimes
- Distributed KV manager registers with inference engine runtimes to enable KV offloading to the pool.
- Registration creates a two-way communication queue between the runtime and the pool.
- Management and transfer granularities
- KV blocks are managed in block level (group of tokens) however transfer of KV states can be performed at layer level.
- If multiple tokens are needed to be fetched, then these layer transfers are parallelized to ensure maximum throughput from the KV pool.
## V1 Implementation
Dynamo Distributed KV Manager has two implementations: V1 and V2. V1 serves as a proof-of-concept design, providing a lightweight KV offloading framework with simple, asynchronous APIs — GET() and PUT(), allowing inference engines to offload KV caches efficiently. These APIs are designed to be fully asynchronous, enabling seamless overlap with inference computation.
<figure>
<img src='images/kv_cache_mgr_design.png' alt='missing' />
<p>Figure 2. Design of Dynamo KV manager V1 </p>
</figure>
The left section of Figure 2 illustrates the execution timeline and data movement sequence in the V1 architecture. Inference engines like vLLM can initiate asynchronous operations with flexible access granularity, enabling various overlapping strategies to optimize execution based on whether the priority is throughput or latency.
The right section of Figure 2 depicts data flow within the runtime. At present, we do not allocate any portion of the GPU's high-bandwidth memory (HBM) beyond what is required by the inference engine, ensuring its exclusive utilization for inference tasks. Within the inference runtime, GPU device memory can either be fully dedicated to key-value (KV) storage or partially allocated for prefix KV caches, which are dynamically managed by the inference engine—similar to vLLM.
When the inference engine determines that some entries in the KV cache should be evicted from GPU memory, it invokes the put_async() API to offload them by the KV manager, which updates its index and transfers the data to the appropriate storage tier (CPU memory or a combination of CPU and SSD). Conversely, if the inference engine fails to locate a required KV entry in its self-managed prefix cache, it issues a get_async() request to the KV manager. If the KV entry already exists, retrieval via get_async() will significantly reduces recomputation overhead, ensuring efficient KV management, optimized memory utilization, and improved inference performance.
In the V1 implementation, CPU memory functions as a cache layer for SSD storage. If a required KV entry resides in CPU memory, the system bypasses SSD access, reducing transfer latency. Asynchronous APIs like get_async() or put_async() also enable transfers such that it does not impact system performance.
A key aspect of our implementation is the introduction of multiple parallel queues (or pipelines) for critical operations, including:
- Index matching, updates, and block allocation/free operations
- Data transfers between GPU and CPU
- Data transfers between CPU and SSD
This multi-queue design is crucial because it:
- Enables true asynchronous execution by decoupling blocking operations.
- Maximizes parallelism, allowing multiple requests to be processed concurrently.
- Fully utilizes different hardware resources, such as CPU, GPU, and storage, avoid bottlenecks.
- Decouples slow operations (e.g., SSD writes) from the critical path of responding to user queries, to improve responsiveness.
- Ensures the correctness of index updates and data transfers, even under high-throughput, concurrent workloads.
Looking ahead, V1 architecture will integrate with NIXL to enable KV reuse across multiple nodes. Additionally, we will add GPUDirect Storage capabilities to reduce the get_async() latency and minimize the CPU overhead while facilitating direct data transfers between GPU memory and SSD. These enhancements will be made available post-GTC.
V1 architecture is an excellent design for quick enablement and execution. However, it does not offer much finer control on memory management and interactions with the NVIDIA Dynamo ecosystem. To address this, we are parallelly implementing V2 architecture providing a notion of distributed KV pool across workers and storage. V2 architecture will be released in coming weeks.
## V2 Implementation
The V2 implementation introduces a distributed KV pool across worker instances and storage, incorporating all features outlined in the design. Development is still in progress, and we welcome collaborators to share their feedback. This documentation aims to offer a high-level overview of the V2 implementation and gather input.
The V2 BlockManager changes the ownership patterns to RAII objects. The primary object will be a KvBlock object which defines the contents of the tokens in the block and the unique sequence hash associated with that block. In Rust, the KvBlock is a generic KvBlock<S: BlockStorage>. This means each KvBlock is strongly typed to the storage type (S) which must conform to the behavior defined in the BlockStorage trait.
KvBlocks are allocated and ownership is transferred to a ReusePool object. The ReusePool object is used to provide free blocks in user defined priority. This specialized Pool is a compound collective, so it can also lookup and extract blocks from the pool by matching on the sequence hash.
When acquired from the ReusePool, the object is a PoolItem<KvBlock<BST>>. PoolItem is the object that is the RAII object that when it goes out of scope (Drop), it will be returned to the pool.
A PoolItem<KvBlock<BST> which is typedef’ed as `UniqueBlock<BST>` is a uniquely owned and mutable block. In order to make the block shareable and discoverable, the UniqueBlock<BST> must be registered with the ReservedBlockRegistry. Upon registration, a RegisteredBlock<BST> is returned – this block is shared, immutable and discoverable.
- Immutable - should only provide a const pointer to the storage
- Shared - internally atomically referenced counted object and therefore Cloneable.
- When refcount → 0, the block is unregistered and the backing UniqueBlock<BST> is returned to the ReusePool.
- Discoverable/Reserved
- Incoming requests can be matched to blocks by sequence hash which will return a list of RegisteredBlock<BST> clones for the matching blocks.
- Registered block state changes are emitted as events allowing the KV Aware Router to add/remove the block from the radix tree.
All data movement requires either Shared or Unique block ownership that is owned for the scope of the TransferEngines operation.
For example,
```bash
pub async fn copy_blocks<D, S>(dst: &[KvBlock<D>], src: &[KvBlock<S>]) -> Result<()>;
```
This allows us to specialize implementations on D and S which will be compiler matched. For python, this will be dynamically dispatched and mismatched types will be raised as exceptions.
The underlying Storage is layer-aware. This allows for us to expose a layer-wise trigger.
```bash
pub async fn copy_blocks_by_layer<D, S>(dst: &[KvBlock<D>], src: &[KvBlock<S>, layers: &[usize]) -> Result<()>;
```
To coordinate layer-wise chaining of transfers, say from GPU -> CPU -> Storage we will provide TransferCoordinator can pipeline layer transfers from to the next storage. Example, the moment a layer or set of layers arrives in CPU memory from GPU, we can trigger those layers begin CPU -> Storage. This allows the secondary transfers to have layer-wise overlap with the primary transfers.
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Runtime
<h4>A Datacenter Scale Distributed Inference Serving Framework</h4>
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
Rust implementation of the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
## Prerequisites
### Install Rust and Cargo using [rustup](https://rustup.rs/):
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
### Build
```
cargo build
cargo test
```
### Start Dependencies
#### Docker Compose
The simplest way to deploy the pre-requisite services is using
[docker-compose](https://docs.docker.com/compose/install/linux/),
defined in [deploy/metrics/docker-compose.yml](../../deploy/metrics/docker-compose.yml).
```
docker-compose up -d
```
This will deploy a [NATS.io](https://nats.io/) server and an [etcd](https://etcd.io/)
server used to communicate between and discover components at runtime.
#### Local (alternate)
To deploy the pre-requisite services locally instead of using `docker-compose`
above, you can manually launch each:
- [NATS.io](https://docs.nats.io/running-a-nats-service/introduction/installation) server with [Jetstream](https://docs.nats.io/nats-concepts/jetstream)
- example: `nats-server -js --trace`
- [etcd](https://etcd.io) server
- follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally
### Run Examples
When developing or running examples, any process or user that shared your core-services (`etcd` and `nats.io`) will
be operating within your distributed runtime.
The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later.
All examples require the `etcd` and `nats.io` pre-requisites to be running and available.
#### Rust `hello_world`
With two terminals open, in one window:
```
cd examples/hello_world
cargo run --bin server
```
In the second terminal, execute:
```
cd examples/hello_world
cargo run --bin client
```
which should yield some output similar to:
```
Finished `dev` profile [unoptimized + debuginfo] target(s) in 6.25s
Running `target/debug/client`
Annotated { data: Some("h"), id: None, event: None, comment: None }
Annotated { data: Some("e"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("o"), id: None, event: None, comment: None }
Annotated { data: Some(" "), id: None, event: None, comment: None }
Annotated { data: Some("w"), id: None, event: None, comment: None }
Annotated { data: Some("o"), id: None, event: None, comment: None }
Annotated { data: Some("r"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("d"), id: None, event: None, comment: None }
```
#### Python
See the [README.md](../API/python_bindings.md) for details
The Python and Rust `hello_world` client and server examples are interchangeable,
so you can start the Python `server.py` and talk to it from the Rust `client`.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Support Matrix
This document provides the support matrix for Dynamo, including hardware, software and build instructions.
......@@ -10,7 +28,9 @@ This document provides the support matrix for Dynamo, including hardware, softwa
| **x86_64** | Supported |
| **ARM64** | Experimental |
> **Note**: While **x86_64** architecture is supported on systems with a minimum of 32 GB RAM and at least 4 CPU cores, the **ARM64** support is experimental and may have limitations.
```{note}
While **x86_64** architecture is supported on systems with a minimum of 32 GB RAM and at least 4 CPU cores, the **ARM64** support is experimental and may have limitations.
```
### GPU Compatibility
......@@ -34,7 +54,9 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **Ubuntu** | 24.04 | ARM64 | Experimental |
| **CentOS Stream** | 9 | x86_64 | Experimental |
> **Note**: For **Linux**, the **ARM64** support is experimental and may have limitations. Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04). Compatibility with other Linux distributions is expected but has not been officially verified yet.
```{note}
For **Linux**, the **ARM64** support is experimental and may have limitations. Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04). Compatibility with other Linux distributions is expected but has not been officially verified yet.
```
## Software Compatibility
### Runtime Dependency
......@@ -55,7 +77,7 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
> **Note**:
> - *ai-dynamo-vllm v0.8.4.post1 is a customized patch of v0.8.4 from vLLM.
> - **The specific version of TensorRT-LLM (planned v0.19.0) that will be supported by Dynamo is subject to change.
> - **Specific versions of TensorRT-LLM supported by Dynamo are subject to change.
## Build Support
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Hello World Example
## Overview
This example demonstrates the basic concepts of Dynamo by creating a simple multi-service pipeline. It shows how to:
1. Create and connect multiple Dynamo services
2. Pass data between services using Dynamo's runtime
3. Set up a simple HTTP API endpoint
4. Deploy and interact with a Dynamo service graph
Pipeline Architecture:
```
Users/Clients (HTTP)
┌─────────────┐
│ Frontend │ HTTP API endpoint (/generate)
└─────────────┘
│ dynamo/runtime
┌─────────────┐
│ Middle │
└─────────────┘
│ dynamo/runtime
┌─────────────┐
│ Backend │
└─────────────┘
```
## Component Descriptions
### Frontend Service
- Serves as the entry point for external HTTP requests
- Exposes a `/generate` HTTP API endpoint that clients can call
- Processes incoming text and passes it to the Middle service
### Middle Service
- Acts as an intermediary service in the pipeline
- Receives requests from the Frontend
- Appends "-mid" to the text and forwards it to the Backend
### Backend Service
- Functions as the final service in the pipeline
- Processes requests from the Middle service
- Appends "-back" to the text and yields tokens
## Running the Example Locally
1. Launch all three services using a single command:
```bash
cd /workspace/examples/hello_world
dynamo serve hello_world:Frontend
```
The `dynamo serve` command deploys the entire service graph, automatically handling the dependencies between Frontend, Middle, and Backend services.
2. Send request to frontend using curl:
```bash
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"text": "test"
}'
```
## Deploying to and Running the Example in Kubernetes
This example can be deployed to a Kubernetes cluster using [Dynamo Cloud](../../docs/guides/dynamo_deploy/dynamo_cloud.md) and the Dynamo CLI.
### Prerequisites
You must have first followed the instructions in [deploy/cloud/helm/README.md](../../deploy/cloud/helm/README.md) to create your Dynamo cloud deployment.
### Deployment Steps
For detailed deployment instructions, please refer to the [Operator Deployment Guide](../../docs/guides/dynamo_deploy/operator_deployment.md). The following are the specific commands for the hello world example:
```bash
# Set your project root directory
export PROJECT_ROOT=$(pwd)
# Configure environment variables (see operator_deployment.md for details)
export KUBE_NS=hello-world
export DYNAMO_CLOUD=http://localhost:8080 # If using port-forward
# OR
# export DYNAMO_CLOUD=https://dynamo-cloud.nvidia.com # If using Ingress/VirtualService
# Build the Dynamo base image (see operator_deployment.md for details)
export DYNAMO_IMAGE=<your-registry>/<your-image-name>:<your-tag>
# Build the service
cd $PROJECT_ROOT/examples/hello_world
DYNAMO_TAG=$(dynamo build hello_world:Frontend | grep "Successfully built" | awk '{ print $3 }' | sed 's/\.$//')
# Deploy to Kubernetes
export DEPLOYMENT_NAME=ci-hw
dynamo deployment create $DYNAMO_TAG -n $DEPLOYMENT_NAME
```
### Testing the Deployment
Once the deployment is complete, you can test it using:
```bash
# Find your frontend pod
export FRONTEND_POD=$(kubectl get pods -n ${KUBE_NS} | grep "${DEPLOYMENT_NAME}-frontend" | sort -k1 | tail -n1 | awk '{print $1}')
# Forward the pod's port to localhost
kubectl port-forward pod/$FRONTEND_POD 8000:8000 -n ${KUBE_NS}
# Test the API endpoint
curl -X 'POST' 'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{"text": "test"}'
```
For more details on managing deployments, testing, and troubleshooting, please refer to the [Operator Deployment Guide](../../docs/guides/dynamo_deploy/operator_deployment.md).
## Expected Output
When you send the request with "test" as input, the response will show how the text flows through each service:
```
Frontend: Middle: Backend: test-mid-back
```
This demonstrates how:
1. The Frontend receives "test"
2. The Middle service adds "-mid" to create "test-mid"
3. The Backend service adds "-back" to create "test-mid-back"
../../docs/examples/hello_world.md
\ No newline at end of file
# Multinode Examples
Table of Contents
- [Single node sized models](#single-node-sized-models)
- [Multi-node sized models](#multi-node-sized-models)
## Single node sized models
You can deploy dynamo on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. Each node will need to be properly configured with Infiniband and/or RoCE for communication between decode and prefill workers.
##### Disaggregated Deployment with KV Routing
- Node 1: Frontend, Processor, Router, Decode Worker
- Node 2: Prefill Worker
- Node 3: Prefill Worker
Note that this can be easily extended to more nodes. You can also run the Frontend, Processor, and Router on a separate CPU only node if you'd like as long as all nodes have access to the NATS/ETCD endpoints!
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes as you will need the NATS/ETCD endpoints to be accessible by all other nodes.
```bash
# node 1
docker compose -f deploy/metrics/docker-compose.yml up -d
```
**Step 2**: Create the inference graph for this node. Here we will use the `agg_router.py` (even though we are doing disaggregated serving) graph because we want the `Frontend`, `Processor`, `Router`, and `VllmWorker` to spin up (we will spin up the other decode worker and prefill worker separately on different nodes later).
```python
# graphs/agg_router.py
Frontend.link(Processor).link(Router).link(VllmWorker)
```
**Step 3**: Create a configuration file for this node. We've provided a sample one for you in `configs/multinode-405b.yaml` for the 405B model. Note that we still include the `PrefillWorker` component in the configuration file even though we are not using it on node 1. This is because we can reuse the same configuration file on all nodes and just spin up individual workers on the other ones.
**Step 4**: Start the frontend, processor, router, and VllmWorker on node 1.
```bash
# node 1
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg_router:Frontend -f ./configs/multinode-405b.yaml
```
**Step 5**: Start the first prefill worker on node 2.
Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly with the configuration file from before.
```bash
# node 2
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
```
**Step 6**: Start the second prefill worker on node 3.
```bash
# node 3
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
```
**Step 7**: [Optional] Start more decode workers on other nodes
This example can be extended to more nodes as well. For example, if you'd like to spin up another decode worker, you can use
```bash
# node X
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.worker:VllmWorker -f ./configs/multinode-405b.yaml --service-name VllmWorker
```
Note the use of `--service-name`. This will only spin up the worker that you are requesting and ignore any `depends` statements.
### Client
In another terminal:
```bash
# this test request has around 200 tokens isl
curl <node1-ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "nvidia/Llama-3.1-405B-Instruct-FP8",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 300
}'
```
#### Multi-node sized models
Multinode model support is coming soon. You can track progress [here](https://github.com/ai-dynamo/dynamo/issues/513)!
##### Aggregated Deployment
The steps for aggregated deployment of multi-node sized models is similar to
single-node sized models, except that you need to first configure the nodes
to be interconnected according to the framework's multi-node deployment guide.
In the below example, vLLM will be used as the framework to serve `DeepSeek-R1` model
using tensor parallel 16 on two H100x8 nodes.
**Step 1**: On each of the nodes, set up Ray cluster so that vLLM can access the resource
collectively:
```bash
# head node
ray start --head --port=6379
# example output and keep note of the IP address of the head node
# Local node IP: <head-node-address>
# set vLLM env arg
export VLLM_HOST_IP=<head-node-address>
# other node
ray start --address=<head-node-address>:6379
export VLLM_HOST_IP=<current-node-address>
# verify the accessibility by checking aggregated GPU count shown in ray status
ray status
# Expected/Sample output for 2 nodes:
# ```bash
# ======== Autoscaler status: 2025-04-16 15:35:42.751688 ========
# Node status
# ---------------------------------------------------------------
# Active:
# 1 node_<hash_1>
# 1 node_<hash_2>
# Pending:
# (no pending nodes)
# Recent failures:
# (no failures)
# Resources
# ---------------------------------------------------------------
# Usage:
# XXX CPU
# XXX GPU
# XXX memory
# XXX object_store_memory
# Demands:
# (no resource demands)
```
**Step 2**: On the head node, follow [LLM Deployment Guide](./README.md#getting-started) to
setup dynamo deployment for aggregated serving, using the configuration file,
`configs/multinode_agg_r1.yaml`, for DeepSeek-R1:
```bash
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/multinode_agg_r1.yaml
```
### Client
In another terminal, you can send the same curl request as described above but
with `"model": "deepseek-ai/DeepSeek-R1"`
```bash
# this test request has around 200 tokens isl
curl <node1-ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 300
}'
```
##### Disaggregated Deployment
In this example, we will be deploying two replicas of the model (one prefill worker
and one decode worker). We will be using 4 H100x8 nodes and group every two of them
into one Ray cluster in the same way as described in aggregated deployment.
However, for etcd and nats server, we will only run them in
one node and let's consider that node to be the head node of the whole deployment.
Note that if you are starting etcd server directly instead of using `docker compose`,
you should add additional arguments to be discoverable in other node.
```bash
etcd --advertise-client-urls http://<head-node-ip>:2379 --listen-client-urls http://<head-node-ip>:2379,http://127.0.0.1:2379
```
**Step 1**: On every two nodes, set up Ray cluster as described in
[aggregated deployment](#aggregated-deployment). After that, you should have
two independent Ray cluster, each has access to 16 GPUs.
**Step 2** start the deployment by running different flavors of `dynamo serve`
on one of the node for each Ray cluster, using the configuration file,
`configs/mutinode_disagg_r1.yaml`.
For decode, below command will be used and the node will be the entry point of
the whole deployment. In other words, the ip of the node should be used to send
requests to.
```bash
# if not head node
export NATS_SERVER='nats://<nats-server-ip>:4222'
export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/mutinode_disagg_r1.yaml
```
For prefill:
```bash
# if not head node
export NATS_SERVER='nats://<nats-server-ip>:4222'
export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/mutinode_disagg_r1.yaml
```
### Client
In another terminal, you can send the same curl request as described in
[aggregated deployment](#aggregated-deployment), addressing to the ip of
the decode node.
../../docs/examples/multinode.md
\ No newline at end of file
......@@ -46,13 +46,13 @@ uv pip install maturin
maturin develop --uv
```
# Run Examples
## Run Examples
## Pre-requisite
### Prerequisite
See [README.md](../../runtime/README.md#️-prerequisites).
See [README.md](../runtime/README.md#prerequisites).
## Hello World Example
### Hello World Example
1. Start 3 separate shells, and activate the virtual environment in each
```
......@@ -80,7 +80,7 @@ distributed across the server instances in each server's output. If only one
server instance is started, you should see the requests go to that server
each time.
# Performance
## Performance
The performance impacts of synchronizing the Python and Rust async runtimes
is a critical consideration when optimizing the performance of a highly
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Runtime
<h4>A Datacenter Scale Distributed Inference Serving Framework</h4>
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
Rust implementation of the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
## 🛠️ Prerequisites
### Install Rust and Cargo using [rustup](https://rustup.rs/):
```bash
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
```
### Build
```
cargo build
cargo test
```
### Start Dependencies
#### Docker Compose
The simplest way to deploy the pre-requisite services is using
[docker-compose](https://docs.docker.com/compose/install/linux/),
defined in [deploy/metrics/docker-compose.yml](../../deploy/metrics/docker-compose.yml).
```
docker-compose up -d
```
This will deploy a [NATS.io](https://nats.io/) server and an [etcd](https://etcd.io/)
server used to communicate between and discover components at runtime.
#### Local (alternate)
To deploy the pre-requisite services locally instead of using `docker-compose`
above, you can manually launch each:
- [NATS.io](https://docs.nats.io/running-a-nats-service/introduction/installation) server with [Jetstream](https://docs.nats.io/nats-concepts/jetstream)
- example: `nats-server -js --trace`
- [etcd](https://etcd.io) server
- follow instructions in [etcd installation](https://etcd.io/docs/v3.5/install/) to start an `etcd-server` locally
### Run Examples
When developing or running examples, any process or user that shared your core-services (`etcd` and `nats.io`) will
be operating within your distributed runtime.
The current examples use a hard-coded `namespace`. We will address the `namespace` collisions later.
All examples require the `etcd` and `nats.io` pre-requisites to be running and available.
#### Rust `hello_world`
With two terminals open, in one window:
```
cd examples/hello_world
cargo run --bin server
```
In the second terminal, execute:
```
cd examples/hello_world
cargo run --bin client
```
which should yield some output similar to:
```
Finished `dev` profile [unoptimized + debuginfo] target(s) in 6.25s
Running `target/debug/client`
Annotated { data: Some("h"), id: None, event: None, comment: None }
Annotated { data: Some("e"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("o"), id: None, event: None, comment: None }
Annotated { data: Some(" "), id: None, event: None, comment: None }
Annotated { data: Some("w"), id: None, event: None, comment: None }
Annotated { data: Some("o"), id: None, event: None, comment: None }
Annotated { data: Some("r"), id: None, event: None, comment: None }
Annotated { data: Some("l"), id: None, event: None, comment: None }
Annotated { data: Some("d"), id: None, event: None, comment: None }
```
#### Python
See the [README.md](../bindings/python/README.md) for details
The Python and Rust `hello_world` client and server examples are interchangeable,
so you can start the Python `server.py` and talk to it from the Rust `client`.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment