Unverified Commit 8d636ebd authored by Suman Tatiraju's avatar Suman Tatiraju Committed by GitHub
Browse files
parent 6d46288c
...@@ -15,18 +15,9 @@ See the License for the specific language governing permissions and ...@@ -15,18 +15,9 @@ See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
# Using `dynamo serve` to deploy inference graphs locally # Serving Inference Graphs (`dynamo serve`)
This guide explains how to create, configure, and deploy inference graphs for large language models using the `dynamo serve` command. This guide explains how to create, configure, and deploy inference graphs locally for large language models using the `dynamo serve` command.
## Table of Contents
- [What are inference graphs?](#what-are-inference-graphs)
- [Creating an inference graph](#creating-an-inference-graph)
- [Serving the inference graph](#deploying-the-inference-graph)
- [Guided Example](#guided-example)
## What are inference graphs?
Inference graphs are compositions of service components that work together to handle LLM inference. A typical graph might include: Inference graphs are compositions of service components that work together to handle LLM inference. A typical graph might include:
...@@ -37,9 +28,13 @@ Inference graphs are compositions of service components that work together to ha ...@@ -37,9 +28,13 @@ Inference graphs are compositions of service components that work together to ha
## Creating an inference graph ## Creating an inference graph
Once you've written your various Dynamo services (docs on how to write these can be found [here](../../deploy/sdk/docs/sdk/README.md)), you can create an inference graph by composing these services together using the following two mechanisms: Once you've written Dynamo services ([see the SDK](https://github.com/ai-dynamo/dynamo/blob/main/deploy/dynamo/sdk/docs/sdk/README.md)), create an inference graph by composing them together using the following mechanisms:
1. Dependencies with `depends()`
2. Dynamic composition with `.link()`
See the following sections for more details.
### 1. Dependencies with `depends()` ### Dependencies with `depends()`
```python ```python
from components.worker import VllmWorker from components.worker import VllmWorker
...@@ -58,7 +53,7 @@ Benefits of `depends()`: ...@@ -58,7 +53,7 @@ Benefits of `depends()`:
- Creates type-safe client connections between services - Creates type-safe client connections between services
- Allows calling dependent service methods directly - Allows calling dependent service methods directly
### 2. Dynamic composition with `.link()` ### Dynamic composition with `.link()`
```python ```python
# From examples/llm/graphs/agg.py # From examples/llm/graphs/agg.py
...@@ -82,15 +77,24 @@ The `.link()` method is useful for: ...@@ -82,15 +77,24 @@ The `.link()` method is useful for:
## Deploying the inference graph ## Deploying the inference graph
Once you've defined your inference graph and its configuration, you can deploy it locally using the `dynamo serve` command! We recommend running the `--dry-run` command so you can see what arguments will be pasesd into your final graph. And then Once you've defined your inference graph and its configuration, deploy it locally using the `dynamo serve` command. We recommend running the `--dry-run` command to see what arguments will be pasesd into your final graph.
Lets walk through an example. Consider the following example.
## Guided Example ### Guided Example
The files referenced here can be found [here](../../examples/llm/components/). You will need 1 GPU minimum to run this example. This example can be run from the `examples/llm` directory The files referenced in this example can be found [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components). You need 1 GPU minimum to run this example. This example can be run from the `examples/llm` directory.
### 1. Define your components This example walks through:
1. [Defining your components](#define-your-components)
2. [Defining your graph](#define-your-graph)
3. [Defining your configuration](#define-your-configuration)
4. [Serving your graph](#serve-your-graph)
See the following sections for details.
#### Define your components
In this example we'll be deploying an aggregated serving graph. Our components include: In this example we'll be deploying an aggregated serving graph. Our components include:
...@@ -125,9 +129,9 @@ class VllmWorker: ...@@ -125,9 +129,9 @@ class VllmWorker:
... ...
``` ```
Note that our prebuilt components have the maximal set of dependancies needed to run the component. This allows you to plug in different components to the same graph to create different architectures. When you write your own components, you can be as flexible as you'd like. Note that our prebuilt components have the maximal set of dependancies needed to run the component, which allows you to plug different components into the same graph to create different architectures. When writing your own components, you can be as flexible as you like.
### 2. Define your graph #### Define your graph
```python ```python
# graphs/agg.py # graphs/agg.py
...@@ -138,18 +142,17 @@ from components.worker import VllmWorker ...@@ -138,18 +142,17 @@ from components.worker import VllmWorker
Frontend.link(Processor).link(VllmWorker) Frontend.link(Processor).link(VllmWorker)
``` ```
### 3. Define your configuration #### Define your configuration
We've provided a set of basic configurations for this example [here](../../examples/llm/configs/agg.yaml). All of these can be changed and also be overridden by passing in CLI flags to serve! We provide [basic configurations](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/configs/agg.yaml) that you can change; you can also override them by passing in CLI flags to `dynamo serve`.
### 4. Serve your graph #### Serve your graph
As a prerequisite, ensure you have NATS and etcd running by running the docker compose in the deploy directory. You can find it [here](../../deploy/metrics/docker-compose.yml). Before serving your graph, ensure that NATS and etcd are running using the [docker compose file](https://github.com/ai-dynamo/dynamo/blob/main/deploy/metrics/docker-compose.yml) file in the deploy directory.
```bash ```bash
docker compose up -d docker compose up -d
``` ```
Note that the we point toward the first node in our graph. In this case, it's the `Frontend` service. Note that the we point toward the first node in our graph. In this case, it's the `Frontend` service.
```bash ```bash
...@@ -157,7 +160,7 @@ Note that the we point toward the first node in our graph. In this case, it's th ...@@ -157,7 +160,7 @@ Note that the we point toward the first node in our graph. In this case, it's th
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --dry-run dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --dry-run
``` ```
This will print out something like This returns output like:
```bash ```bash
Service Configuration: Service Configuration:
...@@ -200,7 +203,7 @@ You can override any of these configuration options by passing in CLI flags to s ...@@ -200,7 +203,7 @@ You can override any of these configuration options by passing in CLI flags to s
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Processor.router=random --dry-run dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml --Processor.router=random --dry-run
``` ```
Which will print out something like Which prints out output like:
```bash ```bash
#... #...
...@@ -237,8 +240,8 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" ...@@ -237,8 +240,8 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
## Close deployment ## Close deployment
> [!IMPORTANT] ```{important}
> We are aware of an issue where vLLM subprocesses might not be killed when `ctrl-c` is pressed. We are aware of an issue where vLLM subprocesses might not be killed when `ctrl-c` is pressed.
> We are working on addressing this. Relevant vLLM issues can be found [here](https://github.com/vllm-project/vllm/pull/8492) and [here](https://github.com/vllm-project/vllm/issues/6219#issuecomment-2439257824). We are working on addressing this. Relevant vLLM issues can be found [here](https://github.com/vllm-project/vllm/pull/8492) and [here](https://github.com/vllm-project/vllm/issues/6219#issuecomment-2439257824).
To stop the serve, you can press `ctrl-c` which will kill the different components. In order to kill the remaining vLLM subprocesses you can run `nvidia-smi` and `kill -9` the remaining processes or run `pkill python3` from inside of the container. To stop the serve, you can press `ctrl-c` which kills the components. In order to kill the remaining vLLM subprocesses you can run `nvidia-smi` and `kill -9` the remaining processes or run `pkill python3` from inside of the container.
...@@ -17,7 +17,7 @@ limitations under the License. ...@@ -17,7 +17,7 @@ limitations under the License.
# Planner # Planner
The planner is a component that monitors the state of the system and makes adjustments to workers to ensure that the system is running efficiently. Currently, planner can scale up and down the number of vllm workers based on the kv cache load and prefill queue size: The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently. Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
* Backend: * Backend:
* local ✅ * local ✅
* kubernetes ✅ * kubernetes ✅
...@@ -40,12 +40,12 @@ To adjust the number of prefill/decode workers, planner monitors the following m ...@@ -40,12 +40,12 @@ To adjust the number of prefill/decode workers, planner monitors the following m
* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload. * Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload.
* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload. * Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload.
Every `metric-pulling-interval`, planner will gather the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner will block the metric pulling and adjustment. Every `metric-pulling-interval`, planner gathers the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism will pick up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, currently, planner revoke the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker will be immediately removed from the router and will not get any new requests. The decode worker will then finish all the current requests in their original stream and exit gracefully. To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism picks up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, planner revokes the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests. The decode worker then finishes all the current requests in their original stream and exits gracefully.
There are two additional rules set by planner to prevent over-compensation: There are two additional rules set by planner to prevent over-compensation:
1. After a new decode worker is added, since it needs time to populate the kv cache, planner will not scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals. 1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval. 1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
## Comply with SLA ## Comply with SLA
...@@ -144,10 +144,11 @@ We currently support two backends: ...@@ -144,10 +144,11 @@ We currently support two backends:
### Local Backend ### Local Backend
Circus is a Python program which can be used to monitor and control processes and sockets. Dynamo serve uses circus to start each node in a graph and monitors each subprocesses. We leverage a core feature to do this called `Watcher`. A `Watcher` is the target program that you would like to run (which in our case is `serve_dynamo.py`). When planner decides to scale up or down, it will either add or remove a watcher from the existing `circus`. Circus is a Python program that can be used to monitor and control processes and sockets. Dynamo serve uses circus to start each node in a graph and monitors each subprocesses. We leverage a core feature to do this called `Watcher`. A `Watcher` is the target program that you would like to run (which in our case is `serve_dynamo.py`). When planner decides to scale up or down, it either adds or removes a watcher from the existing `circus`.
> [!NOTE] ``` {note}
> Although circus allows you to `increment` an existing watcher, it was not designed to allow variables to be passed in which does not allow us to schedule on a GPU. So instead we start a new watcher per process. When planner decides to add or remove a worker, we have logic to handle this adding/removing and incrementing/decrementing the workers. Although circus allows you to `increment` an existing watcher, it was not designed to allow variables to be passed in which does not allow us to schedule on a GPU. So instead we start a new watcher per process. When planner decides to add or remove a worker, we have logic to handle this adding/removing and incrementing/decrementing the workers.
```
#### Statefile #### Statefile
...@@ -155,7 +156,7 @@ The statefile is a json file created when initially running `dynamo serve` and i ...@@ -155,7 +156,7 @@ The statefile is a json file created when initially running `dynamo serve` and i
When one Decode worker is spun up, the statefile looks like: When one Decode worker is spun up, the statefile looks like:
```json ```none
{ {
"dynamo_VllmWorker": {..., resources={...}}, "dynamo_VllmWorker": {..., resources={...}},
} }
...@@ -163,7 +164,7 @@ When one Decode worker is spun up, the statefile looks like: ...@@ -163,7 +164,7 @@ When one Decode worker is spun up, the statefile looks like:
Now another decode worker is added: Now another decode worker is added:
```json ```none
{ {
"dynamo_VllmWorker": {..., resources={...}}, "dynamo_VllmWorker": {..., resources={...}},
"dynamo_VllmWorker_1": {..., resources={...}}, "dynamo_VllmWorker_1": {..., resources={...}},
...@@ -172,7 +173,7 @@ Now another decode worker is added: ...@@ -172,7 +173,7 @@ Now another decode worker is added:
Then one decode worker is removed: Then one decode worker is removed:
```json ```none
{ {
"dynamo_VllmWorker": {..., resources={...}}, "dynamo_VllmWorker": {..., resources={...}},
} }
...@@ -180,16 +181,17 @@ Then one decode worker is removed: ...@@ -180,16 +181,17 @@ Then one decode worker is removed:
If the last decode worker is removed, the statefile looks like: If the last decode worker is removed, the statefile looks like:
```json ```none
{ {
"dynamo_VllmWorker": {...}, "dynamo_VllmWorker": {...},
} }
``` ```
Note that we keep the initial non-suffix entry in order to know what cmd we will need to spin up another worker. This is the same for prefill workers as well. We keep the initial non-suffix entry in order to know what cmd we'll need to spin up another worker. This is the same for prefill workers as well.
> [!NOTE] ``` {note}
> At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher. At the moment - planner work best if your initial replicas per worker are 1. This is because if you specify replicas > 1 when you initially start `dynamo serve`, the current implementation in `serving.py` starts each process in the same watcher.
```
### Kubernetes Backend ### Kubernetes Backend
......
...@@ -35,7 +35,7 @@ python sin_synth.py \ ...@@ -35,7 +35,7 @@ python sin_synth.py \
--osl2 150 --osl2 150
``` ```
This will generate a [mooncake style trace](https://github.com/kvcache-ai/Mooncake) with This generates a [mooncake style trace](https://github.com/kvcache-ai/Mooncake) with
* duration = 600 seconds * duration = 600 seconds
* isl/osl = 3000/150 * isl/osl = 3000/150
* request rate varies sinusoidally from 0.75 to 3 requests with a period of 150 seconds * request rate varies sinusoidally from 0.75 to 3 requests with a period of 150 seconds
...@@ -76,7 +76,7 @@ and open `http://localhost:6006` in your browser. The following metrics are avai ...@@ -76,7 +76,7 @@ and open `http://localhost:6006` in your browser. The following metrics are avai
* `num_decode_workers`: the number of decode workers * `num_decode_workers`: the number of decode workers
* `num_gpu`: the total number of GPUs used * `num_gpu`: the total number of GPUs used
The benchmark results will be printed out in terminal 3 that runs the `genai-perf` command. The benchmark results are printed out in terminal 3 that runs the `genai-perf` command.
In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--no-operation` flag to watch and log the metrics without making any adjustments: In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--no-operation` flag to watch and log the metrics without making any adjustments:
...@@ -92,7 +92,7 @@ genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deeps ...@@ -92,7 +92,7 @@ genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deeps
The below two figures show the performance comparison between planner and the baseline 2p2d deployment. Planner achieves 1.5x speedup while using 7.4% less GPU resources. The below two figures show the performance comparison between planner and the baseline 2p2d deployment. Planner achieves 1.5x speedup while using 7.4% less GPU resources.
![Planner Performance Comparison](./images/planner_perf.png) ![Two bar charts comparing 2P2D and Planner. Planner shows lower GPU usage and lower average sequence latency.](../../images/planner_perf.png)
![Planner Tensorboard](./images/planner_tensorboard.png) ![Planner Tensorboard; four line graphs comparing two runs: 2p2d_rr5-20_2 and planner_rr5-20.](../../images/planner_tensorboard.png)
..
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
.. This hidden toctree includes readmes etc that aren't meant to be in the main table of contents but should be accounted for in the sphinx project structure
.. toctree::
:maxdepth: 2
:hidden:
guides/README.md
runtime/README.md
examples/disagg_skeleton.md
\ No newline at end of file
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
..
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
Welcome to NVIDIA Dynamo
========================
NVIDIA Dynamo is a high-throughput low-latency inference framework designed for serving generative AI and reasoning models in multi-node distributed environments.
Dive in: Examples
-----------------------
.. grid:: 1 2 2 2
:gutter: 3
:margin: 0
:padding: 3 4 0 0
.. grid-item-card:: :doc:`Hello World </examples/hello_world>`
:link: /examples/hello_world
:link-type: doc
Demonstrates the basic concepts of Dynamo by creating a simple multi-service pipeline.
.. grid-item-card:: :doc:`LLM Deployment </examples/llm_deployment>`
:link: /examples/llm_deployment
:link-type: doc
Presents examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
.. grid-item-card:: :doc:`Multinode </examples/multinode>`
:link: /examples/multinode
:link-type: doc
Demonstrates deployment for disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`.
.. grid-item-card:: :doc:`TensorRT-LLM </examples/trtllm>`
:link: /examples/trtllm
:link-type: doc
Presents TensorRT-LLM examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
Overview
--------
The NVIDIA Dynamo Platform is a high-performance, low-latency inference platform
designed to serve all AI models—across any framework, architecture, or deployment scale.
Dynamo is inference engine agnostic, supporting TRT-LLM, vLLM, SGLang, and others, and captures
LLM-specific capabilities such as:
* **Disaggregated prefill & decode inference** - Maximizes GPU throughput and facilitates trade off between throughput and latency.
* **Dynamic GPU scheduling** - Optimizes performance based on fluctuating demand.
* **LLM-aware request routing** - Eliminates unnecessary KV cache re-computation.
* **Accelerated data transfer** - Reduces inference response time using NIXL.
* **KV cache offloading** - Leverages several memory hierarchies for higher system throughput.
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source
and driven by a transparent, OSS (Open Source Software) first development approach.
.. toctree::
:hidden:
Welcome to Dynamo <self>
Support Matrix <support_matrix.md>
Getting Started <get_started.md>
.. toctree::
:hidden:
:caption: Architecture & Features
High Level Architecture <architecture/architecture.md>
Distributed Runtime <architecture/distributed_runtime.md>
Disaggregated Serving <architecture/disagg_serving.md>
KV Block Manager <architecture/kvbm_intro.rst>
KV Cache Routing <architecture/kv_cache_routing.md>
Planner <guides/planner.md>
.. toctree::
:hidden:
:caption: Dynamo Command Line Interface
CLI Overview <guides/cli_overview.md>
Running Dynamo (dynamo run) <guides/dynamo_run.md>
Serving Inference Graphs (dynamo serve) <guides/dynamo_serve.md>
Building Dynamo (dynamo build) <guides/dynamo_build.md>
Deploying Inference Graphs (dynamo deploy) <guides/dynamo_deploy/README.md>
.. toctree::
:hidden:
:caption: Usage Guides
Writing Python Workers in Dynamo <guides/backend.md>
Disaggregation and Performance Tuning <guides/disagg_perf_tuning.md>
KV Cache Router Performance Tuning <guides/kv_router_perf_tuning.md>
Planner Benchmark Example <guides/planner_benchmark/benchmark_planner.md>
.. toctree::
:hidden:
:caption: Deployment Guides
Dynamo Cloud Kubernetes Platform <guides/dynamo_deploy/dynamo_cloud.md>
Deploying Dynamo Inference Graphs to Kubernetes using the Dynamo Cloud Platform <guides/dynamo_deploy/operator_deployment.md>
Manual Helm Deployment <guides/dynamo_deploy/manual_helm_deployment.md>
Minikube Setup Guide <guides/dynamo_deploy/minikube.md>
.. toctree::
:hidden:
:caption: API
SDK Reference <API/sdk.md>
Python API <API/python_bindings.md>
.. toctree::
:hidden:
:caption: Examples
Hello World Example <examples/hello_world.md>
LLM Deployment Examples <examples/llm_deployment.md>
Multinode Examples <examples/multinode.md>
LLM Deployment Examples using TensorRT-LLM <examples/trtllm.md>
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
../../docs/examples/hello_world.md
\ No newline at end of file
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment