Unverified Commit 5be23eb7 authored by Anish's avatar Anish Committed by GitHub
Browse files

Readmes + eks additions (#2157)

parent a8cb6554
...@@ -15,29 +15,10 @@ See the License for the specific language governing permissions and ...@@ -15,29 +15,10 @@ See the License for the specific language governing permissions and
limitations under the License. limitations under the License.
--> -->
# LLM Deployment Examples using TensorRT-LLM # LLM Deployment using TensorRT-LLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM. This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
# User Documentation
- [Deployment Architectures](#deployment-architectures)
- [Getting Started](#getting-started)
- [Prerequisites](#prerequisites)
- [Build docker](#build-docker)
- [Run container](#run-container)
- [Run deployment](#run-deployment)
- [Single Node deployment](#single-node-deployments)
- [Multinode deployment](#multinode-deployment)
- [Client](#client)
- [Benchmarking](#benchmarking)
- [Disaggregation Strategy](#disaggregation-strategy)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [More Example Architectures](#more-example-architectures)
- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)
# Quick Start
## Use the Latest Release ## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes: We recommend using the latest stable release of dynamo to avoid breaking changes:
...@@ -50,26 +31,52 @@ You can find the latest release [here](https://github.com/ai-dynamo/dynamo/relea ...@@ -50,26 +31,52 @@ You can find the latest release [here](https://github.com/ai-dynamo/dynamo/relea
git checkout $(git describe --tags $(git rev-list --tags --max-count=1)) git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
``` ```
## Deployment Architectures ---
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#single-node-deployments)
- [Advanced Examples](#advanced-examples)
- [Disaggregation Strategy](#disaggregation-strategy)
- [KV Cache Transfer](#kv-cache-transfer-in-disaggregated-serving)
- [Client](#client)
- [Benchmarking](#benchmarking)
## Feature Support Matrix
### Core Dynamo Features
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. | Feature | TensorRT-LLM | Notes |
|---------|--------------|-------|
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | Not supported yet |
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | 🚧 | Planned |
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | 🚧 | Planned |
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | 🚧 | Planned |
Note: TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can configure the deployment to always use either aggregate or disaggregated serving. ### Large Scale P/D and WideEP Features
## Getting Started | Feature | TensorRT-LLM | Notes |
|--------------------|--------------|-----------------------------------------------------------------------|
| **WideEP** | ✅ | |
| **DP Rank Routing**| ✅ | |
| **GB200 Support** | ✅ | |
1. Choose a deployment architecture based on your requirements ## Quick Start
2. Configure the components as needed
3. Deploy using the provided scripts
### Prerequisites Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
Start using [Docker Compose](../../../deploy/docker-compose.yml)
Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml)
```bash ```bash
docker compose -f deploy/docker-compose.yml up -d docker compose -f deploy/docker-compose.yml up -d
``` ```
### Build docker ### Build container
```bash ```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance. # TensorRT-LLM uses git-lfs, which needs to be installed in advance.
...@@ -89,17 +96,18 @@ apt-get update && apt-get -y install git git-lfs ...@@ -89,17 +96,18 @@ apt-get update && apt-get -y install git git-lfs
### Run container ### Run container
``` ```bash
./container/run.sh --framework tensorrtllm -it ./container/run.sh --framework tensorrtllm -it
``` ```
## Run Deployment
This figure shows an overview of the major components to deploy: ## Single Node Examples
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `python3 -m dynamo.frontend <args>` to start up the ingress and using `python3 -m dynamo.trtllm <args>` to start up the workers. You can easily take each command and run them in separate terminals.
This figure shows an overview of the major components to deploy:
``` ```
+------+ +-----------+ +------------------+ +---------------+ +------+ +-----------+ +------------------+ +---------------+
| HTTP |----->| processor |----->| Worker1 |------------>| Worker2 | | HTTP |----->| processor |----->| Worker1 |------------>| Worker2 |
| |<-----| |<-----| |<------------| | | |<-----| |<-----| |<------------| |
...@@ -111,29 +119,23 @@ This figure shows an overview of the major components to deploy: ...@@ -111,29 +119,23 @@ This figure shows an overview of the major components to deploy:
| +---------| kv-router | | +---------| kv-router |
+------------->| | +------------->| |
+------------------+ +------------------+
``` ```
**Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below. **Note:** The diagram above shows all possible components in a deployment. Depending on the chosen disaggregation strategy, you can configure whether Worker1 handles prefill and Worker2 handles decode, or vice versa. For more information on how to select and configure these strategies, see the [Disaggregation Strategy](#disaggregation-strategy) section below.
### Single-Node Deployments ### Aggregated
> [!IMPORTANT]
> Below we provide some simple shell scripts that run the components for each configuration. Each shell script is simply running the `dynamo-run` to start up the ingress and using `python3` to start up the workers. You can easily take each command and run them in separate terminals.
#### Aggregated
```bash ```bash
cd $DYNAMO_HOME/components/backends/trtllm cd $DYNAMO_HOME/components/backends/trtllm
./launch/agg.sh ./launch/agg.sh
``` ```
#### Aggregated with KV Routing ### Aggregated with KV Routing
```bash ```bash
cd $DYNAMO_HOME/components/backends/trtllm cd $DYNAMO_HOME/components/backends/trtllm
./launch/agg_router.sh ./launch/agg_router.sh
``` ```
#### Disaggregated ### Disaggregated
> [!IMPORTANT] > [!IMPORTANT]
> Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable. > Disaggregated serving supports two strategies for request flow: `"prefill_first"` and `"decode_first"`. By default, the script below uses the `"decode_first"` strategy, which can reduce response latency by minimizing extra hops in the return path. You can switch strategies by setting the `DISAGGREGATION_STRATEGY` environment variable.
...@@ -143,7 +145,7 @@ cd $DYNAMO_HOME/components/backends/trtllm ...@@ -143,7 +145,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
./launch/disagg.sh ./launch/disagg.sh
``` ```
#### Disaggregated with KV Routing ### Disaggregated with KV Routing
> [!IMPORTANT] > [!IMPORTANT]
> Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly. > Disaggregated serving with KV routing uses a "prefill first" workflow by default. Currently, Dynamo supports KV routing to only one endpoint per model. In disaggregated workflow, it is generally more effective to route requests to the prefill worker. If you wish to use a "decode first" workflow instead, you can simply set the `DISAGGREGATION_STRATEGY` environment variable accordingly.
...@@ -153,7 +155,7 @@ cd $DYNAMO_HOME/components/backends/trtllm ...@@ -153,7 +155,7 @@ cd $DYNAMO_HOME/components/backends/trtllm
./launch/disagg_router.sh ./launch/disagg_router.sh
``` ```
#### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1 ### Aggregated with Multi-Token Prediction (MTP) and DeepSeek R1
```bash ```bash
cd $DYNAMO_HOME/components/backends/trtllm cd $DYNAMO_HOME/components/backends/trtllm
...@@ -172,21 +174,16 @@ Notes: ...@@ -172,21 +174,16 @@ Notes:
- There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark. - There is a noticeable latency for the first two inference requests. Please send warm-up requests before starting the benchmark.
- MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates. - MTP performance may vary depending on the acceptance rate of predicted tokens, which is dependent on the dataset or queries used while benchmarking. Additionally, `ignore_eos` should generally be omitted or set to `false` when using MTP to avoid speculating garbage outputs and getting unrealistic acceptance rates.
### Multinode Deployment ## Advanced Examples
For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
### Client
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `dynamo-run in=http`. Below we provide a selected list of advanced examples. Please open up an issue if you'd like to see a specific example!
### Benchmarking ### Multinode Deployment
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the For comprehensive instructions on multinode serving, see the [multinode-examples.md](./multinode/multinode-examples.md) guide. It provides step-by-step deployment examples and configuration tips for running Dynamo with TensorRT-LLM across multiple nodes. While the walkthrough uses DeepSeek-R1 as the model, you can easily adapt the process for any supported model by updating the relevant configuration files. You can see [Llama4+eagle](./llama4_plus_eagle.md) guide to learn how to use these scripts when a single worker fits on the single node.
`model` name and `host` based on your deployment: [perf.sh](../../benchmarks/llm/perf.sh)
### Speculative Decoding
- **[Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md)**
## Disaggregation Strategy ## Disaggregation Strategy
...@@ -221,6 +218,13 @@ indicates a request to this model may be migrated up to 3 times to another Backe ...@@ -221,6 +218,13 @@ indicates a request to this model may be migrated up to 3 times to another Backe
The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience. The migrated request will continue responding to the original request, allowing for a seamless transition between Backends, and a reduced overall request failure rate at the Frontend for enhanced user experience.
## More Example Architectures ## Client
See [client](../llm/README.md#client) section to learn how to send request to the deployment.
NOTE: To send a request to a multi-node deployment, target the node which is running `python3 -m dynamo.frontend <args>`.
- [Llama 4 Maverick Instruct + Eagle Speculative Decoding](./llama4_plus_eagle.md) ## Benchmarking
To benchmark your deployment with GenAI-Perf, see this utility script, configuring the
`model` name and `host` based on your deployment: [perf.sh](../../../benchmarks/llm/perf.sh)
...@@ -7,33 +7,81 @@ SPDX-License-Identifier: Apache-2.0 ...@@ -7,33 +7,81 @@ SPDX-License-Identifier: Apache-2.0
This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation. This directory contains a Dynamo vllm engine and reference implementations for deploying Large Language Models (LLMs) in various configurations using vLLM. For Dynamo integration, we leverage vLLM's native KV cache events, NIXL based transfer mechanisms, and metric reporting to enable KV-aware routing and P/D disaggregation.
## Deployment Architectures ## Use the Latest Release
See [deployment architectures](../llm/README.md#deployment-architectures) to learn about the general idea of the architecture. vLLM supports aggregated, disaggregated, and KV-routed serving patterns. We recommend using the latest stable release of Dynamo to avoid breaking changes:
## Getting Started [![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
### Prerequisites You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
Start required services (etcd and NATS) using [Docker Compose](../../../deploy/docker-compose.yml): ```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
---
## Table of Contents
- [Feature Support Matrix](#feature-support-matrix)
- [Quick Start](#quick-start)
- [Single Node Examples](#run-single-node-examples)
- [Advanced Examples](#advanced-examples)
- [Deploy on Kubernetes](#kubernetes-deployment)
- [Configuration](#configuration)
## Feature Support Matrix
### Core Dynamo Features
| Feature | vLLM | Notes |
|---------|------|-------|
| [**Disaggregated Serving**](../../docs/architecture/disagg_serving.md) | ✅ | |
| [**Conditional Disaggregation**](../../docs/architecture/disagg_serving.md#conditional-disaggregation) | 🚧 | WIP |
| [**KV-Aware Routing**](../../docs/architecture/kv_cache_routing.md) | ✅ | |
| [**SLA-Based Planner**](../../docs/architecture/sla_planner.md) | ✅ | |
| [**Load Based Planner**](../../docs/architecture/load_planner.md) | 🚧 | WIP |
| [**KVBM**](../../docs/architecture/kvbm_architecture.md) | 🚧 | WIP |
### Large Scale P/D and WideEP Features
| Feature | vLLM | Notes |
|--------------------|------|-----------------------------------------------------------------------|
| **WideEP** | ✅ | Support for PPLX / DeepEP not verified |
| **DP Rank Routing**| ✅ | Supported via external control of DP ranks |
| **GB200 Support** | 🚧 | Container functional on main |
## Quick Start
Below we provide a guide that lets you run all of our the common deployment patterns on a single node.
### Start NATS and ETCD in the background
Start using [Docker Compose](../../../deploy/docker-compose.yml)
```bash ```bash
docker compose -f deploy/docker-compose.yml up -d docker compose -f deploy/docker-compose.yml up -d
``` ```
### Build and Run docker ### Pull or build container
We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd like to build your own container from source:
```bash ```bash
./container/build.sh --framework VLLM ./container/build.sh --framework VLLM
``` ```
### Run container
```bash ```bash
./container/run.sh -it --framework VLLM [--mount-workspace] ./container/run.sh -it --framework VLLM [--mount-workspace]
``` ```
This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks. This includes the specific commit [vllm-project/vllm#19790](https://github.com/vllm-project/vllm/pull/19790) which enables support for external control of the DP ranks.
## Run Deployment ## Run Single Node Examples
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `python3 -m dynamo.frontend` to start the ingress and uses `python3 -m dynamo.vllm` to start the vLLM workers. You can also run each command in separate terminals for better log visibility.
This figure shows an overview of the major components to deploy: This figure shows an overview of the major components to deploy:
...@@ -53,12 +101,7 @@ This figure shows an overview of the major components to deploy: ...@@ -53,12 +101,7 @@ This figure shows an overview of the major components to deploy:
Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern. Note: The above architecture illustrates all the components. The final components that get spawned depend upon the chosen deployment pattern.
### Example Architectures ### Aggregated Serving
> [!IMPORTANT]
> Below we provide simple shell scripts that run the components for each configuration. Each shell script runs `dynamo run` to start the ingress and uses `python3 main.py` to start the vLLM workers. You can run each command in separate terminals for better log visibility.
#### Aggregated Serving
```bash ```bash
# requires one gpu # requires one gpu
...@@ -66,7 +109,7 @@ cd components/backends/vllm ...@@ -66,7 +109,7 @@ cd components/backends/vllm
bash launch/agg.sh bash launch/agg.sh
``` ```
#### Aggregated Serving with KV Routing ### Aggregated Serving with KV Routing
```bash ```bash
# requires two gpus # requires two gpus
...@@ -74,7 +117,7 @@ cd components/backends/vllm ...@@ -74,7 +117,7 @@ cd components/backends/vllm
bash launch/agg_router.sh bash launch/agg_router.sh
``` ```
#### Disaggregated Serving ### Disaggregated Serving
```bash ```bash
# requires two gpus # requires two gpus
...@@ -82,7 +125,7 @@ cd components/backends/vllm ...@@ -82,7 +125,7 @@ cd components/backends/vllm
bash launch/disagg.sh bash launch/disagg.sh
``` ```
#### Disaggregated Serving with KV Routing ### Disaggregated Serving with KV Routing
```bash ```bash
# requires three gpus # requires three gpus
...@@ -90,9 +133,9 @@ cd components/backends/vllm ...@@ -90,9 +133,9 @@ cd components/backends/vllm
bash launch/disagg_router.sh bash launch/disagg_router.sh
``` ```
#### Single Node Data Parallel Attention / Expert Parallelism ### Single Node Data Parallel Attention / Expert Parallelism
This example is not meant to be performant but showcases dynamo routing to data parallel workers This example is not meant to be performant but showcases Dynamo routing to data parallel workers
```bash ```bash
# requires four gpus # requires four gpus
...@@ -100,10 +143,13 @@ cd components/backends/vllm ...@@ -100,10 +143,13 @@ cd components/backends/vllm
bash launch/dep.sh bash launch/dep.sh
``` ```
> [!TIP] > [!TIP]
> Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker. > Run a disaggregated example and try adding another prefill worker once the setup is running! The system will automatically discover and utilize the new worker.
## Advanced Examples
Below we provide a selected list of advanced deployments. Please open up an issue if you'd like to see a specific example!
### Kubernetes Deployment ### Kubernetes Deployment
For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations: For Kubernetes deployment, YAML manifests are provided in the `deploy/` directory. These define DynamoGraphDeployment resources for various configurations:
...@@ -118,7 +164,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director ...@@ -118,7 +164,7 @@ For Kubernetes deployment, YAML manifests are provided in the `deploy/` director
- **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first. - **Dynamo Cloud**: Follow the [Quickstart Guide](../../../docs/guides/dynamo_deploy/quickstart.md) to deploy Dynamo Cloud first.
- **Container Images**: The deployment files currently require access to `nvcr.io/nvidian/nim-llm-dev/vllm-runtime`. If you don't have access, build and push your own image: - **Container Images**: We have public images available on [NGC Catalog](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/collections/ai-dynamo/artifacts). If you'd prefer to use your own registry, build and push your own image:
```bash ```bash
./container/build.sh --framework VLLM ./container/build.sh --framework VLLM
# Tag and push to your container registry # Tag and push to your container registry
......
...@@ -19,7 +19,7 @@ limitations under the License. ...@@ -19,7 +19,7 @@ limitations under the License.
## Overview ## Overview
Dynamo `DistributedRuntime` is the core infrastructure in dynamo that enables distributed communication and coordination between different dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via binding (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure: Dynamo's `DistributedRuntime` is the core infrastructure in the framework that enables distributed communication and coordination between different Dynamo components. It is implemented in rust (`/lib/runtime`) and exposed to other programming languages via bindings (i.e., python bindings can be found in `/lib/bindings/python`). `DistributedRuntime` follows a hierarchical structure:
- `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens. - `DistributedRuntime`: This is the highest level object that exposes the distributed runtime interface. It maintains connection to external services (e.g., etcd for service discovery and NATS for messaging) and manages lifecycle with cancellation tokens.
- `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments. - `Namespace`: A `Namespace` is a logical grouping of components that isolate between different model deployments.
...@@ -28,13 +28,13 @@ Dynamo `DistributedRuntime` is the core infrastructure in dynamo that enables di ...@@ -28,13 +28,13 @@ Dynamo `DistributedRuntime` is the core infrastructure in dynamo that enables di
While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other. While theoretically each `DistributedRuntime` can have multiple `Namespace`s as long as their names are unique (similar logic also applies to `Component/Namespace` and `Endpoint/Component`), in practice, each dynamo components typically are deployed with its own process and thus has its own `DistributedRuntime` object. However, they share the same namespace to discover each other.
For example, the deployment configuration `examples/llm/configs/disagg.yaml` have four workers: For example, a typical deployment configuration (like `components/backends/vllm/deploy/agg.yaml` or `components/backends/sglang/deploy/agg.yaml`) has multiple workers:
- `Frontend`: Start an HTTP server and register a `chat/completions` endpoint. The HTTP server route the request to the `Processor`. - `Frontend`: Starts an HTTP server and handles incoming requests. The HTTP server routes all requests to the worker components.
- `Processor`: When a new request arrives, `Processor` applies the chat template and perform the tokenization. Then, it route the request to the `VllmWorker`. - `VllmDecodeWorker`: Performs the actual decode computation using the vLLM engine through the `DecodeWorkerHandler`.
- `VllmWorker` and `PrefillWorker`: Perform the actual decode and prefill computation. - `VllmPrefillWorker` (in disaggregated deployments): Performs prefill computation using the vLLM engine through the `PrefillWorkerHandler`.
Since the four workers are deployed in different processes, each of them have their own `DistributedRuntime`. Within their own `DistributedRuntime`, they all have their own `Namespace`s named `dynamo`. Then, under their own `dynamo` namespace, they have their own `Component`s named `Frontend/Processor/VllmWorker/PrefillWorker`. Lastly, for the `Endpoint`, `Frontend` has no `Endpoints`, `Processor` and `VllmWorker` each has a `generate` endpoint, and `PrefillWorker` has a placeholder `mock` endpoint. Since the workers are deployed in different processes, each of them has its own `DistributedRuntime`. Within their own `DistributedRuntime`, they all share the same `Namespace` (e.g., `vllm-agg`, `vllm-disagg`, `sglang-agg`). Then, under their namespace, they have their own `Component`s: `Frontend` uses the `make_engine` function which handles HTTP serving and routing automatically, while worker components like `VllmDecodeWorker` and `VllmPrefillWorker` create components with names like `worker`, `decode`, or `prefill` and register endpoints like `generate` and `clear_kv_blocks`. The `Frontend` component doesn't explicitly create endpoints - instead, the `make_engine` function handles the HTTP server and worker discovery. Worker components create their endpoints programmatically using the `component.endpoint()` method and use their respective handler classes (`DecodeWorkerHandler` or `PrefillWorkerHandler`) to process requests. Their `DistributedRuntime`s are initialized in their respective main functions, their `Namespace`s are configured in the deployment YAML, their `Component`s are created programmatically (e.g., `runtime.namespace("vllm-agg").component("worker")`), and their `Endpoint`s are created using the `component.endpoint()` method.
## Initialization ## Initialization
...@@ -55,7 +55,7 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen ...@@ -55,7 +55,7 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
- `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`. - `Component`: When a `Component` object is created, similar to `Namespace`, it isn't be registered in etcd. When `create_service` is called, it creates a NATS service group using `{namespace_name}.{service_name}` and registers a service in the registry of the `Component`, where the registry is an internal data structure that tracks all services and endpoints within the `DistributedRuntime`.
- `Endpoint`: When an Endpoint object is created and started, it performs two key registrations: - `Endpoint`: When an Endpoint object is created and started, it performs two key registrations:
- NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`. - NATS Registration: The endpoint is registered with the NATS service group created during service creation. The endpoint is assigned a unique subject following the naming: `{namespace_name}.{service_name}.{endpoint_name}-{lease_id_hex}`.
- etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `PrefillWorker`s in one deployment) share the same `Namespace`, `Componenet`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`. - etcd Registration: The endpoint information is stored in etcd at a path following the naming: `/services/{namespace}/{component}/{endpoint}-{lease_id}`. Note that the endpoints of different workers of the same type (i.e., two `VllmPrefillWorker`s in one deployment) share the same `Namespace`, `Component`, and `Endpoint` name. They are distinguished by their different primary `lease_id` of their `DistributedRuntime`.
## Calling Endpoints ## Calling Endpoints
...@@ -74,19 +74,6 @@ After selecting which endpoint to hit, the `Client` sends the serialized request ...@@ -74,19 +74,6 @@ After selecting which endpoint to hit, the `Client` sends the serialized request
We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`: We provide native rust and python (through binding) examples for basic usage of `DistributedRuntime`:
- Rust: `/lib/runtime/examples/` - Rust: `/lib/runtime/examples/`
- Python: `/lib/bindings/python/examples/`. We also provide a complete example of using `DistributedRuntime` for communication and Dynamo's LLM library for prompt templates and (de)tokenization to deploy a vllm-based service. Please refer to `lib/bindings/python/examples/hello_world/server_vllm.py` for details. - Python: We also provide complete examples of using `DistributedRuntime`. Please refer to the engines in `/components/backends` for full implementation details.
```{note}
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to be slow and requires extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
`./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
For example, on a GB200 which has very high CPU cores and memory resource:
`./container/build.sh --framework vllm --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
When vLLM has pre-built ARM wheels published, this process can be improved.
```
# Steps to create EKS cluster with EFS
## 1. Install CLIs
### a. Install AWS CLI (steps [here](https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html))
```
sudo apt install unzip
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install
```
### b. Install Kubernetes CLI (steps [here](https://docs.aws.amazon.com/eks/latest/userguide/install-kubectl.html))
```
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.30.0/2024-05-12/bin/linux/amd64/kubectl
chmod +x ./kubectl
mkdir -p $HOME/bin && cp ./kubectl $HOME/bin/kubectl && export PATH=$HOME/bin:$PATH
echo 'export PATH=$HOME/bin:$PATH' >> ~/.bashrc
```
### c. Install EKS CLI (steps [here](https://eksctl.io/installation/))
```
ARCH=amd64
PLATFORM=$(uname -s)_$ARCH
curl -sLO "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_$PLATFORM.tar.gz"
curl -sL "https://github.com/eksctl-io/eksctl/releases/latest/download/eksctl_checksums.txt" | grep $PLATFORM | sha256sum --check
tar -xzf eksctl_$PLATFORM.tar.gz -C /tmp && rm eksctl_$PLATFORM.tar.gz
sudo mv /tmp/eksctl /usr/local/bin
```
### d. Install Helm CLI (steps [here](https://docs.aws.amazon.com/eks/latest/userguide/helm.html))
```
curl https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 > get_helm.sh
chmod 700 get_helm.sh
./get_helm.sh
```
## 2. Create an EKS cluster
In this example we create an EKS cluster consisting of 1 `g6e.48xlarge` compute node, each with 8 NVIDIA L40S GPUs and 1 `c5.2xlarge` CPU node as control plane. We also setup EFA between the compute nodes.
### a. Configure AWS CLI
```
aws configure
```
### b. Create a config file for EKS cluster creation
```
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
metadata:
name: <CLUSTER_NAME>
version: "1.32"
region: <REGION_NAME>
iam:
withOIDC: true
managedNodeGroups:
- name: sys-ng
instanceType: c5.2xlarge
minSize: 1
desiredCapacity: 1
maxSize: 1
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
awsLoadBalancerController: true
cloudWatch: true
albIngress: true
- name: efa-compute-ng
instanceType: g6e.48xlarge
minSize: 1
desiredCapacity: 1
maxSize: 1
volumeSize: 300
efaEnabled: true
privateNetworking: true
iam:
withAddonPolicies:
imageBuilder: true
autoScaler: true
ebs: true
efs: true
awsLoadBalancerController: true
cloudWatch: true
albIngress: true
```
> [!NOTE]
> We set `minSize` and `desiredCapacity` to be 1 because AWS does not create your cluster successfully if no nodes are available. For example, if you specify `desiredCapacity` to be 2 but there are no available 2 nodes, your cluster creation will fail due to timeout even though there are no errors. The easiest way to avoid this is to create the cluster with 1 node and increase the number of nodes later in the EKS console. After you increase number of nodes in your node groups, make sure GPU nodes are in the same subnet. This is required for EFA to work.
### c. Create the EKS cluster
```
eksctl create cluster -f eks_cluster_config.yaml
```
## 3. Create an EFS file system
We'll need a common, shared storage location to enable pods deployed to multiple nodes to load shards of the same model. This way, they can be used in coordination to serve inference requests for models too large to loaded by GPUs on a single node. In Kubernetes, these common, shared storage locations are referred to as persistent volumes. Persistent volumes can be volume mapped in to any number of pods and then accessed by processes running inside of said pods as if they were part of the pod's file system. We will be using EFS as persistent volume.
Additionally, we will need to create a persistent-volume claim which can use to assign the persistent volume to a pod.
### a. Create an IAM role
Follow the steps to create an IAM role for your EFS file system: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-create-iam-resources. This role will be used later when you install the EFS CSI Driver.
### b. Install EFS CSI driver
Install the EFS CSI Driver through the Amazon EKS add-on in AWS console: https://docs.aws.amazon.com/eks/latest/userguide/efs-csi.html#efs-install-driver. Once it's done, check the Add-ons section in EKS console, you should see the driver is showing `Active` under Status.
### c. Create EFS file system
Follow the steps to create an EFS file system: https://github.com/kubernetes-sigs/aws-efs-csi-driver/blob/master/docs/efs-create-filesystem.md. Make sure you mount subnets in the last step correctly. This will affect whether your nodes are able to access the created EFS file system.
## 4. Test
Follow the steps to check if your EFS file system is working properly with your nodes: https://github.com/kubernetes-sigs/aws-efs-csi-driver/tree/master/examples/kubernetes/multiple_pods. This test is going to mount your EFS file system on all of your available nodes and write a text file to the file system.
## 5. Create StorageClass
You can find your `fileSystemId` from AWS EFS. It usually start with `fs-`.
```
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: efs-sc
annotations:
storageclass.kubernetes.io/is-default-class: "true"
provisioner: efs.csi.aws.com
parameters:
fileSystemId: fs-01e72da3fcdbf8a4d
provisioningMode: efs-ap
directoryPerms: "777"
uid: "1000"
gid: "1000"
```
```
kubectl apply -f storageclass.yaml
```
\ No newline at end of file
# Steps to install Dynamo Cloud from Source
## 1. Build Dynamo Base Image
Create 1 ECR repositoriy
```
aws configure
aws ecr create-repository --repository-name <ECR_REPOSITORY>
```
Build Image
```
export NAMESPACE=dynamo-cloud
export DOCKER_SERVER=<ECR_REGISTRY>
export DOCKER_USERNAME=AWS
export DOCKER_PASSWORD="$(aws ecr get-login-password --region <ECR_REGION>)"
export IMAGE_TAG=0.3.2.1
./container/build.sh
```
Push Image
```
docker tag dynamo:latest-vllm <ECR_REGISTRY>/<ECR_REPOSITORY>:$IMAGE_TAG
aws ecr get-login-password | docker login --username AWS --password-stdin <ECR_REGISTRY>
docker push <ECR_REGISTRY>/<ECR_REPOSITORY>:$IMAGE_TAG
```
## 2. Install Dynamo Cloud
Build and Push Operator Image
```
cd deploy/cloud/operator
vim Earthfile # change ARG IMAGE_SUFFIX=<ECR_REPOSITORY>
earthly --push +docker --DOCKER_SERVER=$DOCKER_SERVER --IMAGE_TAG=$IMAGE_TAG
```
Create secrets
```
kubectl create namespace ${NAMESPACE}
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=${DOCKER_SERVER} \
--docker-username=${DOCKER_USERNAME} \
--docker-password=${DOCKER_PASSWORD} \
--namespace=${NAMESPACE}
export HF_TOKEN=<HF_TOKEN>
kubectl create secret generic hf-token-secret \
--from-literal=HF_TOKEN=${HF_TOKEN} \
-n ${NAMESPACE}
```
Install Dynamo Cloud
```
cd dynamo/cloud/helm
helm install dynamo-crds ./crds/ \
--namespace default \
--wait \
--atomic
```
```
helm dep build ./platform/
kubectl create namespace ${NAMESPACE}
# Create docker registry secret
kubectl create secret docker-registry docker-imagepullsecret \
--docker-server=${DOCKER_SERVER} \
--docker-username=${DOCKER_USERNAME} \
--docker-password=${DOCKER_PASSWORD} \
--namespace=${NAMESPACE}
# Install platform
helm install dynamo-platform ./platform/ \
--namespace ${NAMESPACE} \
--set "dynamo-operator.controllerManager.manager.image.repository=${DOCKER_SERVER}/dynamo-operator" \
--set "dynamo-operator.controllerManager.manager.image.tag=${IMAGE_TAG}" \
--set "dynamo-operator.imagePullSecrets[0].name=docker-imagepullsecret"
```
Your pods should be running like below
```
ubuntu@ip-192-168-83-157:~/dynamo/components/backends/vllm/deploy$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
dynamo-cloud dynamo-platform-dynamo-operator-controller-manager-86795c5f4j4k 2/2 Running 0 4h17m
dynamo-cloud dynamo-platform-etcd-0 1/1 Running 0 4h17m
dynamo-cloud dynamo-platform-nats-0 2/2 Running 0 4h17m
dynamo-cloud dynamo-platform-nats-box-5dbf45c748-bxqj7 1/1 Running 0 4h17m
```
# Steps to deploy vLLM example
## 1. Deploy Dynamo Graph
```
cd dynamo/components/backends/vllm/deploy
vim agg_router.yaml #under metadata add namespace: dynamo-cloud and change image to your built base image
kubectl apply -f agg_router.yaml
```
Your pods should be running like below
```
ubuntu@ip-192-168-83-157:~/dynamo/components/backends/vllm/deploy$ kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
dynamo-cloud dynamo-platform-dynamo-operator-controller-manager-86795c5f4j4k 2/2 Running 0 4h17m
dynamo-cloud dynamo-platform-etcd-0 1/1 Running 0 4h17m
dynamo-cloud dynamo-platform-nats-0 2/2 Running 0 4h17m
dynamo-cloud dynamo-platform-nats-box-5dbf45c748-bxqj7 1/1 Running 0 4h17m
dynamo-cloud vllm-agg-router-frontend-79d599bb9c-fg97p 1/1 Running 0 4m9s
dynamo-cloud vllm-agg-router-vllmdecodeworker-787d575485-hrcjp 1/1 Running 0 4m9s
dynamo-cloud vllm-agg-router-vllmdecodeworker-787d575485-zkwdd 1/1 Running 0 4m9s
```
Test the Deployment
```
kubectl port-forward deployment/vllm-agg-router-frontend 8080:8000 -n dynamo-cloud
curl localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": false,
"max_tokens": 30
}'
```
You should output something similar to below
```
{"id":"chatcmpl-bbe52b36-90ed-4479-9872-89e1aa412aa7","choices":[{"index":0,"message":{"content":"<think>\nOkay, so the user wants me to develop a character background for an explorer named someone in Eldoria. The character is part of the","refusal":null,"tool_calls":null,"role":"assistant","function_call":null,"audio":null},"finish_reason":"stop","logprobs":null}],"created":1753417848,"model":"Qwen/Qwen3-0.6B","service_tier":null,"system_fingerprint":null,"object":"chat.completion","usage":{"prompt_tokens":196,"completion_tokens":29,"total_tokens":225,"prompt_tokens_details":null,"completion_tokens_details":null}}
```
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment