"deploy/snapshot/pkg/controller/controller_test.go" did not exist on "7dbebf3c61bd6adeb62d285f56f4d9434afb1f30"
Unverified Commit 3c500ae7 authored by Graham King's avatar Graham King Committed by GitHub
Browse files

docs: Update docs for new UX (#2070)

parent f3d784f3
......@@ -55,96 +55,68 @@ Built in Rust for performance and in Python for extensibility, Dynamo is fully o
The following examples require a few system level packages.
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md)
```
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
python3 -m venv venv
source venv/bin/activate
pip install "ai-dynamo[all]"
```
> [!NOTE]
> To ensure compatibility, please refer to the examples in the release branch or tag that matches the version you installed.
### Building the Dynamo Base Image
1. Install etcd and nats
Although not needed for local development, deploying your Dynamo pipelines to Kubernetes will require you to use a Dynamo base image to your container registry. You can use any container registry of your choice, such as:
- Docker Hub (docker.io)
- NVIDIA NGC Container Registry (nvcr.io)
- Any private registry
To co-ordinate across the data center Dynamo relies on an etcd and nats cluster. To run locally these need to be available.
We publish our images in [nvcr.io](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime) and you can use them.
Alternatively you could build and push an image from source:
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
```bash
./container/build.sh
docker tag dynamo:latest-vllm <your-registry>/dynamo-base:latest-vllm
docker login <your-registry>
docker push <your-registry>/dynamo-base:latest-vllm
The Dynamo team recommend the `uv` Python package manager, although anyway works. Install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Notes about builds for specific frameworks:
- For specific details on the `--framework vllm` build [read about the VLLM backend](components/backends/vllm/README.md)
.
- For specific details on the `--framework tensorrtllm` build, see [Read about the TensorRT-LLM backend](components/backends/trtllm/README.md)
.
2. Select an engine
Note about AWS environments:
- If deploying Dynamo in AWS, make sure to build the container with EFA support using the `--make-efa` flag.
We publish Python wheels specialized for each of our supported engines: vllm, sglang, llama.cpp and trtllm. The examples that follow use sglang, read on for other engines.
After building, you can use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
```bash
export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
```
uv venv venv
source venv/bin/activate
uv pip install pip
> [!NOTE]
> We are working on leaner base images that can be built using the targets in the top-level Earthfile.
# Choose one
uv pip install "ai-dynamo[sglang]"
uv pip install "ai-dynamo[vllm]"
uv pip install "ai-dynamo[llama_cpp]" # CPU, see later for GPU
```
### Running and Interacting with an LLM Locally
You can run a model and interact with it locally using commands below.
We support several backends including: `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
#### Example Commands
```
python -m dynamo.frontend [--http-port 8080]
python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
python -m dynamo.frontend --interactive
python -m dynamo.sglang.worker Qwen/Qwen3-4B
```
```
? User › Hello, how are you?
✔ User · Hello, how are you?
Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...
```
### LLM Serving
If the model is not available locally it will be downloaded from HuggingFace and cached.
Dynamo provides a simple way to spin up a local set of inference
components including:
You can also pass a local path: `python -m dynamo.sglang.worker --model-path ~/llms/Qwen3-0.6B`
### Running an LLM API server
Dynamo provides a simple way to spin up a local set of inference components including:
- **OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
- **Workers** – Set of pre-configured LLM serving engines.
To run a minimal configuration you can use a pre-configured
example.
#### Start Dynamo Distributed Runtime Services
First start the Dynamo Distributed Runtime services:
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
#### Start Dynamo LLM Serving Components
Next serve a minimal configuration with an http server, basic
round-robin router, and a single worker.
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
python -m dynamo.frontend [--http-port 8080]
```bash
cd examples/llm
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
# Start the vllm engine, connecting to nats and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them.
python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```
#### Send a Request
......@@ -163,43 +135,143 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
}' | jq
```
Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.
### Engines
In the introduction we installed the `sglang` engine. There are other options.
All of these requires nats and etcd, as well as a frontend (`python -m dynamo.frontend [--interactive]`).
# vllm
```
uv pip install ai-dynamo[vllm]
```
Run the backend/worker like this:
```
python -m dynamo.vllm --help
```
vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
# sglang
```
uv pip install ai-dynamo[sglang]
```
Run the backend/worker like this:
```
python -m dynamo.sglang.worker --help
```
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs.
# TRT-LLM
This currently requires a container TODO ADD THE DOCS PLZ THANK YOU
# llama.cpp
To install llama.cpp for CPU inference:
```
uv pip install ai-dynamo[llama_cpp]
```
To build llama.cpp for CUDA:
```
pip install llama-cpp-python -C cmake.args="-DGGML_CUDA=on"
uv pip install uvloop ai-dynamo
```
At time of writing the `uv pip` version does not support that syntax, so use `pip` directly inside the venv.
To build llama.cpp for other accelerators see https://pypi.org/project/llama-cpp-python/ .
Download a GGUF and run the engine like this:
```
python -m dynamo.llama_cpp --model-path ~/llms/Qwen3-0.6B-Q8_0.gguf
```
If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
### Local Development
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
1. Install libraries
Otherwise, to develop locally, we recommend working inside of the container
**Ubuntu:**
```
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
```
```bash
./container/build.sh
./container/run.sh -it --mount-workspace
**macOS:**
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)
cargo build --release
mkdir -p /workspace/deploy/sdk/src/dynamo/sdk/cli/bin
cp /workspace/target/release/dynamo-run /workspace/deploy/sdk/src/dynamo/sdk/cli/bin
```
brew install cmake protobuf
uv pip install -e .
export PYTHONPATH=$PYTHONPATH:/workspace/deploy/sdk/src:/workspace/components/planner/src
## Check that Metal is accessible
xcrun -sdk macosx metal
```
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
#### Conda Environment
2. Install Rust
Alternately, you can use a conda environment
```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
```bash
conda activate <ENV_NAME>
3. Create a Python virtual env:
pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
```
uv venv dynamo
source dynamo/bin/activate
```
cargo build --release
4. Install build tools
# To install ai-dynamo-runtime from source
cd lib/bindings/python
pip install .
```
uv pip install pip maturin
```
cd ../../../
pip install ".[all]"
[Maturin](https://github.com/PyO3/maturin) is the Rust<->Python bindings build tool.
5. Build the Rust bindings
```
cd lib/bindings/python
maturin develop --uv
```
Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md)
6. Install the wheel
```
cd $PROJECT_ROOT
uv pip install .
```
Note editable (`-e`) does not work because the `dynamo` package is split over multiple directories, one per backend.
You should now be able to run `python -m dynamo.frontend`.
Remember that nats and etcd must be running (see earlier).
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
### Deployment to Kubernetes
Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md) to deploy to Kubernetes.
......@@ -74,18 +74,14 @@ metrics --component MyComponent --endpoint my_endpoint
To run a more realistic deployment to gathering metrics from,
see the examples in [examples/llm](../../examples/llm).
For example, for a VLLM + KV Routing based deployment that
exposes statistics on an endpoint labeled
`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
with any other example such as examples/vllm_v0, vllm_v1, ...):
```bash
cd deploy/examples/llm
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
python -m dynamo.frontend &
python -m dynamo.vllm --model-path <your-model-checkout>
```
Then, to monitor the metrics of these VllmWorkers, run:
```bash
metrics --component VllmWorker --endpoint load_metrics
metrics --component backend --endpoint load_metrics
```
**NOTE**: `load_metrics` is currently a
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Python Bindings
Python bindings for the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
## 🚀 Quick Start
1. Install `uv`: https://docs.astral.sh/uv/#getting-started
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
2. Install `protoc` protobuf compiler: https://grpc.io/docs/protoc-installation/.
For example on an Ubuntu/Debian system:
```
apt install protobuf-compiler
```
3. Setup a virtualenv
```
uv venv
source .venv/bin/activate
uv pip install maturin
```
4. Build and install dynamo wheel
```
maturin develop --uv
```
## Run Examples
### Prerequisite
See [README.md](../runtime/README.md#prerequisites).
### Hello World Example
1. Start 3 separate shells, and activate the virtual environment in each
```
source .venv/bin/activate
```
2. In one shell (shell 1), run example server the instance-1
```
python3 ./examples/hello_world/server.py
```
3. (Optional) In another shell (shell 2), run example the server instance-2
```
python3 ./examples/hello_world/server.py
```
4. In the last shell (shell 3), run the example client:
```
python3 ./examples/hello_world/client.py
```
If you run the example client in rapid succession, and you started more than
one server instance above, you should see the requests from the client being
distributed across the server instances in each server's output. If only one
server instance is started, you should see the requests go to that server
each time.
## Performance
The performance impacts of synchronizing the Python and Rust async runtimes
is a critical consideration when optimizing the performance of a highly
concurrent and parallel distributed system.
The Python GIL is a global critical section and is ultimately the death of
parallelism. To compound that, when Rust async futures become ready,
accessing the GIL on those async event loop needs to be considered carefully.
Under high load, accessing the GIL or performing CPU intensive tasks on
on the event loop threads can starve out other async tasks for CPU resources.
However, performing a `tokio::task::spawn_blocking` is not without overheads
as well.
If bouncing many small message back-and-forth between the Python and Rust
event loops where Rust requires GIL access, this is pattern where moving the
code from Python to Rust will give you significant gains.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo SDK
## Introduction
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
## Installation
The SDK can be installed using pip:
```bash
pip install ai-dynamo
```
## Core Concepts
As you read about each concept, it is helpful to have the [basic example](../examples/hello_world.md) up as well so you can refer back to it.
### Defining a Service
A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
```python
@service(
dynamo={
"namespace": "dynamo",
},
resources={"gpu": 2, "cpu": "10", "memory": "20Gi"},
workers=1,
)
```
Key configuration options:
1. `dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../architecture/architecture.md)
2. `resources`: Dictionary defining resource requirements. The GPUs field is used for local and remote deployment. The other fields are used to determine resources when deploying to K8s.
3. `workers`: Number of parallel instances of the service to spin up.
### Writing a Service
Let's walk through an example to understand how you write a dynamo service.
```python
import ServiceB
@service(dynamo={"namespace": "dynamo"}, resources={"gpu": 1})
class ServiceA:
# Define service dependencies
service_b = depends(ServiceB)
def __init__(self, model_name="meta-llama/Llama-3.1-8B-Instruct"):
self.model_name = model_name
self.engine = None
@async_on_start
async def async_init(self):
# Initialize resources that require async operations
self.engine = await initialize_model_engine(self.model_name)
print(f"ServiceA initialized with model: {self.model_name}")
@on_shutdown
def shutdown(self):
# Clean up resources
if self.engine:
self.engine.shutdown()
print("ServiceA engine shut down")
@endpoint()
async def generate(self, request: ChatCompletionRequest):
# Call dependent service
processed_request = await self.service_b.preprocess(request)
# Use the engine to generate a response
response = await self.engine.generate(processed_request)
return response
```
#### Class-Based Architecture
Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
1. Class attributes for dependencies using `depends()`
2. An `__init__` method for standard initialization
3. Optional lifecycle hooks like `@async_on_start` and `@on_shutdown`
4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
via `@endpoint("my_endpoint_name")`, but otherwise defaults to the name of the
function being decorated if omitted.
This approach provides a clean separation of concerns and makes the service structure easy to understand.
#### Service Dependencies with `depends()`
The `depends()` function is a powerful feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
3. You are able to access `ServiceB` endpoints as if it were a local function!
```python
# What happens internally when you use depends(ServiceB)
service_b = await runtime.namespace("dynamo").component("ServiceB").endpoint("preprocess").client()
# But with Dynamo SDK, you simply write:
service_b = depends(ServiceB)
# And then call methods directly:
result = await service_b.preprocess(data)
```
```{note}
Through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. To do so:
```
```python
import VllmWorker
# this runtime object gives you access to the underlying python bindings
runtime = dynamo_context["runtime"]
comp_ns, comp_name = VllmWorker.dynamo_address() # dynamo://{namespace}/{name}
print(f"[Processor] comp_ns: {comp_ns}, comp_name: {comp_name}")
self.worker_client = (
await runtime.namespace(comp_ns)
.component(comp_name)
.endpoint("generate")
.client()
)
```
This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
#### Lifecycle Hooks
Dynamo supports key lifecycle hooks to manage service initialization and cleanup.
##### `@async_on_start`
The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
```python
@async_on_start
async def async_init(self):
# Perfect for operations that need to be awaited
self.db = await connect_to_db()
self.tokenizer = await load_tokenizer()
self.engine = await initialize_engine(self.model)
```
This is especially useful for:
- Initializing external connections
- Setting up runtime resources that require async operations
#### `@on_shutdown`
The `@on_shutdown` hook is called when the service is shutdown handles cleanup.
```python
@on_shutdown
def shutdown(self):
# gracefully Handle shutdown / cleanup
logger.info("worker shutting down")
```
This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
### Configuring a Service
Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
1. Directly in the `@service` decorator
2. Through YAML configuration files
3. Via command-line arguments
4. Using environment variables
These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
#### Configuration via Service Decorator
The most basic method is to specify parameters directly in the service decorator:
```python
@service(
dynamo={"namespace": "prod"},
resources={"gpu": 2, "cpu": "4", "memory": "16Gi"},
workers=2,
)
class MyService:
def __init__(self, model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", temperature=0.7):
self.model_name = model_name
self.temperature = temperature
```
This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
#### Configuration via YAML
For more flexible configuration, especially across environments, you can use YAML files:
```yaml
# config.yaml
MyService:
# Override service decorator settings
ServiceArgs:
workers: 4
resources:
gpu: 4
# Service instance parameters
model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
temperature: 0.8
```
The YAML file has a hierarchical structure:
- Top level keys are service class names
- `ServiceArgs` contains parameters for the service decorator
- Other keys are passed as arguments to the service constructor
- Additional keys specific to the service can be accessed via the config system
#### Loading YAML Configuration
Use the CLI to load configuration from a YAML file:
```bash
dynamo serve service:MyService -f config.yaml
```
The configuration is parsed and stored in the `DYNAMO_SERVICE_CONFIG` environment variable, which is then passed to the service workers.
#### Configuration Precedence
When multiple configuration sources are used, they follow this precedence order (highest to lowest):
1. Command-line arguments
2. YAML configuration
3. Service decorator defaults
4. Constructor defaults
#### Accessing Configuration in Services
Inside a service, you can access configuration using the `ServiceConfig` class:
```python
from dynamo.sdk.lib.config import ServiceConfig
class MyService:
def __init__(self):
config = ServiceConfig.get_instance()
# Get with default value
self.model_name = config.get("MyService", {}).get("model_name", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
self.temperature = config.get("MyService", {}).get("temperature", 0.7)
# Require a config value (raises error if missing)
self.api_key = config.require("MyService", "api_key")
# Get all config for this service
all_my_config = config.get("MyService", {})
```
#### Parsing Configuration as CLI Arguments
For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
```python
from dynamo.sdk.lib.config import ServiceConfig
def setup_my_lib():
config = ServiceConfig.get_instance()
# Get all MyService config with prefix "lib_" as CLI args
cli_args = config.as_args("MyService", prefix="lib_")
# Returns: ["--option1", "value1", "--flag2", "--option3", "value3"]
# Pass to an external library's argument parser
lib_parser = MyLibArgumentParser()
lib_args = lib_parser.parse_args(cli_args)
return lib_args
```
This pattern is used in the example vLLM integration:
```python
def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
config = ServiceConfig.get_instance()
vllm_args = config.as_args(service_name, prefix=prefix)
parser = FlexibleArgumentParser()
# Add custom arguments
parser.add_argument("--router", type=str, choices=["random", "round-robin", "kv"], default="random")
parser.add_argument("--remote-prefill", action="store_true", default=False)
# Add VLLM's arguments (ServiceConfig handles True defaults automatically)
parser = AsyncEngineArgs.add_cli_args(parser)
# Parse both custom and VLLM arguments
args = parser.parse_args(vllm_args)
# Convert to engine arguments
engine_args = AsyncEngineArgs.from_cli_args(args)
# Add custom args to the engine args
engine_args.router = args.router
engine_args.remote_prefill = args.remote_prefill
return engine_args
```
#### Boolean Argument Handling
ServiceConfig uses a targeted approach for boolean arguments to maintain compatibility with different argument parsers:
1. Standard Boolean Handling:
- `true` → outputs just the flag (e.g., `--enable-feature`)
- `false` → omitted entirely (uses parser's default)
2. vLLM True-Default Arguments (targeted override support):
- Automatically detects vLLM arguments that default to `True` and need explicit `false` handling
- `enable-prefix-caching: false``--no-enable-prefix-caching` (negative flag)
- `multi-step-stream-outputs: false``--no-multi-step-stream-outputs` (negative flag)
```yaml
# Example YAML configuration
VllmWorker:
# Standard boolean flags (action="store_true" style)
enforce-eager: true # → --enforce-eager
disable-logging: false # → (omitted)
# vLLM arguments with True defaults (automatically handled)
enable-prefix-caching: false # → --no-enable-prefix-caching
# Non-boolean arguments
max-model-len: 16384 # → --max-model-len 16384
```
#### Overriding Service Decorator with ServiceArgs
The `ServiceArgs` section in YAML configuration allows you to override any parameter in the `@service` decorator:
```yaml
MyService:
ServiceArgs:
dynamo:
namespace: "staging" # Override namespace
resources:
gpu: 4 # Use more GPUs
workers: 8 # Scale up workers
```
This is particularly useful for:
- Changing resource allocations between environments
- Modifying worker counts based on expected load
- Switching between namespaces for different deployments
Under the hood, the `DynamoService` class reads these arguments during initialization:
```python
def _get_service_args(self, service_name: str) -> Optional[dict]:
"""Get ServiceArgs from environment config if specified"""
config_str = os.environ.get("DYNAMO_SERVICE_CONFIG")
if config_str:
config = json.loads(config_str)
service_config = config.get(service_name, {})
return service_config.get("ServiceArgs")
return None
```
#### Complete Configuration Example
Here's a comprehensive example showing how all these pieces fit together:
1. First, define your service with basic defaults:
```python
@service(
dynamo={"namespace": "default"},
resources={"gpu": 1},
workers=1,
)
class LLMService:
def __init__(self, model_name="gpt-2", temperature=0.7, max_tokens=1024):
self.model_name = model_name
self.temperature = temperature
self.max_tokens = max_tokens
# Get additional configuration
config = ServiceConfig.get_instance()
service_config = config.get("LLMService", {})
# Extract service-specific parameters
self.cache_size = service_config.get("cache_size", 1000)
self.use_kv_cache = service_config.get("use_kv_cache", True)
```
2. Create a YAML configuration for production:
```yaml
# prod_config.yaml
LLMService:
ServiceArgs:
dynamo:
namespace: "prod"
resources:
gpu: 4
memory: "64Gi"
workers: 8
# Constructor parameters
model_name: "llama-3-70b-instruct"
temperature: 0.8
max_tokens: 2048
# Service-specific parameters
cache_size: 10000
use_kv_cache: true
```
3. Deploy with mixed configuration:
```bash
dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
```
The service receives the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
- `dynamo.namespace = "prod"`
- `resources.gpu = 4`
- `workers = 8`
- `model_name = "llama-3-70b-instruct"`
- `temperature = 0.9` (from CLI override)
- `max_tokens = 2048`
- `cache_size = 10000`
- `use_kv_cache = true`
#### Service Configuration Best Practices
1. **Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
2. **Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
3. **Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
4. **Use CLI for Quick Testing**: Override specific values for experimentation
5. **Document Configuration Keys**: Make sure to document all available configuration options
Following these practices help create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
### Deploying a Single Service
You can deploy a single service for local development even if you have a dependency graph defined using `depends()` using `dynamo serve --service-name <ClassName> <entrypoint> <configuration arguments>`. You can see an example of this in our multinode documentation [here](../examples/multinode.md).
### Composing Services into an Graph
There are two main ways to compose services in Dynamo:
1. Use `depends()` (Recommended)
The depends() approach is the recommended way for production deployments:
- Automatically deploys all dependencies
- Creates a static inference graph at deployment time
- Provides type hints and better IDE support
2. Use `.link()` (Experimental)
Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
#### Understanding the `.link()` syntax
Lets take the example of a `Processor` component. This component can currently do 2 things:
1. Process a request and send it to a `Router` to decide what worker to send it to.
2. Process a request and send it to a `Worker` directly.
Consider this snippet of the Processor:
```python
class Processor(ProcessMixIn):
"""
vLLM pre and post processing
"""
worker = depends(VllmWorker)
router = depends(Router)
# logic for processing a request based on router or worker
```
Think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor spins up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor that removes the `worker` dependency from the Processor.
```python
Processor.link(Router)
```
This removes the `worker` dependency from the Processor and only spin up the Router.
......@@ -2,18 +2,6 @@
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
......@@ -117,78 +105,3 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
- Add prefill worker: no explicit action needed.
- Delete prefill worker: flush engine.
### How this works under the hood
#### Auto-Discovery for new workers
In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
You can watch this happen live by running the following:
```bash
# in terminal 1 - run the disaggregated serving example
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
```
```bash
# in terminal 2 - watch the namespace in etcd
watch -cd etcdctl get --prefix <namespace>
```
You should see something like this show up as the disaggregated serving example starts up:
```bash
# worker information
dynamo/components/PrefillWorker/mock:694d967da694ea1e
{
"component": "PrefillWorker",
"endpoint": "mock",
"namespace": "dynamo",
"lease_id": 7587886413599009310,
"transport": {
"nats_tcp": "dynamo_prefillworker_0d6df828.mock-694d967da694ea1e"
}
}
dynamo/components/Processor/chat/completions:694d967da694ea16
{
"component": "Processor",
"endpoint": "chat/completions",
"namespace": "dynamo",
"lease_id": 7587886413599009302,
"transport": {
"nats_tcp": "dynamo_processor_3816642d.chat/completions-694d967da694ea16"
}
}
dynamo/components/VllmWorker/generate:694d967da694ea1a
{
"component": "VllmWorker",
"endpoint": "generate",
"namespace": "dynamo",
"lease_id": 7587886413599009306,
"transport": {
"nats_tcp": "dynamo_vllmworker_3f6fafd3.generate-694d967da694ea1a"
}
}
dynamo/components/VllmWorker/load_metrics:694d967da694ea1a
{
"component": "VllmWorker",
"endpoint": "load_metrics",
"namespace": "dynamo",
"lease_id": 7587886413599009306,
"transport": {
"nats_tcp": "dynamo_vllmworker_3f6fafd3.load_metrics-694d967da694ea1a"
}
}
# nixl metadata
dynamo/nixl_metadata/e318db87-be55-4c18-9829-8036e1e603e2
```
#### Graceful worker shutdown
Since worker information is stored in etcd, we can shutdown workers by simply revoking their etcd leases. After a lease is revoked:
- Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
- Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
# KV Cache Routing
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
http://www.apache.org/licenses/LICENSE-2.0
To enable KV cache aware routing start the frontend node like this:
```
python -m dynamo.frontend --router-mode kv
```
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
The engine announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node.
>[!NOTE]
>This information is temporary and will change soon.
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
# KV Cache Routing
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
The KV-aware routing arguments:
## Architecture
Dynamo's architecture consists of three key concepts:
- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks).
- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.
- **Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
- **Component**: The deployable unit in Dynamo. Components are self-contained and typically map to separate Docker containers. In our examples, we use labels like `VllmWorker `, `Router`, `Processor` for the components. Components can be created in Python or Rust.
- **Endpoint**: Functions attached to components that transform inputs into outputs. Endpoints are discoverable and callable by other components. In our examples we use the label `generate` for most of the endpoints.
- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).
A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
## Architecture
Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
......@@ -150,32 +144,6 @@ The KVIndexer builds and maintains a global view of cached blocks in a prefix tr
The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
Example:
```python
from dynamo.llm import KvIndexer
from dynamo.sdk import dynamo_context
runtime = dynamo_context["runtime"]
kv_listener = runtime.namespace("dynamo").component("VllmWorker")
await kv_listener.create_service()
indexer = KvIndexer(kv_listener, block_size=16)
indexer.find_matches_for_request([INPUT SEQUENCE OF TOKEN IDs])
```
Sample Output:
```
{
123456789: 10,
987654321: 3,
543219876: 7,
}
```
```{note}
This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
```
### WorkerMetricsPublisher
We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
- num_requests_waiting
......@@ -191,48 +159,3 @@ Currently, the WorkerMetricsPublisher exists as a Python binding.
### KvMetricsAggregator
The KvMetricsAggregator receives these metrics and aggregates them. It has a method `get_metrics` which returns an object of `AggregatedMetrics`.
Example:
```python
from dynamo.llm import KvMetricsAggregator
from dynamo.sdk import dynamo_context
runtime = dynamo_context["runtime"]
kv_listener = runtime.namespace("dynamo").component("VllmWorker")
await kv_listener.create_service()
metrics_aggregator = KvMetricsAggregator(kv_listener)
for endpoint in metrics_aggregator.get_metrics().endpoints:
print("Worker ID: ", endpoint.worker_id)
print("GPU Cache Usage: ", endpoint.gpu_cache_usage_perc)
print("Number of Requests Waiting: ", endpoint.num_requests_waiting)
print("GPU Prefix Cache Hit Rate: ", endpoint.gpu_prefix_cache_hit_rate)
print("***")
```
Sample Output:
```
Worker ID: 123456789
GPU Cache Usage: 0.5
Number of Requests Waiting: 2
GPU Prefix Cache Hit Rate: 0.1
***
Worker ID: 987654321
GPU Cache Usage: 0.5
Number of Requests Waiting: 1
GPU Prefix Cache Hit Rate: 0.1
***
```
```{note}
This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
```
### [KV Router](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/kv_router.py)
The Router component makes intelligent worker selection decisions
1. Receives incoming requests as tokens
2. Queries the KVIndexer to find potential cache hits across workers
3. Collects performance metrics from workers (via KvMetricsAggregator)
4. Uses a cost function to determine the optimal worker for each request
5. Returns chosen worker
The processor manages tokenizing the request, sending it to the KV Router and then once it receives a response, directs the request to the selected worker using direct() routing.
......@@ -19,8 +19,6 @@ If you are a **🧑‍💻 Dynamo Contributor** first follow the instructions in
You would have to rebuild the dynamo platform images as the code evolves. For more details please look at the [Cloud Guide](../guides/dynamo_deploy/dynamo_cloud.md)
Export the [Dynamo Base Image](../get_started.md#building-the-dynamo-base-image) you want to use (or built during the prerequisites step) as the `DYNAMO_IMAGE` environment variable.
```bash
export DYNAMO_IMAGE=<your-registry>/<your-image-name>:<your-tag>
```
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Hello World: Aggregated and Disaggregated Deployment Examples
The `example` directory contains a [hello world example](../examples/hello_world.md) that implements a simplified disaggregated serving architecture used for deploying Large Language Models (LLMs). It removes the LLM related inference code and focuses on how Dynamo handles routing, task queue, and metadata communication between prefill and decode workers.
## Components
- frontend: A simple http server handles incoming requests
- processor: A pre/post processing server and invokes router server
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- worker: A dummy decode worker
- prefill-worker: A dummy prefill worker
## Deployment Architectures
This figure shows an overview of the major components to deploy:
```
+----------------+
| prefill worker |-------+
| | |
+----------------+ | pull
v
+------+ +-----------+ +------------------+ push +---------------+
| HTTP |----->| processor |----->| decode/monolith |------------>| prefill queue |
| |<-----| |<-----| worker | | |
+------+ +-----------+ +------------------+ +---------------+
| ^
query best | | return
worker | | worker_id
| | +------------------+
| +---------| router |
+------------->| |
+------------------+
```
## The Aggregated Deployment
This example uses 2 nodes to demo the disagg serving.
- Node 1
- Runs NATS and etcd services
- Deploys Frontend, Processor and Router
- Deploys DummyWorker as the monolith worker
- Node 2
- Deploys DummyWorker as the monolith worker
### Prerequisites
On Node 1, start required services (etcd and NATS) using [Docker Compose](https://github.com/ai-dynamo/dynamo/blob/main/deploy/docker-compose.yml)
```bash
docker compose -f deploy/docker-compose.yml up -d
```
### Run the Deployment
1. Set environment variables for NATS and etcd services
```bash
export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
```
2. Launch Frontend, Processor and Router services:
```
cd dynamo/examples/hello_world/disagg_skeleton
dynamo serve components.graph:Frontend
```
3. Open a new terminal on Node 1 and deploy Worker service
```
export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
cd dynamo/examples/hello_world/disagg_skeleton
dynamo serve components.worker:DummyWorker
```
4. Go to Node 2 and start Worker service as in step 3.
Now you should see both workers are ready in Node 1's terminal.
5. Query the Frontend with following two prompts. The router would assign different workers for each prompt and you can observe it from the responses.
- `Response: {"worker_output":"Tell me a joke_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
- `Response: {"worker_output":"Which team won 2020 World Series_GeneratedBy_NODE2HOSTNAME","request_id":"id_number"}`
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Tell me a joke",
"request_id":"id_number"
}'
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Which team won 2020 World Series",
"request_id":"id_number"
}'
```
6. Then modify the prompt; prompts with similar prefixes are routed to the same worker due to the routing algorithm used in this demo. For example, following query is routed to the worker that proceesed `Tell me a joke` prompt.
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Tell me a fact",
"request_id":"id_number"
}'
```
- `Response: {"worker_output":"Tell me a fact_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
## The Disaggregated Deployment
This example uses 3 nodes to demo the disagg serving.
- Node 1
- Runs NATS and etcd services
- Deploys Frontend and Processor
- Deploys DummyWorker as the decode worker
- Node 2
- Deploys DummyWorker as the decode worker
- Node 3
- Deploys Prefill as the prefill worker
### Run the Deployment
1. Repeat step 1 to 4 to deploy Frontend, Processor, Router and 2 Workers as decode worker
2. Go to Node 3 and start the prefill worker.
```
export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
cd dynamo/examples/hello_world/disagg_skeleton
dynamo serve components.prefill_worker:PrefillWorker
```
3. Query the Frontend. This time decode workers push requests to the prefill queue, and prefill worker pulles task from the queue to simulate the prefill task. The actual prefill is skipped in this demo.
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "This is prefill disagg serving example",
"request_id":"12345"
}'
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LLM Deployment Examples
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Components
- workers: Prefill and decode worker handles actual LLM inference
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- frontend: OpenAI compatible http server handles incoming requests
## Deployment Architectures
### Aggregated
Single-instance deployment where both prefill and decode are done by the same worker.
### Disaggregated
Distributed deployment where prefill and decode are done by separate workers that can scale independently.
```mermaid
sequenceDiagram
participant D as VllmWorker
participant Q as PrefillQueue
participant P as PrefillWorker
Note over D: Request is routed to decode
D->>D: Decide if prefill should be done locally or remotely
D->>D: Allocate KV blocks
D->>Q: Put RemotePrefillRequest on the queue
P->>Q: Pull request from the queue
P-->>D: Read cached KVs from Decode
D->>D: Decode other requests
P->>P: Run prefill
P-->>D: Write prefilled KVs into allocated blocks
P->>D: Send completion notification
Note over D: Notification received when prefill is done
D->>D: Schedule decoding
```
## Getting Started
1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts
### Prerequisites
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
### Build the container image for your platform
```bash
# On an x86 machine
./container/build.sh --framework VLLM
# On an ARM machine (ex: GB200)
./container/build.sh --framework VLLM --platform linux/arm64
```
```{note}
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
For example, on a GB200 which has very high CPU cores and memory resource:
`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
When vLLM has pre-built ARM wheels published, this process can be improved.
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
For example, on a GB200 which has very high CPU cores and memory resource:
`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
When vLLM has pre-built ARM wheels published, this process can be improved.
```
### Run the container you have built
```
./container/run.sh -it --framework VLLM
```
## Run Deployment
This figure shows an overview of the major components to deploy:
```
+----------------+
+------| prefill worker |-------+
notify | | | |
finished | +----------------+ | pull
v v
+------+ +-----------+ +------------------+ push +---------------+
| HTTP |----->| processor |----->| decode/monolith |------------>| prefill queue |
| |<-----| |<-----| worker | | |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+
```
```{note}
The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command.
For more details, see [Planner Architecture Overview](../architecture/planner_intro.rst).
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Multinode Examples
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Single node sized models
You can deploy dynamo on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. Each node must be properly configured with Infiniband and/or RoCE for communication between decode and prefill workers.
##### Disaggregated Deployment with KV Routing
- Node 1: Frontend, Processor, Router, Decode Worker
- Node 2: Prefill Worker
- Node 3: Prefill Worker
Note that this can be easily extended to more nodes. You can also run the Frontend, Processor, and Router on a separate CPU only node if you'd like as long as all nodes have access to the NATS/ETCD endpoints!
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes the NATS/ETCD endpoints must be accessible by all other nodes.
```bash
# node 1
docker compose -f deploy/metrics/docker-compose.yml up -d
```
**Step 2**: Create the inference graph for this node. Here we use the `agg_router.py` (even though we are doing disaggregated serving) graph because we want the `Frontend`, `Processor`, `Router`, and `VllmWorker` to spin up (we spin up the other decode worker and prefill worker separately on different nodes later).
```python
# graphs/agg_router.py
Frontend.link(Processor).link(Router).link(VllmWorker)
```
**Step 3**: Create a configuration file for this node. We've provided a sample one for you in `configs/multinode-405b.yaml` for the 405B model. Note that we still include the `PrefillWorker` component in the configuration file even though we are not using it on node 1. This is because we can reuse the same configuration file on all nodes and just spin up individual workers on the other ones.
**Step 4**: Start the frontend, processor, router, and VllmWorker on node 1.
```bash
# node 1
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg_router:Frontend -f ./configs/multinode-405b.yaml
```
**Step 5**: Start the first prefill worker on node 2.
Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly with the configuration file from before.
```bash
# node 2
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
```
**Step 6**: Start the second prefill worker on node 3.
```bash
# node 3
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
```
**Step 7**: [Optional] Start more decode workers on other nodes
This example can be extended to more nodes as well. For example, if you want to spin up another decode worker, you can use
```bash
# node X
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.worker:VllmWorker -f ./configs/multinode-405b.yaml --service-name VllmWorker
```
Note the use of `--service-name`. This only spins up the worker that you are requesting and ignore any `depends` statements.
###### Client
In another terminal:
```bash
# this test request has around 200 tokens isl
curl <node1-ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "nvidia/Llama-3.1-405B-Instruct-FP8",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 300
}'
```
#### Multi-node sized models
Multinode model support is coming soon. You can track progress [here](https://github.com/ai-dynamo/dynamo/issues/513)!
##### Aggregated Deployment
The steps for aggregated deployment of multi-node sized models is similar to
single-node sized models, except that you need to first configure the nodes
to be interconnected according to the framework's multi-node deployment guide.
In the below example, vLLM is be used as the framework to serve `DeepSeek-R1` model
using tensor parallel 16 on two H100x8 nodes.
**Step 1**: On each of the nodes, set up Ray cluster so that vLLM can access the resource
collectively:
```bash
# head node
ray start --head --port=6379
# example output and keep note of the IP address of the head node
# Local node IP: <head-node-address>
# set vLLM env arg
export VLLM_HOST_IP=<head-node-address>
# other node
ray start --address=<head-node-address>:6379
export VLLM_HOST_IP=<current-node-address>
# verify the accessibility by checking aggregated GPU count shown in ray status
ray status
# Expected/Sample output for 2 nodes:
# ```bash
# ======== Autoscaler status: 2025-04-16 15:35:42.751688 ========
# Node status
# ---------------------------------------------------------------
# Active:
# 1 node_<hash_1>
# 1 node_<hash_2>
# Pending:
# (no pending nodes)
# Recent failures:
# (no failures)
# Resources
# ---------------------------------------------------------------
# Usage:
# XXX CPU
# XXX GPU
# XXX memory
# XXX object_store_memory
# Demands:
# (no resource demands)
```
**Step 2**: On the head node, follow [LLM deployment examples](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/README.md) to
setup dynamo deployment for aggregated serving, using the configuration file,
`configs/multinode_agg_r1.yaml`, for DeepSeek-R1:
```bash
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/multinode_agg_r1.yaml
```
###### Client
In another terminal, you can send the same curl request as described above but
with `"model": "deepseek-ai/DeepSeek-R1"`
```bash
# this test request has around 200 tokens isl
curl <node1-ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 300
}'
```
##### Disaggregated Deployment
In this example, we deploy two replicas of the model (one prefill worker
and one decode worker). We use 4 H100x8 nodes and group every two of them
into one Ray cluster in the same way as described in aggregated deployment.
However, for etcd and nats server, we only run them in
one node and let's consider that node to be the head node of the whole deployment.
Note that if you are starting etcd server directly instead of using `docker compose`,
you should add additional arguments to be discoverable in other node.
```bash
etcd --advertise-client-urls http://<head-node-ip>:2379 --listen-client-urls http://<head-node-ip>:2379,http://127.0.0.1:2379
```
**Step 1**: On every two nodes, set up Ray cluster as described in
[aggregated deployment](#aggregated-deployment). After that, you should have
two independent Ray cluster, each has access to 16 GPUs.
**Step 2** start the deployment by running different flavors of `dynamo serve`
on one of the node for each Ray cluster, using the configuration file,
`configs/mutinode_disagg_r1.yaml`.
For decode, the below command is used; the node is the entry point of
the whole deployment. In other words, the ip of the node should be used to send
requests to.
```bash
# if not head node
export NATS_SERVER='nats://<nats-server-ip>:4222'
export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/mutinode_disagg_r1.yaml
```
For prefill:
```bash
# if not head node
export NATS_SERVER='nats://<nats-server-ip>:4222'
export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/mutinode_disagg_r1.yaml
```
###### Client
In another terminal, you can send the same curl request as described in
[aggregated deployment](#aggregated-deployment), addressing to the ip of
the decode node.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LLM Deployment Examples using TensorRT-LLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Deployment Architectures
See [Deployment Architectures](llm_deployment.md#deployment-architectures) to learn about the general idea of the architecture.
Note that this TensorRT-LLM version does not support all the options yet.
```{note}
TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
```
## Getting Started
1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts
### Prerequisites
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
### Build docker
#### Step 1: Build TensorRT-LLM base container image
Because of the known issue of C++11 ABI compatibility within the NGC pytorch container, we rebuild TensorRT-LLM from source.
See [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html) for more information.
Use the helper script to build a TensorRT-LLM container base image. The script uses a specific commit id from TensorRT-LLM main branch.
```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
# The script uses python packages like docker-squash to squash image
# layers within trtllm base image
DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get -y install python3 python3-pip python3-venv
./container/build_trtllm_base_image.sh
```
For more information see [here](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-build-tensorrt-llm-in-one-step) for more details on building from source.
If you already have a TensorRT-LLM container image, you can skip this step.
#### Step 2: Build the Dynamo container
```
# On an x86 machine:
./container/build.sh --framework tensorrtllm
# On an ARM machine:
./container/build.sh --framework tensorrtllm --platform linux/arm64
```
This build script internally points to the base container image built with step 1. If you skipped previous step because you already have the container image available, you can run the build script with that image as a base.
```bash
# Build dynamo image with other TRTLLM base image.
./container/build.sh --framework TENSORRTLLM --base-image <trtllm-base-image> --base-image-tag <trtllm-base-image-tag>
```
### Run container
```
./container/run.sh --framework tensorrtllm -it
```
## Run Deployment
This figure shows an overview of the major components to deploy:
```
+------+ +-----------+ +------------------+ +---------------+
| HTTP |----->| processor |----->| Worker |------------>| Prefill |
| |<-----| |<-----| |<------------| Worker |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+
```
```{note}
The above architecture illustrates all the components. The final components that get spawned depend upon the chosen graph.
```
### Example architectures
#### Aggregated serving
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
```
#### Aggregated serving with KV Routing
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg_router:Frontend -f ./configs/agg_router.yaml
```
#### Disaggregated serving
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
```
We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
cache between the context and generation workers.
#### Disaggregated serving with KV Routing
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
```
We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
cache between the context and generation workers.
### Client
See [client](llm_deployment.md#client) section to learn how to send request to the deployment.
### Close deployment
See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
Remaining tasks:
- [x] Add support for the disaggregated serving.
- [ ] Add integration test coverage.
- [ ] Add instructions for benchmarking.
- [ ] Add multi-node support.
- [ ] Merge the code base with llm example to reduce the code duplication.
- [ ] Use processor from dynamo-llm framework.
- [ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Getting Started
## Development Environment
This section describes how to set up your development environment.
### Recommended Setup: Using Dev Container
We recommend using our pre-configured development container:
1. Install prerequisites:
- [Docker](https://www.docker.com/products/docker-desktop)
- [Visual Studio Code](https://code.visualstudio.com/)
- [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
2. Get the code:
```bash
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
```
3. Open in Visual Studio Code:
1. Launch Visual Studio Code
2. Click the button in the bottom-left corner
3. Select **Reopen in Container**
Visual Studio Code builds and starts a container with all necessary dependencies for Dynamo development.
### Alternative Setup: Manual Installation
If you don't want to use the dev container, you can set the environment up manually:
1. Ensure you have:
- Ubuntu 24.04 (recommended)
- x86_64 CPU
- Python 3.x
- Git
See [Support Matrix](support_matrix.md) for more information.
2. **If you plan to use vLLM or SGLang**, you must also install:
- etcd
- NATS.io
Before starting dynamo, run both etcd and NATS.io in separate processes.
3. Install required system packages:
```bash
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
```
4. Set up the Python environment:
```bash
python3 -m venv venv
source venv/bin/activate
```
5. Install Dynamo:
```bash
pip install "ai-dynamo[all]"
```
> [!Important]
> To ensure compatibility, use the examples in the release branch or tag that matches the version you installed.
## Building the Dynamo Base Image
Deploying your Dynamo pipelines to Kubernetes requires you to build and push a Dynamo base image to your container registry.
You can use any private container registry of your choice, including:
- [Docker Hub](https://hub.docker.com/)
- [NVIDIA NGC Container Registry](https://catalog.ngc.nvidia.com/)
To build it:
```bash
./container/build.sh
docker tag dynamo:latest-vllm <your-registry>/dynamo-base:latest-vllm
docker login <your-registry>
docker push <your-registry>/dynamo-base:latest-vllm
```
This documentation describes these frameworks:
- `--framework vllm` build:
See [LLM Deployment Examples](examples/llm_deployment.md).
- `--framework tensorrtllm` build:
See [TRTLLM Deployment Examples](examples/trtllm.md).
After building, use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
```bash
export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
```
## Running and Interacting with an LLM Locally
Dynamo supports several backends, including `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
Use example commands below tp launch a model.
### Example Command
```bash
python -m dynamo.frontend [--http-port 8080]
python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```
```bash
? User › Hello, how are you?
✔ User · Hello, how are you?
Okay, so I'm trying to figure out how to respond to the user's greeting.
They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking."
Hmm, I need to come up with a suitable reply. ...
```
## LLM Serving
Dynamo provides a simple way to spin up a local set of inference components including:
- **OpenAI-compatible Frontend**:
High-performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router**:
Route and load balance traffic to a set of workers.
- **Workers**:
Set of pre-configured LLM serving engines.
To run a minimal configuration, use a pre-configured example.
### Start Dynamo Distributed Runtime Services
To start the Dynamo Distributed Runtime services the first time:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
### Start Dynamo LLM Serving Components
[Explore the VLLM Example](../components/backends/vllm/README.md)
## Local Development
If you use vscode or cursor, use the `.devcontainer` folder built on [Microsoft's Extension](https://code.visualstudio.com/docs/devcontainers/containers).
For instructions, see the Dynamo repository's [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
Otherwise, to develop locally, we recommend working inside of the container:
```bash
./container/build.sh
./container/run.sh -it --mount-workspace
cargo build --release
mkdir -p /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
cp /workspace/target/release/dynamo-run /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
uv pip install -e .
export PYTHONPATH=$PYTHONPATH:/workspace/deploy/dynamo/sdk/src:/workspace/components/planner/src
```
### Conda Environment
Alternatively, use a Conda environment:
```bash
conda activate <ENV_NAME>
pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
cargo build --release
# To install ai-dynamo-runtime from source
cd lib/bindings/python
pip install .
cd ../../../
pip install .[all]
# To test
docker compose -f deploy/docker-compose.yml up -d
python -m dynamo.frontend [--http-port 8080]
python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```
......@@ -2,680 +2,108 @@
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Writing Python Workers in Dynamo
This guide explains how to create your own Python worker in Dynamo and deploy
it via `dynamo serve` or `dynamo deploy`, covering basic concepts as well as
advanced features like enabling KV routing and disaggregated serving.
For detailed information about `dynamo serve` infrastructure, see the
[Dynamo SDK Docs](../API/sdk.md).
For a guide that walks through how to launch a vLLM-based worker with
implementation of Disaggregated Serving and KV-Aware Routing included,
see the [Dynamo Serve Guide](../../docs/guides/dynamo_serve.md).
## Basic Concepts
When deploying a python-based worker with `dynamo serve` or `dynamo deploy`, it is
a Python class based definition that requires a few key decorators to get going:
- `@service`: used to define a worker class
- `@endpoint`: marks methods that can be called by other workers or clients
For more detailed information on these concepts, see the
[Dynamo SDK Docs](../API/sdk.md).
This guide explains how to create your own Python worker in Dynamo.
### Worker Skeleton
The [dynamo](https://pypi.org/project/ai-dynamo/) Python library allows you to build your own engine and attach it to Dynamo.
Here is the rough outline of what a worker may look like in its simplest form:
The Python file must do three things:
1. Decorate a function to get the runtime
2. Register on the network
3. Attach a request handler
```python
from dynamo.sdk import endpoint, service
@service(
dynamo={
"namespace": "your_namespace",
},
)
class YourWorker:
# Worker implementation
# ...
@endpoint()
async def your_endpoint(self, request: RequestType) -> AsyncIterator[ResponseType]:
# Endpoint Implementation
pass
```
from dynamo.llm import ModelType, register_llm
from dynamo.runtime import DistributedRuntime, dynamo_worker
Workers in Dynamo are identified by a `namespace/component/endpoint` naming schema.
When addressing this worker's endpoint with the `namespace/component/endpoint` schema
based on the definitions above, it would be: `your_namespace/YourWorker/your_endpoint`:
- `namespace="your_namespace"`: Defined in the `@service` decorator
- `component="YourWorker"`: Defined by the Python Class name
- `endpoint="your_endpoint"`: Defined by the `@endpoint` decorator, or by default the name of the function being decorated.
For more details about service configuration, resource management, and dynamo endpoints,
see the [Dynamo SDK Docs](../API/sdk.md).
### Request/Response Types
Request/Response types of endpoints can be defined arbitrarily for your use case's needs, as long as
the client calling your worker matches the expectations.
Define your request and response types using Pydantic models:
```python
from pydantic import BaseModel
# 1. Decorate a function to get the runtime
#
@dynamo_worker(static=False)
async def worker(runtime: DistributedRuntime):
class RequestType(BaseModel):
text: str
# Add other fields as needed
class ResponseType(BaseModel):
text: str
# Add other fields as needed
```
```python
from vllm.entrypoints.openai.protocol import ChatCompletionRequest
class YourLLMWorker:
@endpoint(name="my_chat_completions_endpoint")
async def generate(self, request: ChatCompletionRequest):
# Endpoint Implementation
pass
```
## Basic Worker Example
Here's a simple example of a worker that takes text in and streams text out
via custom RequestType/ResponseType definitions:
```python
# basic_worker.py
# This can be run standalone with `dynamo serve basic_worker:YourWorker`
import logging
from pydantic import BaseModel
from dynamo.sdk import endpoint, service
logger = logging.getLogger(__name__)
class RequestType(BaseModel):
text: str
class ResponseType(BaseModel):
text: str
@service(
dynamo={
"namespace": "your_namespace",
}
)
class YourWorker:
def __init__(self) -> None:
logger.info("Starting worker...")
@endpoint()
async def generate(self, request: RequestType):
"""Generate tokens and stream them back"""
logger.info(f"Worker endpoint received: {request.text}")
for token in request.text.split():
yield ResponseType(text=token).model_dump_json()
```
# 2. Register ourselves on the network
#
component = runtime.namespace("namespace").component("component")
await component.create_service()
model_path = "Qwen/Qwen3-0.6B" # or "/data/models/Qwen3-0.6B"
model_type = ModelType.Backend
endpoint = component.endpoint("endpoint")
# Optional last param to register_llm is model_name. If not present derives it from model_path
await register_llm(model_type, endpoint, model_path)
To see a minimal worker example like the above used in a larger pipeline of
components, see the `dynamo serve`
[Hello World example](../../examples/hello_world).
# Initialize your engine here
# engine = ...
### Client Example
# 3. Attach request handler
#
await endpoint.serve_endpoint(RequestHandler(engine).generate)
Here's a simple example of a client that directly calls the example
worker above through Dynamo without any intermediate services:
class RequestHandler:
```python
import asyncio
from pydantic import BaseModel
from dynamo.runtime import dynamo_worker, DistributedRuntime
def __init__(self, engine):
...
# These could also be imported from a shared file/definition
class RequestType(BaseModel):
text: str
class ResponseType(BaseModel):
text: str
@dynamo_worker()
async def client_worker(runtime: DistributedRuntime):
# Get a client to the worker endpoint from the distributed runtime
client = await runtime.namespace("your_namespace").component("YourWorker").endpoint("generate").client()
# Create a request
request = RequestType(text="Hello, Dynamo!")
# Call the dynamo endpoint exposed by the worker
responses = await client.generate(request.model_dump_json())
async for response in responses:
print(response)
async def generate(self, request):
# Call the engine
# yield result dict
...
if __name__ == "__main__":
asyncio.run(client_worker())
```
If there is an OpenAI `http` service in front of your worker and the worker
is defined to accept ChatCompletions-like requests, you could also use an
OpenAI-based client (or `curl`) that sends requests to the OpenAI HTTP Service,
and internally these requests would be routed to the attached worker endpoints instead.
In more advanced scenarios where your worker may operate on some other intermediate format
that may not directly match an OpenAI-like format, you could setup a separate processor worker
that does something like the following:
- Take in OpenAI Chat Completions requests from the HTTP service
- Convert requests from Chat Completions format to the RequestType format your worker expects
- Forward requests to the worker(s)
- Convert responses from the worker's ResponseType back into Chat Completions response format
- Forward responses back to client
This advanced scenario of a separate
[OpenAI Processor worker](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/processor.py)
is demonstrated in this
[vLLM example](https://github.com/ai-dynamo/dynamo/tree/main/examples/llm).
For a more minimal example of deploying a pipeline of components with a custom
API that your client can communicate with, see the
[Hello World example](../../examples/hello_world).
## Advanced Features
### KV Routing for LLMs
KV-aware routing is a powerful feature of Dynamo that optimizes for routing
requests to specific workers while minimizing a specific KV-cache based cost function.
In its simplest form, all a worker needs to do to enable KV-aware routing is to
publish KV metrics through the `WorkerMetricsPublisher`, which is consumed
by a Dynamo KV Router through the `KvMetricsAggregator`:
```python
# kv_metrics_worker.py
# This can be run standalone with `dynamo serve kv_metrics_worker:YourWorker`
import logging
import random
from pydantic import BaseModel
from dynamo.llm import (
WorkerMetricsPublisher,
ForwardPassMetrics,
KvStats,
SpecDecodeStats,
WorkerStats
)
from dynamo.sdk import endpoint, service, dynamo_context
logger = logging.getLogger(__name__)
class RequestType(BaseModel):
text: str
class ResponseType(BaseModel):
text: str
@service(
dynamo={
"namespace": "your_namespace",
}
)
class YourWorker:
def __init__(self):
# Initialize metrics publisher from Dynamo
self.component = dynamo_context["component"]
self.metrics_publisher = WorkerMetricsPublisher()
# Register an endpoint for consumers of the KV Metrics
# (KvMetricsAggregator) to listen/gather on.
self.metrics_publisher.create_endpoint(self.component)
# Initialize some metrics for the worker/class to track
self.request_active_slots = 0
self.request_total_slots = 1024
self.kv_active_blocks = 0
self.kv_total_blocks = 1024
self.num_requests_waiting = 0
self.gpu_cache_usage_perc = 0.0
self.gpu_prefix_cache_hit_rate = 0.0
worker_stats = WorkerStats(
data_parallel_rank=None,
self.request_active_slots,
self.request_total_slots,
self.num_requests_waiting
)
kv_stats = KvStats(
self.kv_active_blocks,
self.kv_total_blocks,
self.gpu_cache_usage_perc,
self.gpu_prefix_cache_hit_rate
)
# Publish some initial metrics to register
# this worker as a candidate for KV Routing.
metrics = ForwardPassMetrics(
worker_stats=worker_stats,
kv_stats=kv_stats,
spec_decode_stats=None,
)
self.metrics_publisher.publish(metrics)
def publish_kv_metrics(self):
# Populate the frequently changing metrics with random data for
# demonstration. These values should be tracked by the implementation,
# or queried from the underlying inference framework.
self.kv_active_blocks = random.randint(0, 1024)
self.num_requests_waiting = random.randint(0, 100)
self.gpu_cache_usage_perc = random.uniform(0, 1.0)
self.gpu_prefix_cache_hit_rate = random.uniform(0, 1.0)
# Publish the metrics with the current state
worker_stats = WorkerStats(
data_parallel_rank=None,
self.request_active_slots,
self.request_total_slots,
self.num_requests_waiting
)
kv_stats = KvStats(
self.kv_active_blocks,
self.kv_total_blocks,
self.gpu_cache_usage_perc,
self.gpu_prefix_cache_hit_rate
)
metrics = ForwardPassMetrics(
worker_stats=worker_stats,
kv_stats=kv_stats,
spec_decode_stats=None,
)
self.metrics_publisher.publish(metrics)
@endpoint()
async def generate(self, request: RequestType):
"""Generate tokens, update KV Cache metrics, and stream the tokens back"""
# Increment the number of active requests on receiving one
self.request_active_slots += 1
logger.info(f"Worker endpoint received: {request.text}")
# Simulate each step of token generation
for token in request.text.split():
# Update the metrics with the current state
self.publish_kv_metrics()
print("Returning token:", token)
yield ResponseType(text=token).model_dump_json()
# Decrement the number of active requests when complete with one
self.request_active_slots -= 1
uvloop.install()
asyncio.run(worker())
```
The granularity at which metrics are published is up to the backend/worker implementation.
For example, you may want to update metrics on every single generation step during token
generation, or you may only want to update once per request, depending on your use case.
Assuming long generation time or long output token sequence lengths, it would be more
accurate to publish metrics at every generation step.
With the worker publishing KV metrics, you should now be able to connect it
to a KV Router through the `KvMetricsAggregator`.
These metrics can then be inputs to a cost function to determine which
of the available worker's the request should be routed to.
For a [python-based KV Router](../../examples/llm/components/kv_router.py)
implementation, the router is like any other worker, and it can expose
an endpoint that can do arbitrary things based on your use case.
For example, you can initialize the `KvMetricsAggregator` and `KvIndexer`
in your class implementation:
```python
@service(
dynamo={
"namespace": "your_namespace",
},
)
class Router:
# ...
@async_on_start
async def async_init(self):
self.runtime = dynamo_context["runtime"]
# Initialize a listener/aggregator for collecting KV metrics
# from the specified component (workers) publishing them
kv_listener = self.runtime.namespace("your_namespace").component("YourWorker")
await kv_listener.create_service()
self.indexer = KvIndexer(kv_listener, self.args.block_size)
self.metrics_aggregator = KvMetricsAggregator(kv_listener)
```
The `model_path` can be:
- A HuggingFace repo ID, optionally prefixed with `hf://`. It is downloaded and cached locally.
- The path to a checkout of a HuggingFace repo - any folder containing safetensor files as well as `config.json`, `tokenizer.json` and `tokenizer_config.json`.
- The path to a GGUF file, if your engine supports that.
With this flexibility, you can also define your own cost function that takes the
KV metrics from the KvMetricsAggregator above and the set of available workers
as inputs, and returns which worker ID that the request should be routed to.
Since the router is like any other worker in this context, you can also track
your own custom metrics and use them in your cost function:
The `model_type` can be:
- ModelType.Backend. Dynamo handles pre-processing. Your `generate` method receives a `request` dict containing a `token_ids` array of int. It must return a dict also containing a `token_ids` array and an optional `finish_reason` string.
- ModelType.Chat. Your `generate` method receives a `request` and must return a response dict of type [OpenAI Chat Completion](https://platform.openai.com/docs/api-reference/chat). Your engine handles pre-processing.
- ModelType.Completion. Your `generate` method receives a `request` and must return a response dict of the older [Completions](https://platform.openai.com/docs/api-reference/completions). Your engine handles pre-processing.
```python
@service(
dynamo={
"namespace": "your_namespace",
},
)
class Router:
# ...
`register_llm` can also take the following kwargs:
- `model_name`: The name to call the model. Your incoming HTTP requests model name must match this. Defaults to the hugging face repo name, the folder name, or the GGUF file name.
- `context_length`: Max model length in tokens. Defaults to the model's set max. Only set this if you need to reduce KV cache allocation to fit into VRAM.
- `kv_cache_block_size`: Size of a KV block for the engine, in tokens. Defaults to 16.
def _cost_function(
self,
scores: OverlapScores | None,
metrics: AggregatedMetrics | None,
token_length: int,
custom_metrics: dict = {},
):
"""
Args:
scores (OverlapScores | None): The number of matching blocks between
the request and the prefix cache of each worker.
metrics (AggregatedMetrics | None): Several worker metrics polled
by the `KvMetricsAggregator`, currently including the
GPU cache usage, number of waiting requests, and the
GPU prefix cache hit rate.
token_length (int): The number of tokens in the request.
custom_metrics (dict): Arbitrary (optional) data from your implementation.
See `components/backends` for full code examples.
Returns:
worker_id (str): The best worker ID based on the inputs.
"""
### Component names
# This is a dummy implementation for demonstration purposes, see the
# llm/tensorrt_llm/hello_world examples for more realistic implementations.
worker_ids = []
A worker needs three names to register itself: namespace.component.endpoint
# KV cache block hit scores
for worker_id, score in scores.scores.items():
print(f"{worker_id} # of matching KV Blocks of size {self.indexer.block_size()}: {score}")
worker_ids.append(worker_id)
* *Namespace*: A pipeline. Usually a model. e.g "llama_8b". Just a name.
* *Component*: A load balanced service needed to run that pipeline. "backend", "prefill", "decode", "preprocessor", "draft", etc. This typically has some configuration (which model to use, for example).
* *Endpoint*: Like a URL. "generate", "load_metrics".
* *Instance*: A process. Unique. Dynamo assigns each one a unique instance_id. The thing that is running is always an instance. Namespace/component/endpoint can refer to multiple instances.
# Aggregated KVMetrics published by workers
for endpoint_metrics in metrics.endpoints:
print(f"Endpoint metrics: {endpoint_metrics}")
If you run two models, that is two pipelines. An exception would be if doing speculative decoding. The draft model is part of the pipeline of a bigger model.
# Replace this random choice with your custom criteria to
# select the best worker ID.
best_worker_id = random.choice(worker_ids)
If you run two instances of the same model ("data parallel") they are the same namespace+component+endpoint but different instances. The router will spread traffic over all the instances of a namespace+component+endpoint. If you have four prefill workers in a pipeline, they all have the same namespace+component+endpoint and are automatically assigned unique instance_ids.
return best_worker_id
@endpoint()
async def generate(self, request: Tokens) -> AsyncIterator[WorkerId]:
try:
# lora_id is a placeholder for lora support, but not used in this example
lora_id = 0
scores = await self.indexer.find_matches_for_request(
request.tokens, lora_id
)
except Exception as e:
scores = {}
print(f"Error finding matches: {e}")
# Get published/aggregated KV Metrics
metrics = await self.metrics_aggregator.get_metrics()
# (Optional) Add custom metrics to consider in the cost function
# The types and data used here are fully up to your implementation
custom_metrics = {"my_custom_metric": 42}
# Call custom cost function
worker_id = self._cost_function(
scores, metrics, len(request.tokens), custom_metrics
)
# Return worker ID selected from cost function
yield f"{worker_id}"
Example 1: Data parallel load balanced, one model one pipeline two instances.
```
Similarly, for running a Rust-based Router as a standalone binary
rather than as a Python Worker, see the
[WorkerSelector Trait](../../lib/llm/src/kv_router.rs) trait, and the
[Router Component](../../components/router/src/main.rs).
For more details on receiving and routing based on the worker's published KV
metrics, see the [KV Cache Routing Guide](../../docs/architecture/kv_cache_routing.md).
### Disaggregated Serving
#### NIXL
NIXL (NVIDIA Inter-process Link) enables efficient GPU memory sharing between processes. In Prefill/Decode disaggregation, we use NIXL to transfer computed KV cache blocks from prefill workers to decode workers. Here are the core concepts:
1. **NIXL Agent Setup**
```python
from nixl._api import nixl_agent
class NixlConnector:
def __init__(self, engine_id: str, rank: int):
# Create unique NIXL agent for this worker
self.nixl_wrapper = nixl_agent(str(uuid.uuid4()), None)
self.engine_id = engine_id
self.rank = rank
self.block_len = None # Will be set during registration
Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 0
Node 2: namespace: qwen3-32b, component: backend, endpoint: generate model: /data/Qwen3-32B --tensor-parallel-size 2 --base-gpu-id 2
```
2. **Memory Registration and Transfer Preparation**
```python
def register_kv_caches(self, kv_cache: torch.Tensor):
# Get block size from the KV cache tensor
# Note: KV cache layout depends on specific attention implementation
num_blocks, block_size, num_heads, head_dim = kv_cache.shape
self.block_len = block_size * num_heads * head_dim * kv_cache.element_size()
self.num_blocks = num_blocks
# Register KV cache tensor with NIXL for sharing
base_addr = kv_cache.data_ptr()
region_len = num_blocks * self.block_len
caches_data = [(base_addr, region_len, self.rank, "")]
# Register memory regions with NIXL
descs = self.nixl_wrapper.get_reg_descs(caches_data, "VRAM")
self.nixl_wrapper.register_memory(descs)
# Prepare local side of the transfer
blocks_data = []
for block_id in range(num_blocks):
block_offset = block_id * self.block_len
blocks_data.append((base_addr + block_offset, self.block_len, self.rank))
# Create transfer descriptors and prepare for transfers
self.local_blocks_descs = self.nixl_wrapper.get_xfer_descs(blocks_data, "VRAM")
# Create transfer handle with block descriptors for future transfers
self.local_xfer_side_handle = self.nixl_wrapper.prep_xfer_dlist("", self.local_blocks_descs)
Example 2: Two models, two pipelines.
```
3. **Remote Agent Communication**
```python
def get_agent_metadata(self):
# Get metadata for sharing with other agents
return self.nixl_wrapper.get_agent_metadata(), self.local_blocks_descs
def add_remote_agent(self, engine_id: str, agent_metadata: bytes, remote_blocks_descs: bytes):
# Connect to remote NIXL agent
agent_name = self.nixl_wrapper.add_remote_agent(agent_metadata)
# Prepare remote side transfer handle using provided block descriptors
self.remote_xfer_side_handle = self.nixl_wrapper.prep_xfer_dlist(agent_name, remote_blocks_descs)
return agent_name
# Example usage:
# On decode worker:
decode_metadata, decode_blocks_descs = nixl_connector.get_agent_metadata()
# Share metadata with prefill worker through your communication channel
# On prefill worker:
nixl_connector.add_remote_agent(decode_engine_id, decode_metadata, decode_blocks_descs)
```
4. **KV Cache Transfer**
```python
def write_blocks(self, local_block_ids, remote_block_ids, notify_msg):
# Initiate asynchronous transfer using block IDs
# Block descriptors were specified during transfer preparation
handle = self.nixl_wrapper.make_prepped_xfer(
"WRITE",
self.local_xfer_side_handle,
local_block_ids,
self.remote_xfer_side_handle,
remote_block_ids,
notify_msg
)
status = self.nixl_wrapper.transfer(handle)
# Example usage:
# On prefill worker:
nixl_connector.write_blocks([0, 3], [12, 16], "kv_transfer")
```
The NIXL connector provides:
- GPU memory registration for sharing between processes
- Connection establishment between Prefill and Decode workers
- Efficient block-based KV cache transfers
- Asynchronous transfer notifications
For a complete implementation example using NIXL for disaggregated serving, see the [vLLM example](../examples/llm_deployment.md).
#### Disaggregation in Dynamo
Aside from the NIXL specifics above, at its core, disaggregation in Dynamo builds
on the same concepts used for any Dynamo client<->worker or worker<->worker
interaction over the DistributedRuntime.
First you can define a worker for each as usual:
```python
class DecodeWorker:
# ...
class PrefillWorker:
# ...
Node 1: namespace: qwen3-32b, component: backend, endpoint: generate, model: /data/Qwen3-32B
Node 2: namespace: llama3-1-8b, component: backend, endpoint: generat, model: /data/Llama-3.1-8B-Instruct/
```
Second, you decide when/how the (Decode) worker should decide to Prefill remotely
(by calling a separate Prefill worker), or decide to simply do the Prefill itself.
In some scenarios, it may be more efficient for the Decode worker to just do the
Prefill itself rather than do the extra communication, such as if the input
sequence length is below some small threshold. If you wanted to disable
disaggregation, the DecodeWorker could just always do the Prefill step as well.
Example 3: Different endpoints.
```python
@service(
dynamo={
"namespace": "your_namespace",
},
)
class DecodeWorker:
def __init__(self):
self.runtime = dynamo_context["runtime"]
The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.
# Whether the decode worker should call a separate Prefill worker or not
self.do_remote_prefill = True
# Initialize client to PrefillWorker
self.prefill_client = await self.runtime
.namespace("your_namespace")
.component("PrefillWorker")
.endpoint("generate")
.client()
@endpoint()
async def generate(self, request):
if self.do_remote_prefill:
# Forward the request to the prefill worker
prefill_response = await self.prefill_client.generate(...)
# ... framework-specific decode logic ...
@service(
dynamo={
"namespace": "your_namespace",
},
)
class PrefillWorker:
def __init__(self):
# ...
@endpoint()
async def generate(self, request):
# ... framework-specific prefill logic ...
```
Depending on the load distribution of requests and number of Prefill/Decode
worker instances, instead of directly forwarding requests to the Prefill
worker endpoint, it may be advantageous to send Prefill requests into a queue
that the Prefill workers can pull from on-demand instead. You can see an example
of that [here](https://github.com/ai-dynamo/dynamo/blob/main/examples/hello_world/disagg_skeleton/components/prefill_worker.py).
For an introductory example on doing disaggregation with Dynamo using simple models, see
[this example](../examples/disagg_skeleton.md).
For more information on Disaggregated Serving, see the
[general guide](../architecture/disagg_serving.md) and [performance tuning guide](disagg_perf_tuning.md).
## Best Practices
1. **Resource Management**: Configure resource requirements based on your needs:
```python
@service(
resources={
"cpu": "10",
"memory": "20Gi",
"gpu": "1",
}
)
```
2. **Async Operations**: Use async/await for I/O operations:
```python
@endpoint()
async def generate(self, request):
# Use async operations for better performance
result = await self.some_async_operation()
```
Example 4: Multiple component in a pipeline.
## Additional Resources
In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instances of this) and `deepseek-distill-llama8b.decode.generate`.
- Check the [examples](https://github.com/ai-dynamo/dynamo/tree/main/examples) directory for more detailed implementations
- Refer to the [Dynamo SDK Docs](../API/sdk.md) for API details.
- For Disaggregated Serving, see the [general guide](../architecture/disagg_serving.md) and [performance tuning guide](disagg_perf_tuning.md).
......@@ -137,7 +137,6 @@ This section describes how to use FluxCD for GitOps-based deployment of Dynamo i
- A Kubernetes cluster with [Dynamo Cloud](dynamo_cloud.md) installed
- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations
- [Dynamo CLI](../../get_started.md#alternative-setup-manual-installation) installed locally
### Workflow Overview
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Planner Benchmark Example
......@@ -50,8 +38,8 @@ For other models and GPU SKUs, adjust the request rate ranges accordingly to mat
To measure the performance of dynamo with planner, we start from a 1p1d deployment and set planner to make adjustments every 10 seconds:
```bash
cd examples/llm
dynamo serve graphs.disagg_router:Frontend -f disagg_1p1d.yml
# Start Kubernetes with one frontend node, one prefill and one decode worker
# TODO
# in terminal 2
genai-perf profile \
......@@ -84,7 +72,8 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
```bash
# in terminal 1
dynamo serve graphs.disagg_router:Frontend -f disagg_2p2d.yml
# Start Kubernetes with one frontend node, two prefill and two decode workers
# TODO
# in terminal 2
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
......
......@@ -75,7 +75,6 @@ The examples below assume you build the latest image yourself from source. If us
Welcome to Dynamo <self>
Support Matrix <support_matrix.md>
Getting Started <get_started.md>
.. toctree::
:hidden:
......
......@@ -2,18 +2,6 @@
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Support Matrix
......@@ -72,7 +60,6 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| :----------------- | :------------ | :----------------------------------- | :----------- |
| ai-dynamo | 0.3.2 | >=2.28 | |
| ai-dynamo-runtime | 0.3.2 | >=2.28 (Python 3.12 has known issues)| |
| ai-dynamo-vllm | 0.8.4.post4¹ | >=2.28 (recommended) | |
| NIXL | 0.4.0 | >=2.27 | >=11.8 |
### Build Dependency
......@@ -80,13 +67,10 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **Build Dependency** | **Version** |
| :------------------- | :------------------------------------------------------------------------------- |
| **Base Container** | [25.03](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base/tags) |
| **ai-dynamo-vllm** | 0.8.4.post4¹ |
| **TensorRT-LLM** | 1.0.0rc² |
| **NIXL** | 0.4.0 |
> [!Important]
> ¹ ai-dynamo-vllm `v0.8.4.post4` is a customized patch of `v0.8.4` from vLLM.
>
> ² Specific versions of TensorRT-LLM supported by Dynamo are subject to change.
## Build Support
......
......@@ -178,7 +178,6 @@ impl ModelWatcher {
Some(card)
}
Err(err) => {
// `dynamo serve` isn't using MDC yet so can't be an error
tracing::info!(%err, "load_mdc did not complete");
None
}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment