Unverified Commit 3c500ae7 authored by Graham King's avatar Graham King Committed by GitHub
Browse files

docs: Update docs for new UX (#2070)

parent f3d784f3
......@@ -55,96 +55,68 @@ Built in Rust for performance and in Python for extensibility, Dynamo is fully o
The following examples require a few system level packages.
Recommended to use Ubuntu 24.04 with a x86_64 CPU. See [docs/support_matrix.md](docs/support_matrix.md)
```
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
python3 -m venv venv
source venv/bin/activate
pip install "ai-dynamo[all]"
```
> [!NOTE]
> To ensure compatibility, please refer to the examples in the release branch or tag that matches the version you installed.
### Building the Dynamo Base Image
1. Install etcd and nats
Although not needed for local development, deploying your Dynamo pipelines to Kubernetes will require you to use a Dynamo base image to your container registry. You can use any container registry of your choice, such as:
- Docker Hub (docker.io)
- NVIDIA NGC Container Registry (nvcr.io)
- Any private registry
To co-ordinate across the data center Dynamo relies on an etcd and nats cluster. To run locally these need to be available.
We publish our images in [nvcr.io](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/ai-dynamo/containers/vllm-runtime) and you can use them.
Alternatively you could build and push an image from source:
- [etcd](https://etcd.io/) can be run directly as `./etcd`.
- [nats](https://nats.io/) needs jetstream enabled: `nats-server -js`.
```bash
./container/build.sh
docker tag dynamo:latest-vllm <your-registry>/dynamo-base:latest-vllm
docker login <your-registry>
docker push <your-registry>/dynamo-base:latest-vllm
The Dynamo team recommend the `uv` Python package manager, although anyway works. Install uv:
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
Notes about builds for specific frameworks:
- For specific details on the `--framework vllm` build [read about the VLLM backend](components/backends/vllm/README.md)
.
- For specific details on the `--framework tensorrtllm` build, see [Read about the TensorRT-LLM backend](components/backends/trtllm/README.md)
.
2. Select an engine
Note about AWS environments:
- If deploying Dynamo in AWS, make sure to build the container with EFA support using the `--make-efa` flag.
We publish Python wheels specialized for each of our supported engines: vllm, sglang, llama.cpp and trtllm. The examples that follow use sglang, read on for other engines.
After building, you can use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
```bash
export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
```
uv venv venv
source venv/bin/activate
uv pip install pip
> [!NOTE]
> We are working on leaner base images that can be built using the targets in the top-level Earthfile.
# Choose one
uv pip install "ai-dynamo[sglang]"
uv pip install "ai-dynamo[vllm]"
uv pip install "ai-dynamo[llama_cpp]" # CPU, see later for GPU
```
### Running and Interacting with an LLM Locally
You can run a model and interact with it locally using commands below.
We support several backends including: `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
#### Example Commands
```
python -m dynamo.frontend [--http-port 8080]
python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
python -m dynamo.frontend --interactive
python -m dynamo.sglang.worker Qwen/Qwen3-4B
```
```
? User › Hello, how are you?
✔ User · Hello, how are you?
Okay, so I'm trying to figure out how to respond to the user's greeting. They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking." Hmm, I need to come up with a suitable reply. ...
```
### LLM Serving
If the model is not available locally it will be downloaded from HuggingFace and cached.
Dynamo provides a simple way to spin up a local set of inference
components including:
You can also pass a local path: `python -m dynamo.sglang.worker --model-path ~/llms/Qwen3-0.6B`
### Running an LLM API server
Dynamo provides a simple way to spin up a local set of inference components including:
- **OpenAI Compatible Frontend** – High performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router** – Route and load balance traffic to a set of workers.
- **Workers** – Set of pre-configured LLM serving engines.
To run a minimal configuration you can use a pre-configured
example.
#### Start Dynamo Distributed Runtime Services
First start the Dynamo Distributed Runtime services:
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
#### Start Dynamo LLM Serving Components
Next serve a minimal configuration with an http server, basic
round-robin router, and a single worker.
# Start an OpenAI compatible HTTP server, a pre-processor (prompt templating and tokenization) and a router:
python -m dynamo.frontend [--http-port 8080]
```bash
cd examples/llm
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
# Start the vllm engine, connecting to nats and etcd to receive requests. You can run several of these,
# both for the same model and for multiple models. The frontend node will discover them.
python -m dynamo.sglang.worker deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```
#### Send a Request
......@@ -163,43 +135,143 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
}' | jq
```
Rerun with `curl -N` and change `stream` in the request to `true` to get the responses as soon as the engine issues them.
### Engines
In the introduction we installed the `sglang` engine. There are other options.
All of these requires nats and etcd, as well as a frontend (`python -m dynamo.frontend [--interactive]`).
# vllm
```
uv pip install ai-dynamo[vllm]
```
Run the backend/worker like this:
```
python -m dynamo.vllm --help
```
vllm attempts to allocate enough KV cache for the full context length at startup. If that does not fit in your available memory pass `--context-length <value>`.
To specify which GPUs to use set environment variable `CUDA_VISIBLE_DEVICES`.
# sglang
```
uv pip install ai-dynamo[sglang]
```
Run the backend/worker like this:
```
python -m dynamo.sglang.worker --help
```
You can pass any sglang flags directly to this worker, see https://docs.sglang.ai/backend/server_arguments.html . See there to use multiple GPUs.
# TRT-LLM
This currently requires a container TODO ADD THE DOCS PLZ THANK YOU
# llama.cpp
To install llama.cpp for CPU inference:
```
uv pip install ai-dynamo[llama_cpp]
```
To build llama.cpp for CUDA:
```
pip install llama-cpp-python -C cmake.args="-DGGML_CUDA=on"
uv pip install uvloop ai-dynamo
```
At time of writing the `uv pip` version does not support that syntax, so use `pip` directly inside the venv.
To build llama.cpp for other accelerators see https://pypi.org/project/llama-cpp-python/ .
Download a GGUF and run the engine like this:
```
python -m dynamo.llama_cpp --model-path ~/llms/Qwen3-0.6B-Q8_0.gguf
```
If you have multiple GPUs, llama.cpp does automatic tensor parallelism. You do not need to pass any extra flags to dynamo-run to enable it.
### Local Development
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
1. Install libraries
**Ubuntu:**
```
sudo apt install -y build-essential libhwloc-dev libudev-dev pkg-config libclang-dev protobuf-compiler python3-dev cmake
```
Otherwise, to develop locally, we recommend working inside of the container
**macOS:**
- [Homebrew](https://brew.sh/)
```
# if brew is not installed on your system, install it
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)"
```
- [Xcode](https://developer.apple.com/xcode/)
```bash
./container/build.sh
./container/run.sh -it --mount-workspace
```
brew install cmake protobuf
cargo build --release
mkdir -p /workspace/deploy/sdk/src/dynamo/sdk/cli/bin
cp /workspace/target/release/dynamo-run /workspace/deploy/sdk/src/dynamo/sdk/cli/bin
## Check that Metal is accessible
xcrun -sdk macosx metal
```
If Metal is accessible, you should see an error like `metal: error: no input files`, which confirms it is installed correctly.
2. Install Rust
uv pip install -e .
export PYTHONPATH=$PYTHONPATH:/workspace/deploy/sdk/src:/workspace/components/planner/src
```
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
```
3. Create a Python virtual env:
#### Conda Environment
```
uv venv dynamo
source dynamo/bin/activate
```
Alternately, you can use a conda environment
4. Install build tools
```bash
conda activate <ENV_NAME>
```
uv pip install pip maturin
```
pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
[Maturin](https://github.com/PyO3/maturin) is the Rust<->Python bindings build tool.
cargo build --release
5. Build the Rust bindings
# To install ai-dynamo-runtime from source
```
cd lib/bindings/python
pip install .
maturin develop --uv
```
6. Install the wheel
```
cd $PROJECT_ROOT
uv pip install .
```
Note editable (`-e`) does not work because the `dynamo` package is split over multiple directories, one per backend.
You should now be able to run `python -m dynamo.frontend`.
Remember that nats and etcd must be running (see earlier).
Set the environment variable `DYN_LOG` to adjust the logging level; for example, `export DYN_LOG=debug`. It has the same syntax as `RUST_LOG`.
If you use vscode or cursor, we have a .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions see the [ReadMe](.devcontainer/README.md) for more details.
cd ../../../
pip install ".[all]"
### Deployment to Kubernetes
Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md)
Follow the [Quickstart Guide](docs/guides/dynamo_deploy/quickstart.md) to deploy to Kubernetes.
```
\ No newline at end of file
......@@ -74,18 +74,14 @@ metrics --component MyComponent --endpoint my_endpoint
To run a more realistic deployment to gathering metrics from,
see the examples in [examples/llm](../../examples/llm).
For example, for a VLLM + KV Routing based deployment that
exposes statistics on an endpoint labeled
`dynamo/VllmWorker/load_metrics` (note: this does NOT currently work
with any other example such as examples/vllm_v0, vllm_v1, ...):
```bash
cd deploy/examples/llm
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
python -m dynamo.frontend &
python -m dynamo.vllm --model-path <your-model-checkout>
```
Then, to monitor the metrics of these VllmWorkers, run:
```bash
metrics --component VllmWorker --endpoint load_metrics
metrics --component backend --endpoint load_metrics
```
**NOTE**: `load_metrics` is currently a
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
https://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Python Bindings
Python bindings for the Dynamo runtime system, enabling distributed computing capabilities for machine learning workloads.
## 🚀 Quick Start
1. Install `uv`: https://docs.astral.sh/uv/#getting-started
```
curl -LsSf https://astral.sh/uv/install.sh | sh
```
2. Install `protoc` protobuf compiler: https://grpc.io/docs/protoc-installation/.
For example on an Ubuntu/Debian system:
```
apt install protobuf-compiler
```
3. Setup a virtualenv
```
uv venv
source .venv/bin/activate
uv pip install maturin
```
4. Build and install dynamo wheel
```
maturin develop --uv
```
## Run Examples
### Prerequisite
See [README.md](../runtime/README.md#prerequisites).
### Hello World Example
1. Start 3 separate shells, and activate the virtual environment in each
```
source .venv/bin/activate
```
2. In one shell (shell 1), run example server the instance-1
```
python3 ./examples/hello_world/server.py
```
3. (Optional) In another shell (shell 2), run example the server instance-2
```
python3 ./examples/hello_world/server.py
```
4. In the last shell (shell 3), run the example client:
```
python3 ./examples/hello_world/client.py
```
If you run the example client in rapid succession, and you started more than
one server instance above, you should see the requests from the client being
distributed across the server instances in each server's output. If only one
server instance is started, you should see the requests go to that server
each time.
## Performance
The performance impacts of synchronizing the Python and Rust async runtimes
is a critical consideration when optimizing the performance of a highly
concurrent and parallel distributed system.
The Python GIL is a global critical section and is ultimately the death of
parallelism. To compound that, when Rust async futures become ready,
accessing the GIL on those async event loop needs to be considered carefully.
Under high load, accessing the GIL or performing CPU intensive tasks on
on the event loop threads can starve out other async tasks for CPU resources.
However, performing a `tokio::task::spawn_blocking` is not without overheads
as well.
If bouncing many small message back-and-forth between the Python and Rust
event loops where Rust requires GIL access, this is pattern where moving the
code from Python to Rust will give you significant gains.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo SDK
## Introduction
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
## Installation
The SDK can be installed using pip:
```bash
pip install ai-dynamo
```
## Core Concepts
As you read about each concept, it is helpful to have the [basic example](../examples/hello_world.md) up as well so you can refer back to it.
### Defining a Service
A Service is a core building block for a project. You can think of it as a logical unit of work. For example, you might have a service responsible for preprocessing and tokenizing and another service running the model worker itself. You define a service using the `@service` decorator on a class.
```python
@service(
dynamo={
"namespace": "dynamo",
},
resources={"gpu": 2, "cpu": "10", "memory": "20Gi"},
workers=1,
)
```
Key configuration options:
1. `dynamo`: Dictionary that defines the Dynamo configuration and enables/disables it. When enabled, a dynamo worker is created under the hood which can register with the [Distributed Runtime](../architecture/architecture.md)
2. `resources`: Dictionary defining resource requirements. The GPUs field is used for local and remote deployment. The other fields are used to determine resources when deploying to K8s.
3. `workers`: Number of parallel instances of the service to spin up.
### Writing a Service
Let's walk through an example to understand how you write a dynamo service.
```python
import ServiceB
@service(dynamo={"namespace": "dynamo"}, resources={"gpu": 1})
class ServiceA:
# Define service dependencies
service_b = depends(ServiceB)
def __init__(self, model_name="meta-llama/Llama-3.1-8B-Instruct"):
self.model_name = model_name
self.engine = None
@async_on_start
async def async_init(self):
# Initialize resources that require async operations
self.engine = await initialize_model_engine(self.model_name)
print(f"ServiceA initialized with model: {self.model_name}")
@on_shutdown
def shutdown(self):
# Clean up resources
if self.engine:
self.engine.shutdown()
print("ServiceA engine shut down")
@endpoint()
async def generate(self, request: ChatCompletionRequest):
# Call dependent service
processed_request = await self.service_b.preprocess(request)
# Use the engine to generate a response
response = await self.engine.generate(processed_request)
return response
```
#### Class-Based Architecture
Dynamo follows a class-based architecture similar to BentoML making it intuitive for users familiar with those frameworks. Each service is defined as a Python class, with the following components:
1. Class attributes for dependencies using `depends()`
2. An `__init__` method for standard initialization
3. Optional lifecycle hooks like `@async_on_start` and `@on_shutdown`
4. Endpoints defined with `@endpoint()`. Optionally, an endpoint can be given a name
via `@endpoint("my_endpoint_name")`, but otherwise defaults to the name of the
function being decorated if omitted.
This approach provides a clean separation of concerns and makes the service structure easy to understand.
#### Service Dependencies with `depends()`
The `depends()` function is a powerful feature that lets you create a dependency between services. When you use `depends(ServiceB)`, several things happen:
1. It ensures that `ServiceB` is deployed when `ServiceA` is deployed by adding it to an internal service dependency graph
2. It creates a client to the endpoints of `ServiceB` that is being served under the hood.
3. You are able to access `ServiceB` endpoints as if it were a local function!
```python
# What happens internally when you use depends(ServiceB)
service_b = await runtime.namespace("dynamo").component("ServiceB").endpoint("preprocess").client()
# But with Dynamo SDK, you simply write:
service_b = depends(ServiceB)
# And then call methods directly:
result = await service_b.preprocess(data)
```
```{note}
Through the SDK, we also provide you with a way to access the underlying bindings if you need. Sometimes you might want to write complicated logic that causes you to directly create a client to another Service without depending on it. To do so:
```
```python
import VllmWorker
# this runtime object gives you access to the underlying python bindings
runtime = dynamo_context["runtime"]
comp_ns, comp_name = VllmWorker.dynamo_address() # dynamo://{namespace}/{name}
print(f"[Processor] comp_ns: {comp_ns}, comp_name: {comp_name}")
self.worker_client = (
await runtime.namespace(comp_ns)
.component(comp_name)
.endpoint("generate")
.client()
)
```
This is used in some of our prebuilt examples and is a powerful way to leverage the benefits of the SDK while being able to access Dynamo's core primitives.
#### Lifecycle Hooks
Dynamo supports key lifecycle hooks to manage service initialization and cleanup.
##### `@async_on_start`
The `@async_on_start` hook is called when the service is started and is used to run an async process outside of the main `__init__` function.
```python
@async_on_start
async def async_init(self):
# Perfect for operations that need to be awaited
self.db = await connect_to_db()
self.tokenizer = await load_tokenizer()
self.engine = await initialize_engine(self.model)
```
This is especially useful for:
- Initializing external connections
- Setting up runtime resources that require async operations
#### `@on_shutdown`
The `@on_shutdown` hook is called when the service is shutdown handles cleanup.
```python
@on_shutdown
def shutdown(self):
# gracefully Handle shutdown / cleanup
logger.info("worker shutting down")
```
This ensures resources are properly released, preventing memory leaks and making sure external connections are properly closed. This is helpful to clean up vLLM engines that have been started outside of the main process.
### Configuring a Service
Dynamo SDK provides a flexible configuration system that allows you to define service parameters through multiple methods:
1. Directly in the `@service` decorator
2. Through YAML configuration files
3. Via command-line arguments
4. Using environment variables
These methods can be used together with clear precedence rules, giving you fine-grained control over service configuration across different environments.
#### Configuration via Service Decorator
The most basic method is to specify parameters directly in the service decorator:
```python
@service(
dynamo={"namespace": "prod"},
resources={"gpu": 2, "cpu": "4", "memory": "16Gi"},
workers=2,
)
class MyService:
def __init__(self, model_name="deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B", temperature=0.7):
self.model_name = model_name
self.temperature = temperature
```
This defines static configuration values in code. Note that the constructor parameters (`model_name` and `temperature`) are also configurable values that can be overridden.
#### Configuration via YAML
For more flexible configuration, especially across environments, you can use YAML files:
```yaml
# config.yaml
MyService:
# Override service decorator settings
ServiceArgs:
workers: 4
resources:
gpu: 4
# Service instance parameters
model_name: "deepseek-ai/DeepSeek-R1-Distill-Llama-8B"
temperature: 0.8
```
The YAML file has a hierarchical structure:
- Top level keys are service class names
- `ServiceArgs` contains parameters for the service decorator
- Other keys are passed as arguments to the service constructor
- Additional keys specific to the service can be accessed via the config system
#### Loading YAML Configuration
Use the CLI to load configuration from a YAML file:
```bash
dynamo serve service:MyService -f config.yaml
```
The configuration is parsed and stored in the `DYNAMO_SERVICE_CONFIG` environment variable, which is then passed to the service workers.
#### Configuration Precedence
When multiple configuration sources are used, they follow this precedence order (highest to lowest):
1. Command-line arguments
2. YAML configuration
3. Service decorator defaults
4. Constructor defaults
#### Accessing Configuration in Services
Inside a service, you can access configuration using the `ServiceConfig` class:
```python
from dynamo.sdk.lib.config import ServiceConfig
class MyService:
def __init__(self):
config = ServiceConfig.get_instance()
# Get with default value
self.model_name = config.get("MyService", {}).get("model_name", "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B")
self.temperature = config.get("MyService", {}).get("temperature", 0.7)
# Require a config value (raises error if missing)
self.api_key = config.require("MyService", "api_key")
# Get all config for this service
all_my_config = config.get("MyService", {})
```
#### Parsing Configuration as CLI Arguments
For services that need to extract their configuration as command-line arguments (common when integrating and validating with external libraries), the SDK provides a helper method:
```python
from dynamo.sdk.lib.config import ServiceConfig
def setup_my_lib():
config = ServiceConfig.get_instance()
# Get all MyService config with prefix "lib_" as CLI args
cli_args = config.as_args("MyService", prefix="lib_")
# Returns: ["--option1", "value1", "--flag2", "--option3", "value3"]
# Pass to an external library's argument parser
lib_parser = MyLibArgumentParser()
lib_args = lib_parser.parse_args(cli_args)
return lib_args
```
This pattern is used in the example vLLM integration:
```python
def parse_vllm_args(service_name, prefix) -> AsyncEngineArgs:
config = ServiceConfig.get_instance()
vllm_args = config.as_args(service_name, prefix=prefix)
parser = FlexibleArgumentParser()
# Add custom arguments
parser.add_argument("--router", type=str, choices=["random", "round-robin", "kv"], default="random")
parser.add_argument("--remote-prefill", action="store_true", default=False)
# Add VLLM's arguments (ServiceConfig handles True defaults automatically)
parser = AsyncEngineArgs.add_cli_args(parser)
# Parse both custom and VLLM arguments
args = parser.parse_args(vllm_args)
# Convert to engine arguments
engine_args = AsyncEngineArgs.from_cli_args(args)
# Add custom args to the engine args
engine_args.router = args.router
engine_args.remote_prefill = args.remote_prefill
return engine_args
```
#### Boolean Argument Handling
ServiceConfig uses a targeted approach for boolean arguments to maintain compatibility with different argument parsers:
1. Standard Boolean Handling:
- `true` → outputs just the flag (e.g., `--enable-feature`)
- `false` → omitted entirely (uses parser's default)
2. vLLM True-Default Arguments (targeted override support):
- Automatically detects vLLM arguments that default to `True` and need explicit `false` handling
- `enable-prefix-caching: false``--no-enable-prefix-caching` (negative flag)
- `multi-step-stream-outputs: false``--no-multi-step-stream-outputs` (negative flag)
```yaml
# Example YAML configuration
VllmWorker:
# Standard boolean flags (action="store_true" style)
enforce-eager: true # → --enforce-eager
disable-logging: false # → (omitted)
# vLLM arguments with True defaults (automatically handled)
enable-prefix-caching: false # → --no-enable-prefix-caching
# Non-boolean arguments
max-model-len: 16384 # → --max-model-len 16384
```
#### Overriding Service Decorator with ServiceArgs
The `ServiceArgs` section in YAML configuration allows you to override any parameter in the `@service` decorator:
```yaml
MyService:
ServiceArgs:
dynamo:
namespace: "staging" # Override namespace
resources:
gpu: 4 # Use more GPUs
workers: 8 # Scale up workers
```
This is particularly useful for:
- Changing resource allocations between environments
- Modifying worker counts based on expected load
- Switching between namespaces for different deployments
Under the hood, the `DynamoService` class reads these arguments during initialization:
```python
def _get_service_args(self, service_name: str) -> Optional[dict]:
"""Get ServiceArgs from environment config if specified"""
config_str = os.environ.get("DYNAMO_SERVICE_CONFIG")
if config_str:
config = json.loads(config_str)
service_config = config.get(service_name, {})
return service_config.get("ServiceArgs")
return None
```
#### Complete Configuration Example
Here's a comprehensive example showing how all these pieces fit together:
1. First, define your service with basic defaults:
```python
@service(
dynamo={"namespace": "default"},
resources={"gpu": 1},
workers=1,
)
class LLMService:
def __init__(self, model_name="gpt-2", temperature=0.7, max_tokens=1024):
self.model_name = model_name
self.temperature = temperature
self.max_tokens = max_tokens
# Get additional configuration
config = ServiceConfig.get_instance()
service_config = config.get("LLMService", {})
# Extract service-specific parameters
self.cache_size = service_config.get("cache_size", 1000)
self.use_kv_cache = service_config.get("use_kv_cache", True)
```
2. Create a YAML configuration for production:
```yaml
# prod_config.yaml
LLMService:
ServiceArgs:
dynamo:
namespace: "prod"
resources:
gpu: 4
memory: "64Gi"
workers: 8
# Constructor parameters
model_name: "llama-3-70b-instruct"
temperature: 0.8
max_tokens: 2048
# Service-specific parameters
cache_size: 10000
use_kv_cache: true
```
3. Deploy with mixed configuration:
```bash
dynamo serve service:LLMService -f prod_config.yaml --LLMService.temperature=0.9
```
The service receives the combined configuration with the command-line value taking precedence, resulting in effective configuration of:
- `dynamo.namespace = "prod"`
- `resources.gpu = 4`
- `workers = 8`
- `model_name = "llama-3-70b-instruct"`
- `temperature = 0.9` (from CLI override)
- `max_tokens = 2048`
- `cache_size = 10000`
- `use_kv_cache = true`
#### Service Configuration Best Practices
1. **Use the Service Decorator for Defaults**: Put reasonable defaults in the service decorator
2. **Use Constructor Parameters for Runtime Options**: Parameters that might change between deployments
3. **Use YAML for Environment Configuration**: Separate configuration by environment (dev/staging/prod)
4. **Use CLI for Quick Testing**: Override specific values for experimentation
5. **Document Configuration Keys**: Make sure to document all available configuration options
Following these practices help create flexible and maintainable Dynamo services that can be easily configured for different environments and use cases.
### Deploying a Single Service
You can deploy a single service for local development even if you have a dependency graph defined using `depends()` using `dynamo serve --service-name <ClassName> <entrypoint> <configuration arguments>`. You can see an example of this in our multinode documentation [here](../examples/multinode.md).
### Composing Services into an Graph
There are two main ways to compose services in Dynamo:
1. Use `depends()` (Recommended)
The depends() approach is the recommended way for production deployments:
- Automatically deploys all dependencies
- Creates a static inference graph at deployment time
- Provides type hints and better IDE support
2. Use `.link()` (Experimental)
Our `.link()` syntax is an flexible and experimental way to compose various services. Linking allows you to compose checks at runtime and view behavior. Under the hood - we are editing the dependency graph between various services. This is useful for experimentation and development but we suggest writing a static graph for your final production deployment.
#### Understanding the `.link()` syntax
Lets take the example of a `Processor` component. This component can currently do 2 things:
1. Process a request and send it to a `Router` to decide what worker to send it to.
2. Process a request and send it to a `Worker` directly.
Consider this snippet of the Processor:
```python
class Processor(ProcessMixIn):
"""
vLLM pre and post processing
"""
worker = depends(VllmWorker)
router = depends(Router)
# logic for processing a request based on router or worker
```
Think of all the depends statements as the maximal set of edges for the processor. At runtime, you may want to follow only a single path. By default, our processor spins up both the VllmWorker and Router as separate services (because `depends()` is defined for both). However, if you want to only spin up the Router, you can do this by linking the Router to the Processor that removes the `worker` dependency from the Processor.
```python
Processor.link(Router)
```
This removes the `worker` dependency from the Processor and only spin up the Router.
......@@ -2,18 +2,6 @@
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Disaggregation: Separating Prefill and Decode for Enhanced Performance
......@@ -117,78 +105,3 @@ The prefill queue and NIXL-based KV transfer design in Dynamo naturally allows r
- Add prefill worker: no explicit action needed.
- Delete prefill worker: flush engine.
### How this works under the hood
#### Auto-Discovery for new workers
In Dynamo, we use `etcd` (a distributed key-value pair store) as a way to register and discover new components. When a new decode/aggregated worker starts, it adds its endpoint information to `etcd` allowing the router to discover it and route requests to it. For the KV-cache transfer process, newly added decode workers put memory descriptors of their KV cache (used in NIXL transfer) in `etcd`. Newly added prefill workers also register with `etcd` for discovery and simply start pulling prefill requests from the global prefill queue after they spin up. Prefill workers lazy-pull the descriptors when they start serving a remote prefill request for the first time.
You can watch this happen live by running the following:
```bash
# in terminal 1 - run the disaggregated serving example
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
```
```bash
# in terminal 2 - watch the namespace in etcd
watch -cd etcdctl get --prefix <namespace>
```
You should see something like this show up as the disaggregated serving example starts up:
```bash
# worker information
dynamo/components/PrefillWorker/mock:694d967da694ea1e
{
"component": "PrefillWorker",
"endpoint": "mock",
"namespace": "dynamo",
"lease_id": 7587886413599009310,
"transport": {
"nats_tcp": "dynamo_prefillworker_0d6df828.mock-694d967da694ea1e"
}
}
dynamo/components/Processor/chat/completions:694d967da694ea16
{
"component": "Processor",
"endpoint": "chat/completions",
"namespace": "dynamo",
"lease_id": 7587886413599009302,
"transport": {
"nats_tcp": "dynamo_processor_3816642d.chat/completions-694d967da694ea16"
}
}
dynamo/components/VllmWorker/generate:694d967da694ea1a
{
"component": "VllmWorker",
"endpoint": "generate",
"namespace": "dynamo",
"lease_id": 7587886413599009306,
"transport": {
"nats_tcp": "dynamo_vllmworker_3f6fafd3.generate-694d967da694ea1a"
}
}
dynamo/components/VllmWorker/load_metrics:694d967da694ea1a
{
"component": "VllmWorker",
"endpoint": "load_metrics",
"namespace": "dynamo",
"lease_id": 7587886413599009306,
"transport": {
"nats_tcp": "dynamo_vllmworker_3f6fafd3.load_metrics-694d967da694ea1a"
}
}
# nixl metadata
dynamo/nixl_metadata/e318db87-be55-4c18-9829-8036e1e603e2
```
#### Graceful worker shutdown
Since worker information is stored in etcd, we can shutdown workers by simply revoking their etcd leases. After a lease is revoked:
- Decode/aggregated worker endpoints are immediately removed from etcd so that they would not accept new requests. They finish any in-flight requests, shut down their engine, and exit gracefully
- Prefill workers stop pulling from the prefill queue and exit gracefully after all pending remote kv cache writes finish
You can also visualize this by revoking a workers etcd lease while it has ongoing requests. Refer to this example script that does this: [revoke_lease.py](https://github.com/ai-dynamo/dynamo/blob/main/lib/bindings/python/examples/hello_world/revoke_lease.py).
\ No newline at end of file
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
# KV Cache Routing
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
http://www.apache.org/licenses/LICENSE-2.0
To enable KV cache aware routing start the frontend node like this:
```
python -m dynamo.frontend --router-mode kv
```
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
The engine announces when a KV block is created or removed. The Dynamo router run finds the worker with the best match for those KV blocks and directs the traffic to that node.
>[!NOTE]
>This information is temporary and will change soon.
For performance testing, compare a typical workload with `--router-mode random|round-robin` to see if it can benefit from KV-aware routing.
# KV Cache Routing
This documentation explains how Key-Value (KV) cache routing works in Dynamo, providing optimized inference for large language models by intelligently directing requests to workers with the most relevant cached data while simultaneously load balancing based on utilization metrics sent by the workers.
The KV-aware routing arguments:
## Architecture
Dynamo's architecture consists of three key concepts:
- `--kv-overlap-score-weight`: Sets the amount of weighting on overlaps with prefix caches, which directly contributes to the prefill cost. A large weight is expected to yield a better TTFT (at the expense of worse ITL). When set to 0, prefix caches are not considered at all (falling back to pure load balancing behavior on the active blocks).
- `--router-temperature`: Sets the temperature when randomly selecting workers to route to via softmax sampling on the router cost logits. Setting it to 0 recovers the deterministic behavior where the min logit is picked.
- **Namespace**: Groups related components (similar to directories in a file system). In our examples, we use the label `dynamo`. This avoids collisions between two different dynamo graphs.
- **Component**: The deployable unit in Dynamo. Components are self-contained and typically map to separate Docker containers. In our examples, we use labels like `VllmWorker `, `Router`, `Processor` for the components. Components can be created in Python or Rust.
- **Endpoint**: Functions attached to components that transform inputs into outputs. Endpoints are discoverable and callable by other components. In our examples we use the label `generate` for most of the endpoints.
- `--use-kv-events`: Sets whether to listen to KV events for maintaining the global view of cached blocks. If true, then we use the `KvIndexer` to listen to the block creation and deletion events. If false, `ApproxKvIndexer`, which assumes the kv cache of historical prompts exists for fixed time durations (hard-coded to 120s), is used to predict the kv cache hit ratio in each engine. Set false if your backend engine does not emit KV events.
A Dynamo graph is a collection of components that are linked together to form a graph. There are two paths through the graphs. The request path and the response path. For LLMs the request path is single-in (a single message) and the response path is many-out (streamed output).
A common pattern is to spin up multiple of the same components that serve the same endpoints, for example, when you want to duplicate models to serve more requests. Each endpoint will get a unique identifier and you will have to tell Dynamo how to route requests between these endpoints.
## Architecture
Colloquially, we refer to a Dynamo component that serves an endpoint for LLM inference as a **worker**.
......@@ -150,32 +144,6 @@ The KVIndexer builds and maintains a global view of cached blocks in a prefix tr
The KVIndexer has a method `find_matches_for_request`, which takes in tokens and returns a dictionary with keys of worker id and values of the number of matched KV Blocks.
Example:
```python
from dynamo.llm import KvIndexer
from dynamo.sdk import dynamo_context
runtime = dynamo_context["runtime"]
kv_listener = runtime.namespace("dynamo").component("VllmWorker")
await kv_listener.create_service()
indexer = KvIndexer(kv_listener, block_size=16)
indexer.find_matches_for_request([INPUT SEQUENCE OF TOKEN IDs])
```
Sample Output:
```
{
123456789: 10,
987654321: 3,
543219876: 7,
}
```
```{note}
This example is designed to help you understand KV cache routing; it won't run outside of the context of dynamo serve. See the examples/ directory for runnable examples.
```
### WorkerMetricsPublisher
We added a KvMetrics Publisher which sends the following metrics to the KvMetricsAggregator:
- num_requests_waiting
......@@ -191,48 +159,3 @@ Currently, the WorkerMetricsPublisher exists as a Python binding.
### KvMetricsAggregator
The KvMetricsAggregator receives these metrics and aggregates them. It has a method `get_metrics` which returns an object of `AggregatedMetrics`.
Example:
```python
from dynamo.llm import KvMetricsAggregator
from dynamo.sdk import dynamo_context
runtime = dynamo_context["runtime"]
kv_listener = runtime.namespace("dynamo").component("VllmWorker")
await kv_listener.create_service()
metrics_aggregator = KvMetricsAggregator(kv_listener)
for endpoint in metrics_aggregator.get_metrics().endpoints:
print("Worker ID: ", endpoint.worker_id)
print("GPU Cache Usage: ", endpoint.gpu_cache_usage_perc)
print("Number of Requests Waiting: ", endpoint.num_requests_waiting)
print("GPU Prefix Cache Hit Rate: ", endpoint.gpu_prefix_cache_hit_rate)
print("***")
```
Sample Output:
```
Worker ID: 123456789
GPU Cache Usage: 0.5
Number of Requests Waiting: 2
GPU Prefix Cache Hit Rate: 0.1
***
Worker ID: 987654321
GPU Cache Usage: 0.5
Number of Requests Waiting: 1
GPU Prefix Cache Hit Rate: 0.1
***
```
```{note}
This example is for building understanding, it will not run outside of the context of dynamo serve. See the examples/ folder for runnable examples.
```
### [KV Router](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/components/kv_router.py)
The Router component makes intelligent worker selection decisions
1. Receives incoming requests as tokens
2. Queries the KVIndexer to find potential cache hits across workers
3. Collects performance metrics from workers (via KvMetricsAggregator)
4. Uses a cost function to determine the optimal worker for each request
5. Returns chosen worker
The processor manages tokenizing the request, sending it to the KV Router and then once it receives a response, directs the request to the selected worker using direct() routing.
......@@ -19,8 +19,6 @@ If you are a **🧑‍💻 Dynamo Contributor** first follow the instructions in
You would have to rebuild the dynamo platform images as the code evolves. For more details please look at the [Cloud Guide](../guides/dynamo_deploy/dynamo_cloud.md)
Export the [Dynamo Base Image](../get_started.md#building-the-dynamo-base-image) you want to use (or built during the prerequisites step) as the `DYNAMO_IMAGE` environment variable.
```bash
export DYNAMO_IMAGE=<your-registry>/<your-image-name>:<your-tag>
```
......@@ -74,4 +72,4 @@ kubectl port-forward svc/${SERVICE_NAME}-frontend 8000:8000 -n ${NAMESPACE}
Consult the [Port Forward Documentation](https://kubernetes.io/docs/tasks/access-application-cluster/port-forward-access-application-cluster/)
More on [LLM examples](llm_deployment.md)
\ No newline at end of file
More on [LLM examples](llm_deployment.md)
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Hello World: Aggregated and Disaggregated Deployment Examples
The `example` directory contains a [hello world example](../examples/hello_world.md) that implements a simplified disaggregated serving architecture used for deploying Large Language Models (LLMs). It removes the LLM related inference code and focuses on how Dynamo handles routing, task queue, and metadata communication between prefill and decode workers.
## Components
- frontend: A simple http server handles incoming requests
- processor: A pre/post processing server and invokes router server
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- worker: A dummy decode worker
- prefill-worker: A dummy prefill worker
## Deployment Architectures
This figure shows an overview of the major components to deploy:
```
+----------------+
| prefill worker |-------+
| | |
+----------------+ | pull
v
+------+ +-----------+ +------------------+ push +---------------+
| HTTP |----->| processor |----->| decode/monolith |------------>| prefill queue |
| |<-----| |<-----| worker | | |
+------+ +-----------+ +------------------+ +---------------+
| ^
query best | | return
worker | | worker_id
| | +------------------+
| +---------| router |
+------------->| |
+------------------+
```
## The Aggregated Deployment
This example uses 2 nodes to demo the disagg serving.
- Node 1
- Runs NATS and etcd services
- Deploys Frontend, Processor and Router
- Deploys DummyWorker as the monolith worker
- Node 2
- Deploys DummyWorker as the monolith worker
### Prerequisites
On Node 1, start required services (etcd and NATS) using [Docker Compose](https://github.com/ai-dynamo/dynamo/blob/main/deploy/docker-compose.yml)
```bash
docker compose -f deploy/docker-compose.yml up -d
```
### Run the Deployment
1. Set environment variables for NATS and etcd services
```bash
export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
```
2. Launch Frontend, Processor and Router services:
```
cd dynamo/examples/hello_world/disagg_skeleton
dynamo serve components.graph:Frontend
```
3. Open a new terminal on Node 1 and deploy Worker service
```
export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
cd dynamo/examples/hello_world/disagg_skeleton
dynamo serve components.worker:DummyWorker
```
4. Go to Node 2 and start Worker service as in step 3.
Now you should see both workers are ready in Node 1's terminal.
5. Query the Frontend with following two prompts. The router would assign different workers for each prompt and you can observe it from the responses.
- `Response: {"worker_output":"Tell me a joke_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
- `Response: {"worker_output":"Which team won 2020 World Series_GeneratedBy_NODE2HOSTNAME","request_id":"id_number"}`
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Tell me a joke",
"request_id":"id_number"
}'
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Which team won 2020 World Series",
"request_id":"id_number"
}'
```
6. Then modify the prompt; prompts with similar prefixes are routed to the same worker due to the routing algorithm used in this demo. For example, following query is routed to the worker that proceesed `Tell me a joke` prompt.
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "Tell me a fact",
"request_id":"id_number"
}'
```
- `Response: {"worker_output":"Tell me a fact_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
## The Disaggregated Deployment
This example uses 3 nodes to demo the disagg serving.
- Node 1
- Runs NATS and etcd services
- Deploys Frontend and Processor
- Deploys DummyWorker as the decode worker
- Node 2
- Deploys DummyWorker as the decode worker
- Node 3
- Deploys Prefill as the prefill worker
### Run the Deployment
1. Repeat step 1 to 4 to deploy Frontend, Processor, Router and 2 Workers as decode worker
2. Go to Node 3 and start the prefill worker.
```
export NATS_SERVER="nats://Node_1_IP_ADDRESS:4222"
export ETCD_ENDPOINTS="http://Node_1_IP_ADDRESS:2379"
cd dynamo/examples/hello_world/disagg_skeleton
dynamo serve components.prefill_worker:PrefillWorker
```
3. Query the Frontend. This time decode workers push requests to the prefill queue, and prefill worker pulles task from the queue to simulate the prefill task. The actual prefill is skipped in this demo.
```
curl -X 'POST' \
'http://localhost:8000/generate' \
-H 'accept: text/event-stream' \
-H 'Content-Type: application/json' \
-d '{
"prompt": "This is prefill disagg serving example",
"request_id":"12345"
}'
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LLM Deployment Examples
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Components
- workers: Prefill and decode worker handles actual LLM inference
- router: Handles API requests and routes them to appropriate workers based on specified strategy
- frontend: OpenAI compatible http server handles incoming requests
## Deployment Architectures
### Aggregated
Single-instance deployment where both prefill and decode are done by the same worker.
### Disaggregated
Distributed deployment where prefill and decode are done by separate workers that can scale independently.
```mermaid
sequenceDiagram
participant D as VllmWorker
participant Q as PrefillQueue
participant P as PrefillWorker
Note over D: Request is routed to decode
D->>D: Decide if prefill should be done locally or remotely
D->>D: Allocate KV blocks
D->>Q: Put RemotePrefillRequest on the queue
P->>Q: Pull request from the queue
P-->>D: Read cached KVs from Decode
D->>D: Decode other requests
P->>P: Run prefill
P-->>D: Write prefilled KVs into allocated blocks
P->>D: Send completion notification
Note over D: Notification received when prefill is done
D->>D: Schedule decoding
```
## Getting Started
1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts
### Prerequisites
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
### Build the container image for your platform
```bash
# On an x86 machine
./container/build.sh --framework VLLM
# On an ARM machine (ex: GB200)
./container/build.sh --framework VLLM --platform linux/arm64
```
```{note}
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
For example, on a GB200 which has very high CPU cores and memory resource:
`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
When vLLM has pre-built ARM wheels published, this process can be improved.
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
For example, on an ARM machine with low system resources:
`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=2`
For example, on a GB200 which has very high CPU cores and memory resource:
`./container/build.sh --framework VLLM --platform linux/arm64 --build-arg VLLM_MAX_JOBS=64`
When vLLM has pre-built ARM wheels published, this process can be improved.
```
### Run the container you have built
```
./container/run.sh -it --framework VLLM
```
## Run Deployment
This figure shows an overview of the major components to deploy:
```
+----------------+
+------| prefill worker |-------+
notify | | | |
finished | +----------------+ | pull
v v
+------+ +-----------+ +------------------+ push +---------------+
| HTTP |----->| processor |----->| decode/monolith |------------>| prefill queue |
| |<-----| |<-----| worker | | |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+
```
```{note}
The planner component is enabled by default for all deployment architectures but is set to no-op mode. This means the planner observes metrics but doesn't take scaling actions. To enable active scaling, you can add `--Planner.no-operation=false` to your `dynamo serve` command.
For more details, see [Planner Architecture Overview](../architecture/planner_intro.rst).
```
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Multinode Examples
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Single node sized models
You can deploy dynamo on multiple nodes via NATS/ETCD based discovery and communication. Here's an example of deploying disaggregated serving on 3 nodes using `nvidia/Llama-3.1-405B-Instruct-FP8`. Each node must be properly configured with Infiniband and/or RoCE for communication between decode and prefill workers.
##### Disaggregated Deployment with KV Routing
- Node 1: Frontend, Processor, Router, Decode Worker
- Node 2: Prefill Worker
- Node 3: Prefill Worker
Note that this can be easily extended to more nodes. You can also run the Frontend, Processor, and Router on a separate CPU only node if you'd like as long as all nodes have access to the NATS/ETCD endpoints!
**Step 1**: Start NATS/ETCD on your head node. Ensure you have the correct firewall rules to allow communication between the nodes the NATS/ETCD endpoints must be accessible by all other nodes.
```bash
# node 1
docker compose -f deploy/metrics/docker-compose.yml up -d
```
**Step 2**: Create the inference graph for this node. Here we use the `agg_router.py` (even though we are doing disaggregated serving) graph because we want the `Frontend`, `Processor`, `Router`, and `VllmWorker` to spin up (we spin up the other decode worker and prefill worker separately on different nodes later).
```python
# graphs/agg_router.py
Frontend.link(Processor).link(Router).link(VllmWorker)
```
**Step 3**: Create a configuration file for this node. We've provided a sample one for you in `configs/multinode-405b.yaml` for the 405B model. Note that we still include the `PrefillWorker` component in the configuration file even though we are not using it on node 1. This is because we can reuse the same configuration file on all nodes and just spin up individual workers on the other ones.
**Step 4**: Start the frontend, processor, router, and VllmWorker on node 1.
```bash
# node 1
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg_router:Frontend -f ./configs/multinode-405b.yaml
```
**Step 5**: Start the first prefill worker on node 2.
Since we only want to start the `PrefillWorker` on node 2, you can simply run just the PrefillWorker component directly with the configuration file from before.
```bash
# node 2
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
```
**Step 6**: Start the second prefill worker on node 3.
```bash
# node 3
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/multinode-405b.yaml
```
**Step 7**: [Optional] Start more decode workers on other nodes
This example can be extended to more nodes as well. For example, if you want to spin up another decode worker, you can use
```bash
# node X
export NATS_SERVER = '<your-nats-server-address>' # note this should start with nats://...
export ETCD_ENDPOINTS = '<your-etcd-endpoints-address>'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.worker:VllmWorker -f ./configs/multinode-405b.yaml --service-name VllmWorker
```
Note the use of `--service-name`. This only spins up the worker that you are requesting and ignore any `depends` statements.
###### Client
In another terminal:
```bash
# this test request has around 200 tokens isl
curl <node1-ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "nvidia/Llama-3.1-405B-Instruct-FP8",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 300
}'
```
#### Multi-node sized models
Multinode model support is coming soon. You can track progress [here](https://github.com/ai-dynamo/dynamo/issues/513)!
##### Aggregated Deployment
The steps for aggregated deployment of multi-node sized models is similar to
single-node sized models, except that you need to first configure the nodes
to be interconnected according to the framework's multi-node deployment guide.
In the below example, vLLM is be used as the framework to serve `DeepSeek-R1` model
using tensor parallel 16 on two H100x8 nodes.
**Step 1**: On each of the nodes, set up Ray cluster so that vLLM can access the resource
collectively:
```bash
# head node
ray start --head --port=6379
# example output and keep note of the IP address of the head node
# Local node IP: <head-node-address>
# set vLLM env arg
export VLLM_HOST_IP=<head-node-address>
# other node
ray start --address=<head-node-address>:6379
export VLLM_HOST_IP=<current-node-address>
# verify the accessibility by checking aggregated GPU count shown in ray status
ray status
# Expected/Sample output for 2 nodes:
# ```bash
# ======== Autoscaler status: 2025-04-16 15:35:42.751688 ========
# Node status
# ---------------------------------------------------------------
# Active:
# 1 node_<hash_1>
# 1 node_<hash_2>
# Pending:
# (no pending nodes)
# Recent failures:
# (no failures)
# Resources
# ---------------------------------------------------------------
# Usage:
# XXX CPU
# XXX GPU
# XXX memory
# XXX object_store_memory
# Demands:
# (no resource demands)
```
**Step 2**: On the head node, follow [LLM deployment examples](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/README.md) to
setup dynamo deployment for aggregated serving, using the configuration file,
`configs/multinode_agg_r1.yaml`, for DeepSeek-R1:
```bash
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/multinode_agg_r1.yaml
```
###### Client
In another terminal, you can send the same curl request as described above but
with `"model": "deepseek-ai/DeepSeek-R1"`
```bash
# this test request has around 200 tokens isl
curl <node1-ip>:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "deepseek-ai/DeepSeek-R1",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream": true,
"max_tokens": 300
}'
```
##### Disaggregated Deployment
In this example, we deploy two replicas of the model (one prefill worker
and one decode worker). We use 4 H100x8 nodes and group every two of them
into one Ray cluster in the same way as described in aggregated deployment.
However, for etcd and nats server, we only run them in
one node and let's consider that node to be the head node of the whole deployment.
Note that if you are starting etcd server directly instead of using `docker compose`,
you should add additional arguments to be discoverable in other node.
```bash
etcd --advertise-client-urls http://<head-node-ip>:2379 --listen-client-urls http://<head-node-ip>:2379,http://127.0.0.1:2379
```
**Step 1**: On every two nodes, set up Ray cluster as described in
[aggregated deployment](#aggregated-deployment). After that, you should have
two independent Ray cluster, each has access to 16 GPUs.
**Step 2** start the deployment by running different flavors of `dynamo serve`
on one of the node for each Ray cluster, using the configuration file,
`configs/mutinode_disagg_r1.yaml`.
For decode, the below command is used; the node is the entry point of
the whole deployment. In other words, the ip of the node should be used to send
requests to.
```bash
# if not head node
export NATS_SERVER='nats://<nats-server-ip>:4222'
export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
cd $DYNAMO_HOME/examples/llm
dynamo serve graphs.agg:Frontend -f ./configs/mutinode_disagg_r1.yaml
```
For prefill:
```bash
# if not head node
export NATS_SERVER='nats://<nats-server-ip>:4222'
export ETCD_ENDPOINTS='<etcd-endpoints-ip>:2379'
cd $DYNAMO_HOME/examples/llm
dynamo serve components.prefill_worker:PrefillWorker -f ./configs/mutinode_disagg_r1.yaml
```
###### Client
In another terminal, you can send the same curl request as described in
[aggregated deployment](#aggregated-deployment), addressing to the ip of
the decode node.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# LLM Deployment Examples using TensorRT-LLM
This directory contains examples and reference implementations for deploying Large Language Models (LLMs) in various configurations using TensorRT-LLM.
## Use the Latest Release
We recommend using the latest stable release of dynamo to avoid breaking changes:
[![GitHub Release](https://img.shields.io/github/v/release/ai-dynamo/dynamo)](https://github.com/ai-dynamo/dynamo/releases/latest)
You can find the latest release [here](https://github.com/ai-dynamo/dynamo/releases/latest) and check out the corresponding branch with:
```bash
git checkout $(git describe --tags $(git rev-list --tags --max-count=1))
```
## Deployment Architectures
See [Deployment Architectures](llm_deployment.md#deployment-architectures) to learn about the general idea of the architecture.
Note that this TensorRT-LLM version does not support all the options yet.
```{note}
TensorRT-LLM disaggregation does not support conditional disaggregation yet. You can only configure the deployment to always use aggregate or disaggregated serving.
```
## Getting Started
1. Choose a deployment architecture based on your requirements
2. Configure the components as needed
3. Deploy using the provided scripts
### Prerequisites
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
```bash
docker compose -f deploy/metrics/docker-compose.yml up -d
```
### Build docker
#### Step 1: Build TensorRT-LLM base container image
Because of the known issue of C++11 ABI compatibility within the NGC pytorch container, we rebuild TensorRT-LLM from source.
See [here](https://nvidia.github.io/TensorRT-LLM/installation/linux.html) for more information.
Use the helper script to build a TensorRT-LLM container base image. The script uses a specific commit id from TensorRT-LLM main branch.
```bash
# TensorRT-LLM uses git-lfs, which needs to be installed in advance.
apt-get update && apt-get -y install git git-lfs
# The script uses python packages like docker-squash to squash image
# layers within trtllm base image
DEBIAN_FRONTEND=noninteractive TZ=America/Los_Angeles apt-get -y install python3 python3-pip python3-venv
./container/build_trtllm_base_image.sh
```
For more information see [here](https://nvidia.github.io/TensorRT-LLM/installation/build-from-source-linux.html#option-1-build-tensorrt-llm-in-one-step) for more details on building from source.
If you already have a TensorRT-LLM container image, you can skip this step.
#### Step 2: Build the Dynamo container
```
# On an x86 machine:
./container/build.sh --framework tensorrtllm
# On an ARM machine:
./container/build.sh --framework tensorrtllm --platform linux/arm64
```
This build script internally points to the base container image built with step 1. If you skipped previous step because you already have the container image available, you can run the build script with that image as a base.
```bash
# Build dynamo image with other TRTLLM base image.
./container/build.sh --framework TENSORRTLLM --base-image <trtllm-base-image> --base-image-tag <trtllm-base-image-tag>
```
### Run container
```
./container/run.sh --framework tensorrtllm -it
```
## Run Deployment
This figure shows an overview of the major components to deploy:
```
+------+ +-----------+ +------------------+ +---------------+
| HTTP |----->| processor |----->| Worker |------------>| Prefill |
| |<-----| |<-----| |<------------| Worker |
+------+ +-----------+ +------------------+ +---------------+
| ^ |
query best | | return | publish kv events
worker | | worker_id v
| | +------------------+
| +---------| kv-router |
+------------->| |
+------------------+
```
```{note}
The above architecture illustrates all the components. The final components that get spawned depend upon the chosen graph.
```
### Example architectures
#### Aggregated serving
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg:Frontend -f ./configs/agg.yaml
```
#### Aggregated serving with KV Routing
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.agg_router:Frontend -f ./configs/agg_router.yaml
```
#### Disaggregated serving
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.disagg:Frontend -f ./configs/disagg.yaml
```
We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
cache between the context and generation workers.
#### Disaggregated serving with KV Routing
```bash
cd /workspace/examples/tensorrt_llm
dynamo serve graphs.disagg_router:Frontend -f ./configs/disagg_router.yaml
```
We are defining TRTLLM_USE_UCX_KVCACHE so that TRTLLM uses UCX for transfering the KV
cache between the context and generation workers.
### Client
See [client](llm_deployment.md#client) section to learn how to send request to the deployment.
### Close deployment
See [close deployment](../../docs/guides/dynamo_serve.md#close-deployment) section to learn about how to close the deployment.
Remaining tasks:
- [x] Add support for the disaggregated serving.
- [ ] Add integration test coverage.
- [ ] Add instructions for benchmarking.
- [ ] Add multi-node support.
- [ ] Merge the code base with llm example to reduce the code duplication.
- [ ] Use processor from dynamo-llm framework.
- [ ] Enable NIXL integration with TensorRT-LLM once available. Currently, TensorRT-LLM uses UCX to transfer KV cache.
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Getting Started
## Development Environment
This section describes how to set up your development environment.
### Recommended Setup: Using Dev Container
We recommend using our pre-configured development container:
1. Install prerequisites:
- [Docker](https://www.docker.com/products/docker-desktop)
- [Visual Studio Code](https://code.visualstudio.com/)
- [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
2. Get the code:
```bash
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
```
3. Open in Visual Studio Code:
1. Launch Visual Studio Code
2. Click the button in the bottom-left corner
3. Select **Reopen in Container**
Visual Studio Code builds and starts a container with all necessary dependencies for Dynamo development.
### Alternative Setup: Manual Installation
If you don't want to use the dev container, you can set the environment up manually:
1. Ensure you have:
- Ubuntu 24.04 (recommended)
- x86_64 CPU
- Python 3.x
- Git
See [Support Matrix](support_matrix.md) for more information.
2. **If you plan to use vLLM or SGLang**, you must also install:
- etcd
- NATS.io
Before starting dynamo, run both etcd and NATS.io in separate processes.
3. Install required system packages:
```bash
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
```
4. Set up the Python environment:
```bash
python3 -m venv venv
source venv/bin/activate
```
5. Install Dynamo:
```bash
pip install "ai-dynamo[all]"
```
> [!Important]
> To ensure compatibility, use the examples in the release branch or tag that matches the version you installed.
## Building the Dynamo Base Image
Deploying your Dynamo pipelines to Kubernetes requires you to build and push a Dynamo base image to your container registry.
You can use any private container registry of your choice, including:
- [Docker Hub](https://hub.docker.com/)
- [NVIDIA NGC Container Registry](https://catalog.ngc.nvidia.com/)
To build it:
```bash
./container/build.sh
docker tag dynamo:latest-vllm <your-registry>/dynamo-base:latest-vllm
docker login <your-registry>
docker push <your-registry>/dynamo-base:latest-vllm
```
This documentation describes these frameworks:
- `--framework vllm` build:
See [LLM Deployment Examples](examples/llm_deployment.md).
- `--framework tensorrtllm` build:
See [TRTLLM Deployment Examples](examples/trtllm.md).
After building, use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
```bash
export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
```
## Running and Interacting with an LLM Locally
Dynamo supports several backends, including `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
Use example commands below tp launch a model.
### Example Command
```bash
python -m dynamo.frontend [--http-port 8080]
python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```
```bash
? User › Hello, how are you?
✔ User · Hello, how are you?
Okay, so I'm trying to figure out how to respond to the user's greeting.
They said, "Hello, how are you?" and then followed it with "Hello! I'm just a program, but thanks for asking."
Hmm, I need to come up with a suitable reply. ...
```
## LLM Serving
Dynamo provides a simple way to spin up a local set of inference components including:
- **OpenAI-compatible Frontend**:
High-performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router**:
Route and load balance traffic to a set of workers.
- **Workers**:
Set of pre-configured LLM serving engines.
To run a minimal configuration, use a pre-configured example.
### Start Dynamo Distributed Runtime Services
To start the Dynamo Distributed Runtime services the first time:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
### Start Dynamo LLM Serving Components
[Explore the VLLM Example](../components/backends/vllm/README.md)
## Local Development
If you use vscode or cursor, use the `.devcontainer` folder built on [Microsoft's Extension](https://code.visualstudio.com/docs/devcontainers/containers).
For instructions, see the Dynamo repository's [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
Otherwise, to develop locally, we recommend working inside of the container:
```bash
./container/build.sh
./container/run.sh -it --mount-workspace
cargo build --release
mkdir -p /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
cp /workspace/target/release/dynamo-run /workspace/deploy/dynamo/sdk/src/dynamo/sdk/cli/bin
uv pip install -e .
export PYTHONPATH=$PYTHONPATH:/workspace/deploy/dynamo/sdk/src:/workspace/components/planner/src
```
### Conda Environment
Alternatively, use a Conda environment:
```bash
conda activate <ENV_NAME>
pip install nixl # Or install https://github.com/ai-dynamo/nixl from source
cargo build --release
# To install ai-dynamo-runtime from source
cd lib/bindings/python
pip install .
cd ../../../
pip install .[all]
# To test
docker compose -f deploy/docker-compose.yml up -d
python -m dynamo.frontend [--http-port 8080]
python -m dynamo.vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```
This diff is collapsed.
......@@ -137,7 +137,6 @@ This section describes how to use FluxCD for GitOps-based deployment of Dynamo i
- A Kubernetes cluster with [Dynamo Cloud](dynamo_cloud.md) installed
- [FluxCD](https://fluxcd.io/flux/installation/) installed in your cluster
- A Git repository to store your deployment configurations
- [Dynamo CLI](../../get_started.md#alternative-setup-manual-installation) installed locally
### Workflow Overview
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Planner Benchmark Example
......@@ -50,8 +38,8 @@ For other models and GPU SKUs, adjust the request rate ranges accordingly to mat
To measure the performance of dynamo with planner, we start from a 1p1d deployment and set planner to make adjustments every 10 seconds:
```bash
cd examples/llm
dynamo serve graphs.disagg_router:Frontend -f disagg_1p1d.yml
# Start Kubernetes with one frontend node, one prefill and one decode worker
# TODO
# in terminal 2
genai-perf profile \
......@@ -84,7 +72,8 @@ In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--n
```bash
# in terminal 1
dynamo serve graphs.disagg_router:Frontend -f disagg_2p2d.yml
# Start Kubernetes with one frontend node, two prefill and two decode workers
# TODO
# in terminal 2
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
......
......@@ -75,7 +75,6 @@ The examples below assume you build the latest image yourself from source. If us
Welcome to Dynamo <self>
Support Matrix <support_matrix.md>
Getting Started <get_started.md>
.. toctree::
:hidden:
......
......@@ -2,18 +2,6 @@
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
# Dynamo Support Matrix
......@@ -72,7 +60,6 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| :----------------- | :------------ | :----------------------------------- | :----------- |
| ai-dynamo | 0.3.2 | >=2.28 | |
| ai-dynamo-runtime | 0.3.2 | >=2.28 (Python 3.12 has known issues)| |
| ai-dynamo-vllm | 0.8.4.post4¹ | >=2.28 (recommended) | |
| NIXL | 0.4.0 | >=2.27 | >=11.8 |
### Build Dependency
......@@ -80,13 +67,10 @@ If you are using a **GPU**, the following GPU models and architectures are suppo
| **Build Dependency** | **Version** |
| :------------------- | :------------------------------------------------------------------------------- |
| **Base Container** | [25.03](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base/tags) |
| **ai-dynamo-vllm** | 0.8.4.post4¹ |
| **TensorRT-LLM** | 1.0.0rc² |
| **NIXL** | 0.4.0 |
> [!Important]
> ¹ ai-dynamo-vllm `v0.8.4.post4` is a customized patch of `v0.8.4` from vLLM.
>
> ² Specific versions of TensorRT-LLM supported by Dynamo are subject to change.
## Build Support
......
......@@ -178,7 +178,6 @@ impl ModelWatcher {
Some(card)
}
Err(err) => {
// `dynamo serve` isn't using MDC yet so can't be an error
tracing::info!(%err, "load_mdc did not complete");
None
}
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment