"examples/deployments/vscode:/vscode.git/clone" did not exist on "ac0206293ca3f9c879587cc2e22acfe69f3d9cca"
Unverified Commit 0e7d4d82 authored by Kristen Kelleher's avatar Kristen Kelleher Committed by GitHub
Browse files

docs: DIS-133 and DIS-134 plus copyediting (#1439)


Signed-off-by: default avatarKristen Kelleher <kkelleher@nvidia.com>
Co-authored-by: default avatarRyan McCormick <rmccormick@nvidia.com>
parent cb71be92
......@@ -17,17 +17,9 @@ limitations under the License.
# Dynamo SDK
# Table of Contents
- [Introduction](#introduction)
- [Installation](#installation)
- [Core Concepts](#core-concepts)
- [Writing a Service](#writing-a-service)
- [Configuring a Service](#configuring-a-service)
- [Composing Services into an Graph](#composing-services-into-an-graph)
## Introduction
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See Python Bindings](./python_bindings.md).
Dynamo is a flexible and performant distributed inferencing solution for large-scale deployments. It is an ecosystem of tools, frameworks, and abstractions that makes the design, customization, and deployment of frontier-level models onto datacenter-scale infrastructure easy to reason about and optimized for your specific inferencing workloads. Dynamo's core is written in Rust and contains a set of well-defined Python bindings. See [Python Bindings](./python_bindings.md).
Dynamo SDK is a layer on top of the core. It is a Python framework that makes it easy to create inference graphs and deploy them locally and onto a target K8s cluster. The SDK was heavily inspired by [BentoML's](https://github.com/bentoml/BentoML) open source deployment patterns. The Dynamo CLI is a companion tool that allows you to spin up an inference pipeline locally, containerize it, and deploy it. You can find a toy hello-world example and instructions for deploying it [here](../examples/hello_world.md).
......
......@@ -20,13 +20,13 @@ limitations under the License.
Dynamo is NVIDIA's high-throughput, low-latency inference framework that's designed to serve generative AI and reasoning models in multi-node distributed environments. It's inference engine agnostic, supporting TRT-LLM, vLLM, SGLang and others, while capturing essential LLM capabilities:
- **Disaggregated prefill & decode inference** Maximizes GPU throughput and helps you balance throughput and latency
- **Dynamic GPU scheduling** Optimizes performance based on real-time demand
- **LLM-aware request routing** Eliminates unnecessary KV cache recomputation
- **Accelerated data transfer** Reduces inference response time using NIXL
- **KV cache offloading** Uses multiple memory hierarchies for higher system throughput
- **Disaggregated prefill & decode inference**: Maximizes GPU throughput and helps you balance throughput and latency
- **Dynamic GPU scheduling**: Optimizes performance based on real-time demand
- **LLM-aware request routing**: Eliminates unnecessary KV cache recomputation
- **Accelerated data transfer**: Reduces inference response time using NIXL
- **KV cache offloading**: Uses multiple memory hierarchies for higher system throughput and lower latency
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, OSS (Open Source Software) first development approach
Built in Rust for performance and in Python for extensibility, Dynamo is fully open-source and driven by a transparent, Open Source Software (OSS)-first development approach
## Motivation behind Dynamo
......
......@@ -61,11 +61,11 @@ The hierarchy and naming in etcd and NATS may change over time, and this documen
Dynamo uses `Client` object to call an endpoint. When a `Client` objected is created, it is given the name of the `Namespace`, `Component`, and `Endpoint`. It then sets up an etcd watcher to monitor the prefix `/services/{namespace}/{component}/{endpoint}`. The etcd watcher continuously updates the `Client` with the information, including `lease_id` and NATS subject of the available `Endpoint`s.
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_routers.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
The user can decide which load balancing strategy to use when calling the `Endpoint` from the `Client`, which is done in [push_router.rs](../../lib/runtime/src/pipeline/network/egress/push_router.rs). Dynamo supports three load balancing strategies:
- `random`: randomly select an endpoint to hit,
- `round_robin`: select endpoints in round-robin order,
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint.
- `random`: randomly select an endpoint to hit
- `round_robin`: select endpoints in round-robin order
- `direct`: direct the request to a specific endpoint by specifying the `lease_id` of the endpoint
After selecting which endpoint to hit, the `Client` sends the serialized request to the NATS subject of the selected `Endpoint`. The `Endpoint` receives the request and create a TCP response stream using the connection information from the request, which establishes a direct TCP connection to the `Client`. Then, as the worker generates the response, it serializes each response chunk and sends the serialized data over the TCP connection.
......@@ -77,7 +77,7 @@ We provide native rust and python (through binding) examples for basic usage of
- Python: `/lib/bindings/python/examples/`. We also provide a complete example of using `DistributedRuntime` for communication and Dynamo's LLM library for prompt templates and (de)tokenization to deploy a vllm-based service. Please refer to `lib/bindings/python/examples/hello_world/server_vllm.py` for details.
```{note}
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to have performance issues to require exgtensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
Building a vLLM docker image for ARM machines currently involves building vLLM from source, which is known to be slow and requires extensive system RAM; see [vLLM Issue 8878](https://github.com/vllm-project/vllm/issues/8878).
You can tune the number of parallel build jobs for building VLLM from source
on ARM based on your available cores and system RAM with `VLLM_MAX_JOBS`.
......
......@@ -23,7 +23,7 @@ The design of the KVBM is inspired from vLLM and SGLang KV block managers but wi
![Internal architecture and key modules in the Dynamo KVBM. ](../images/kvbm-internal-arch.png)
**Internal architecture and key modules in the Dynamo KVBM**
#### KvBlockManager as Orchestration Layer
## KvBlockManager as Orchestration Layer
The `KvBlockManager <H, D>` acts as a coordinator across memory tiers—host (CPU), device (GPU), and remote—by managing per-backend block pools and exposing consistent block lifecycle APIs. It tracks KV block locations across device memory (G1), CPU memory within and across nodes (G2), local/pooled SSDs (G3), and remote storage (G4). G1-G4 are key tiers enabled by KVBM. Critical to note that KVBM treats G4 storage as an opaque blob store, unaware of internal layout optimizations.
......@@ -36,12 +36,12 @@ The `KvBlockManager <H, D>` acts as a coordinator across memory tiers—host (CP
Implementation-wise, `KvBlockManagerState` holds the logic: it's initialized by `KvBlockManagerConfig`, which merges runtime, model, and layout configurations. `NixlOptions` injects remote awareness.
#### Block Layout and Memory Mapping
## Block Layout and Memory Mapping
Each block is a 2D array `[num_layers][page_size × inner_dim]`. `BlockLayouttrait` abstracts the memory layout. The default implementation,`FullyContiguous`, stores all layers for all blocks in one region with alignment-aware stride computation:
```
```none
block_stride_in_bytes = align_up(num_layers × layer_stride, alignment);
```
......@@ -55,12 +55,12 @@ Both CPU and GPU pools share this memory layout, but they use storage-specific b
Each layout is constructed using a `LayoutConfig`, and storage is either passed directly or allocated using a StorageAllocator.
#### BlockPool and Memory Pools (Active and Inactive)
## BlockPool and Memory Pools (Active and Inactive)
Each `BlockPool<T>` (where `T` is `DeviceStorage`, `PinnedStorage`, and so forth) tracks two sub-pools:
* `ActivePool`: Contains blocks currently in use by sequences
* `InactivePool`: Recycled blocks ready for allocation. Think free list.
* `InactivePool`: Recycled blocks ready for allocation; think free list
When a token block is requested (for example, `get_mutable_block()`), the allocator pops from `InactivePool`, transitions its state, and returns a writable handle. On sequence commit or eviction, the system resets blocks and returns them to the inactive pool.
......@@ -93,7 +93,7 @@ Consider this example lifecycle of a block in the KVBM; in it, a sequence reques
5. `register()` → Block is hashed and moved to Registered. Blocks can now be used to lookup.
6. On eviction or end-of-life → `drop()` of RAII handle returns block to Reset
#### Lifecycle Management using RAII and Event Plane
## Lifecycle Management using RAII and Event Plane
The system uses RAII for memory lifecycle management. Every block holds metadata and registration state, and registration is coupled with an `EventManager`. On registration and drop:
......@@ -102,13 +102,13 @@ The system uses RAII for memory lifecycle management. Every block holds metadata
This pattern ensures consistency for shared memory tracking across workers without requiring explicit deallocation logic. The events are propagated in the Dynamo Events plane. Any Dynamo component subscribed to the events plane can listen to these changes. Note that even the storage provider can subscribe to the events plane and create an internal prefix tree representation that is tailored and optimized for the specific platform.
#### Remote Memory Integration using NIXL
## Remote Memory Integration using NIXL
The NIXL agent exposes remote memory buffers using `NixlBlockSet`, `RemoteBlocks`, and layout descriptors. Key operations include:
* `nixl_register()`: Registers memory region with NIXL runtime
* `serialize() / deserialize()`: Converts layout and memory into transferable descriptors
* `import_remote_blockset()`: Loads remote nodes block layouts into the manager
* `import_remote_blockset()`: Loads remote node's block layouts into the manager
* `get_remote_blocks_mutable()`: Fetches transferable memory views from another node
`RemoteBlocks` is a lightweight abstraction over shared memory for cross-node block usage (through UCX or other backends).
......@@ -173,42 +173,27 @@ The left side of the figure in [Understanding KVBM Components](#understanding-kv
The system can batch and publish registration events using a Publisher, optimizing performance under high concurrency
#### Storage backends and pluggability
## Storage backends and pluggability
You can integrating KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers. We defer detailed integration guidance, since we collaborate with storage partners to simplify and standardize these integration paths.
You can integrate KVBM with a storage backend by extending or wrapping `NixlEnabledStorage` to support cross-node RDMA registration. All layouts and block pools are generic over these backends, allowing for fine-grained control over memory tiers. We defer detailed integration guidance, since we collaborate with storage partners to simplify and standardize these integration paths.
```
An example system architecture
+------------------------------+
|Distributed Inference engine |
+------------------------------+
|
v
+------------------------------+
| Dynamo KV Block Manager |
+------------------------------+
|
+----------------+----------------+
| |
v v
+------------------------------+ +----------------------------+
| NIXL Storage Agent | | Event Plane |
| - Volume registration | | - NATS-based Pub/Sub |
| - get()/put() abstraction | | - StoreEvent / RemoveEvent |
+------------------------------+ +----------------------------+
| |
v v
+-----------------------------+ +-----------------------------+
| G4 Storage Infrastructure | | Storage Provider Subscriber |
| (SSD, Object store, etc.) | | - Parse Events |
| - Store KV blocks | | - Build fast tree/index |
+-----------------------------+ | - Optimize G4 tiering |
+-----------------------------+
```mermaid
---
title: Example KVBM System Architecture
---
flowchart TD
A["Distributed Inference Engine"] --> B["Dynamo KV Block Manager"]
B --> C["NIXL Storage Agent<br/>- Volume registration<br/>- get()/put() abstraction"]
B --> D["Event Plane<br/>- NATS-based Pub/Sub<br/>- StoreEvent / RemoveEvent"]
C --> E["G4 Storage Infrastructure<br/>(SSD, Object store, etc.)<br/>- Store KV blocks"]
D --> F["Storage Provider Subscriber<br/>- Parse Events<br/>- Build fast tree/index<br/>- Optimize G4 tiering"]
```
For now, the following breakdown provides a high-level understanding of how KVBM interacts with external storage using the NIXL storage interface and the Dynamo Event Plane:
##### NIXL Storage Interface (for Backend Integration)
### NIXL Storage Interface (for Backend Integration)
The NIXL interface abstracts volume interaction and decouples it from mounting, metadata tracking, or direct system I/O. It provides:
......@@ -216,9 +201,9 @@ The NIXL interface abstracts volume interaction and decouples it from mounting,
* unregisterVolume(): Cleanly deregister and release volume mappings.
* get() / put(): Block-level APIs used by KVBM to fetch and store token blocks.
These abstractions allow backends to be integrated without tying into the hosts file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Please note that these APIs are still being finalized.
These abstractions allow backends to be integrated without tying into the host's file system stack, enabling safe interaction with block devices, local filesystems, and RDMA-capable volumes. Please note that these APIs are still being finalized.
##### Dynamo Event Plane (Pub/Sub Coordination Layer)
### Dynamo Event Plane (Pub/Sub Coordination Layer)
To support external storage optimizations without modifying KVBM logic, we provide an **event plane** built on NATS.io that emits lifecycle events for all block operations. Particularly there are two events emitted.
......@@ -236,17 +221,17 @@ Each KVEvent (\~100 bytes) contains:
For scalability, the system batches and publishes these events periodically (for example, every \~10s, or dynamically based on system load).
##### A conceptual design of a storage advisor
### A conceptual design of a storage advisor
This section provides an overview for the storage provider who is interested in integrating as a custom backend to KVBM and providing optimized performance. ***Please note, this is optional and not required for KVBM to integrate with a backend.***
This section provides an overview for the storage provider who is interested in integrating as a custom backend to KVBM and providing optimized performance. **Please note, this is optional for KVBM integration with a backend.**
External storage systems are not tightly coupled with Dynamos execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
External storage systems are not tightly coupled with Dynamo's execution pipeline. Instead, they passively observe KV block lifecycle events through a subscription model:
* Storage volumes are pre-provisioned and mounted by the storage provider.
* These volumes are then registered with Dynamo through the NIXL Storage Agent using registerVolume() APIs. Dynamo itself does not manage mounts or provisioning.
* The Dynamo KV Block Manager interacts only with logical block-level APIs (that is, get() and put()).
* In parallel, the Event Plane asynchronously broadcasts KV lifecycle events using a NATS-based pub/sub channel.
* Storage vendors implement a lightweight subscriber process that listens to these events without interfering with the KV Managers runtime behavior.
* Storage vendors implement a lightweight subscriber process that listens to these events without interfering with the KV Manager's runtime behavior.
* This decoupling ensures that external storage systems can optimize block placement and lifecycle tracking without modifying or instrumenting the core Dynamo codebase.
Now, to enable fast lookup and dynamic tiering, storage vendors may build internal data structures using the received event stream. Here is a high level conceptual design:
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
......@@ -18,21 +18,23 @@ limitations under the License.
# Planner
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently. Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
* Backend:
* local ✅
* kubernetes ✅
* LLM framework:
* vllm ✅
* tensorrt-llm ❌
* SGLang ❌
* llama.cpp ❌
* Serving type:
* Aggregated ✅
* Disaggregated ✅
* Planner actions:
* Load-based scaling up/down prefill/decode workers ✅
* SLA-based scaling up/down prefill/decode workers ✅ (with some limitations)
* Adjusting engine knobs ❌
| | | Feature |
| :---------------- | :--| :-----------------|
| **Backend** | ✅ | Local |
| | ✅ | Kubernetes |
| **LLM Framework** | ✅ | vLLM |
| | ❌ | TensorRT-LLM |
| | ❌ | SGLang |
| | ❌ | llama.cpp |
| **Serving Type** | ✅ | Aggregated |
| | ✅ | Disaggregated |
| **Planner Actions** | ✅ | Load-based scaling up/down prefill/decode workers |
| | ✅ | SLA-based scaling up/down prefill/decode workers **<sup>[1]</sup>** |
| | ✅ | Adjusting engine knobs |
**<sup>[1]</sup>** Supported with some limitations.
## Load-based Scaling Up/Down Prefill/Decode Workers
......@@ -48,6 +50,9 @@ There are two additional rules set by planner to prevent over-compensation:
1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
For benchmarking recommendations, see the [Planner benchmark example](../../docs/guides/planner_benchmark/benchmark_planner.md).
## Comply with SLA
To ensure dynamo serve complies with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings and recommend the parallelization mapping for prefill and decode workers and planner configurations. To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.
......@@ -55,7 +60,7 @@ To ensure dynamo serve complies with the SLA, we provide a pre-deployment script
```{note}
The script considers a fixed ISL/OSL without KV cache reuse. If the real ISL/OSL has a large variance or a significant amount of KV cache can be reused, the result might be inaccurate.
We assume there is no piggy-backed prefill requests in the decode engine. Even if there are some short piggy-backed prefill requests in the decode engine, it should not affect the ITL too much in most conditions. However, if the piggy-backed prefill requests are too much, the ITL might be inaccurate.
We assume there are no piggybacked prefill requests in the decode engine. Even if there are some short piggybacked prefill requests in the decode engine, it should not affect the ITL in most cases. However, if the piggybacked prefill requests are too much, the ITL might be inaccurate.
```
```bash
......@@ -68,18 +73,19 @@ python -m utils.profile_sla \
--itl <target-itl-(ms)>
```
The script will first detect the number of available GPUs on the current nodes (multi-node engine not supported yet). Then, it will profile the prefill and decode performance with different TP sizes. For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reusing. For decode, since the ITL (or iteration time) is relevant with how many requests are in-flight, the script will measure the ITL under different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggy-backed prefill requests, the script will enable kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.
The script first detects the number of available GPUs on the current nodes (multi-node engine not supported yet). Then, it profiles the prefill and decode performance with different TP sizes. For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reuse. For decode, since the ITL (or iteration time) is relevant to how many requests are in-flight, the script measures the ITL under a different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggybacked prefill requests, the script enables kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.
After the profiling finishes, two plots will be generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
After the profiling finishes, two plots are generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
![Prefill Performance](../images/h100_prefill_performance.png)
![Decode Performance](../images/h100_decode_performance.png)
For the prefill performance, the script will plot the TTFT for different TP sizes and select the best TP size that meet the target TTFT SLA and delivers the best throughput per GPU. Based on how close the TTFT of the selected TP size is to the SLA, the script will also recommend the upper and lower bounds of the prefill queue size to be used in planner.
For the prefill performance, the script plots the TTFT for different TP sizes and selects the best TP size that meets the target TTFT SLA and delivers the best throughput per GPU. Based on how close the TTFT of the selected TP size is to the SLA, the script also recommends the upper and lower bounds of the prefill queue size to be used in planner.
For the decode performance, the script will plot the ITL for different TP sizes and different in-flight requests. Similarly, it will select the best point that satisfies the ITL SLA and delivers the best throughput per GPU and recommend the upper and lower bounds of the kv cache utilization rate to be used in planner.
For the decode performance, the script plots the ITL for different TP sizes and different in-flight requests. Similarly, it selects the best point that satisfies the ITL SLA and delivers the best throughput per GPU and recommends the upper and lower bounds of the kv cache utilization rate to be used in planner.
The following information will be printed out in the terminal:
The following information is printed out in the terminal:
```none
2025-05-16 15:20:24 - __main__ - INFO - Analyzing results and generate recommendations...
2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10
......@@ -87,7 +93,7 @@ The following information will be printed out in the terminal:
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.20/0.10
```
After finding the best TP size for prefill and decode, the script will then interpolate the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes. The results will be saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`.
After finding the best TP size for prefill and decode, the script interpolates the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes. The results are saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`.
## Usage
......
......@@ -126,7 +126,8 @@ curl -X 'POST' \
"request_id":"id_number"
}'
```
-`Response: {"worker_output":"Tell me a fact_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
- `Response: {"worker_output":"Tell me a fact_GeneratedBy_NODE1HOSTNAME","request_id":"id_number"}`
## The Disaggregated Deployment
......
......@@ -214,7 +214,7 @@ These examples can be deployed to a Kubernetes cluster using [Dynamo Cloud](../g
### Prerequisites
You must first follow the instructions in [deploy/cloud/helm/README.md](../../deploy/cloud/helm/README.md) to install Dynamo Cloud on your Kubernetes cluster.
You must first follow the instructions in [deploy/cloud/helm/README.md](https://github.com/ai-dynamo/dynamo/blob/main/deploy/cloud/helm/README.md) to install Dynamo Cloud on your Kubernetes cluster.
```{note}
The `KUBE_NS` variable in the following steps must match the Kubernetes namespace where you installed Dynamo Cloud. You must also expose the `dynamo-store` service externally. This will be the endpoint the CLI uses to interface with Dynamo Cloud.
......
......@@ -50,7 +50,7 @@ TensorRT-LLM disaggregation does not support conditional disaggregation yet. You
Start required services (etcd and NATS) using [Docker Compose](../../deploy/metrics/docker-compose.yml)
```bash
docker compose -f deploy/docker-compose.yml up -d
docker compose -f deploy/metrics/docker-compose.yml up -d
```
### Build docker
......@@ -103,8 +103,6 @@ This build script internally points to the base container image built with step
This figure shows an overview of the major components to deploy:
```
+------+ +-----------+ +------------------+ +---------------+
......@@ -121,8 +119,9 @@ This figure shows an overview of the major components to deploy:
```
Note: The above architecture illustrates all the components. The final components
that get spawned depend upon the chosen graph.
```{note}
The above architecture illustrates all the components. The final components that get spawned depend upon the chosen graph.
```
### Example architectures
......
......@@ -54,21 +54,29 @@ If you don't want to use the dev container, you can set the environment up manua
* Python 3.x
* Git
See [Support Matrix](support_matrix.md) for more information.
See [Support Matrix](support_matrix.md) for more information.
2. Install required system packages:
2. **If you plan to use vLLM or SGLang**, you must also install:
* etcd
* NATS.io
Before starting dyanmo, run both etcd and NATS.io in seperate processes.
3. Install required system packages:
```bash
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -yq python3-dev python3-pip python3-venv libucx0
```
3. Set up Python environment:
4. Set up the Python environment:
```bash
python3 -m venv venv
source venv/bin/activate
```
4. Install Dynamo:
5. Install Dynamo:
```bash
pip install "ai-dynamo[all]"
```
......@@ -79,10 +87,10 @@ To ensure compatibility, use the examples in the release branch or tag that matc
## Building the Dynamo Base Image
Although not needed for local development, deploying your Dynamo pipelines to Kubernetes requires you to build and push a Dynamo base image to your container registry. You can use any container registry of your choice, such as:
- Docker Hub (docker.io)
- NVIDIA NGC Container Registry (nvcr.io)
- Any private registry
Deploying your Dynamo pipelines to Kubernetes requires you to build and push a Dynamo base image to your container registry. You can use any private container registry of your choice, including:
- [Docker Hub](https://hub.docker.com/)
- [NVIDIA NGC Container Registry](https://catalog.ngc.nvidia.com/)
To build it:
......@@ -104,7 +112,7 @@ export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
## Running and Interacting with an LLM Locally
To run a model and interact with it locally, call `dynamo run` with a Hugging Face model. `dynamo run` supports several backends, including: `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
To run a model and interact with it locally, call `dynamo run` with a Hugging Face model. `dynamo run` supports several backends, including `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
### Example Command
......@@ -124,9 +132,9 @@ Hmm, I need to come up with a suitable reply. ...
Dynamo provides a simple way to spin up a local set of inference components including:
- **OpenAI Compatible Frontend**High performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router**Route and load balance traffic to a set of workers.
- **Workers**Set of pre-configured LLM serving engines.
- **OpenAI-compatible Frontend**High-performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router**Route and load balance traffic to a set of workers.
- **Workers**Set of pre-configured LLM serving engines.
To run a minimal configuration, use a pre-configured example.
......@@ -165,7 +173,7 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
## Local Development
If you use vscode or cursor, use the .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions, see the [ReadMe](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
If you use vscode or cursor, use the .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions, see the Dynamo repository's [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
Otherwise, to develop locally, we recommend working inside of the container:
......
......@@ -35,7 +35,7 @@ With the Dynamo CLI, you can:
Use `run` to start an interactive chat session with a model. This command executes the `dynamo-run` Rust binary under the hood. For more details, see [Running Dynamo](dynamo_run.md).
**Example**
#### Example
```bash
dynamo run deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
```
......@@ -44,22 +44,22 @@ dynamo run deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B
Use `serve` to run your defined inference graph locally. You'll need to specify your file and intended class using the file:Class syntax. For more details, see [Serving Inference Graphs](dynamo_serve.md).
**Usage**
#### Usage
```bash
dynamo serve [SERVICE]
```
**Arguments**
#### Arguments
* `SERVICE`: Specify the service to start using file:Class syntax
**Flags**
#### Flags
* `--file`/`-f`: Path to optional YAML configuration file. For configuration examples, see the [SDK docs](../API/sdk.md)
* `--dry-run`: Print the dependency graph and values without starting services
* `--service-name`: Start only the specified service name
* `--working-dir`: Set the directory for finding the Service instance
* Additional flags following Class.key=value pattern are passed to the service constructor. For details, see the configuration section of the [SDK docs](../API/sdk.md)
**Example**
#### Example
```bash
cd examples
# Start the Frontend, Middle, and Backend components
......@@ -73,19 +73,19 @@ dynamo serve --service-name Middle hello_world:Frontend
Use `build` to package your inference graph and its dependencies into an archive. Combine this with the `--containerize` flag to create a single Docker container for your inference graph. As with `serve`, you point toward the first service in your dependency graph. For more details, see [Serving Inference Graphs](dynamo_serve.md).
**Usage**
#### Usage
```bash
dynamo build [SERVICE]
```
**Arguments**
#### Arguments
* `SERVICE`: Specify the service to build using file:Class syntax
**Flags**
#### Flags
* `--working-dir`: Specify the directory for finding the Service instance
* `--containerize`: Choose whether to create a container from the dynamo artifact after building
**Example**
#### Example
```bash
cd examples/hello_world
dynamo build hello_world:Frontend
......@@ -95,15 +95,15 @@ dynamo build hello_world:Frontend
Use `deploy` to create a pipeline on Dynamo Cloud using either interactive prompts or a YAML configuration file. For more details, see [Deploying Inference Graphs to Kubernetes](dynamo_deploy/README.md).
**Usage**
#### Usage
```bash
dynamo deploy [PIPELINE]
```
**Arguments**
#### Arguments
* `PIPELINE`: The pipeline to deploy; defaults to *None*; required
**Flags**
#### Flags
* `--name`/`-n`: Set the deployment name. Defaults to *None*; required
* `--config-file`/`-f`: Specify the configuration file path. Defaults to *None*; required
* `--wait`/`--no-wait`: Choose whether to wait for deployment readiness. Defaults to wait
......
......@@ -166,7 +166,7 @@ spec:
## GitOps Deployment with FluxCD
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](../../../examples/llm/README.md) to demonstrate the workflow.
This section describes how to use FluxCD for GitOps-based deployment of Dynamo inference graphs. GitOps enables you to manage your Dynamo deployments declaratively using Git as the source of truth. We'll use the [aggregated vLLM example](https://github.com/ai-dynamo/dynamo/blob/main/examples/llm/README.md) to demonstrate the workflow.
### Prerequisites
......@@ -353,9 +353,9 @@ kubectl get dynamocomponentdeployment -n $KUBE_NS
The operator is built using Kubebuilder and the operator-sdk, with the following structure:
- `controllers/` Reconciliation logic
- `api/v1alpha1/` CRD types
- `config/` Manifests and Helm charts
- `controllers/`: Reconciliation logic
- `api/v1alpha1/`: CRD types
- `config/`: Manifests and Helm charts
## References
......
......@@ -321,7 +321,6 @@ spec:
- [Fluid Documentation](https://fluid-cloudnative.github.io/)
- [Alluxio Documentation](https://docs.alluxio.io/)
- [MinIO Documentation](https://min.io/docs/)
- [HuggingFace Hub](https://huggingface.co/docs/hub/index)
- [Dynamo README](../../../README.md)
- [Dynamo Documentation](https://docs.nvidia.com/dynamo/)
- [Fluid](https://fluid-cloudnative.github.io/docs)
\ No newline at end of file
- [Hugging Face Hub](https://huggingface.co/docs/hub/index)
- [Dynamo README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md)
- [Dynamo Documentation](https://docs.nvidia.com/dynamo/latest/index.html)
# Running Dynamo (`dynamo run`)
This guide explains the`dynamo run` command.
This guide explains the `dynamo run` command.
`dynamo-run` is a CLI tool for exploring the Dynamo components. It's also an example of how to use components from Rust. If you use the Python wheel, it's available as `dynamo run` .
......@@ -137,7 +137,7 @@ Example 3: Different endpoints.
The KV metrics publisher in VLLM adds a `load_metrics` endpoint to the current component. If the `llama3-1-8b.backend` component above is using patched vllm it will also expose `llama3-1-8b.backend.load_metrics`.
Example 4: Multiple component in a pipeline
Example 4: Multiple component in a pipeline.
In the P/D disaggregated setup you would have `deepseek-distill-llama8b.prefill.generate` (possibly multiple instance of this) and `deepseek-distill-llama8b.decode.generate`.
......@@ -349,7 +349,7 @@ Pass `--tensor-parallel-size <NUM-GPUS>` to `dynamo-run`.
dynamo-run out=sglang ~/llms/Llama-4-Scout-17B-16E-Instruct/ --tensor-parallel-size 8
```
To specify which GPU to start from pass `--base-gpu-id <num>`, for example on a shared eight GPU machine where GPUs 0-3 are already in use:
To specify the GPU to start from, pass `--base-gpu-id <num>`; for example, on a shared eight GPU machine where GPUs 03 are already in use:
```
dynamo-run out=sglang <model> --tensor-parallel-size 4 --base-gpu-id 4
```
......
......@@ -19,6 +19,17 @@ Welcome to NVIDIA Dynamo
The NVIDIA Dynamo Platform is a high-performance, low-latency inference framework designed to serve all AI models—across any framework, architecture, or deployment scale.
.. admonition:: 💎 Discover the latest developments!
:class: seealso
This guide is a snapshot of the `Dynamo GitHub Repository <https://github.com/ai-dynamo/dynamo>`_ for a specific release. For the latest information and examples, see:
- `Dynamo README <https://github.com/ai-dynamo/dynamo/blob/main/README.md>`_
- `Architecture and features doc <https://github.com/ai-dynamo/dynamo/blob/main/docs/architecture/>`_
- `Usage guides <https://github.com/ai-dynamo/dynamo/tree/main/docs/guides>`_
- `Dynamo examples repo <https://github.com/ai-dynamo/examples>`_
Dive in: Examples
-----------------
......@@ -100,7 +111,6 @@ and is driven by a transparent development approach. Check out our repo at https
Writing Python Workers in Dynamo <guides/backend.md>
Disaggregation and Performance Tuning <guides/disagg_perf_tuning.md>
KV Cache Router Performance Tuning <guides/kv_router_perf_tuning.md>
Planner Benchmark Example <guides/planner_benchmark/benchmark_planner.md>
Working with Dynamo Kubernetes Operator <guides/dynamo_deploy/dynamo_operator.md>
.. toctree::
......@@ -113,6 +123,13 @@ and is driven by a transparent development approach. Check out our repo at https
Minikube Setup Guide <guides/dynamo_deploy/minikube.md>
Model Caching with Fluid <guides/dynamo_deploy/model_caching_with_fluid.md>
.. toctree::
:hidden:
:caption: Benchmarking
Planner Benchmark Example <guides/planner_benchmark/benchmark_planner.md>
.. toctree::
:hidden:
:caption: API
......
<!--
SPDX-FileCopyrightText: Copyright (c) 2024-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
Licensed under the Apache License, Version 2.0 (the "License");
......
......@@ -50,7 +50,7 @@ maturin develop --uv
### Prerequisite
See [README.md](../../../docs/../../docs/runtime/README.md).
See [README.md](../../../docs/runtime/README.md#prerequisites).
### Hello World Example
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment