Unverified Commit 6f8c68c1 authored by J Wyman's avatar J Wyman Committed by GitHub
Browse files

docs: Cleanup & Standardize Guides (#1357)

parent 1da05309
......@@ -17,21 +17,22 @@ limitations under the License.
# Planner
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently. Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
| | | Feature |
| :---------------- | :--| :-----------------|
| **Backend** | ✅ | Local |
| | ✅ | Kubernetes |
| **LLM Framework** | ✅ | vLLM |
| | ❌ | TensorRT-LLM |
| | ❌ | SGLang |
| | ❌ | llama.cpp |
| **Serving Type** | ✅ | Aggregated |
| | ✅ | Disaggregated |
| **Planner Actions** | ✅ | Load-based scaling up/down prefill/decode workers |
The planner monitors the state of the system and adjusts workers to ensure that the system runs efficiently.
Currently, the planner can scale the number of vllm workers up and down based on the kv cache load and prefill queue size:
| | | Feature |
| :------------------ | - | :------------------------------------------------------------------ |
| **Backend** | ✅ | Local |
| | ✅ | Kubernetes |
| **LLM Framework** | ✅ | vLLM |
| | ❌ | TensorRT-LLM |
| | ❌ | SGLang |
| | ❌ | llama.cpp |
| **Serving Type** | ✅ | Aggregated |
| | ✅ | Disaggregated |
| **Planner Actions** | ✅ | Load-based scaling up/down prefill/decode workers |
| | ✅ | SLA-based scaling up/down prefill/decode workers **<sup>[1]</sup>** |
| | ✅ | Adjusting engine knobs |
| | ✅ | Adjusting engine knobs |
**<sup>[1]</sup>** Supported with some limitations.
......@@ -39,29 +40,48 @@ The planner monitors the state of the system and adjusts workers to ensure that
## Load-based Scaling Up/Down Prefill/Decode Workers
To adjust the number of prefill/decode workers, planner monitors the following metrics:
* Prefill worker: planner monitors the number of requests pending in the prefill queue to estimate the prefill workload.
* Decode/aggregated worker: planner monitors the average KV cache utilization rate to estimate the decode/aggregated workload.
Every `metric-pulling-interval`, planner gathers the aforementioned metrics. Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers. To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval. In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
Every `metric-pulling-interval`, planner gathers the aforementioned metrics.
Every `adjustment-interval`, planner compares the aggregated metrics in this interval with pre-set thresholds and decide to scale up/down prefill/decode workers.
To avoid over-compensation, planner only changes the number of workers by 1 in one adjustment interval.
In addition, when the number of workers is being adjusted, the planner blocks the metric pulling and adjustment.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace. The auto-discovery mechanism picks up the workers and add them to the routers. To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker. The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue. This ensures that no remote prefill request is dropped. To scale down a decode worker, planner revokes the etcd lease of the decode worker. When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests. The decode worker then finishes all the current requests in their original stream and exits gracefully.
To scale up a prefill/decode worker, planner just need to launch the worker in the correct namespace.
The auto-discovery mechanism picks up the workers and add them to the routers.
To scale down a prefill worker, planner send a SIGTERM signal to the prefill worker.
The prefill worker store the signal and exit when it finishes the current request pulled from the prefill queue.
This ensures that no remote prefill request is dropped.
To scale down a decode worker, planner revokes the etcd lease of the decode worker.
When the etcd lease is revoked, the corresponding decode worker is immediately removed from the router and won't get any new requests.
The decode worker then finishes all the current requests in their original stream and exits gracefully.
There are two additional rules set by planner to prevent over-compensation:
1. After a new decode worker is added, since it needs time to populate the kv cache, planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
1. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
1. After a new decode worker is added, since it needs time to populate the kv cache,
planner doesn't scale down the number of decode workers in the next `NEW_DECODE_WORKER_GRACE_PERIOD=3` adjustment intervals.
2. We do not scale up prefill worker if the prefill queue size is estimated to reduce below the `--prefill-queue-scale-up-threshold` within the next `NEW_PREFILL_WORKER_QUEUE_BUFFER_PERIOD=3` adjustment intervals following the trend observed in the current adjustment interval.
For benchmarking recommendations, see the [Planner benchmark example](../../docs/guides/planner_benchmark/benchmark_planner.md).
## Comply with SLA
To ensure dynamo serve complies with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings and recommend the parallelization mapping for prefill and decode workers and planner configurations. To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.
To ensure dynamo serve complies with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings, and recommend the parallelization mapping for prefill and decode workers and planner configurations.
To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.
```{note}
The script considers a fixed ISL/OSL without KV cache reuse. If the real ISL/OSL has a large variance or a significant amount of KV cache can be reused, the result might be inaccurate.
We assume there are no piggybacked prefill requests in the decode engine. Even if there are some short piggybacked prefill requests in the decode engine, it should not affect the ITL in most cases. However, if the piggybacked prefill requests are too much, the ITL might be inaccurate.
```
> [!Note]
> The script considers a fixed ISL/OSL without KV cache reuse.
> If the real ISL/OSL has a large variance or a significant amount of KV cache can be reused, the result might be inaccurate.
>
> We assume there are no piggybacked prefill requests in the decode engine.
> Even if there are some short piggybacked prefill requests in the decode engine, it should not affect the ITL in most cases.
> However, if the piggybacked prefill requests are too much, the ITL might be inaccurate.
```bash
python -m utils.profile_sla \
......@@ -73,19 +93,30 @@ python -m utils.profile_sla \
--itl <target-itl-(ms)>
```
The script first detects the number of available GPUs on the current nodes (multi-node engine not supported yet). Then, it profiles the prefill and decode performance with different TP sizes. For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reuse. For decode, since the ITL (or iteration time) is relevant to how many requests are in-flight, the script measures the ITL under a different number of in-flight requests. The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold. To measure the ITL without being affected by piggybacked prefill requests, the script enables kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL. Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.
The script first detects the number of available GPUs on the current nodes (multi-node engine not supported yet).
Then, it profiles the prefill and decode performance with different TP sizes.
For prefill, since there is no in-flight batching (assume isl is long enough to saturate the GPU), the script directly measures the TTFT for a request with given isl without kv-reuse.
For decode, since the ITL (or iteration time) is relevant to how many requests are in-flight, the script measures the ITL under a different number of in-flight requests.
The range of the number of in-flight requests is from 1 to the maximum number of requests that the kv cache of the engine can hold.
To measure the ITL without being affected by piggybacked prefill requests, the script enables kv-reuse and warm up the engine by issuing the same prompts before measuring the ITL.
Since the kv cache is sufficient for all the requests, it can hold the kv cache of the pre-computed prompts and skip the prefill phase when measuring the ITL.
After the profiling finishes, two plots are generated in the `output-dir`. For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
After the profiling finishes, two plots are generated in the `output-dir`.
For example, here are the profiling results for `examples/llm/configs/disagg.yaml`:
![Prefill Performance](../images/h100_prefill_performance.png)
![Decode Performance](../images/h100_decode_performance.png)
For the prefill performance, the script plots the TTFT for different TP sizes and selects the best TP size that meets the target TTFT SLA and delivers the best throughput per GPU. Based on how close the TTFT of the selected TP size is to the SLA, the script also recommends the upper and lower bounds of the prefill queue size to be used in planner.
For the prefill performance, the script plots the TTFT for different TP sizes and selects the best TP size that meets the target TTFT SLA and delivers the best throughput per GPU.
Based on how close the TTFT of the selected TP size is to the SLA, the script also recommends the upper and lower bounds of the prefill queue size to be used in planner.
For the decode performance, the script plots the ITL for different TP sizes and different in-flight requests. Similarly, it selects the best point that satisfies the ITL SLA and delivers the best throughput per GPU and recommends the upper and lower bounds of the kv cache utilization rate to be used in planner.
For the decode performance, the script plots the ITL for different TP sizes and different in-flight requests.
Similarly, it selects the best point that satisfies the ITL SLA and delivers the best throughput per GPU
and recommends the upper and lower bounds of the kv cache utilization rate to be used in planner.
The following information is printed out in the terminal:
```none
```text
2025-05-16 15:20:24 - __main__ - INFO - Analyzing results and generate recommendations...
2025-05-16 15:20:24 - __main__ - INFO - Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for prefill queue size: 0.24/0.10
......@@ -93,11 +124,16 @@ The following information is printed out in the terminal:
2025-05-16 15:20:24 - __main__ - INFO - Suggested planner upper/lower bound for decode kv cache utilization: 0.20/0.10
```
After finding the best TP size for prefill and decode, the script interpolates the TTFT with ISL and ITL with active KV cache and decode context length. This is to provide a more accurate estimation of the performance when ISL and OSL changes. The results are saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`.
After finding the best TP size for prefill and decode, the script interpolates the TTFT with ISL and ITL with active KV cache and decode context length.
This is to provide a more accurate estimation of the performance when ISL and OSL changes.
The results are saved to `<output_dir>/<decode/prefill>_tp<best_tp>_interpolation`.
## Usage
`dynamo serve` automatically starts the planner. Configure it through YAML files or command-line arguments:
`dynamo serve` automatically starts the planner.
Configure it through YAML files or command-line arguments:
Usage:
```bash
# YAML configuration
......@@ -109,55 +145,105 @@ Planner:
no-operation: false
log-dir: log/planner
# Command-line configuration
dynamo serve graphs.disagg:Frontend -f disagg.yaml --Planner.environment=local --Planner.no-operation=false --Planner.log-dir=log/planner
# Configure the planner through CLI arguments
dynamo serve graphs.disagg:Frontend \
-f disagg.yaml \
--Planner.environment=local \
--Planner.no-operation=false \
--Planner.log-dir=log/planner
```
Configuration options:
* `namespace` (str, default: "dynamo"): Target namespace for planner operations
* `environment` (str, default: "local"): Target environment (local, kubernetes)
* `no-operation` (bool, default: false): Run in observation mode only
* `log-dir` (str, default: None): Tensorboard log directory
* `adjustment-interval` (int, default: 30): Seconds between adjustments
* `metric-pulling-interval` (int, default: 1): Seconds between metric pulls
* `max-gpu-budget` (int, default: 8): Maximum GPUs for all workers
* `min-gpu-budget` (int, default: 1): Minimum GPUs per worker type
* `decode-kv-scale-up-threshold` (float, default: 0.9): KV cache threshold for scale-up
* `decode-kv-scale-down-threshold` (float, default: 0.5): KV cache threshold for scale-down
* `prefill-queue-scale-up-threshold` (float, default: 0.5): Queue threshold for scale-up
* `prefill-queue-scale-down-threshold` (float, default: 0.2): Queue threshold for scale-down
* `decode-engine-num-gpu` (int, default: 1): GPUs per decode engine
* `prefill-engine-num-gpu` (int, default: 1): GPUs per prefill engine
The planner accepts the following options:
- `namespace` (str, default: "dynamo"):
Namespace planner will look at
- `environment` (str, default: "local"):
Environment to run the planner in (local, kubernetes)
- `no-operation` (bool, default: false):
Do not make any adjustments, just observe the metrics and log to tensorboard
- `log-dir` (str, default: None):
Tensorboard logging directory
- `adjustment-interval` (int, default: 30):
Interval in seconds between scaling adjustments
- `metric-pulling-interval` (int, default: 1):
Interval in seconds between metric pulls
- `max-gpu-budget` (int, default: 8):
Maximum number of GPUs to use, planner will not scale up more than this number of GPUs for prefill plus decode workers
- `min-gpu-budget` (int, default: 1):
Minimum number of GPUs to use, planner will not scale down below this number of GPUs for prefill or decode workers
- `decode-kv-scale-up-threshold` (float, default: 0.9):
KV cache utilization threshold to scale up decode workers
- `decode-kv-scale-down-threshold` (float, default: 0.5):
KV cache utilization threshold to scale down decode workers
- `prefill-queue-scale-up-threshold` (float, default: 0.5):
Queue utilization threshold to scale up prefill workers
- `prefill-queue-scale-down-threshold` (float, default: 0.2):
Queue utilization threshold to scale down prefill workers
- `decode-engine-num-gpu` (int, default: 1):
Number of GPUs per decode engine
- `prefill-engine-num-gpu` (int, default: 1):
Number of GPUs per prefill engine
Run as standalone process:
```bash
PYTHONPATH=/workspace/examples/llm python components/planner.py --namespace=dynamo --served-model-name=vllm --no-operation --log-dir=log/planner
PYTHONPATH=/workspace/examples/llm python components/planner.py \
--namespace=dynamo \
--served-model-name=vllm \
--no-operation \
--log-dir=log/planner
```
Monitor metrics with Tensorboard:
### Tensorboard
Planner logs to tensorboard to visualize the metrics and the scaling actions.
You can start tensorboard with the following command:
```bash
tensorboard --logdir=<path-to-tensorboard-log-dir>
```
## Backends
The planner supports local and kubernetes backends for worker management.
### Local Backend
The local backend uses Circus to control worker processes. A Watcher tracks each `serve_dynamo.py` process. The planner adds or removes watchers to scale workers.
The local backend uses Circus to control worker processes. A Watcher tracks each `serve_dynamo.py` process.
The planner adds or removes watchers to scale workers.
Note: Circus's `increment` feature doesn't support GPU scheduling variables, so we create separate watchers per process.
#### State Management
The planner maintains state in a JSON file at `~/.dynamo/state/{namespace}.json`. This file:
* Tracks worker names as `{namespace}_{component_name}`
* Records GPU allocations from the allocator
* Updates after each planner action
* Cleans up automatically when the arbiter exits
- Tracks worker names as `{namespace}_{component_name}`.
- Records GPU allocations from the allocator.
- Updates after each planner action.
- Cleans up automatically when the arbiter exits.
Example state file evolution:
```none
# Initial decode worker
{
......@@ -181,11 +267,17 @@ Example state file evolution:
}
```
Note: Start with one replica per worker. Multiple initial replicas currently share a single watcher.
> [!Note]
> Start with one replica per worker.
> Multiple initial replicas currently share a single watcher.
### Kubernetes Backend
The Kubernetes backend scales workers by updating DynamoGraphDeployment replica counts. When scaling needs change, the planner:
The Kubernetes backend scales workers by updating DynamoGraphDeployment replica counts.
When scaling needs change, the planner:
1. Updates the deployment's replica count
2. Lets the Kubernetes operator create/remove pods
3. Maintains seamless scaling without manual intervention
......@@ -18,6 +18,7 @@ limitations under the License.
# Getting Started
## Development Environment
This section describes how to set up your development environment.
......@@ -27,20 +28,23 @@ This section describes how to set up your development environment.
We recommend using our pre-configured development container:
1. Install prerequisites:
* [Docker](https://www.docker.com/products/docker-desktop)
* [Visual Studio Code](https://code.visualstudio.com/)
* [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
- [Docker](https://www.docker.com/products/docker-desktop)
- [Visual Studio Code](https://code.visualstudio.com/)
- [Dev Containers extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode-remote.remote-containers)
2. Get the code:
```bash
git clone https://github.com/ai-dynamo/dynamo.git
cd dynamo
```
3. Open in Visual Studio Code:
* Launch Visual Studio Code
* Click the button in the bottom-left corner
* Select **Reopen in Container**
1. Launch Visual Studio Code
2. Click the button in the bottom-left corner
3. Select **Reopen in Container**
Visual Studio Code builds and starts a container with all necessary dependencies for Dynamo development.
......@@ -49,21 +53,20 @@ Visual Studio Code builds and starts a container with all necessary dependencies
If you don't want to use the dev container, you can set the environment up manually:
1. Ensure you have:
* Ubuntu 24.04 (recommended)
* x86_64 CPU
* Python 3.x
* Git
- Ubuntu 24.04 (recommended)
- x86_64 CPU
- Python 3.x
- Git
See [Support Matrix](support_matrix.md) for more information.
2. **If you plan to use vLLM or SGLang**, you must also install:
* etcd
* NATS.io
- etcd
- NATS.io
Before starting dyanmo, run both etcd and NATS.io in seperate processes.
3. Install required system packages:
```bash
apt-get update
......@@ -81,13 +84,15 @@ If you don't want to use the dev container, you can set the environment up manua
pip install "ai-dynamo[all]"
```
```{note}
To ensure compatibility, use the examples in the release branch or tag that matches the version you installed.
```
> [!Important]
> To ensure compatibility, use the examples in the release branch or tag that matches the version you installed.
## Building the Dynamo Base Image
Deploying your Dynamo pipelines to Kubernetes requires you to build and push a Dynamo base image to your container registry. You can use any private container registry of your choice, including:
Deploying your Dynamo pipelines to Kubernetes requires you to build and push a Dynamo base image to your container registry.
You can use any private container registry of your choice, including:
- [Docker Hub](https://hub.docker.com/)
- [NVIDIA NGC Container Registry](https://catalog.ngc.nvidia.com/)
......@@ -102,25 +107,32 @@ docker push <your-registry>/dynamo-base:latest-vllm
```
This documentation describes these frameworks:
- `--framework vllm` build: see [here](examples/llm_deployment.md).
- `--framework tensorrtllm` build: see [here](examples/trtllm.md).
- `--framework vllm` build:
See [LLM Deployment Examples](examples/llm_deployment.md).
- `--framework tensorrtllm` build:
See [TRTLLM Deployment Examples](examples/trtllm.md).
After building, use this image by setting the `DYNAMO_IMAGE` environment variable to point to your built image:
```bash
export DYNAMO_IMAGE=<your-registry>/dynamo-base:latest-vllm
```
## Running and Interacting with an LLM Locally
To run a model and interact with it locally, call `dynamo run` with a Hugging Face model. `dynamo run` supports several backends, including `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
To run a model and interact with it locally, call `dynamo run` with a Hugging Face model.
`dynamo run` supports several backends, including `mistralrs`, `sglang`, `vllm`, and `tensorrtllm`.
### Example Command
```
```bash
dynamo run out=vllm deepseek-ai/DeepSeek-R1-Distill-Llama-8B
```
```
```bash
? User › Hello, how are you?
✔ User · Hello, how are you?
Okay, so I'm trying to figure out how to respond to the user's greeting.
......@@ -128,13 +140,19 @@ They said, "Hello, how are you?" and then followed it with "Hello! I'm just a pr
Hmm, I need to come up with a suitable reply. ...
```
## LLM Serving
Dynamo provides a simple way to spin up a local set of inference components including:
- **OpenAI-compatible Frontend**—High-performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router**—Route and load balance traffic to a set of workers.
- **Workers**—Set of pre-configured LLM serving engines.
- **OpenAI-compatible Frontend**:
High-performance OpenAI compatible http api server written in Rust.
- **Basic and Kv Aware Router**:
Route and load balance traffic to a set of workers.
- **Workers**:
Set of pre-configured LLM serving engines.
To run a minimal configuration, use a pre-configured example.
......@@ -145,6 +163,7 @@ To start the Dynamo Distributed Runtime services the first time:
```bash
docker compose -f deploy/docker-compose.yml up -d
```
### Start Dynamo LLM Serving Components
Next, serve a minimal configuration with an http server, basic
......@@ -158,7 +177,9 @@ dynamo serve graphs.agg:Frontend -f configs/agg.yaml
### Send a Request
```bash
curl localhost:8000/v1/chat/completions -H "Content-Type: application/json" -d '{
curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-Distill-Llama-8B",
"messages": [
{
......@@ -171,9 +192,11 @@ curl localhost:8000/v1/chat/completions -H "Content-Type: application/json"
}' | jq
```
## Local Development
If you use vscode or cursor, use the .devcontainer folder built on [Microsofts Extension](https://code.visualstudio.com/docs/devcontainers/containers). For instructions, see the Dynamo repository's [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
If you use vscode or cursor, use the `.devcontainer` folder built on [Microsoft's Extension](https://code.visualstudio.com/docs/devcontainers/containers).
For instructions, see the Dynamo repository's [devcontainer README](https://github.com/ai-dynamo/dynamo/blob/main/.devcontainer/README.md).
Otherwise, to develop locally, we recommend working inside of the container:
......@@ -214,5 +237,3 @@ docker compose -f deploy/docker-compose.yml up -d
cd examples/llm
dynamo serve graphs.agg:Frontend -f configs/agg.yaml
```
......@@ -17,82 +17,132 @@ limitations under the License.
# Disaggregation and Performance Tuning
Disaggregation gains performance by separating the prefill and decode into different engines to reduce interferences between the two. However, performant disaggregation requires careful tuning of the inference parameters. Specifically, there are three sets of parameters that needs to be tuned:
Disaggregation gains performance by separating the prefill and decode into different engines to reduce interferences between the two.
However, performant disaggregation requires careful tuning of the inference parameters.
Specifically, there are three sets of parameters that needs to be tuned:
1. Engine knobs (e.g. parallelization mapping, maximum number of tokens, etc.).
1. Disaggregated router knobs.
1. Number of prefill and decode engines.
1. Engine configuration and options (e.g. parallelization mapping, maximum number of tokens, etc.).
2. Disaggregated router configuration and options.
3. Number of prefill and decode engines.
This guide describes the process of tuning these parameters.
## Engine Knobs
The most important engine knob to tune is the parallelization mapping. For most dense models, the best setting is to use TP within node and PP across nodes. For example, for llama 405b w8a8 on H100, TP8 on a single node or TP8PP2 on two nodes is usually the best choice. The next thing to decide is how many numbers of GPU to serve the model. Typically, the number of GPUs vs the performance follows the following pattern:
## Engine Configuration and Tuning
Number of GPUs | Performance
--- | ---
Cannot hold weights in VRAM | OOM
(Barely hold weights in VRAM) | (KV cache is too small to maintain large enough sequence length or reasonable batch size)
Minimum number with fair amount of KV cache | Best overall throughput/GPU, worst latency/user
Between minimum and maximum | Tradeoff between throughput/GPU and latency/user
Maximum number limited by communication scalability | Worst overall throughput/GPU, best latency/user
More than maximum | Communication overhead dominates, poor performance
The most important engine configuration to tune is the parallelization mapping.
For most dense models, the best setting is to use TP within node and PP across nodes.
For example, for Llama-405b w8a8 on H100, TP8 on a single node or TP8PP2 on two nodes is usually the best choice.
The next thing to decide is how many numbers of GPU to serve the model.
Typically, the number of GPUs vs the performance follows the following pattern:
Note that for decode-only engines, sometimes larger number of GPUs has to larger kv cache per GPU and more decoding requests running in parallel, which leads to both better throughput/GPU and better latency/user. For example, for llama3.3 70b NVFP4 quantization on B200 in vllm with 0.9 free GPU memory fraction:
| Number of GPUs | Performance
| :-------------------------------------------------- | :---------------------------------------------------------------------------------------- |
| Cannot hold weights in VRAM | OOM |
| (Barely hold weights in VRAM) | (KV cache is too small to maintain large enough sequence length or reasonable batch size) |
| Minimum number with fair amount of KV cache | Best overall throughput/GPU, worst latency/user |
| Between minimum and maximum | Tradeoff between throughput/GPU and latency/user |
| Maximum number limited by communication scalability | Worst overall throughput/GPU, best latency/user |
| More than maximum | Communication overhead dominates, poor performance |
TP Size | KV Cache Size (GB) | KV Cache per GPU (GB) | Per GPU Improvement over TP1
--- | --- | --- | ---
1 | 113 | 113 | 1.00x
2 | 269 | 135 | 1.19x
4 | 578 | 144 | 1.28x
> [!Note]
> for decode-only engines, sometimes larger number of GPUs has to larger KV cache per GPU and more decoding requests running in parallel, which leads to both better throughput/GPU and better latency/user.
>
> For example, for Llama-3.3-70b NVFP4 quantization on B200 in vLLM with 0.9 free GPU memory fraction:
The best number of GPUs to use in the prefill and decode engines can be determined by running a few fixed isl/osl/concurrency test using [GenAI-Perf](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf) and compare with the SLA. GenAI-Perf is pre-installed in the dynamo container.
| TP Size | KV Cache Size (GB) | KV Cache per GPU (GB) | Per GPU Improvement over TP1 |
| ------: | -----------------: | --------------------: | ---------------------------: |
| 1 | 113 | 113 | 1.00x |
| 2 | 269 | 135 | 1.19x |
| 4 | 578 | 144 | 1.28x |
Besides the parallelization mapping, other common knobs to tune are maximum batch size, maximum number of tokens, and block size. For prefill engines, usually a small batch size and large max_num_token is preferred. For decode engines, usually a large batch size and medium max_num_token is preferred. For details on tuning the max_num_token and max_batch_size, see the next section.
The best number of GPUs to use in the prefill and decode engines can be determined by running a few fixed ISL/OSL/concurrency test using [GenAI-Perf](https://github.com/triton-inference-server/perf_analyzer/tree/main/genai-perf) and compare with the SLA.
GenAI-Perf is pre-installed in the dynamo container.
> [!Tip]
> If you are unfamiliar with GenAI-Perf, please see this helpful [tutorial](https://github.com/triton-inference-server/perf_analyzer/blob/main/genai-perf/docs/tutorial.md) to get you started.
Besides the parallelization mapping, other common knobs to tune are maximum batch size, maximum number of tokens, and block size.
For prefill engines, usually a small batch size and large `max_num_token` is preferred.
For decode engines, usually a large batch size and medium `max_num_token` is preferred.
For details on tuning the `max_num_token` and max_batch_size, see the next section.
For block size, if the block size is too small, it leads to small memory chunks in the P->D KV cache transfer and poor performance.
Too small block size also leads to memory fragmentation in the attention calculation, but the impact is usually insignificant.
If the block size is too large, it leads to low prefix cache hit ratio.
For most dense models, we find block size 128 is a good choice.
For block size, if the block size is too small, it leads to small memory chunks in the P->D KV cache transfer and poor performance. Too small block size also leads to memory fragmentation in the attention calculation, but the impact is usually insignificant. If the block size is too large, it leads to low prefix cache hit ratio. For most dense models, we find block size 128 is a good choice.
## Disaggregated Router
Disaggregated router decides whether to prefill a request in the remote prefill engine or locally in the decode engine using chunked prefill. For most frameworks, when chunked prefill is enabled and one forward iteration gets a mixture of prefilling and decoding request, three kernels are launched:
1. The attention kernel for context tokens (context_fmha kernel in trtllm).
2. The attention kernel for decode tokens (xqa kernel in trtllm).
3. Dense kernel for the combined active tokens in prefills and decodes.
Disaggregated router decides whether to prefill a request in the remote prefill engine or locally in the decode engine using chunked prefill.
For most frameworks, when chunked prefill is enabled and one forward iteration gets a mixture of prefilling and decoding request, three kernels are launched:
1. The attention kernel for context tokens (context_fmha kernel in TRTLLM).
2. The attention kernel for decode tokens (xqa kernel in TRTLLM).
3. Dense kernel for the combined active tokens in prefills and decodes.
### Prefill Engine
In the prefill engine, the best strategy is to operate at the smallest batch size that saturates the GPUs so that the average TTFT is minimized. For example, for llama3.3 70b NVFP4 quantization on B200 TP1 in vllm, the below figure shows the prefill time with different isl (prefix caching is turned off):
In the prefill engine, the best strategy is to operate at the smallest batch size that saturates the GPUs so that the average time to first token (TTFT) is minimized.
For example, for Llama3.3-70b NVFP4 quantization on B200 TP1 in vLLM, the below figure shows the prefill time with different isl (prefix caching is turned off):
![Combined bar and line chart showing "Prefill Time". Bar chart represents TTFT (Time To First Token) in milliseconds against ISL (Input Sequence Length). The line chart shows TTFT/ISL (milliseconds per token) against ISL.](../images/prefill_time.png)
For isl less than 1000, the prefill efficiency is low because the GPU is not fully saturated. For isl larger than 4000, the prefill time per token increases because the attention takes longer to compute with a longer history.
Currently, prefill engines in Dynamo operate at a batch size of 1. To make sure prefill engine is saturated, users can set `max-local-prefill-length` to the saturation point to make sure prefill engine is optimal.
For isl less than 1000, the prefill efficiency is low because the GPU is not fully saturated.
For isl larger than 4000, the prefill time per token increases because the attention takes longer to compute with a longer history.
Currently, prefill engines in Dynamo operate at a batch size of 1.
To make sure prefill engine is saturated, users can set `max-local-prefill-length` to the saturation point to make sure prefill engine is optimal.
### Decode Engine
In the decode engine, maximum batch size and maximum number of tokens affects the size of intermediate tensors. With a larger batch size and number of tokens, the size of intermediate tensors increases and the size of KV cache decreases. Trtllm has a good [summary](https://nvidia.github.io/TensorRT-LLM/reference/memory.html) on the memory footprint where similar ideas also applies to other llm frameworks.
In the decode engine, maximum batch size and maximum number of tokens affects the size of intermediate tensors.
With a larger batch size and number of tokens, the size of intermediate tensors increases and the size of KV cache decreases.
TensorRT-LLM (TRTLLM) has a good [summary](https://nvidia.github.io/TensorRT-LLM/reference/memory.html) on the memory footprint where similar ideas also applies to other LLM frameworks.
With chunked prefill enabled, the maximum number of tokens controls the longest prefill that can be piggybacked to decode and control the inter-token latency (ITL).
For the same prefill requests, a large maximum number of tokens leads to fewer but longer stalls in the generation, while a small maximum number of tokens leads to more but shorter stalls in the generation.
However, chunked prefill is currently not supported in Dynamo (vLLM backend).
Hence, the current best strategy is to set the maximum batch size to the optimized KV cache size and set the maximum number of tokens to the maximum local prefill length + maximum batch size (since one decode request has one active token).
With chunked prefill enabled, the maximum number of tokens controls the longest prefill that can be piggybacked to decode and control the ITL. For the same prefill requests, a large maximum number of tokens leads to fewer but longer stalls in the generation, while a small maximum number of tokens leads to more but shorter stalls in the generation. However, chunked prefill is currently not supported in Dynamo (vllm backend). Hence, the current best strategy is to set the maximum batch size to the optimized kv cache size and set the maximum number of tokens to the maximum local prefill length + maximum batch size (since one decode request has one active token).
## Number of Prefill and Decode Engines
The best dynamo knob choices depends on the operating condition of the model. Based on the load, we define three operating conditions:
1. **Low load**: The endpoint is hit by a single user (single-stream) most of the time.
2. **Medium load**: The endpoint is hit by multiple users, but the KV cache of the decode engines is never fully utilized.
3. **High load**: The endpoint is hit by multiple users and the requests are queued up due to no available KV cache in the decode engines.
The best dynamo knob choices depends on the operating condition of the model.
Based on the load, we define three operating conditions:
1. **Low load**:
The endpoint is hit by a single user (single-stream) most of the time.
At low load, disaggregation would not benefit much as prefill and decode are usually computed separately. It is usually better to use a single monolithic engine.
2. **Medium load**:
The endpoint is hit by multiple users, but the KV cache of the decode engines is never fully utilized.
At medium load, disaggregation allows better ITL compared with prefill-prioritized and chunked prefill engines and better TTFT compared with chunked prefill engine and decode-only engine for each user. Dynamo users can adjust the number of prefill and decode engines based on TTFT and ITL SLA.
3. **High load**:
The endpoint is hit by multiple users and the requests are queued up due to no available KV cache in the decode engines.
At low load, disaggregation would not benefit much as prefill and decode are usually computed separately.
It is usually better to use a single monolithic engine.
At medium load, disaggregation allows better ITL compared with prefill-prioritized and chunked prefill engines and better TTFT compared with chunked prefill engine and decode-only engine for each user.
Dynamo users can adjust the number of prefill and decode engines based on TTFT and ITL SLA.
At high load where KV cache capacity is the bottleneck, disaggregation has the following effect on the KV cache usage in the decode engines:
* Increase the total amount of KV cache:
* Being able to use larger TP in decode engines leads to more KV cache per GPU and higher prefix cache hit rate.
* When the requests is prefilled remotely, the decode engine does not need to maintain its KV cache (currently not implemented in Dynamo).
* Lower ITL reduces the decode time and allow the same amount of KV cache to serve more requests.
* Decrease the total amount of KV cache:
* Some GPUs are configured as prefill engines whose KV cache is not used in the decode phase.
Since Dynamo currently allocates the KV blocks immediately when the decode engine get the requests, it is advisable to use as few prefill engines as possible (even no prefill engine) to maximize the available KV cache in decode engines. To prevent queueing at prefill engines, users can set a large `max-local-prefill-length` and piggyback more prefill requests at decode engines.
\ No newline at end of file
* Increase the total amount of KVcache:
* Being able to use greater TP values in decode engines leads to more KV cache per GPU and higher prefix cache hit rate.
* When the requests is prefilled remotely, the decode engine does not need to maintain its KV cache (currently not implemented in Dynamo).
* Lower ITL reduces the decode time and allow the same amount of KV cache to serve more requests.
* Decrease the total amount of KV cache:
* Some GPUs are configured as prefill engines whose KV cache is not used in the decode phase.
Since Dynamo currently allocates the KV blocks immediately when the decode engine get the requests,
it is advisable to use as few prefill engines as possible (even no prefill engine) to maximize the available KV cache in decode engines.
To prevent queueing at prefill engines, users can set a large `max-local-prefill-length` and piggyback more prefill requests at decode engines.
......@@ -19,22 +19,28 @@ limitations under the License.s
This guide explains how to use the `dynamo build` command to containerize Dynamo inference graphs (pipelines) for deployment.
`dynamo build` is a command-line tool that helps containerize inference graphs created with Dynamo SDK. Run `dynamo build --containerize` to build a stand-alone Docker container that encapsulates your entire inference graph. This image can then be shared and run standalone.
`dynamo build` is a command-line tool that helps containerize inference graphs created with Dynamo SDK.
Run `dynamo build --containerize` to build a stand-alone Docker container that encapsulates your entire inference graph.
The generated container-image can then be shared and/or run standalone.
> [!Caution]
> This experimental feature is tested on the examples in the `examples/` directory.
> You need to make some modifications.
> Pay particular attention if your inference graph introduces custom dependencies.
```{note}
This experimental feature is tested on the examples in the `examples/` directory. You need to make some modifications. Pay particular attention if your inference graph introduces custom dependencies.
```
## Building a containerized inference graph
The basic workflow for using `dynamo build` includes:
#. Defining your inference graph and testing locally with `dynamo serve`
#. Specifying a base image for your inference graph. More on this below.
#. Running `dynamo build` to build a containerized inference graph
1. Defining your inference graph and testing locally with `dynamo serve`.
2. Specifying a base image for your inference graph. More on this below.
3. Running `dynamo build` to build a containerized inference graph.
### Basic Usage
```bash
dynamo build <graph_definition> --containerize
```
\ No newline at end of file
```
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
File suppressed by a .gitattributes entry or the file's encoding is unsupported.
......@@ -22,71 +22,81 @@ This document provides the support matrix for Dynamo, including hardware, softwa
## Hardware Compatibility
| **CPU Architecture** | **Status** |
| :------------------- | :----------- |
| **x86_64** | Supported |
| **ARM64** | Experimental |
| **CPU Architecture** | **Status** |
|-----------------------|---------------|
| **x86_64** | Supported |
| **ARM64** | Experimental |
```{note}
While **x86_64** architecture is supported on systems with a minimum of 32 GB RAM and at least 4 CPU cores, the **ARM64** support is experimental and may have limitations.
```
> [!Warning]
> While **x86_64** architecture is supported on systems with a minimum of 32 GB RAM and at least 4 CPU cores,
> the **ARM64** support is experimental and may have limitations.
### GPU Compatibility
If you are using a **GPU**, the following GPU models and architectures are supported:
| **GPU Architecture** | **Status** |
|-------------------------------------|---------------|
| **NVIDIA Blackwell Architecture** | Supported |
| **NVIDIA Hopper Architecture** | Supported |
| **NVIDIA Ada Lovelace Architecture**| Supported |
| **NVIDIA Ampere Architecture** | Supported |
| **GPU Architecture** | **Status** |
| :----------------------------------- | :--------- |
| **NVIDIA Blackwell Architecture** | Supported |
| **NVIDIA Hopper Architecture** | Supported |
| **NVIDIA Ada Lovelace Architecture** | Supported |
| **NVIDIA Ampere Architecture** | Supported |
## Platform Architecture Compatibility
**Dynamo** is compatible with the following platforms:
| **Operating System** | **Version** | **Architecture** | **Status** |
|----------------------|-------------|------------------|--------------|
| :------------------- | :---------- | :--------------- | :----------- |
| **Ubuntu** | 22.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | x86_64 | Supported |
| **Ubuntu** | 24.04 | ARM64 | Experimental |
| **CentOS Stream** | 9 | x86_64 | Experimental |
```{note}
For **Linux**, the **ARM64** support is experimental and may have limitations. Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04). Compatibility with other Linux distributions is expected but has not been officially verified yet.
> [!Note]
> For **Linux**, the **ARM64** support is experimental and may have limitations.
> Wheels are built using a manylinux_2_28-compatible environment and they have been validated on CentOS 9 and Ubuntu (22.04, 24.04).
>
> Compatibility with other Linux distributions is expected but has not been officially verified yet.
> [!Caution]
> KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
**Known Issues**: KV Block Manager is supported only with Python 3.12. Python 3.12 support is currently limited to Ubuntu 24.04.
```
## Software Compatibility
### Runtime Dependency
| **Python Package** | **Version** | glibc version | CUDA Version |
|--------------------|---------------|----------------------|--------------|
| ai-dynamo | 0.3.0 | >=2.28 | |
| ai-dynamo-runtime | 0.3.0 | >=2.28 | |
| ai-dynamo-vllm | 0.8.4.post2* | >=2.28 (recommended) | |
| NIXL | 0.3.0 | >=2.27 | >=11.8 |
| :----------------- | :------------ | :------------------- | :----------- |
| ai-dynamo | 0.3.0 | >=2.28 | |
| ai-dynamo-runtime | 0.3.0 | >=2.28 | |
| ai-dynamo-vllm | 0.8.4.post2¹ | >=2.28 (recommended) | |
| NIXL | 0.3.0 | >=2.27 | >=11.8 |
### Build Dependency
| **Build Dependency** | **Version** |
|----------------------|-------------|
| **Base Container** | [25.03](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base/tags) |
| **ai-dynamo-vllm** |0.8.4.post2* |
| **TensorRT-LLM** | 0.19.0** |
| **NIXL** | 0.3.0 |
```{note}
*ai-dynamo-vllm v0.8.4.post2 is a customized patch of v0.8.4 from vLLM.
| **Build Dependency** | **Version** |
| :------------------- | :------------------------------------------------------------------------------- |
| **Base Container** | [25.03](https://catalog.ngc.nvidia.com/orgs/nvidia/containers/cuda-dl-base/tags) |
| **ai-dynamo-vllm** | 0.8.4.post2¹ |
| **TensorRT-LLM** | 0.19.0² |
| **NIXL** | 0.3.0 |
**Specific versions of TensorRT-LLM supported by Dynamo are subject to change.
```
> [!Important]
> ¹ ai-dynamo-vllm `v0.8.4.post2` is a customized patch of `v0.8.4` from vLLM.
>
> ² Specific versions of TensorRT-LLM supported by Dynamo are subject to change.
## Build Support
**Dynamo** currently provides build support in the following ways:
- **Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**. No wheels are available for other platforms at this time.
- **Container Images**: We distribute only the source code for container images, **x86_64 Linux** and **ARM64** are supported for these. Users must build the container image from source if they require it.
- **Wheels**: Pre-built Python wheels are only available for **x86_64 Linux**.
No wheels are available for other platforms at this time.
- **Container Images**: We distribute only the source code for container images, **x86_64 Linux** and **ARM64** are supported for these.
Users must build the container image from source if they require it.
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/blob/main/README.md#installation).
\ No newline at end of file
Once you've confirmed that your platform and architecture are compatible, you can install **Dynamo** by following the instructions in the [Quick Start Guide](https://github.com/ai-dynamo/dynamo/blob/main/README.md#installation).
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment