Unverified Commit 89cf9107 authored by hhzhang16's avatar hhzhang16 Committed by GitHub
Browse files

docs: add Planner Quickstart doc (#3358)


Signed-off-by: default avatarHannah Zhang <hannahz@nvidia.com>
Signed-off-by: default avatarhhzhang16 <54051230+hhzhang16@users.noreply.github.com>
Co-authored-by: default avatarHongkuan Zhou <tedzhouhk@gmail.com>
parent 1faf0152
...@@ -237,7 +237,7 @@ args: ...@@ -237,7 +237,7 @@ args:
- **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/create_deployment.md) - **Deployment Guide**: [Creating Kubernetes Deployments](../../../../docs/kubernetes/create_deployment.md)
- **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md) - **Quickstart**: [Deployment Quickstart](../../../../docs/kubernetes/README.md)
- **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md) - **Platform Setup**: [Dynamo Cloud Installation](../../../../docs/kubernetes/installation_guide.md)
- **SLA Planner**: [SLA Planner Deployment Guide](../../../../docs/kubernetes/sla_planner_deployment.md) - **SLA Planner**: [SLA Planner Quickstart Guide](../../../../docs/kubernetes/sla_planner_quickstart.md)
- **Examples**: [Deployment Examples](../../../../docs/examples/README.md) - **Examples**: [Deployment Examples](../../../../docs/examples/README.md)
- **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md) - **Architecture Docs**: [Disaggregated Serving](../../../../docs/architecture/disagg_serving.md), [KV-Aware Routing](../../../../docs/architecture/kv_cache_routing.md)
......
...@@ -23,10 +23,16 @@ Currently, the planner can scale the number of vllm workers up and down based on ...@@ -23,10 +23,16 @@ Currently, the planner can scale the number of vllm workers up and down based on
Key features include: Key features include:
* **Load-based scaling** that monitors KV cache utilization and prefill queue size to make scaling decisions
* **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets * **SLA-based scaling** that uses predictive modeling and performance interpolation to proactively meet TTFT and ITL targets
* **Graceful scaling** that ensures no requests are dropped during scale-down operations * **Graceful scaling** that ensures no requests are dropped during scale-down operations
.. admonition:: 🚀 Quick Start
:class: seealso
**New to SLA Planner?** Start with the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md) for a complete, step-by-step workflow.
**Prerequisites**: SLA-based planner requires pre-deployment profiling (2-4 hours on real silicon or a few minutes using simulator) before deployment. The Quick Start guide includes everything you need.
.. list-table:: .. list-table::
:widths: 20 5 75 :widths: 20 5 75
:header-rows: 1 :header-rows: 1
...@@ -35,7 +41,7 @@ Key features include: ...@@ -35,7 +41,7 @@ Key features include:
- -
- Feature - Feature
* - **Backend** * - **Backend**
- -
- Local - Local
* - * -
- ✅ - ✅
...@@ -47,7 +53,7 @@ Key features include: ...@@ -47,7 +53,7 @@ Key features include:
- ✅ - ✅
- TensorRT-LLM - TensorRT-LLM
* - * -
- -
- SGLang - SGLang
* - **Serving Type** * - **Serving Type**
- ✅ - ✅
...@@ -56,7 +62,7 @@ Key features include: ...@@ -56,7 +62,7 @@ Key features include:
- ✅ - ✅
- Disaggregated - Disaggregated
* - **Planner Actions** * - **Planner Actions**
- -
- Load-based scaling up/down prefill/decode workers - Load-based scaling up/down prefill/decode workers
* - * -
- ✅ - ✅
...@@ -71,6 +77,6 @@ Key features include: ...@@ -71,6 +77,6 @@ Key features include:
:hidden: :hidden:
Overview <self> Overview <self>
SLA Planner Quick Start <../kubernetes/sla_planner_quickstart>
Pre-Deployment Profiling <../benchmarks/pre_deployment_profiling.md> Pre-Deployment Profiling <../benchmarks/pre_deployment_profiling.md>
Load-based Planner <load_planner.md>
SLA-based Planner <sla_planner.md> SLA-based Planner <sla_planner.md>
# SLA-based Planner # SLA-based Planner
This document covers SLA-based planner in `examples/common/utils/planner_core.py`. > [!TIP]
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
This document covers information regarding the SLA-based planner in `examples/common/utils/planner_core.py`.
The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers. The SLA (Service Level Agreement)-based planner is an intelligent autoscaling system that monitors system performance and adjusts the number of prefill and decode workers to meet specified TTFT and ITL targets. Unlike the load-based planner that scales based on resource utilization thresholds, the SLA planner uses predictive modeling and performance interpolation to proactively scale the workers.
...@@ -10,6 +13,24 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy ...@@ -10,6 +13,24 @@ The SLA (Service Level Agreement)-based planner is an intelligent autoscaling sy
> [!WARNING] > [!WARNING]
> Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s. > Bare metal deployment with local connector is deprecated. Please deploy the SLA planner in k8s.
## Architecture Overview
**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (by default, can be updated in the podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: prefill and backend workers handle inference
The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/planner/src/dynamo/planner/defaults.py).
```mermaid
flowchart LR
Frontend --"/metrics"--> Prometheus
Planner --"query API"--> Prometheus
Planner --"scaling decisions"--> Workers
Frontend -.->|"requests"| Workers
```
## Features ## Features
* **SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets * **SLA-driven scaling**: Automatically scales prefill/decode workers to meet TTFT and ITL targets
...@@ -108,15 +129,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill ...@@ -108,15 +129,7 @@ Finally, SLA planner applies the change by scaling up/down the number of prefill
## Deploying ## Deploying
### K8s Deployment For complete deployment instructions, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
For detailed deployment instructions including setup, configuration, troubleshooting, and architecture overview, see the [SLA Planner Deployment Guide](../kubernetes/sla_planner_deployment.md).
**To deploy SLA Planner:**
```bash
cd components/backends/vllm/deploy
kubectl apply -f disagg_planner.yaml -n {$NAMESPACE}
```
> [!NOTE] > [!NOTE]
> The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically. > The SLA planner requires a frontend that reports metrics at the `/metrics` HTTP endpoint with the number of requests, ISL, OSL, TTFT, and ITL in the correct format. The dynamo frontend provides these metrics automatically.
......
# Pre-Deployment Profiling # Pre-Deployment Profiling
> [!TIP]
> **New to SLA Planner?** For a complete workflow including profiling and deployment, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
## Profiling Script ## Profiling Script
To ensure Dynamo deployments comply with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings and recommend the parallelization mapping for prefill and decode workers and planner configurations. To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA. To ensure Dynamo deployments comply with the SLA, we provide a pre-deployment script to profile the model performance with different parallelization mappings and recommend the parallelization mapping for prefill and decode workers and planner configurations. To use this script, the user needs to provide the target ISL, OSL, TTFT SLA, and ITL SLA.
...@@ -93,40 +96,16 @@ After suggesting the optimal TP configuration, two `.npz` files that describe th ...@@ -93,40 +96,16 @@ After suggesting the optimal TP configuration, two `.npz` files that describe th
SLA planner can work with any interpolation data that follows the above format. For best results, use fine-grained and high coverage interpolation data for the prefill and decode engines. SLA planner can work with any interpolation data that follows the above format. For best results, use fine-grained and high coverage interpolation data for the prefill and decode engines.
## Running the Profiling Script in Kubernetes ## Detailed Kubernetes Profiling Instructions
Set up your Kubernetes namespace for profiling (one-time per namespace). First ensure Dynamo Cloud platform is installed by following the [main installation guide](/docs/kubernetes/installation_guide.md), then set up profiling resources using [deploy/utils/README](/deploy/utils/README.md). If your namespace is already set up, skip this step.
**Prerequisites**: Ensure all dependencies are installed. If you ran the setup script above, dependencies are already installed. Otherwise, install them manually:
```bash
pip install -r deploy/utils/requirements.txt
```
**Step 1: Inject your DGD configuration** > [!TIP]
> For a complete step-by-step workflow, see the [SLA Planner Quick Start Guide](/docs/kubernetes/sla_planner_quickstart.md).
Use the injector utility to place your DGD manifest into the PVC. The profiling job will read the path you specify. This section provides detailed technical information for advanced users who need to customize the profiling process.
```bash
# Use default disagg.yaml config
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml
# Or use a custom disagg config file
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
# Or specify a custom target path in the PVC
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/profiling_results/my-disagg.yaml
```
> **Note**: All paths must start with `/data/` for security reasons. If you forget this prefix, the script will show a helpful error message with the correct path. ### Configuration Options
**Step 2: Set SLA target** **For dense models**, configure `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml`:
For dense models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml` to set the target ISL, OSL, TTFT, and ITL. Also, set the backend type to match the dynamo deployment in the `DGD_CONFIG_FILE`.
For MoE models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml` to set the target TEP, DEP, TTFT, and ITL.
> [!NOTE]
> If the model is too large to be downloaded every time, you can create a multi-attach PVC to cache the model. Refer to [recipes](../../recipes/README.md) for more details.
```yaml ```yaml
spec: spec:
...@@ -147,36 +126,13 @@ spec: ...@@ -147,36 +126,13 @@ spec:
- <vllm/sglang> - <vllm/sglang>
``` ```
**Step 3: Define the container image and config path** **For MoE models**, use `profile_sla_moe_job.yaml` with TEP/DEP configuration instead.
1. **Set the container image:**
```bash
export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
```
2. **Set the config path for the profiling job:**
```bash
export DGD_CONFIG_FILE=/data/configs/disagg.yaml # should be the same path you set for --dest in Step 1
```
**Step 4: Run profiling (required)**
```bash
# for dense models
envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -
# for MoE models ### Advanced Configuration
envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -
# using aiconfigurator instead of real sweeping (see below for more details) - **Model caching**: For large models, create a multi-attach PVC to cache the model. See [recipes](../../recipes/README.md) for details.
envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f - - **Custom configurations**: Use the manifest injector to place custom DGD configurations in the PVC.
``` - **Resource allocation**: Modify the job YAML to adjust GPU and memory requirements.
**Step 5: Wait for profiling to complete**
```bash
kubectl get jobs -n $NAMESPACE
kubectl logs job/profile-sla -n $NAMESPACE
```
### Viewing Profiling Results ### Viewing Profiling Results
...@@ -265,53 +221,54 @@ If you see `ErrImagePull` or `ImagePullBackOff` errors with 401 unauthorized mes ...@@ -265,53 +221,54 @@ If you see `ErrImagePull` or `ImagePullBackOff` errors with 401 unauthorized mes
3. The service account should show `imagePullSecrets` containing `nvcr-imagepullsecret`. 3. The service account should show `imagePullSecrets` containing `nvcr-imagepullsecret`.
## Running the Profiling Script with `aiconfigurator` ## Running the Profiling Script with AI Configurator
The profiling script can be run much quicker by using `aiconfigurator` to estimate perf numbers instead of running and benchmarking real dynamo deployments. To enable estimation using `aiconfigurator`, pass the `--use-ai-configurator` flag to the profiling script.
> [!NOTE]
> **TensorRT-LLM Only**: AI Configurator currently supports TensorRT-LLM only. Support for vLLM and SGLang is coming soon.
The profiling script can be run much faster using AI Configurator to estimate performance numbers instead of running real Dynamo deployments. This completes profiling in 20-30 seconds using performance simulation.
**Advantages** of `--use-ai-configurator`: **Advantages** of `--use-ai-configurator`:
* Script will finish in seconds rather than hours. * Script completes in seconds rather than hours
* No k8s or GPU access is required. * No Kubernetes or GPU access required
* Ideal for rapid prototyping and testing
**Disadvantages**: **Disadvantages**:
* Estimated perf could contain some error, especially when the input dimensions out-of-distribution compared to the sampled values in aiconfigurator. * Estimated performance may contain errors, especially for out-of-distribution input dimensions
* `aiconfigurator` has a limited list of supported models. * Limited list of supported models, systems, and backends
* `aiconfigurator`'s database has a limited list of systems and backends. * Less accurate than real deployment profiling
### Prerequisites ### Prerequisites
You will need a virtual environment with `dynamo` installed. Either use the local dev environment or the docker images. If using local environment, install the required dependencies:
Install AI Configurator:
```bash ```bash
pip install -r deploy/utils/requirements.txt pip install aiconfigurator
``` ```
Additionally, install `aiconfigurator`: If using local environment, also install:
```bash ```bash
pip install aiconfigurator pip install -r deploy/utils/requirements.txt
``` ```
### Available Models, Systems, and Backends ### Check Support Matrix
`aiconfigurator` supports a limited list of models, systems, and backends.
You can use the `aiconfigurator` CLI to see the support matrix: View supported models, systems, and backends:
```bash ```bash
aiconfigurator cli --help aiconfigurator cli --help
``` ```
This will display:
``` **Supported configurations:**
...options...
--model {GPT_7B,GPT_13B,GPT_30B,GPT_66B,GPT_175B,LLAMA2_7B,LLAMA2_13B,LLAMA2_70B,LLAMA3.1_8B,LLAMA3.1_70B,LLAMA3.1_405B,MOE_Mixtral8x7B,MOE_Mixtral8x22B,DEEPSEEK_V3,KIMI_K2,QWEN2.5_1.5B,QWEN2.5_7B,QWEN2.5_32B,QWEN2.5_72B,QWEN3_32B,QWEN3_235B,QWEN3_480B,Nemotron_super_v1.1}
Model name
--system {h100_sxm,h200_sxm}
System name
--backend {trtllm,sglang,vllm}
Backend name, suport trtllm for now
--version VERSION Version, 0.20.0,1.0.0rc3 for trtllm
...more options...
``` ```
Models: GPT_7B, GPT_13B, GPT_30B, GPT_66B, GPT_175B, LLAMA2_7B, LLAMA2_13B, LLAMA2_70B, LLAMA3.1_8B, LLAMA3.1_70B, LLAMA3.1_405B, MOE_Mixtral8x7B, MOE_Mixtral8x22B, DEEPSEEK_V3, KIMI_K2, QWEN2.5_1.5B, QWEN2.5_7B, QWEN2.5_32B, QWEN2.5_72B, QWEN3_32B, QWEN3_235B, QWEN3_480B, Nemotron_super_v1.1
### Running the Script Systems: h100_sxm, h200_sxm
In addition to passing the `--use-ai-configurator` flag, you must also provide the `--aic-system`, `--aic-model-name`, and `--backend-version` arguments. Backends: trtllm (vllm and sglang support coming soon)
```
Example command: ### Running Fast Profiling
Example command for TensorRT-LLM:
```bash ```bash
python3 -m benchmarks.profiler.profile_sla \ python3 -m benchmarks.profiler.profile_sla \
--config ./components/backends/trtllm/deploy/disagg.yaml \ --config ./components/backends/trtllm/deploy/disagg.yaml \
...@@ -319,6 +276,11 @@ python3 -m benchmarks.profiler.profile_sla \ ...@@ -319,6 +276,11 @@ python3 -m benchmarks.profiler.profile_sla \
--aic-system h200_sxm \ --aic-system h200_sxm \
--aic-model-name QWEN3_32B \ --aic-model-name QWEN3_32B \
--backend trtllm \ --backend trtllm \
--backend-version 0.20.0 --backend-version 0.20.0 \
--isl 3000 \
--osl 150 \
--ttft 0.2 \
--itl 0.02
``` ```
The output will be written to `./profiling_results/`.
The output will be written to `./profiling_results/` and can be used directly with SLA planner deployment.
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Common:
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
block-size: 64
max-model-len: 16384
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
router: kv-load
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
endpoint: dynamo.Processor.chat/completions
port: 8000
Processor:
common-configs: [model, block-size, router]
Router:
min-workers: 1
common-configs: [model, block-size, router]
VllmWorker:
remote-prefill: true
conditional-disagg: false
ServiceArgs:
workers: 1
resources:
gpu: 1
common-configs: [model, block-size, max-model-len, kv-transfer-config]
PrefillWorker:
max-num-batched-tokens: 16384
ServiceArgs:
workers: 1
resources:
gpu: 1
common-configs: [model, block-size, max-model-len, kv-transfer-config]
Planner:
environment: local
no-operation: false
metric-pulling-interval: 1
adjustment-interval: 10
prefill-queue-scale-down-threshold: 0.2
prefill-queue-scale-up-threshold: 10
decode-kv-scale-down-threshold: 0.3
decode-kv-scale-up-threshold: 0.6
log-dir: log/planner
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
Common:
model: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
block-size: 64
max-model-len: 16384
kv-transfer-config: '{"kv_connector":"DynamoNixlConnector"}'
router: kv-load
Frontend:
served_model_name: deepseek-ai/DeepSeek-R1-Distill-Llama-8B
endpoint: dynamo.Processor.chat/completions
port: 8000
Processor:
router: kv-load
common-configs: [model, block-size, router]
Router:
min-workers: 1
common-configs: [model, block-size, router]
VllmWorker:
remote-prefill: true
conditional-disagg: false
ServiceArgs:
workers: 2
resources:
gpu: 1
common-configs: [model, block-size, max-model-len, kv-transfer-config]
PrefillWorker:
max-num-batched-tokens: 16384
ServiceArgs:
workers: 2
resources:
gpu: 1
common-configs: [model, block-size, max-model-len, kv-transfer-config]
Planner:
environment: local
no-operation: true
log-dir: log/2p2d
<!--
SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# Planner Benchmark Example
This guide shows an example of benchmarking `LocalPlanner` performance with synthetic data. In this example, we focus on 8x H100 SXM GPU and `deepseek-ai/DeepSeek-R1-Distill-Llama-8B` model with TP1 prefill and decode engine.
> [!WARNING]
> Bare metal deployment with local connector is deprecated. The only option to deploy planner is via k8s. We will update the examples in this document soon.
## Synthetic Data Generation
We first generate synthetic data with varying request rate from 0.75 to 3 using the provided `generate_synthetic_data.py` script.
```bash
python sin_synth.py \
--time-duration 600 \
--request-rate-min 5 \
--request-rate-max 20 \
--request-rate-period 150 \
--isl1 3000 \
--osl1 150 \
--isl2 3000 \
--osl2 150
```
This generates a [mooncake style trace](https://github.com/kvcache-ai/Mooncake) with
* duration = 600 seconds
* isl/osl = 3000/150
* request rate varies sinusoidally from 0.75 to 3 requests with a period of 150 seconds
For other models and GPU SKUs, adjust the request rate ranges accordingly to match the load.
## Run the Benchmark
To measure the performance of dynamo with planner, we start from a 1p1d deployment and set planner to make adjustments every 10 seconds:
```bash
# Start Kubernetes with one frontend node, one prefill and one decode worker
# TODO
# in terminal 2
genai-perf profile \
--tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
-m deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
--endpoint-type chat \
--url http://localhost:8000 \
--streaming \
--input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
```
To view the performance metrics and planner decisions, launch tensorboard with
```bash
tensorboard --logdir log
```
and open `http://localhost:6006` in your browser. The following metrics are available:
* `average_kv_load`: the average KV load in decode workers
* `prefill_queue_size`: the size of the prefill queue
* `num_queued_request`: the number of requests queued in decode workers
* `num_prefill_workers`: the number of prefill workers
* `num_decode_workers`: the number of decode workers
* `num_gpu`: the total number of GPUs used
The benchmark results are printed out in terminal 3 that runs the `genai-perf` command.
In this example, we use a fixed 2p2d engine as baseline. Planner provides a `--no-operation` flag to watch and log the metrics without making any adjustments:
```bash
# in terminal 1
# Start Kubernetes with one frontend node, two prefill and two decode workers
# TODO
# in terminal 2
genai-perf profile --tokenizer deepseek-ai/DeepSeek-R1-Distill-Llama-8B -m deepseek-ai/DeepSeek-R1-Distill-Llama-8B --service-kind openai --endpoint-type chat --url http://localhost:8000 --streaming --input-file payload:sin_b512_t600_rr5.0-20.0-150.0_io3000150-3000150-0.2-0.8-10.jsonl
```
## Results
The below two figures show the performance comparison between planner and the baseline 2p2d deployment. Planner achieves 1.5x speedup while using 7.4% less GPU resources.
![Two bar charts comparing 2P2D and Planner. Planner shows lower GPU usage and lower average sequence latency.](../../images/planner_perf.png)
![Planner Tensorboard; four line graphs comparing two runs: 2p2d_rr5-20_2 and planner_rr5-20.](../../images/planner_tensorboard.png)
...@@ -33,7 +33,6 @@ ...@@ -33,7 +33,6 @@
kubernetes/model_caching_with_fluid.md kubernetes/model_caching_with_fluid.md
kubernetes/README.md kubernetes/README.md
guides/dynamo_run.md guides/dynamo_run.md
kubernetes/sla_planner_deployment.md
guides/metrics.md guides/metrics.md
guides/run_kvbm_in_vllm.md guides/run_kvbm_in_vllm.md
guides/run_kvbm_in_trtllm.md guides/run_kvbm_in_trtllm.md
......
...@@ -72,7 +72,7 @@ Quickstart ...@@ -72,7 +72,7 @@ Quickstart
:caption: Developer Guide :caption: Developer Guide
Benchmarking Guide <benchmarks/benchmarking.md> Benchmarking Guide <benchmarks/benchmarking.md>
Planner Benchmark Example <guides/planner_benchmark/README.md> SLA Planner (Autoscaling) Quickstart <kubernetes/sla_planner_quickstart>
Logging <guides/logging.md> Logging <guides/logging.md>
Health Checks <guides/health_check.md> Health Checks <guides/health_check.md>
Tuning Disaggregated Serving Performance <guides/disagg_perf_tuning.md> Tuning Disaggregated Serving Performance <guides/disagg_perf_tuning.md>
......
...@@ -188,7 +188,7 @@ kubectl get pods -n ${NAMESPACE} ...@@ -188,7 +188,7 @@ kubectl get pods -n ${NAMESPACE}
3. **Optional:** 3. **Optional:**
- [Set up Prometheus & Grafana](metrics.md) - [Set up Prometheus & Grafana](metrics.md)
- [SLA Planner Deployment Guide](sla_planner_deployment.md) (for advanced SLA-aware scheduling and autoscaling) - [SLA Planner Quickstart Guide](sla_planner_quickstart.md) (for SLA-aware scheduling and autoscaling)
## Troubleshooting ## Troubleshooting
......
# SLA Planner Deployment Guide
Quick deployment guide for the disaggregated planner with automatic scaling.
> [!NOTE]
> For high-level architecture and concepts, see [SLA-based Planner](/docs/architecture/sla_planner.md).
## Architecture Overview
**Components:**
- **Frontend**: Serves requests and exposes `/metrics`
- **Prometheus**: Scrapes frontend metrics every 5s (by default, can be updated in the podmonitor manifest)
- **Planner**: Queries Prometheus and adjusts worker scaling every adjustment interval
- **Workers**: prefill and backend workers handle inference
The adjustment interval can be defined in the planner manifest as an argument. The default interval value can be found in this [file](/components/planner/src/dynamo/planner/defaults.py).
```mermaid
flowchart LR
Frontend --"/metrics"--> Prometheus
Planner --"query API"--> Prometheus
Planner --"scaling decisions"--> Workers
Frontend -.->|"requests"| Workers
```
## Prerequisites
- Kubernetes cluster with GPU nodes
- [Pre-Deployment Profiling](/docs/benchmarks/pre_deployment_profiling.md) completed and its results saved to `dynamo-pvc` PVC.
- Prefill and decode worker uses the best parallelization mapping suggested by the pre-deployment profiling script.
- [kube-prometheus-stack](/docs/kubernetes/metrics.md) installed and running. By default, the prometheus server is not deployed in the `monitoring` namespace. If it is deployed to a different namespace, set `dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090"`.
> [!NOTE]
> **Important**: The profiling that occurs before Planner deployment requires additional Kubernetes manifests (ServiceAccount, Role, RoleBinding, PVC) that are not included in standard Dynamo deployments. Apply these manifests in the same namespace as `$NAMESPACE`. For a complete setup, start with the [Quick Start guide](/deploy/utils/README.md#quick-start), which provides a fully encapsulated deployment including all required manifests.
```bash
export NAMESPACE=your-namespace
```
## 1. Deploy the System
We use vllm as the backend engine in this guide. SLA planner also supports SGLang and TensorRT-LLM. Checkout `disagg_planner.yaml` in their example deployment folders for more details. The deployment is the same for all backends.
```bash
# Apply the disaggregated planner deployment
kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE # for vllm
kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n $NAMESPACE # for sglang
kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n $NAMESPACE # for trtllm
# Check deployment status
kubectl get pods -n $NAMESPACE
```
Expected pods (all should be `1/1 Running`):
```
# For vLLM:
vllm-disagg-planner-frontend-* 1/1 Running
vllm-disagg-planner-planner-* 1/1 Running
vllm-disagg-planner-backend-* 1/1 Running
vllm-disagg-planner-prefill-* 1/1 Running
```
## 2. Test the System
```bash
# Port forward to frontend
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
# Send a request
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "In the heart of Eldoria, an ancient land of boundless magic and mysterious creatures, lies the long-forgotten city of Aeloria. Once a beacon of knowledge and power, Aeloria was buried beneath the shifting sands of time, lost to the world for centuries. You are an intrepid explorer, known for your unparalleled curiosity and courage, who has stumbled upon an ancient map hinting at ests that Aeloria holds a secret so profound that it has the potential to reshape the very fabric of reality. Your journey will take you through treacherous deserts, enchanted forests, and across perilous mountain ranges. Your Task: Character Background: Develop a detailed background for your character. Describe their motivations for seeking out Aeloria, their skills and weaknesses, and any personal connections to the ancient city or its legends. Are they driven by a quest for knowledge, a search for lost familt clue is hidden."
}
],
"stream":true,
"max_tokens": 30
}'
```
## 3. Monitor Scaling
```bash
# Check planner logs for scaling decisions
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10
# Expected successful output (after streaming requests):
# New adjustment interval started!
# Observed num_req: X.XXX isl: X.XXX osl: X.XXX
# Observed ttft: X.XXXs itl: X.XXXs
# Number of prefill workers: 1, number of decode workers: 1
```
### Metrics Requirements
- **Basic metrics** (request count): Available with any request type
- **Latency metrics** (TTFT/ITL): Available for both streaming and non-streaming requests
- **Scaling decisions**: Require sufficient request volume
## 4. Troubleshooting
**Connection Issues:**
```bash
# Verify Prometheus is accessible
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"
```
**Missing Metrics:**
```bash
# Check frontend metrics
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
curl http://localhost:8000/metrics | grep nv_llm_http_service
```
**Worker Issues:**
- Large models can take 10+ minutes to initialize
- Check worker logs: `kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend`
- Ensure GPU resources are available for workers
**Unknown Field subComponentType:**
If you encounter the following error when attempting to apply the deployment:
```bash
Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"
```
This is because the `subComponentType` field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions [here](/docs/kubernetes/installation_guide.md).
# SLA Planner Quick Start Guide
Complete workflow to deploy SLA-based autoscaling for Dynamo deployments. This guide consolidates all necessary steps into a clear, sequential process.
> [!IMPORTANT]
> **Prerequisites**: This guide assumes you have a Kubernetes cluster with GPU nodes and have completed the [Dynamo Platform installation](/docs/kubernetes/installation_guide.md).
## Overview
The SLA Planner automatically scales prefill and decode workers to meet your TTFT (Time To First Token) and ITL (Inter-Token Latency) targets.
The deployment process consists of two mandatory phases:
1. **Pre-Deployment Profiling** (2-4 hours) - Generates performance data
2. **SLA Planner Deployment** (5-10 minutes) - Enables autoscaling
> [!TIP]
> **Fast Profiling with AI Configurator**: For TensorRT-LLM users, we provide AI Configurator (AIC) that can complete profiling in 20-30 seconds using performance simulation instead of real deployments. Support for vLLM and SGLang coming soon. See [AI Configurator section](/docs/benchmarks/pre_deployment_profiling.md#running-the-profiling-script-with-aiconfigurator) in the Profiling Guide.
```mermaid
flowchart TD
A[Start Setup] --> B{Profiling Done?}
B -->|No| C[Run Profiling<br/>2-4 hours]
C --> D[Verify Results]
D --> E[Deploy Planner<br/>5-10 minutes]
B -->|Yes| E
E --> F[Test System]
F --> G[Ready!]
style A fill:#e1f5fe
style C fill:#fff3e0
style E fill:#e8f5e8
style G fill:#f3e5f5
style B fill:#fff8e1
```
## Phase 1: Pre-Deployment Profiling (REQUIRED)
> [!WARNING]
> **MANDATORY**: Pre-deployment profiling must be completed before deploying SLA planner. This process analyzes your model's performance characteristics to determine optimal tensor parallelism configurations and scaling parameters.
### Step 1.1: Set Up Profiling Environment
Set up your Kubernetes namespace for profiling (one-time per namespace). If your namespace is already set up, skip this step.
```bash
export NAMESPACE=your-namespace
```
**Prerequisites**: Ensure all dependencies are installed:
```bash
pip install -r deploy/utils/requirements.txt
```
### Step 1.2: Inject Your Configuration
Use the injector utility to place your DGD manifest into the PVC:
```bash
# Use default disagg.yaml config
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src components/backends/vllm/deploy/disagg.yaml --dest /data/configs/disagg.yaml
# Or use a custom disagg config file
python3 -m deploy.utils.inject_manifest --namespace $NAMESPACE --src my-custom-disagg.yaml --dest /data/configs/disagg.yaml
```
> **Note**: All paths must start with `/data/` for security reasons.
### Step 1.3: Configure SLA Targets
For dense models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_job.yaml`:
```yaml
spec:
template:
spec:
containers:
- name: profile-sla
args:
- --isl
- "3000" # average ISL is 3000 tokens
- --osl
- "150" # average OSL is 150 tokens
- --ttft
- "200" # target TTFT is 200ms
- --itl
- "20" # target ITL is 20ms
- --backend
- <vllm/sglang>
```
For MoE models, edit `$DYNAMO_HOME/benchmarks/profiler/deploy/profile_sla_moe_job.yaml` instead.
### Step 1.4: Run Profiling
Set the container image and config path:
```bash
export DOCKER_IMAGE=nvcr.io/nvidia/ai-dynamo/vllm-runtime:my-tag
export DGD_CONFIG_FILE=/data/configs/disagg.yaml
```
Run profiling:
```bash
# for dense models
envsubst < benchmarks/profiler/deploy/profile_sla_job.yaml | kubectl apply -f -
# for MoE models
envsubst < benchmarks/profiler/deploy/profile_sla_moe_job.yaml | kubectl apply -f -
# using aiconfigurator instead of real sweeping (see below for more details)
envsubst < benchmarks/profiler/deploy/profile_sla_aic_job.yaml | kubectl apply -f -
```
### Step 1.5: Monitor Profiling Progress
```bash
kubectl get jobs -n $NAMESPACE
kubectl logs job/profile-sla -n $NAMESPACE
```
> [!NOTE]
> **Time Investment**: This profiling process is comprehensive and typically takes **2-4 hours** to complete. The script systematically tests multiple tensor parallelism configurations and load conditions to find optimal performance settings.
### Step 1.6: Download Profiling Results (Optional)
If you want to view the profiling results and performance plots:
```bash
# Download to directory
python3 -m deploy.utils.download_pvc_results --namespace $NAMESPACE --output-dir ./results --folder /data/profiling_results
```
For detailed information about the output structure, performance plots, and how to analyze the results, see the [Viewing Profiling Results](/docs/benchmarks/pre_deployment_profiling.md#viewing-profiling-results) section in the Profiling Guide.
**Verify Success**: Look for terminal output like:
```
Suggested prefill TP:4 (TTFT 48.37 ms, throughput 15505.23 tokens/s/GPU)
Suggested decode TP:4 (ITL 4.83 ms, throughput 51.22 tokens/s/GPU)
```
## Phase 2: Deploy SLA Planner
### Step 2.1: Verify Prerequisites
Before deploying the SLA planner, ensure:
- **Pre-deployment profiling completed successfully** (from Phase 1)
- **Profiling results saved to `dynamo-pvc` PVC**
- **[kube-prometheus-stack](/docs/kubernetes/metrics.md) installed and running.** By default, the prometheus server is not deployed in the `monitoring` namespace. If it is deployed to a different namespace, set `dynamo-operator.dynamo.metrics.prometheusEndpoint="http://prometheus-kube-prometheus-prometheus.<namespace>.svc.cluster.local:9090"`.
- **Dynamo platform installed** (see [Installation Guide](/docs/kubernetes/installation_guide.md))
- **Prefill and decode workers use the best parallelization mapping from profiling**
### Step 2.2: Deploy the System
We use vllm as the backend engine in this guide. SLA planner also supports SGLang and TensorRT-LLM.
```bash
# Apply the disaggregated planner deployment
kubectl apply -f components/backends/vllm/deploy/disagg_planner.yaml -n $NAMESPACE # for vllm
kubectl apply -f components/backends/sglang/deploy/disagg_planner.yaml -n $NAMESPACE # for sglang
kubectl apply -f components/backends/trtllm/deploy/disagg_planner.yaml -n $NAMESPACE # for trtllm
# Check deployment status
kubectl get pods -n $NAMESPACE
```
**Expected pods** (all should be `1/1 Running`):
```
vllm-disagg-planner-frontend-* 1/1 Running
vllm-disagg-planner-planner-* 1/1 Running
vllm-disagg-planner-backend-* 1/1 Running
vllm-disagg-planner-prefill-* 1/1 Running
```
### Step 2.3: Test the System
```bash
# Port forward to frontend
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
# Send a request
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-0.6B",
"messages": [
{
"role": "user",
"content": "Hello, how are you?"
}
],
"stream":true,
"max_tokens": 30
}'
```
### Step 2.4: Monitor Scaling
```bash
# Check planner logs for scaling decisions
kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-planner --tail=10
```
**Expected successful output** (after streaming requests):
```
New adjustment interval started!
Observed num_req: X.XXX isl: X.XXX osl: X.XXX
Observed ttft: X.XXXs itl: X.XXXs
Number of prefill workers: 1, number of decode workers: 1
```
## Phase 3: Production Readiness
### Monitoring Metrics
- **Basic metrics** (request count): Available with any request type
- **Latency metrics** (TTFT/ITL): Available for both streaming and non-streaming requests
- **Scaling decisions**: Require sufficient request volume
### Troubleshooting
**Connection Issues:**
```bash
# Verify Prometheus is accessible
kubectl port-forward svc/prometheus-kube-prometheus-prometheus -n monitoring 9090:9090
curl "http://localhost:9090/api/v1/query?query=up"
```
**Missing Metrics:**
```bash
# Check frontend metrics
kubectl port-forward -n $NAMESPACE deployment/vllm-disagg-planner-frontend 8000:8000
curl http://localhost:8000/metrics | grep nv_llm_http_service
```
**Worker Issues:**
- Large models can take 10+ minutes to initialize
- Check worker logs: `kubectl logs -n $NAMESPACE deployment/vllm-disagg-planner-backend`
- Ensure GPU resources are available for workers
**Unknown Field subComponentType:**
If you encounter the following error when applying the deployment:
```bash
Error from server (BadRequest): error when creating "components/backends/vllm/deploy/disagg.yaml": DynamoGraphDeployment in version "v1alpha1" cannot be handled as a DynamoGraphDeployment: strict decoding error: unknown field "spec.services.DecodeWorker.subComponentType", unknown field "spec.services.PrefillWorker.subComponentType"
```
This is because the `subComponentType` field has only been added in newer versions of the DynamoGraphDeployment CRD (> 0.5.0). You can upgrade the CRD version by following the instructions [here](/docs/kubernetes/installation_guide.md).
## Next Steps
- **Architecture Details**: See [SLA-based Planner Architecture](/docs/architecture/sla_planner.md) for technical details
- **Performance Tuning**: See [Pre-Deployment Profiling Guide](/docs/benchmarks/pre_deployment_profiling.md) for advanced profiling options
- **Load Testing**: See [SLA Planner Load Test](/tests/planner/README.md) for comprehensive testing tools
## Quick Reference
| Phase | Duration | Purpose | Status Check |
|-------|----------|---------|--------------|
| Profiling | 2-4 hours | Generate performance data | `kubectl logs job/profile-sla` |
| Deployment | 5-10 minutes | Enable autoscaling | `kubectl get pods` |
| Testing | 5 minutes | Verify functionality | `kubectl logs deployment/planner` |
---
> [!TIP]
> **Need Help?** If you encounter issues, check the [troubleshooting section](#troubleshooting) or refer to the detailed guides linked in [Next Steps](#next-steps).
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment