Unverified Commit 66f7832a authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

feat(planner): unify throughput and load scaling on FPM regression (#7961)

parent 0b7a18ce
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# SLA Planner Load Test
......@@ -19,52 +15,13 @@ You have two options to obtain the pre-deployment profiling data:
### Option A: Use Test Configuration (Quickstart)
Use the pre-configured test deployment with sample profiling data, we provide the results and the deployment configuration for the following models x hardware configurations:
- `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 with max context length 16384, TP1 Prefill, and TP1 Decode. At ISL/OSL 3000/150, it achieves 40k tokens/s/gpu prefill with 80ms TTFT and 10k tokens/s/gpu decode with 10ms ITL. See `../tests/data/profiling_results/H200_TP1P_TP1D/`.
### Option B: Use Your Own Profiling Results
1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../../../../../docs/components/profiler/profiler-guide.md) for detailed instructions.
## Interpolator Testing
SLA planner uses two interpolators to estimate the performance of prefill and decode. You can test the interpolators with the following command:
```bash
python components/src/dynamo/planner/core/throughput/interpolation.py \
--profile_results_dir <path_to_profile_results> \
--isl <ISL> \
--osl <OSL> \
--ttft <TTFT(ms)> \
--itl <ITL(ms)>
```
The script will perform the interpolation based on ISL, OSL, and TTFT and ITL SLAs and advise the load that can saturate the engine.
For example, to test the interpolator for `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 (target TTFT=200ms, ITL=10ms):
```bash
python components/src/dynamo/planner/core/throughput/interpolation.py \
--profile_results_dir components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/ \
--isl 3000 \
--osl 300 \
--ttft 200 \
--itl 10
# output:
ISL=3000, OSL=300
TTFT=200ms, ITL=10ms
Using profile results from components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/
Interpolating prefill performance ...
Estimated TTFT=60.00ms <= target TTFT=200.00ms. Requests can queue 140.00ms maximally while meeting TTFT SLA.
Estimated throughput: 49481.09 tokens/s/gpu. Request rate at 16.49 requests/s will saturate one GPU.
Interpolating decode performance ...
Average context length: isl + osl/2 = 3150.
Estimated ITL=9.70ms <= target ITL=10.00ms at 16.16% active kv usage.
Estimated throughput: 4555.68 token/s/gpu. Request rate at 15.19 requests/s will saturate one GPU.
```
## Generating Load Dataset
We provide a tool to generate load dataset with varying request rate. More details can be found in [sin_load_generator](../../../../../../benchmarks/sin_load_generator/README.md).
......@@ -89,36 +46,6 @@ python benchmarks/sin_load_generator/sin_synth.py \
The dataset starts at 5 requests/s, increases to 45 requests/s at t=300s, decreases back to 5 requests/s at t=600s, and repeats.
The total duration is 30 minutes or 1800 seconds.
## Planner Dry Run
Before testing SLA planner on real deployments, we provide a dry run feature to test the autoscaling behavior on a given dataset. Specifically, in dry run mode,
- The load predictor will be tested. However, the load metrics will be different from the real deployment because the actual OSL is only known after the requests are processed.
- There will be no SLA predictions. Instead, sla planner will show the safe throughput limit that will ensure the requests can be processed within the SLA.
- The correction factor will be disabled because there is no SLA metrics as reference.
To dry run SLA planner,
```bash
python components/src/dynamo/planner/tests/manual/unit/planner_sla_dryrun.py \
--config '{"environment":"kubernetes","backend":"vllm","ttft":200,"itl":10,"profile_results_dir":"components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D","throughput_adjustment_interval":60,"no_correction":true}' \
--dataset rr-5-45_i3000o300.jsonl \
--start-num-p 1 \
--start-num-d 1 \
--output-plot dryrun_plot.png
```
Below is the dryrun result:
![Dryrun Plot](./figures/dryrun_plot.png)
The first plot shows the actual request rate and the predicted request rate (in the unit of requests/adjustment_interval).
The second plot shows the actual ISL/OSL and the predicted ISL/OSL. The first two plots are useful when tuning the performance of the load predictor.
The third plot shows the actual prefill throughput, number of prefill workers that planner scales, and the safe throughput limit with the number of prefill workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the TTFT SLA. Note that in the real deployment, due to other factors such as queueing, load balancing, KV cache transfer latency, and ISL variance, it is not guaranteed that the actual deployment can adhere the TTFT SLA.
The fourth plot, similar to the third plot, shows the actual decode throughput, number of decode workers that planner scales, and the safe throughput limit with the number of decode workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the ITL SLA. Note that in the real deployment, due to other factors such as load balancing and OSL variance, it is not guaranteed that the actual deployment can adhere the ITL SLA.
## Scaling Tests
This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers.
......@@ -132,6 +59,7 @@ This directory contains comprehensive tests for validating the SLA planner's sca
### Quick Start for Unit Tests and End-to-End Tests
#### Run Unit Tests Only
Test the replica calculation logic without requiring Kubernetes:
```bash
......@@ -175,6 +103,7 @@ components/src/dynamo/planner/tests/manual/scaling/run_scaling_test.sh --namespa
### Instructions for End-to-End Perf Tests
In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun:
- Config 1 with inefficient P/D ratio: 3xTP1P_1xTP1D_4GPU
`./perf_test_configs/disagg_8b_3p1d.yaml`
- Config 2 with best static deployment: 2xTP1P_2xTP1D_4GPU
......@@ -214,12 +143,13 @@ aiperf profile \
#### E2E Perf Test Results
![Results](./figures/sla_planner_perf.png)
Results
The table below shows the performance improvement of SLA planner across different deployment configurations:
| Baseline | Goodput Improvement | Goodput/GPU Improvement |
|---------------|-----------------|-------------------------|
| Inefficient P/D ratio | 725% | 600% |
| Inefficient parallelization mapping | 311% | 249% |
| Best static deployment | 52% | 29% |
| Baseline | Goodput Improvement | Goodput/GPU Improvement |
| ----------------------------------- | ------------------- | ----------------------- |
| Inefficient P/D ratio | 725% | 600% |
| Inefficient parallelization mapping | 311% | 249% |
| Best static deployment | 52% | 29% |
......@@ -82,7 +82,7 @@ spec:
- dynamo.planner
args:
- --config
- '{"environment": "kubernetes", "backend": "vllm", "ttft": 200, "itl": 10, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/", "throughput_adjustment_interval": 60, "metric_reporting_prometheus_port": 9085, "no_correction": true}'
- '{"environment": "kubernetes", "backend": "vllm", "ttft": 200, "itl": 10, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/", "throughput_adjustment_interval": 60, "metric_reporting_prometheus_port": 9085}'
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
......
......@@ -25,7 +25,7 @@ spec:
- dynamo.planner
args:
- --config
- '{"environment": "kubernetes", "backend": "vllm", "throughput_adjustment_interval": 60, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D", "no_correction": true}'
- '{"environment": "kubernetes", "backend": "vllm", "throughput_adjustment_interval": 60, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D"}'
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import logging
from dynamo.planner.config.planner_config import PlannerConfig
from dynamo.planner.offline.dryrun import run_sla_planner_dryrun
logger = logging.getLogger(__name__)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Planner Dryrun")
parser.add_argument(
"--config",
required=True,
help="JSON string or path to a JSON/YAML config file",
)
parser.add_argument(
"--dataset", type=str, required=True, help="Path to the jsonl dataset file"
)
parser.add_argument(
"--start-num-p",
type=int,
default=1,
help="Number of prefill workers to start with",
)
parser.add_argument(
"--start-num-d",
type=int,
default=1,
help="Number of decode workers to start with",
)
parser.add_argument(
"--output-plot",
type=str,
default="dryrun_plot.png",
help="Path to the output plot file",
)
args = parser.parse_args()
config = PlannerConfig.from_config_arg(args.config)
run_sla_planner_dryrun(
config,
dataset=args.dataset,
start_num_p=args.start_num_p,
start_num_d=args.start_num_d,
output_plot=args.output_plot,
)
......@@ -87,3 +87,36 @@ def test_throughput_metrics_source_invalid():
"""throughput_metrics_source rejects invalid values."""
with pytest.raises(ValidationError):
PlannerConfig(namespace="test-ns", throughput_metrics_source="invalid")
@pytest.mark.parametrize("bucket_size", [1, 4, 9, 16, 25])
def test_fpm_sample_bucket_size_accepts_perfect_squares(bucket_size):
"""fpm_sample_bucket_size must be a perfect square (valid values)."""
config = PlannerConfig(namespace="test-ns", fpm_sample_bucket_size=bucket_size)
assert config.fpm_sample_bucket_size == bucket_size
@pytest.mark.parametrize("bucket_size", [2, 3, 5, 7, 10])
def test_fpm_sample_bucket_size_rejects_non_squares(bucket_size):
"""fpm_sample_bucket_size rejects values that are not perfect squares."""
with pytest.raises(ValidationError, match="perfect square"):
PlannerConfig(namespace="test-ns", fpm_sample_bucket_size=bucket_size)
def test_max_num_fpm_samples_field():
"""max_num_fpm_samples configures the FPM sample retention (formerly load_learning_window)."""
config = PlannerConfig(namespace="test-ns", max_num_fpm_samples=100)
assert config.max_num_fpm_samples == 100
def test_agg_mode_supports_throughput_scaling():
"""Agg mode supports throughput-based scaling."""
config = PlannerConfig(
namespace="test-ns",
mode="agg",
enable_throughput_scaling=True,
enable_load_scaling=False,
)
assert config.mode == "agg"
assert config.enable_throughput_scaling is True
assert config.scaling_enabled() is True
......@@ -12,4 +12,4 @@ pmdarima==2.1.1
prometheus-api-client==0.6.0
prophet==1.2.1
scikit-learn==1.7.2
scipy<1.14.0 # Upper bound for pmdarima compatibility
scipy>=1.14.0,<2.0
......@@ -25,8 +25,8 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
The Planner supports two scaling modes that can run independently or together:
- **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No profiling data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
- **Throughput-based scaling**: Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No pre-deployment data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
......@@ -36,12 +36,12 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
|---------|:----------------:|:-------------------------:|
| **Deployment** | | |
| Disaggregated | Supported | Supported |
| Aggregated | Unsupported | Supported |
| Aggregated | Supported | Supported |
| **LLM Framework** | | |
| SGLang | Supported | Supported |
| TensorRT-LLM | Supported | Supported |
| vLLM | Supported | Supported |
| **Requires Profiling Data** | Yes | No |
| **Requires Pre-deployment Data** | Yes (self-benchmark or profiler) | No |
| **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A |
| **Router** | | |
| Any (round-robin, random, etc.) | Supported | Not supported |
......@@ -52,8 +52,8 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
## When to Use Which Mode
- **Throughput-based scaling** should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning.
- **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data.
- **Throughput-based scaling** should be enabled whenever engine performance data is available (through self-benchmark or pre-deployment profiling). It provides stable, prediction-based capacity planning.
- **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring pre-deployment data.
- **Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
## Quick Start
......@@ -63,7 +63,7 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
- kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md))
For throughput-based scaling, pre-deployment profiling is also required ([Profiling Guide](../profiler/profiler-guide.md)).
For throughput-based scaling, pre-deployment engine performance data is also required (via self-benchmark mode or [Profiling Guide](../profiler/profiler-guide.md)).
### Throughput-Based Scaling (with DGDR)
......@@ -141,13 +141,11 @@ Load-based scaling has the following known limitations. Throughput-based scaling
| `--adjustment-interval` | `180` | Seconds between throughput-based scaling decisions |
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
| `--no-correction` | `true` | Disable correction factors (auto-disabled when load-based scaling is on) |
| **Load-based scaling** | | |
| `--enable-loadbased-scaling` | `false` | Enable load-based scaling |
| `--disable-throughput-scaling` | `false` | Disable throughput-based scaling (required for `agg` mode) |
| `--loadbased-router-metrics-url` | auto-discovered | URL to router's `/metrics` endpoint |
| `--loadbased-adjustment-interval` | `5` | Seconds between load-based scaling decisions |
| `--loadbased-learning-window` | `50` | Sliding window size for regression model |
| `--loadbased-adjustment-interval` | `5` | Seconds between FPM regression updates and load-based scaling decisions |
| `--max-num-fpm-samples` | `64` | Maximum retained FPM observations for regression |
| `--fpm-sample-bucket-size` | `16` | Number of buckets for observation retirement (must be perfect square) |
| `--loadbased-scaling-down-sensitivity` | `80` | Scale-down sensitivity 0-100 (0=never, 100=aggressive) |
| `--loadbased-metric-samples` | `10` | Number of metric samples per adjustment interval |
| `--loadbased-min-observations` | `5` | Minimum observations before regression activates |
......@@ -175,7 +173,7 @@ The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
- FPM regression model status
### Prometheus Metrics
......
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment