"tests/vscode:/vscode.git/clone" did not exist on "4fb8beefaa8b2c4bd2cd3b336b01ff006dc98bdc"
Unverified Commit 66f7832a authored by Hongkuan Zhou's avatar Hongkuan Zhou Committed by GitHub
Browse files

feat(planner): unify throughput and load scaling on FPM regression (#7961)

parent 0b7a18ce
<!--
SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES.
All rights reserved.
SPDX-License-Identifier: Apache-2.0
-->
# SLA Planner Load Test
......@@ -19,52 +15,13 @@ You have two options to obtain the pre-deployment profiling data:
### Option A: Use Test Configuration (Quickstart)
Use the pre-configured test deployment with sample profiling data, we provide the results and the deployment configuration for the following models x hardware configurations:
- `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 with max context length 16384, TP1 Prefill, and TP1 Decode. At ISL/OSL 3000/150, it achieves 40k tokens/s/gpu prefill with 80ms TTFT and 10k tokens/s/gpu decode with 10ms ITL. See `../tests/data/profiling_results/H200_TP1P_TP1D/`.
### Option B: Use Your Own Profiling Results
1. Run pre-deployment profiling for your specific setup. See the [pre-deployment profiling documentation](../../../../../../docs/components/profiler/profiler-guide.md) for detailed instructions.
## Interpolator Testing
SLA planner uses two interpolators to estimate the performance of prefill and decode. You can test the interpolators with the following command:
```bash
python components/src/dynamo/planner/core/throughput/interpolation.py \
--profile_results_dir <path_to_profile_results> \
--isl <ISL> \
--osl <OSL> \
--ttft <TTFT(ms)> \
--itl <ITL(ms)>
```
The script will perform the interpolation based on ISL, OSL, and TTFT and ITL SLAs and advise the load that can saturate the engine.
For example, to test the interpolator for `nvidia/Llama-3.1-8B-Instruct-FP8` on H200 (target TTFT=200ms, ITL=10ms):
```bash
python components/src/dynamo/planner/core/throughput/interpolation.py \
--profile_results_dir components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/ \
--isl 3000 \
--osl 300 \
--ttft 200 \
--itl 10
# output:
ISL=3000, OSL=300
TTFT=200ms, ITL=10ms
Using profile results from components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/
Interpolating prefill performance ...
Estimated TTFT=60.00ms <= target TTFT=200.00ms. Requests can queue 140.00ms maximally while meeting TTFT SLA.
Estimated throughput: 49481.09 tokens/s/gpu. Request rate at 16.49 requests/s will saturate one GPU.
Interpolating decode performance ...
Average context length: isl + osl/2 = 3150.
Estimated ITL=9.70ms <= target ITL=10.00ms at 16.16% active kv usage.
Estimated throughput: 4555.68 token/s/gpu. Request rate at 15.19 requests/s will saturate one GPU.
```
## Generating Load Dataset
We provide a tool to generate load dataset with varying request rate. More details can be found in [sin_load_generator](../../../../../../benchmarks/sin_load_generator/README.md).
......@@ -89,36 +46,6 @@ python benchmarks/sin_load_generator/sin_synth.py \
The dataset starts at 5 requests/s, increases to 45 requests/s at t=300s, decreases back to 5 requests/s at t=600s, and repeats.
The total duration is 30 minutes or 1800 seconds.
## Planner Dry Run
Before testing SLA planner on real deployments, we provide a dry run feature to test the autoscaling behavior on a given dataset. Specifically, in dry run mode,
- The load predictor will be tested. However, the load metrics will be different from the real deployment because the actual OSL is only known after the requests are processed.
- There will be no SLA predictions. Instead, sla planner will show the safe throughput limit that will ensure the requests can be processed within the SLA.
- The correction factor will be disabled because there is no SLA metrics as reference.
To dry run SLA planner,
```bash
python components/src/dynamo/planner/tests/manual/unit/planner_sla_dryrun.py \
--config '{"environment":"kubernetes","backend":"vllm","ttft":200,"itl":10,"profile_results_dir":"components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D","throughput_adjustment_interval":60,"no_correction":true}' \
--dataset rr-5-45_i3000o300.jsonl \
--start-num-p 1 \
--start-num-d 1 \
--output-plot dryrun_plot.png
```
Below is the dryrun result:
![Dryrun Plot](./figures/dryrun_plot.png)
The first plot shows the actual request rate and the predicted request rate (in the unit of requests/adjustment_interval).
The second plot shows the actual ISL/OSL and the predicted ISL/OSL. The first two plots are useful when tuning the performance of the load predictor.
The third plot shows the actual prefill throughput, number of prefill workers that planner scales, and the safe throughput limit with the number of prefill workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the TTFT SLA. Note that in the real deployment, due to other factors such as queueing, load balancing, KV cache transfer latency, and ISL variance, it is not guaranteed that the actual deployment can adhere the TTFT SLA.
The fourth plot, similar to the third plot, shows the actual decode throughput, number of decode workers that planner scales, and the safe throughput limit with the number of decode workers. If the actual throughput is below the safe throughput limit, the deployment has the capacity to adhere the ITL SLA. Note that in the real deployment, due to other factors such as load balancing and OSL variance, it is not guaranteed that the actual deployment can adhere the ITL SLA.
## Scaling Tests
This directory contains comprehensive tests for validating the SLA planner's scaling behavior. The tests validate both the replica calculation logic and end-to-end scaling behavior. The scaling test uses a graduated load approach rather than dataset files, as it proved more reliable for metric generation and scaling triggers.
......@@ -132,6 +59,7 @@ This directory contains comprehensive tests for validating the SLA planner's sca
### Quick Start for Unit Tests and End-to-End Tests
#### Run Unit Tests Only
Test the replica calculation logic without requiring Kubernetes:
```bash
......@@ -175,6 +103,7 @@ components/src/dynamo/planner/tests/manual/scaling/run_scaling_test.sh --namespa
### Instructions for End-to-End Perf Tests
In this test, we compare performance (goodput and goodput/GPU) on deployments on the following four deployments using the aforementioned 8b FP8 model on H200 and the dataset used in dryrun:
- Config 1 with inefficient P/D ratio: 3xTP1P_1xTP1D_4GPU
`./perf_test_configs/disagg_8b_3p1d.yaml`
- Config 2 with best static deployment: 2xTP1P_2xTP1D_4GPU
......@@ -214,12 +143,13 @@ aiperf profile \
#### E2E Perf Test Results
![Results](./figures/sla_planner_perf.png)
Results
The table below shows the performance improvement of SLA planner across different deployment configurations:
| Baseline | Goodput Improvement | Goodput/GPU Improvement |
|---------------|-----------------|-------------------------|
| Inefficient P/D ratio | 725% | 600% |
| Inefficient parallelization mapping | 311% | 249% |
| Best static deployment | 52% | 29% |
| Baseline | Goodput Improvement | Goodput/GPU Improvement |
| ----------------------------------- | ------------------- | ----------------------- |
| Inefficient P/D ratio | 725% | 600% |
| Inefficient parallelization mapping | 311% | 249% |
| Best static deployment | 52% | 29% |
......@@ -82,7 +82,7 @@ spec:
- dynamo.planner
args:
- --config
- '{"environment": "kubernetes", "backend": "vllm", "ttft": 200, "itl": 10, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/", "throughput_adjustment_interval": 60, "metric_reporting_prometheus_port": 9085, "no_correction": true}'
- '{"environment": "kubernetes", "backend": "vllm", "ttft": 200, "itl": 10, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D/", "throughput_adjustment_interval": 60, "metric_reporting_prometheus_port": 9085}'
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
......
......@@ -25,7 +25,7 @@ spec:
- dynamo.planner
args:
- --config
- '{"environment": "kubernetes", "backend": "vllm", "throughput_adjustment_interval": 60, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D", "no_correction": true}'
- '{"environment": "kubernetes", "backend": "vllm", "throughput_adjustment_interval": 60, "profile_results_dir": "/workspace/components/src/dynamo/planner/tests/data/profiling_results/H200_TP1P_TP1D"}'
VllmDecodeWorker:
envFromSecret: hf-token-secret
componentType: worker
......
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import logging
from dynamo.planner.config.planner_config import PlannerConfig
from dynamo.planner.offline.dryrun import run_sla_planner_dryrun
logger = logging.getLogger(__name__)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Planner Dryrun")
parser.add_argument(
"--config",
required=True,
help="JSON string or path to a JSON/YAML config file",
)
parser.add_argument(
"--dataset", type=str, required=True, help="Path to the jsonl dataset file"
)
parser.add_argument(
"--start-num-p",
type=int,
default=1,
help="Number of prefill workers to start with",
)
parser.add_argument(
"--start-num-d",
type=int,
default=1,
help="Number of decode workers to start with",
)
parser.add_argument(
"--output-plot",
type=str,
default="dryrun_plot.png",
help="Path to the output plot file",
)
args = parser.parse_args()
config = PlannerConfig.from_config_arg(args.config)
run_sla_planner_dryrun(
config,
dataset=args.dataset,
start_num_p=args.start_num_p,
start_num_d=args.start_num_d,
output_plot=args.output_plot,
)
......@@ -19,7 +19,7 @@ from dynamo.common.forward_pass_metrics import (
)
from dynamo.planner.config.planner_config import PlannerConfig
from dynamo.planner.core.decode import DecodePlanner
from dynamo.planner.core.load.fpm_regression import (
from dynamo.planner.core.perf_model import (
AggRegressionModel,
DecodeRegressionModel,
PrefillRegressionModel,
......@@ -70,19 +70,24 @@ def _make_fpm(
class TestPrefillRegressionModel:
def test_insufficient_data(self):
model = PrefillRegressionModel(window_size=50, min_observations=5)
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=5, bucket_count=16
)
assert not model.has_sufficient_data()
assert model.estimate_next_ttft(0, 2048) is None
def test_heartbeat_skipped(self):
model = PrefillRegressionModel(window_size=50, min_observations=3)
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpm = _make_fpm(wall_time=0.0, sum_prefill_tokens=100, num_prefill_requests=1)
model.add_observation(fpm)
assert model.num_observations == 0
def test_basic_regression_and_ttft_estimate(self):
model = PrefillRegressionModel(window_size=50, min_observations=3)
# wall_time = 0.001 * prefill_tokens + 0.002 (linear relationship)
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for tokens in [500, 1000, 1500, 2000, 2500]:
fpm = _make_fpm(
sum_prefill_tokens=tokens,
......@@ -93,9 +98,6 @@ class TestPrefillRegressionModel:
assert model.has_sufficient_data()
# Single iteration: queued=0, avg_isl should be mean of [500..2500]=1500
# total_tokens = 0 + avg_isl ≈ 1500
# 1 iteration at max_num_batched_tokens=2048 (1500 < 2048)
est = model.estimate_next_ttft(
queued_prefill_tokens=0, max_num_batched_tokens=2048
)
......@@ -103,8 +105,9 @@ class TestPrefillRegressionModel:
assert est > 0
def test_chunked_ttft_simulation(self):
model = PrefillRegressionModel(window_size=50, min_observations=3)
# Simple: wall_time = 0.001 * prefill_tokens (slope=0.001, intercept≈0)
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for tokens in [100, 200, 300, 400, 500]:
fpm = _make_fpm(
sum_prefill_tokens=tokens,
......@@ -113,11 +116,6 @@ class TestPrefillRegressionModel:
)
model.add_observation(fpm)
# avg_isl = mean([100,200,300,400,500]) = 300
# total_tokens = 5000 (queued) + 300 (next ISL) = 5300
# max_num_batched_tokens = 2048
# iterations: ceil(5300/2048) = 3
# chunk1=2048, chunk2=2048, chunk3=1204
est = model.estimate_next_ttft(
queued_prefill_tokens=5000, max_num_batched_tokens=2048
)
......@@ -125,7 +123,9 @@ class TestPrefillRegressionModel:
assert est > 0.003 # at least 3 iterations worth
def test_avg_isl_tracking(self):
model = PrefillRegressionModel(window_size=50, min_observations=3)
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for isl in [1000, 2000, 3000]:
fpm = _make_fpm(
sum_prefill_tokens=isl, num_prefill_requests=1, wall_time=0.01
......@@ -133,39 +133,219 @@ class TestPrefillRegressionModel:
model.add_observation(fpm)
assert abs(model.avg_isl - 2000.0) < 1.0
def test_sliding_window_eviction(self):
model = PrefillRegressionModel(window_size=5, min_observations=3)
for i in range(10):
fpm = _make_fpm(sum_prefill_tokens=100 * (i + 1), wall_time=0.01)
def test_find_best_engine_prefill_rps(self):
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for tokens in [500, 1000, 1500, 2000, 2500]:
fpm = _make_fpm(
sum_prefill_tokens=tokens,
num_prefill_requests=1,
wall_time=0.001 * tokens + 0.002,
)
model.add_observation(fpm)
rps, actual_ttft_ms = model.find_best_engine_prefill_rps(
ttft_sla=2000.0, isl=1000.0
)
assert rps > 0
# wall_time ~1.002s for 1000 tokens -> rps ~ 1/1.002 ~ 0.998
assert 0.5 < rps < 2.0
assert actual_ttft_ms > 0
assert 1000 < actual_ttft_ms < 2000
def test_find_best_engine_prefill_rps_zero_isl(self):
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for tokens in [500, 1000, 1500]:
fpm = _make_fpm(
sum_prefill_tokens=tokens,
num_prefill_requests=1,
wall_time=0.001 * tokens,
)
model.add_observation(fpm)
rps, _ = model.find_best_engine_prefill_rps(ttft_sla=1000.0, isl=0.0)
assert rps == 0.0
def test_load_benchmark_fpms(self):
model = PrefillRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpms = [
_make_fpm(sum_prefill_tokens=t, num_prefill_requests=1, wall_time=0.001 * t)
for t in [500, 1000, 1500, 2000, 2500]
]
model.load_benchmark_fpms(fpms)
assert model.num_observations == 5
assert model.has_sufficient_data()
est = model.estimate_next_ttft(
queued_prefill_tokens=0, max_num_batched_tokens=2048
)
assert est is not None
# ── Bucketed retirement tests ─────────────────────────────────────────
class TestBucketedRetirement:
def test_total_capped_at_max(self):
"""Total observations never exceed max_num_fpm_samples."""
model = PrefillRegressionModel(
max_num_fpm_samples=10, min_observations=3, bucket_count=4
)
for i in range(20):
fpm = _make_fpm(
sum_prefill_tokens=100 * (i + 1),
num_prefill_requests=1,
wall_time=0.01 * (i + 1),
)
model.add_observation(fpm)
assert model.num_observations == 10
def test_most_populated_bucket_loses_oldest(self):
"""When evicting, the oldest entry from the most-populated bucket is removed."""
model = PrefillRegressionModel(
max_num_fpm_samples=6, min_observations=1, bucket_count=4
)
# 3 observations at low tokens (bucket 0 area)
for i in range(3):
fpm = _make_fpm(
sum_prefill_tokens=10 + i,
num_prefill_requests=1,
wall_time=0.001 * (10 + i),
)
model.add_observation(fpm)
# 3 observations at high tokens (different bucket)
for i in range(3):
fpm = _make_fpm(
sum_prefill_tokens=1000 + i * 100,
num_prefill_requests=1,
wall_time=0.001 * (1000 + i * 100),
)
model.add_observation(fpm)
assert model.num_observations == 6
# One more at low tokens; total would exceed 6 so most-populated
# bucket loses its oldest entry.
fpm = _make_fpm(
sum_prefill_tokens=15,
num_prefill_requests=1,
wall_time=0.015,
)
model.add_observation(fpm)
assert model.num_observations == 6
def test_uniform_distribution_preserved(self):
"""Bucketed eviction keeps observations across operating points."""
model = DecodeRegressionModel(
max_num_fpm_samples=10, min_observations=3, bucket_count=16
)
# Many observations at a single operating point
for _ in range(15):
fpm = _make_fpm(
num_decode_requests=32,
sum_decode_kv_tokens=32000,
wall_time=0.01,
)
model.add_observation(fpm)
assert model.num_observations == 10
# Add a different operating point; the concentrated bucket loses one
fpm = _make_fpm(
num_decode_requests=4,
sum_decode_kv_tokens=4000,
wall_time=0.005,
)
model.add_observation(fpm)
assert model.num_observations == 10
def test_2d_bucketed_retirement(self):
"""2D models retire from the most-populated grid cell."""
model = AggRegressionModel(
max_num_fpm_samples=8, min_observations=1, bucket_count=16
)
# Fill with varied data
for p, d in [(100, 500), (200, 1000), (300, 1500), (400, 2000)]:
fpm = _make_fpm(
sum_prefill_tokens=p,
num_prefill_requests=1,
sum_decode_kv_tokens=d,
num_decode_requests=5,
wall_time=0.001 * p + 0.0001 * d,
)
model.add_observation(fpm)
# Concentrate 4 more in one region
for _ in range(4):
fpm = _make_fpm(
sum_prefill_tokens=100,
num_prefill_requests=1,
sum_decode_kv_tokens=500,
num_decode_requests=5,
wall_time=0.15,
)
model.add_observation(fpm)
assert model.num_observations == 8
# Overflow triggers retirement from the concentrated cell
fpm = _make_fpm(
sum_prefill_tokens=350,
num_prefill_requests=1,
sum_decode_kv_tokens=1800,
num_decode_requests=5,
wall_time=0.5,
)
model.add_observation(fpm)
assert model.num_observations == 8
# ── DecodeRegressionModel tests ──────────────────────────────────────
class TestDecodeRegressionModel:
def _train_2d(self, model: DecodeRegressionModel) -> None:
"""Populate with 2D data: wall_time = f(num_decode_requests, sum_decode_kv_tokens)."""
for n_req, kv in [
(5, 1000),
(10, 2000),
(15, 3000),
(20, 4000),
(25, 5000),
]:
fpm = _make_fpm(
sum_decode_kv_tokens=kv,
num_decode_requests=n_req,
wall_time=0.0001 * kv + 0.0005 * n_req + 0.001,
)
model.add_observation(fpm)
def test_insufficient_data(self):
model = DecodeRegressionModel(window_size=50, min_observations=5)
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=5, bucket_count=16
)
assert not model.has_sufficient_data()
assert model.estimate_next_itl(0, 0) is None
def test_heartbeat_skipped(self):
model = DecodeRegressionModel(window_size=50, min_observations=3)
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpm = _make_fpm(wall_time=0.0, sum_decode_kv_tokens=100, num_decode_requests=1)
model.add_observation(fpm)
assert model.num_observations == 0
def test_basic_itl_estimate(self):
model = DecodeRegressionModel(window_size=50, min_observations=3)
# wall_time = 0.0001 * decode_kv + 0.001
for kv in [1000, 2000, 3000, 4000, 5000]:
fpm = _make_fpm(
sum_decode_kv_tokens=kv,
num_decode_requests=10,
wall_time=0.0001 * kv + 0.001,
)
model.add_observation(fpm)
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
self._train_2d(model)
assert model.has_sufficient_data()
est = model.estimate_next_itl(scheduled_decode_kv=3000, queued_decode_kv=0)
......@@ -173,7 +353,9 @@ class TestDecodeRegressionModel:
assert est > 0
def test_avg_decode_length_tracking(self):
model = DecodeRegressionModel(window_size=50, min_observations=3)
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
for total_kv, num_req in [(1000, 10), (2000, 10), (3000, 10)]:
fpm = _make_fpm(
sum_decode_kv_tokens=total_kv,
......@@ -183,35 +365,99 @@ class TestDecodeRegressionModel:
model.add_observation(fpm)
assert abs(model.avg_decode_length - 200.0) < 1.0
def _train_thpt_model(self, model: DecodeRegressionModel) -> None:
"""Populate with 2D data at decode-realistic wall-time scale."""
for n_req, kv in [
(5, 5000),
(10, 10000),
(20, 20000),
(30, 30000),
(40, 40000),
]:
fpm = _make_fpm(
sum_decode_kv_tokens=kv,
num_decode_requests=n_req,
wall_time=0.00001 * kv + 0.001,
)
model.add_observation(fpm)
def test_find_best_engine_decode_rps(self):
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
self._train_thpt_model(model)
rps, actual_itl = model.find_best_engine_decode_rps(
itl=50.0, context_length=1000.0, osl=150.0
)
assert rps > 0
assert actual_itl > 0
assert actual_itl <= 50.0
def test_find_best_engine_decode_rps_zero_context(self):
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
self._train_2d(model)
rps, itl_ms = model.find_best_engine_decode_rps(
itl=50.0, context_length=0.0, osl=150.0
)
assert rps == 0.0
assert itl_ms == 0.0
def test_load_benchmark_fpms(self):
model = DecodeRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpms = [
_make_fpm(
num_decode_requests=n,
sum_decode_kv_tokens=n * 1000,
wall_time=0.001 * n,
)
for n in [5, 10, 15, 20, 25]
]
model.load_benchmark_fpms(fpms)
assert model.num_observations == 5
assert model.has_sufficient_data()
# ── AggRegressionModel tests ─────────────────────────────────────────
class TestAggRegressionModel:
def _train_agg(self, model: AggRegressionModel) -> None:
for p, d in [(100, 1000), (200, 2000), (300, 3000), (400, 4000), (500, 5000)]:
fpm = _make_fpm(
sum_prefill_tokens=p,
num_prefill_requests=1,
sum_decode_kv_tokens=d,
num_decode_requests=10,
wall_time=0.001 * p + 0.0001 * d + 0.001,
)
model.add_observation(fpm)
def test_insufficient_data(self):
model = AggRegressionModel(window_size=50, min_observations=5)
model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=5, bucket_count=16
)
assert not model.has_sufficient_data()
assert model.estimate_next_ttft(0, 2048, 0) is None
assert model.estimate_next_itl(0, 0) is None
def test_heartbeat_skipped(self):
model = AggRegressionModel(window_size=50, min_observations=3)
model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
fpm = _make_fpm(wall_time=0.0, sum_prefill_tokens=100, sum_decode_kv_tokens=200)
model.add_observation(fpm)
assert model.num_observations == 0
def test_2d_regression(self):
model = AggRegressionModel(window_size=50, min_observations=3)
# wall_time = 0.001 * prefill + 0.0001 * decode_kv + 0.001
for p, d in [(100, 1000), (200, 2000), (300, 3000), (400, 4000), (500, 5000)]:
fpm = _make_fpm(
sum_prefill_tokens=p,
num_prefill_requests=1,
sum_decode_kv_tokens=d,
num_decode_requests=10,
wall_time=0.001 * p + 0.0001 * d + 0.001,
)
model.add_observation(fpm)
model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
self._train_agg(model)
assert model.has_sufficient_data()
......@@ -227,6 +473,37 @@ class TestAggRegressionModel:
assert itl is not None
assert itl > 0
def test_find_best_engine_agg_rps(self):
model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=3, bucket_count=16
)
self._train_agg(model)
thpt, actual_ttft, actual_itl = model.find_best_engine_agg_rps(
isl=2048.0,
osl=150.0,
max_num_batched_tokens=4096,
ttft_sla=500.0,
itl_sla=50.0,
)
assert isinstance(thpt, float)
assert thpt > 0
assert actual_ttft >= 0
assert actual_itl >= 0
def test_find_best_engine_agg_rps_insufficient_data(self):
model = AggRegressionModel(
max_num_fpm_samples=50, min_observations=5, bucket_count=16
)
thpt, _, _ = model.find_best_engine_agg_rps(
isl=2048.0,
osl=150.0,
max_num_batched_tokens=4096,
ttft_sla=500.0,
itl_sla=50.0,
)
assert thpt == 0.0
# ── Planner integration tests (with mocked FPM subscriber) ──────────
......@@ -249,7 +526,6 @@ def _build_load_config(**overrides) -> PlannerConfig:
itl=50.0,
backend="vllm",
no_operation=True,
no_correction=True,
metric_pulling_prometheus_endpoint="http://localhost:9090",
metric_reporting_prometheus_port=0,
load_predictor="constant",
......@@ -266,7 +542,8 @@ def _build_load_config(**overrides) -> PlannerConfig:
enable_load_scaling=True,
enable_throughput_scaling=True,
load_adjustment_interval=5,
load_learning_window=50,
max_num_fpm_samples=50,
fpm_sample_bucket_size=16,
load_scaling_down_sensitivity=80,
load_metric_samples=10,
load_min_observations=5,
......@@ -294,7 +571,6 @@ class TestPrefillFpmScaling:
planner.model_name = "test-model"
planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
# Train regression: wall_time grows linearly with prefill tokens
for tokens in range(200, 1200, 100):
fpm = _make_fpm(
sum_prefill_tokens=tokens,
......@@ -303,7 +579,6 @@ class TestPrefillFpmScaling:
)
planner.ttft_regression.add_observation(fpm)
# Both engines have heavy queued prefill -> high estimated TTFT
stats = {
("w1", 0): _make_fpm(
worker_id="w1",
......@@ -335,8 +610,6 @@ class TestPrefillFpmScaling:
planner.model_name = "test-model"
planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
# Train with short ISL (100 tokens each) so avg_isl stays low.
# Regression: wall_time ≈ 0.001 * prefill_tokens
for tokens in range(100, 600, 50):
fpm = _make_fpm(
sum_prefill_tokens=tokens,
......@@ -345,9 +618,6 @@ class TestPrefillFpmScaling:
)
planner.ttft_regression.add_observation(fpm)
# All engines idle (no queued prefill).
# estimate_next_ttft: total = 0 + avg_isl(~100) = ~100 tokens
# predicted wall_time ≈ 0.001 * 100 = 0.1s = 100ms < 500ms SLA
stats = {
(f"w{i}", 0): _make_fpm(
worker_id=f"w{i}",
......@@ -372,7 +642,6 @@ class TestPrefillFpmScaling:
planner.model_name = "test-model"
planner.prefill_worker_info = WorkerInfo(max_num_batched_tokens=2048)
# Only 2 observations, need 5
for tokens in [100, 200]:
fpm = _make_fpm(sum_prefill_tokens=tokens, wall_time=0.01)
planner.ttft_regression.add_observation(fpm)
......@@ -394,11 +663,18 @@ class TestDecodeFpmScaling:
planner = DecodePlanner(None, config, shared_state=shared_state)
planner.model_name = "test-model"
for kv in range(1000, 6000, 500):
# 2D regression: vary both num_decode_requests and sum_decode_kv_tokens
for n_req, kv in [
(5, 1000),
(10, 2000),
(15, 3000),
(20, 4000),
(25, 5000),
]:
fpm = _make_fpm(
sum_decode_kv_tokens=kv,
num_decode_requests=10,
wall_time=0.0001 * kv + 0.001,
num_decode_requests=n_req,
wall_time=0.0001 * kv + 0.0005 * n_req + 0.001,
)
planner.itl_regression.add_observation(fpm)
......@@ -431,40 +707,17 @@ class TestDecodeFpmScaling:
planner = DecodePlanner(None, config, shared_state=shared_state)
planner.model_name = "test-model"
fpm = _make_fpm(sum_decode_kv_tokens=1000, wall_time=0.01)
fpm = _make_fpm(
sum_decode_kv_tokens=1000, num_decode_requests=5, wall_time=0.01
)
planner.itl_regression.add_observation(fpm)
stats = {("w1", 0): _make_fpm(sum_decode_kv_tokens=5000, wall_time=0.5)}
stats = {
("w1", 0): _make_fpm(
sum_decode_kv_tokens=5000, num_decode_requests=10, wall_time=0.5
)
}
planner.fpm_subscriber = _mock_fpm_subscriber(stats)
result = planner.load_plan_adjustment()
assert result is None
# ── Correction factor auto-disable tests ─────────────────────────────
class TestCorrectionFactorAutoDisable:
def test_correction_factor_disabled_when_load_enabled(self):
config = PlannerConfig(
enable_load_scaling=True,
enable_throughput_scaling=True,
no_correction=False,
)
assert config.no_correction is True
def test_correction_factor_stays_disabled_if_already_set(self):
config = PlannerConfig(
enable_load_scaling=True,
enable_throughput_scaling=True,
no_correction=True,
)
assert config.no_correction is True
def test_correction_factor_not_disabled_without_loadbased(self):
config = PlannerConfig(
enable_load_scaling=False,
enable_throughput_scaling=True,
no_correction=False,
)
assert config.no_correction is False
......@@ -87,3 +87,36 @@ def test_throughput_metrics_source_invalid():
"""throughput_metrics_source rejects invalid values."""
with pytest.raises(ValidationError):
PlannerConfig(namespace="test-ns", throughput_metrics_source="invalid")
@pytest.mark.parametrize("bucket_size", [1, 4, 9, 16, 25])
def test_fpm_sample_bucket_size_accepts_perfect_squares(bucket_size):
"""fpm_sample_bucket_size must be a perfect square (valid values)."""
config = PlannerConfig(namespace="test-ns", fpm_sample_bucket_size=bucket_size)
assert config.fpm_sample_bucket_size == bucket_size
@pytest.mark.parametrize("bucket_size", [2, 3, 5, 7, 10])
def test_fpm_sample_bucket_size_rejects_non_squares(bucket_size):
"""fpm_sample_bucket_size rejects values that are not perfect squares."""
with pytest.raises(ValidationError, match="perfect square"):
PlannerConfig(namespace="test-ns", fpm_sample_bucket_size=bucket_size)
def test_max_num_fpm_samples_field():
"""max_num_fpm_samples configures the FPM sample retention (formerly load_learning_window)."""
config = PlannerConfig(namespace="test-ns", max_num_fpm_samples=100)
assert config.max_num_fpm_samples == 100
def test_agg_mode_supports_throughput_scaling():
"""Agg mode supports throughput-based scaling."""
config = PlannerConfig(
namespace="test-ns",
mode="agg",
enable_throughput_scaling=True,
enable_load_scaling=False,
)
assert config.mode == "agg"
assert config.enable_throughput_scaling is True
assert config.scaling_enabled() is True
......@@ -5,7 +5,7 @@
Unit tests for SLA planner replica calculation logic.
These tests focus specifically on the replica calculation formulas without
testing load prediction, interpolation, or correction factors.
testing load prediction or regression internals.
"""
import asyncio
......@@ -42,9 +42,9 @@ class PlannerHarness:
if not self.shared_state.last_metrics.is_valid():
return
p_endpoints, d_endpoints = await self.prefill_planner.get_workers_info()
self.shared_state.p_endpoints = p_endpoints
self.shared_state.d_endpoints = d_endpoints
num_p, num_d, is_stable = await self.prefill_planner.get_workers_info()
self.shared_state.num_p_workers = num_p
self.shared_state.num_d_workers = num_d
next_num_p = self.prefill_planner.plan_adjustment()
next_num_d = self.decode_planner.plan_adjustment()
......@@ -86,14 +86,12 @@ class PlannerHarness:
"config",
}
prefill_attrs = {
"prefill_interpolator",
"ttft_regression",
"prefill_worker_info",
"p_correction_factor",
}
decode_attrs = {
"decode_interpolator",
"itl_regression",
"decode_worker_info",
"d_correction_factor",
}
if name == "last_metrics":
return self.shared_state.last_metrics
......@@ -119,8 +117,8 @@ class PlannerHarness:
"config",
"get_workers_info",
}
prefill_attrs = {"prefill_interpolator", "p_correction_factor"}
decode_attrs = {"decode_interpolator", "d_correction_factor"}
prefill_attrs = {"ttft_regression"}
decode_attrs = {"itl_regression"}
if name == "last_metrics":
self.shared_state.last_metrics = value
return None
......@@ -159,7 +157,6 @@ def planner():
itl=10.0,
backend="vllm",
no_operation=True,
no_correction=False,
metric_pulling_prometheus_endpoint="http://localhost:9090",
metric_reporting_prometheus_port=0,
load_predictor="constant",
......@@ -176,12 +173,13 @@ def planner():
enable_load_scaling=False,
load_predictor_warmup_trace=None,
load_predictor_log1p=False,
max_num_fpm_samples=50,
fpm_sample_bucket_size=16,
load_min_observations=5,
)
# Mock the runtime
mock_runtime = Mock()
# Patch Prometheus Gauge to avoid registry conflicts
with patch("dynamo.planner.monitoring.planner_metrics.Gauge") as mock_gauge:
mock_gauge.return_value = Mock()
......@@ -206,9 +204,21 @@ def planner():
decode_planner.prefill_worker_info = prefill_planner.prefill_worker_info
decode_planner.decode_worker_info = prefill_planner.decode_worker_info
# Mock the interpolators to return fixed values for testing
planner.prefill_interpolator = Mock()
planner.decode_interpolator = Mock()
planner.ttft_regression = Mock()
# Default: 40000 tokens/s at isl=3000 → 40000/3000 rps
planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
40000.0 / 3000.0,
75.0,
)
planner.ttft_regression.has_sufficient_data.return_value = True
planner.itl_regression = Mock()
# Default: 10000 tokens/s at osl=150 → 10000/150 rps
planner.itl_regression.find_best_engine_decode_rps.return_value = (
10000.0 / 150.0,
9.5,
)
planner.itl_regression.has_sufficient_data.return_value = True
# Mock the predictors to return fixed values
planner.num_req_predictor = Mock()
......@@ -221,14 +231,9 @@ def planner():
# Mock prometheus client
planner.prometheus_traffic_client = Mock()
# Set up some baseline correction factors
planner.p_correction_factor = 1.0
planner.d_correction_factor = 1.0
planner.config = config
yield planner
# Cleanup is automatic with context manager
class TestReplicaCalculation:
......@@ -239,59 +244,40 @@ class TestReplicaCalculation:
@pytest.mark.performance
def test_prefill_replica_calculation_basic(self, planner):
"""Test basic prefill replica calculation."""
# Setup test data
next_num_req = 10
next_isl = 3000
prefill_thpt_per_gpu = 40000 # tokens/s/gpu (from the test data)
engine_rps = 40000.0 / next_isl
# Mock the predictor outputs
planner.num_req_predictor.predict_next.return_value = next_num_req
planner.isl_predictor.predict_next.return_value = next_isl
planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator output
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = (
prefill_thpt_per_gpu
planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
engine_rps,
75.0,
)
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
10000,
0.01,
0.5,
planner.itl_regression.find_best_engine_decode_rps.return_value = (
10000.0 / 150.0,
9.5,
)
# Calculate expected result manually
pred_prefill_load_per_gpu = (
next_num_req
* next_isl
/ planner.config.throughput_adjustment_interval
* min(1, planner.p_correction_factor)
)
expected_prefill_replicas = math.ceil(
pred_prefill_load_per_gpu
/ prefill_thpt_per_gpu
/ planner.config.prefill_engine_num_gpu
# Formula: ceil(num_req / interval / engine_rps)
pred_prefill_demand = (
next_num_req / planner.config.throughput_adjustment_interval
)
expected_prefill_replicas = math.ceil(pred_prefill_demand / engine_rps)
# Set up valid metrics to trigger calculation
planner.last_metrics = Metrics(
num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
async def mock_get_workers_info(*args, **kwargs):
return (1, 1, True)
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls for correction factor calculation
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run the calculation
asyncio.run(planner.make_adjustments())
# Extract the calculated values from the log calls or by checking the mock calls
# Since we mocked the connector, we can check what replicas were requested
prefill_component = "VllmPrefillWorker"
calculated_prefill_replicas = _replica_count(
planner.last_target_replicas, prefill_component
......@@ -299,7 +285,6 @@ class TestReplicaCalculation:
print(f"Expected prefill replicas: {expected_prefill_replicas}")
print(f"Calculated prefill replicas: {calculated_prefill_replicas}")
# Allow for small differences due to min_endpoint constraints
assert (
max(expected_prefill_replicas, planner.config.min_endpoint)
== calculated_prefill_replicas
......@@ -310,52 +295,39 @@ class TestReplicaCalculation:
@pytest.mark.performance
def test_decode_replica_calculation_basic(self, planner):
"""Test basic decode replica calculation."""
# Setup test data
next_num_req = 10
next_osl = 150
decode_thpt_per_gpu = 10000 # tokens/s/gpu
engine_rps = 10000.0 / next_osl
# Mock the predictor outputs
planner.num_req_predictor.predict_next.return_value = next_num_req
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = next_osl
# Mock interpolator outputs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
decode_thpt_per_gpu,
0.01,
0.5,
planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
40000.0 / 3000.0,
75.0,
)
planner.itl_regression.find_best_engine_decode_rps.return_value = (
engine_rps,
9.5,
)
# Calculate expected result manually
# Formula: ceil(num_req / interval / engine_rps)
expected_decode_replicas = math.ceil(
next_num_req
* next_osl
/ planner.config.throughput_adjustment_interval
/ decode_thpt_per_gpu
/ planner.config.decode_engine_num_gpu
next_num_req / planner.config.throughput_adjustment_interval / engine_rps
)
# Set up valid metrics
planner.last_metrics = Metrics(
num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
async def mock_get_workers_info(*args, **kwargs):
return (1, 1, True)
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls for correction factor calculation
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run the calculation
asyncio.run(planner.make_adjustments())
# Check the results
decode_component = "VllmDecodeWorker"
calculated_decode_replicas = _replica_count(
planner.last_target_replicas, decode_component
......@@ -363,46 +335,43 @@ class TestReplicaCalculation:
print(f"Expected decode replicas: {expected_decode_replicas}")
print(f"Calculated decode replicas: {calculated_decode_replicas}")
# Allow for small differences due to min_endpoint constraints
assert (
max(expected_decode_replicas, planner.config.min_endpoint)
== calculated_decode_replicas
)
@pytest.mark.parametrize(
"num_req,decode_thpt,expected_p,expected_d",
"num_req,decode_rps,expected_p,expected_d",
[
(10, 10000, 1, 1), # low_load_10_req_per_second
(500, 1000, 1, 2), # high_load_500_req_per_second (lower decode throughput)
(10, 10000.0 / 150.0, 1, 1), # low_load_10_req_per_second
(
500,
1000.0 / 150.0,
1,
2,
), # high_load_500_req_per_second (lower decode rps)
],
)
@pytest.mark.nightly
@pytest.mark.gpu_2
@pytest.mark.performance
def test_scaling_scenario_low_to_high_load(
self, planner, num_req, decode_thpt, expected_p, expected_d
self, planner, num_req, decode_rps, expected_p, expected_d
):
"""Test scaling from low to high load scenarios."""
# Reset the planner state
planner.p_correction_factor = 1.0
planner.d_correction_factor = 1.0
# Mock predictor outputs for this case
planner.num_req_predictor.predict_next.return_value = num_req
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs (based on H200 1P1D profiling data)
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = (
40000 # tokens/s/gpu
planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
40000.0 / 3000.0,
75.0,
)
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
decode_thpt,
0.01,
0.5,
planner.itl_regression.find_best_engine_decode_rps.return_value = (
decode_rps,
9.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=num_req,
isl=3000,
......@@ -412,23 +381,14 @@ class TestReplicaCalculation:
request_duration=100.0,
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
async def mock_get_workers_info(*args, **kwargs):
return (1, 1, True)
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls for correction factor calculation
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Reset the mock
planner.connector.reset_mock()
# Run calculation
asyncio.run(planner.make_adjustments())
# Verify results
prefill_replicas = _replica_count(
planner.last_target_replicas, "VllmPrefillWorker"
)
......@@ -449,41 +409,32 @@ class TestReplicaCalculation:
@pytest.mark.performance
def test_gpu_budget_constraint(self, planner):
"""Test that GPU budget constraints are properly applied."""
# Set a low GPU budget
planner.config.max_gpu_budget = 3
# Mock predictor outputs that would normally require more GPUs
planner.num_req_predictor.predict_next.return_value = 50 # High load
planner.num_req_predictor.predict_next.return_value = 50
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
10000,
0.01,
0.5,
planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
40000.0 / 3000.0,
75.0,
)
planner.itl_regression.find_best_engine_decode_rps.return_value = (
10000.0 / 150.0,
9.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=50, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
async def mock_get_workers_info(*args, **kwargs):
return (1, 1, True)
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run calculation
asyncio.run(planner.make_adjustments())
# Verify that total GPU usage doesn't exceed budget
prefill_replicas = _replica_count(
planner.last_target_replicas, "VllmPrefillWorker"
)
......@@ -510,38 +461,30 @@ class TestReplicaCalculation:
"""Test that minimum endpoint constraints are respected."""
planner.config.min_endpoint = 2
# Mock predictor outputs that would normally require fewer workers
planner.num_req_predictor.predict_next.return_value = 1 # Very low load
planner.num_req_predictor.predict_next.return_value = 1
planner.isl_predictor.predict_next.return_value = 100
planner.osl_predictor.predict_next.return_value = 10
# Mock interpolator outputs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
10000,
0.01,
0.5,
planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
40000.0 / 100.0,
75.0,
)
planner.itl_regression.find_best_engine_decode_rps.return_value = (
10000.0 / 10.0,
9.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=1, isl=100, osl=10, ttft=80.0, itl=10.0, request_duration=100.0
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
async def mock_get_workers_info(*args, **kwargs):
return (1, 1, True)
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run calculation
asyncio.run(planner.make_adjustments())
# Verify minimum constraints are respected
prefill_replicas = _replica_count(
planner.last_target_replicas, "VllmPrefillWorker"
)
......@@ -557,182 +500,47 @@ class TestReplicaCalculation:
decode_replicas >= planner.config.min_endpoint
), "Decode replicas below minimum"
@pytest.mark.nightly
@pytest.mark.gpu_2
@pytest.mark.performance
def test_prefill_correction_factor_clamping(self, planner):
"""Test that prefill correction factor > 1 is clamped to 1."""
# Set a high correction factor > 1
planner.p_correction_factor = 2.5
planner.d_correction_factor = 1.0
# Mock predictor outputs
planner.num_req_predictor.predict_next.return_value = 10
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
10000,
0.01,
0.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=10, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Calculate expected result manually with clamping
# Should use min(1, 2.5) = 1
pred_prefill_load_per_gpu = (
10
* 3000
/ planner.config.throughput_adjustment_interval
* min(1, 2.5) # Should be * 1
)
expected_prefill_replicas = math.ceil(
pred_prefill_load_per_gpu / 40000 / planner.config.prefill_engine_num_gpu
)
# Run calculation
asyncio.run(planner.make_adjustments())
# Verify that correction factor was effectively clamped
prefill_replicas = _replica_count(
planner.last_target_replicas, "VllmPrefillWorker"
)
print(
f"Correction factor clamping test: Expected={expected_prefill_replicas}, Got={prefill_replicas}"
)
assert prefill_replicas == max(
expected_prefill_replicas, planner.config.min_endpoint
), "Prefill correction factor should be clamped to 1"
@pytest.mark.nightly
@pytest.mark.gpu_2
@pytest.mark.performance
def test_decode_correction_factor_zero_handling(self, planner):
"""Test handling of d_correction_factor <= 0."""
# Test both 0 and negative values
for correction_factor in [0.0, -1.0]:
planner.p_correction_factor = 1.0
planner.d_correction_factor = correction_factor
# Mock predictor outputs
planner.num_req_predictor.predict_next.return_value = 10
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
10000,
0.01,
0.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=10,
isl=3000,
osl=150,
ttft=80.0,
itl=10.0,
request_duration=100.0,
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run calculation
asyncio.run(planner.make_adjustments())
# Should handle gracefully without crashing
# The code should use args.itl directly instead of dividing by 0
decode_replicas = _replica_count(
planner.last_target_replicas, "VllmDecodeWorker"
)
print(
f"Correction factor {correction_factor} test: Decode replicas={decode_replicas}"
)
# Should get a valid result (not crash)
assert (
decode_replicas >= 1
), f"Should handle correction factor {correction_factor} gracefully"
@pytest.mark.nightly
@pytest.mark.gpu_2
@pytest.mark.performance
def test_multi_gpu_engines(self, planner):
"""Test replica calculation with multi-GPU engines."""
# Set multi-GPU configuration
planner.config.prefill_engine_num_gpu = 2
planner.config.decode_engine_num_gpu = 4
# Mock predictor outputs
planner.num_req_predictor.predict_next.return_value = 20
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150
# Mock interpolator outputs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 40000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
5000,
0.01,
0.5,
) # Lower for scaling
# Engine-level request rate (already accounts for multi-GPU)
prefill_engine_rps = 40000.0 / 3000.0
decode_engine_rps = 5000.0 / 150.0
planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
prefill_engine_rps,
75.0,
)
planner.itl_regression.find_best_engine_decode_rps.return_value = (
decode_engine_rps,
9.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=20, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
async def mock_get_workers_info(*args, **kwargs):
return (1, 1, True)
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Calculate expected results manually
pred_prefill_load_per_gpu = (
20 * 3000 / planner.config.throughput_adjustment_interval * 1.0
)
# No engine_num_gpu division — regression returns engine-level rps
expected_prefill_replicas = math.ceil(
pred_prefill_load_per_gpu / 40000 / 2
) # 2 GPUs per engine
20 / planner.config.throughput_adjustment_interval / prefill_engine_rps
)
expected_decode_replicas = math.ceil(
20 * 150 / planner.config.throughput_adjustment_interval / 5000 / 4
) # 4 GPUs per engine
20 / planner.config.throughput_adjustment_interval / decode_engine_rps
)
# Run calculation
asyncio.run(planner.make_adjustments())
prefill_replicas = _replica_count(
......@@ -742,10 +550,10 @@ class TestReplicaCalculation:
planner.last_target_replicas, "VllmDecodeWorker"
)
print(
f"Multi-GPU test: P={prefill_replicas} (expected ~{expected_prefill_replicas}), D={decode_replicas} (expected ~{expected_decode_replicas})"
f"Multi-GPU test: P={prefill_replicas} (expected ~{expected_prefill_replicas}), "
f"D={decode_replicas} (expected ~{expected_decode_replicas})"
)
# Verify calculations account for multiple GPUs per engine
assert prefill_replicas == max(
expected_prefill_replicas, planner.config.min_endpoint
)
......@@ -757,42 +565,39 @@ class TestReplicaCalculation:
@pytest.mark.gpu_2
@pytest.mark.performance
def test_complex_gpu_budget_scaling(self, planner):
"""Test complex GPU budget scaling with proportional reduction and decode adjustment."""
# Set tight GPU budget that will trigger complex scaling
"""Test complex GPU budget scaling with proportional reduction."""
planner.config.max_gpu_budget = 5
planner.config.prefill_engine_num_gpu = 2
planner.config.decode_engine_num_gpu = 2
planner.config.min_endpoint = 1
# High load that would normally require more GPUs
planner.num_req_predictor.predict_next.return_value = 100
planner.isl_predictor.predict_next.return_value = 3000
planner.osl_predictor.predict_next.return_value = 150
# Lower throughput to trigger higher replica needs
planner.prefill_interpolator.interpolate_thpt_per_gpu.return_value = 10000
planner.decode_interpolator.find_best_throughput_per_gpu.return_value = (
1000,
0.01,
0.5,
planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
10000.0 / 3000.0,
300.0,
)
planner.itl_regression.find_best_engine_decode_rps.return_value = (
1000.0 / 150.0,
9.5,
)
# Set up metrics
planner.last_metrics = Metrics(
num_req=100, isl=3000, osl=150, ttft=80.0, itl=10.0, request_duration=100.0
num_req=100,
isl=3000,
osl=150,
ttft=80.0,
itl=10.0,
request_duration=100.0,
)
# Mock workers info
async def mock_get_workers_info():
return (["prefill1"], ["decode1"])
async def mock_get_workers_info(*args, **kwargs):
return (1, 1, True)
planner.get_workers_info = mock_get_workers_info
# Mock interpolation calls
planner.prefill_interpolator.interpolate_ttft.return_value = 80.0
planner.decode_interpolator.interpolate_itl.return_value = 10.0
# Run calculation
asyncio.run(planner.make_adjustments())
prefill_replicas = _replica_count(
......@@ -801,14 +606,14 @@ class TestReplicaCalculation:
decode_replicas = _replica_count(
planner.last_target_replicas, "VllmDecodeWorker"
)
# Verify total GPU usage doesn't exceed budget
total_gpus = (
prefill_replicas * planner.config.prefill_engine_num_gpu
+ decode_replicas * planner.config.decode_engine_num_gpu
)
print(
f"Complex GPU budget test: P={prefill_replicas}, D={decode_replicas}, Total GPUs={total_gpus}"
f"Complex GPU budget test: P={prefill_replicas}, D={decode_replicas}, "
f"Total GPUs={total_gpus}"
)
assert (
......@@ -820,6 +625,3 @@ class TestReplicaCalculation:
assert (
decode_replicas >= planner.config.min_endpoint
), "Should respect min_endpoint for decode"
# No need for unittest.main() with pytest!
# SPDX-FileCopyrightText: Copyright (c) 2025-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
import argparse
import asyncio
import math
import os
from unittest.mock import Mock, patch
from unittest.mock import MagicMock, Mock, patch
import pytest
......@@ -15,7 +14,6 @@ from dynamo.planner.core.decode import DecodePlanner
from dynamo.planner.core.prefill import PrefillPlanner
from dynamo.planner.core.state import PlannerSharedState
from dynamo.planner.errors import DeploymentValidationError
from dynamo.planner.offline.dryrun import run_sla_planner_dryrun
pytestmark = [
pytest.mark.gpu_0,
......@@ -24,6 +22,10 @@ pytestmark = [
pytest.mark.planner,
]
PREFILL_ENGINE_RPS = 10.0
DECODE_ENGINE_RPS = 5.0
DECODE_ACTUAL_ITL_MS = 40.0
@pytest.fixture(autouse=True)
def mock_prometheus_metrics():
......@@ -43,7 +45,6 @@ def _build_config():
itl=50.0,
backend="vllm",
no_operation=True,
no_correction=True,
metric_pulling_prometheus_endpoint="http://localhost:9090",
metric_reporting_prometheus_port=0,
load_predictor="constant",
......@@ -90,6 +91,20 @@ def _build_planners(config, prometheus_client):
prefill_planner.model_name = "test-model"
decode_planner.model_name = "test-model"
prefill_planner.ttft_regression = MagicMock()
prefill_planner.ttft_regression.find_best_engine_prefill_rps.return_value = (
PREFILL_ENGINE_RPS,
75.0,
)
prefill_planner.ttft_regression.has_sufficient_data.return_value = True
decode_planner.itl_regression = MagicMock()
decode_planner.itl_regression.find_best_engine_decode_rps.return_value = (
DECODE_ENGINE_RPS,
DECODE_ACTUAL_ITL_MS,
)
decode_planner.itl_regression.has_sufficient_data.return_value = True
async def mock_get_workers_info(require_prefill=True, require_decode=True):
return (
1 if require_prefill else 0,
......@@ -103,32 +118,20 @@ def _build_planners(config, prometheus_client):
def _expected_prefill(config, prefill_planner, sample):
pred_prefill_throughput = (
sample["num_req"] * sample["isl"] / config.throughput_adjustment_interval
)
thpt_per_gpu = prefill_planner.prefill_interpolator.interpolate_thpt_per_gpu(
sample["isl"]
)
expected = math.ceil(
pred_prefill_throughput / thpt_per_gpu / config.prefill_engine_num_gpu
demand_rps = sample["num_req"] / config.throughput_adjustment_interval
engine_rps, _ = prefill_planner.ttft_regression.find_best_engine_prefill_rps(
ttft_sla=config.ttft, isl=sample["isl"]
)
expected = math.ceil(demand_rps / engine_rps)
return max(expected, config.min_endpoint)
def _expected_decode(config, decode_planner, sample):
(
pred_decode_thpt_per_gpu,
_,
_,
) = decode_planner.decode_interpolator.find_best_throughput_per_gpu(
demand_rps = sample["num_req"] / config.throughput_adjustment_interval
engine_rps, _ = decode_planner.itl_regression.find_best_engine_decode_rps(
itl=config.itl, context_length=sample["isl"] + sample["osl"] / 2
)
pred_decode_throughput = (
sample["num_req"] * sample["osl"] / config.throughput_adjustment_interval
)
expected = math.ceil(
pred_decode_throughput / pred_decode_thpt_per_gpu / config.decode_engine_num_gpu
)
expected = math.ceil(demand_rps / engine_rps)
return max(expected, config.min_endpoint)
......@@ -210,128 +213,114 @@ def test_disagg_scale_down():
assert low_d < high_d
# Tests for _initialize_gpu_counts
class TestInitializeGpuCounts:
@staticmethod
def _make_config(**overrides):
defaults = dict(prefill_engine_num_gpu=None, decode_engine_num_gpu=None)
defaults.update(overrides)
return PlannerConfig.model_construct(**defaults)
def test_kubernetes_mode_reads_from_dgd(self):
"""Test that GPU counts are read from DGD in Kubernetes mode"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = None
config = self._make_config()
connector = Mock()
connector.get_gpu_counts = Mock(return_value=(2, 4))
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True
config, connector, require_prefill=True, require_decode=True
)
assert args.prefill_engine_num_gpu == 2
assert args.decode_engine_num_gpu == 4
assert config.prefill_engine_num_gpu == 2
assert config.decode_engine_num_gpu == 4
connector.get_gpu_counts.assert_called_once_with(
require_prefill=True, require_decode=True
)
def test_kubernetes_mode_prefill_only(self):
"""Test GPU count initialization for prefill-only mode"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = None
config = self._make_config()
connector = Mock()
connector.get_gpu_counts = Mock(return_value=(2, 0))
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=False
config, connector, require_prefill=True, require_decode=False
)
assert args.prefill_engine_num_gpu == 2
assert args.decode_engine_num_gpu == 0
assert config.prefill_engine_num_gpu == 2
assert config.decode_engine_num_gpu == 0
connector.get_gpu_counts.assert_called_once_with(
require_prefill=True, require_decode=False
)
def test_virtual_mode_uses_cli_args(self):
"""Test that GPU counts come from CLI args in virtual mode"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = 2
args.decode_engine_num_gpu = 4
"""Test that GPU counts come from config in virtual mode"""
config = self._make_config(prefill_engine_num_gpu=2, decode_engine_num_gpu=4)
# Virtual connector doesn't have get_gpu_counts method
connector = Mock(spec=[])
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True
config, connector, require_prefill=True, require_decode=True
)
# Values should remain unchanged
assert args.prefill_engine_num_gpu == 2
assert args.decode_engine_num_gpu == 4
assert config.prefill_engine_num_gpu == 2
assert config.decode_engine_num_gpu == 4
def test_virtual_mode_missing_prefill_raises_error(self):
"""Test that missing prefill GPU flag raises error in virtual mode"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = 4
"""Test that missing prefill GPU config raises error in virtual mode"""
config = self._make_config(decode_engine_num_gpu=4)
connector = Mock(spec=[])
with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True
config, connector, require_prefill=True, require_decode=True
)
assert "prefill_engine_num_gpu" in str(exc_info.value)
def test_virtual_mode_missing_decode_raises_error(self):
"""Test that missing decode GPU flag raises error in virtual mode"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = 2
args.decode_engine_num_gpu = None
"""Test that missing decode GPU config raises error in virtual mode"""
config = self._make_config(prefill_engine_num_gpu=2)
connector = Mock(spec=[])
with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True
config, connector, require_prefill=True, require_decode=True
)
assert "decode_engine_num_gpu" in str(exc_info.value)
def test_virtual_mode_missing_both_raises_error_with_both_messages(self):
"""Test that missing both GPU flags shows both error messages"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = None
"""Test that missing both GPU configs shows both error messages"""
config = self._make_config()
connector = Mock(spec=[])
with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True
config, connector, require_prefill=True, require_decode=True
)
assert len(exc_info.value.errors) == 2
def test_virtual_mode_decode_only_no_prefill_error(self):
"""Test decode-only mode doesn't require prefill GPU flag"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = 4
"""Test decode-only mode doesn't require prefill GPU config"""
config = self._make_config(decode_engine_num_gpu=4)
connector = Mock(spec=[])
# Should not raise - prefill not required
_initialize_gpu_counts(
args, connector, require_prefill=False, require_decode=True
config, connector, require_prefill=False, require_decode=True
)
assert args.decode_engine_num_gpu == 4
assert config.decode_engine_num_gpu == 4
def test_kubernetes_mode_fallback_to_cli_on_dgd_error(self):
"""Test that K8s mode falls back to CLI flags when DGD parsing fails"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = 2
args.decode_engine_num_gpu = 4
"""Test that K8s mode falls back to config when DGD parsing fails"""
config = self._make_config(prefill_engine_num_gpu=2, decode_engine_num_gpu=4)
connector = Mock()
connector.get_gpu_counts = Mock(
......@@ -339,18 +328,15 @@ class TestInitializeGpuCounts:
)
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True
config, connector, require_prefill=True, require_decode=True
)
# Should use CLI flag values after fallback
assert args.prefill_engine_num_gpu == 2
assert args.decode_engine_num_gpu == 4
assert config.prefill_engine_num_gpu == 2
assert config.decode_engine_num_gpu == 4
def test_kubernetes_mode_fallback_missing_cli_flags_raises_error(self):
"""Test that K8s fallback raises error when CLI flags are also missing"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = None
args.decode_engine_num_gpu = None
"""Test that K8s fallback raises error when config also missing"""
config = self._make_config()
connector = Mock()
connector.get_gpu_counts = Mock(
......@@ -359,16 +345,14 @@ class TestInitializeGpuCounts:
with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True
config, connector, require_prefill=True, require_decode=True
)
assert len(exc_info.value.errors) == 2
def test_kubernetes_mode_fallback_partial_cli_flags(self):
"""Test K8s fallback with only one CLI flag provided"""
args = argparse.Namespace()
args.prefill_engine_num_gpu = 2
args.decode_engine_num_gpu = None
"""Test K8s fallback with only one config value provided"""
config = self._make_config(prefill_engine_num_gpu=2)
connector = Mock()
connector.get_gpu_counts = Mock(
......@@ -377,73 +361,7 @@ class TestInitializeGpuCounts:
with pytest.raises(DeploymentValidationError) as exc_info:
_initialize_gpu_counts(
args, connector, require_prefill=True, require_decode=True
config, connector, require_prefill=True, require_decode=True
)
assert "decode_engine_num_gpu" in str(exc_info.value)
# Tests for dryrun GPU defaults
class TestDryrunGpuDefaults:
@staticmethod
def _build_dryrun_config(**overrides) -> PlannerConfig:
defaults = dict(
throughput_adjustment_interval=60,
prefill_engine_num_gpu=1,
decode_engine_num_gpu=1,
min_endpoint=1,
max_gpu_budget=-1,
ttft=500.0,
itl=50.0,
backend="vllm",
no_operation=True,
no_correction=True,
metric_pulling_prometheus_endpoint="http://localhost:9090",
metric_reporting_prometheus_port=0,
load_predictor="constant",
load_predictor_warmup_trace=None,
load_predictor_log1p=False,
profile_results_dir=os.path.join(
os.path.dirname(__file__),
"..",
"data",
"profiling_results",
"H200_TP1P_TP1D",
),
environment="kubernetes",
namespace="test-namespace",
mode="disagg",
enable_throughput_scaling=True,
enable_load_scaling=False,
)
defaults.update(overrides)
return PlannerConfig.model_construct(**defaults)
def test_dryrun_defaults_gpu_counts_when_none(self):
"""Test that dryrun sets default GPU counts of 1 when None"""
config = self._build_dryrun_config(
prefill_engine_num_gpu=None, decode_engine_num_gpu=None
)
try:
run_sla_planner_dryrun(config, dataset="nonexistent.jsonl")
except (FileNotFoundError, ValueError):
pass
assert config.prefill_engine_num_gpu == 1
assert config.decode_engine_num_gpu == 1
def test_dryrun_preserves_cli_gpu_counts(self):
"""Test that dryrun preserves GPU counts provided via config"""
config = self._build_dryrun_config(
prefill_engine_num_gpu=2, decode_engine_num_gpu=4
)
try:
run_sla_planner_dryrun(config, dataset="nonexistent.jsonl")
except (FileNotFoundError, ValueError):
pass
assert config.prefill_engine_num_gpu == 2
assert config.decode_engine_num_gpu == 4
......@@ -12,4 +12,4 @@ pmdarima==2.1.1
prometheus-api-client==0.6.0
prophet==1.2.1
scikit-learn==1.7.2
scipy<1.14.0 # Upper bound for pmdarima compatibility
scipy>=1.14.0,<2.0
......@@ -25,8 +25,8 @@ The Dynamo **Planner** is an autoscaler purpose-built for these constraints. It
The Planner supports two scaling modes that can run independently or together:
- **Throughput-based scaling**: Uses pre-deployment profiling data and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No profiling data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
- **Throughput-based scaling**: Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to compute the replica count needed to meet TTFT and ITL targets. Adjusts on a longer interval (default 180s). This is the primary mode for production deployments.
- **Load-based scaling**: Uses ForwardPassMetrics (FPM) from the Dynamo event plane and fits an online linear regression to make scaling decisions. No pre-deployment data or KV Router required. Adjusts on a short interval (default 5s) to respond quickly to bursts.
When both modes are enabled, throughput-based scaling provides a capacity floor (long-term planning) while load-based scaling handles real-time adjustments above that floor.
......@@ -36,12 +36,12 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
|---------|:----------------:|:-------------------------:|
| **Deployment** | | |
| Disaggregated | Supported | Supported |
| Aggregated | Unsupported | Supported |
| Aggregated | Supported | Supported |
| **LLM Framework** | | |
| SGLang | Supported | Supported |
| TensorRT-LLM | Supported | Supported |
| vLLM | Supported | Supported |
| **Requires Profiling Data** | Yes | No |
| **Requires Pre-deployment Data** | Yes (self-benchmark or profiler) | No |
| **Load Predictors** | ARIMA, Prophet, Kalman, Constant | N/A |
| **Router** | | |
| Any (round-robin, random, etc.) | Supported | Not supported |
......@@ -52,8 +52,8 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
## When to Use Which Mode
- **Throughput-based scaling** should be enabled whenever engine profiling data is available (through pre-deployment profiling). It provides stable, prediction-based capacity planning.
- **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring profiling data.
- **Throughput-based scaling** should be enabled whenever engine performance data is available (through self-benchmark or pre-deployment profiling). It provides stable, prediction-based capacity planning.
- **Load-based scaling** should be enabled when traffic is bursty or hard to predict. It reacts quickly to real-time load changes without requiring pre-deployment data.
- **Both modes together**: For the best of both worlds, enable both. Throughput-based scaling provides a lower bound (long-term capacity), while load-based scaling handles bursts above that floor. When both are enabled, use a longer `--adjustment-interval` for throughput-based scaling.
## Quick Start
......@@ -63,7 +63,7 @@ When both modes are enabled, throughput-based scaling provides a capacity floor
- Dynamo platform installed on Kubernetes ([Installation Guide](../../kubernetes/installation-guide.md))
- kube-prometheus-stack installed ([Metrics Setup](../../kubernetes/observability/metrics.md))
For throughput-based scaling, pre-deployment profiling is also required ([Profiling Guide](../profiler/profiler-guide.md)).
For throughput-based scaling, pre-deployment engine performance data is also required (via self-benchmark mode or [Profiling Guide](../profiler/profiler-guide.md)).
### Throughput-Based Scaling (with DGDR)
......@@ -141,13 +141,11 @@ Load-based scaling has the following known limitations. Throughput-based scaling
| `--adjustment-interval` | `180` | Seconds between throughput-based scaling decisions |
| `--profile-results-dir` | `profiling_results` | Path to profiling data (NPZ/JSON) |
| `--load-predictor` | `arima` | Prediction model (`arima`, `prophet`, `kalman`, `constant`) |
| `--no-correction` | `true` | Disable correction factors (auto-disabled when load-based scaling is on) |
| **Load-based scaling** | | |
| `--enable-loadbased-scaling` | `false` | Enable load-based scaling |
| `--disable-throughput-scaling` | `false` | Disable throughput-based scaling (required for `agg` mode) |
| `--loadbased-router-metrics-url` | auto-discovered | URL to router's `/metrics` endpoint |
| `--loadbased-adjustment-interval` | `5` | Seconds between load-based scaling decisions |
| `--loadbased-learning-window` | `50` | Sliding window size for regression model |
| `--loadbased-adjustment-interval` | `5` | Seconds between FPM regression updates and load-based scaling decisions |
| `--max-num-fpm-samples` | `64` | Maximum retained FPM observations for regression |
| `--fpm-sample-bucket-size` | `16` | Number of buckets for observation retirement (must be perfect square) |
| `--loadbased-scaling-down-sensitivity` | `80` | Scale-down sensitivity 0-100 (0=never, 100=aggressive) |
| `--loadbased-metric-samples` | `10` | Number of metric samples per adjustment interval |
| `--loadbased-min-observations` | `5` | Minimum observations before regression activates |
......@@ -175,7 +173,7 @@ The dashboard shows:
- Worker counts and GPU usage over time
- Observed TTFT, ITL, request rate, sequence lengths
- Predicted load and recommended replica counts
- Correction factors (actual vs. expected performance)
- FPM regression model status
### Prometheus Metrics
......
......@@ -12,12 +12,12 @@ For a quick overview, see the [Planner overview](README.md). For architecture in
The planner supports two scaling modes that can be used independently or together:
- **Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine interpolation data and traffic prediction to plan capacity. Best for stable, predictable workloads. Requires profiling data generated by the [Profiler](../profiler/profiler-guide.md).
- **Load-based scaling** (`enable_load_scaling: true`): Uses real-time per-worker engine metrics and online regression. Best for bursty or unpredictable traffic. Does not require profiling data. Requires the [KV Router](../router/README.md) — see [Current Limitations](README.md#current-limitations).
- **Throughput-based scaling** (`enable_throughput_scaling: true`): Uses pre-deployment engine performance data (from self-benchmark or profiler) and traffic prediction to plan capacity. Best for stable, predictable workloads.
- **Load-based scaling** (`enable_load_scaling: true`): Uses real-time ForwardPassMetrics (FPM) from the Dynamo event plane and online regression to make scaling decisions. Best for bursty or unpredictable traffic. Does not require pre-deployment data.
**When to use which:**
- Enable **throughput-based scaling** whenever profiling data is available. It provides stable, prediction-based capacity planning.
- Enable **throughput-based scaling** whenever pre-deployment performance data is available (via self-benchmark or profiler). It provides stable, prediction-based capacity planning.
- Enable **load-based scaling** when traffic is bursty. It reacts quickly to real-time load changes.
- Enable **both** for the best of both worlds: throughput-based provides a capacity floor, load-based handles bursts above it. When both are enabled, use a longer `throughput_adjustment_interval`.
......@@ -39,8 +39,8 @@ features:
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment profiling data). |
| `enable_load_scaling` | bool | `false` | Enable load-based scaling (no pre-deployment profiling data required). |
| `enable_throughput_scaling` | bool | `true` | Enable throughput-based scaling (requires pre-deployment performance data). |
| `enable_load_scaling` | bool | `false` | Enable load-based scaling. |
At least one scaling mode must be enabled.
......@@ -48,9 +48,9 @@ At least one scaling mode must be enabled.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine interpolation data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). |
| `pre_deployment_sweeping_mode` | string | `rapid` | How to generate engine performance data: `rapid` (AIC simulation, ~30s), `thorough` (real GPUs, 2-4h), or `none` (skip). |
When throughput-based scaling is enabled, the planner needs interpolation curves that map ISL to TTFT (prefill) and KV-cache utilization to ITL (decode). The profiler generates this data based on the `pre_deployment_sweeping_mode` setting. See the [Profiler Guide](../profiler/profiler-guide.md) for details on how this data is produced.
When throughput-based scaling is enabled, the planner needs engine performance data. At startup, it first tries to fetch self-benchmark results from the `get_perf_metrics` Dynamo endpoint (see PR #7779). If unavailable, it falls back to profiler-generated data (npz or JSON) at `profile_results_dir`. Both sources are converted to ForwardPassMetrics and fed into the FPM regression model.
### Throughput-Based Scaling Settings
......@@ -61,14 +61,14 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
| `max_gpu_budget` | int | `8` | Maximum total GPUs the planner may allocate. |
| `ttft` | float | `500.0` | TTFT SLA target (ms) for scaling decisions. |
| `itl` | float | `50.0` | ITL SLA target (ms) for scaling decisions. |
| `no_correction` | bool | `true` | Disable latency correction factor. Auto-disabled when load-based scaling is on. |
### Load-Based Scaling Settings
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `load_adjustment_interval` | int | `5` | Seconds between load-based scaling decisions. Must be shorter than `throughput_adjustment_interval`. |
| `load_learning_window` | int | `50` | Sliding window size for regression model. |
| `load_adjustment_interval` | int | `5` | Seconds between FPM regression updates and load-based scaling decisions. Even when only throughput scaling is enabled, live FPM observations are fed into the regression at this interval. Must be shorter than `throughput_adjustment_interval`. |
| `max_num_fpm_samples` | int | `64` | Maximum retained FPM observations for regression. |
| `fpm_sample_bucket_size` | int | `16` | Number of buckets for observation retirement (must be a perfect square). |
| `load_scaling_down_sensitivity` | int | `80` | Scale-down sensitivity 0–100 (0=never, 100=aggressive). |
| `load_metric_samples` | int | `10` | Number of metric samples to collect per decision. |
| `load_min_observations` | int | `5` | Minimum observations before making scaling decisions. |
......@@ -105,8 +105,8 @@ When throughput-based scaling is enabled, the planner needs interpolation curves
When the profiler runs with planner enabled, it:
1. Selects the best prefill and decode engine configurations
2. Generates interpolation curves (TTFT vs ISL, ITL vs KV-cache utilization)
3. Saves the `PlannerConfig` and profiling data into separate Kubernetes ConfigMaps
2. Generates engine performance data (prefill TTFT vs ISL, decode ITL vs KV-cache utilization)
3. Saves the `PlannerConfig` and performance data into separate Kubernetes ConfigMaps
4. Adds the planner service to the generated DGD, configured to read from those ConfigMaps
The planner receives its config via `--config /path/to/planner_config.json` which is mounted from the `planner-config-XXXX` ConfigMap. Profiling data is mounted from the `planner-profile-data-XXXX` ConfigMap.
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment